MoSSAIC

Full funding application available here, including explanation of threat model, developed plan, extensive theory of change and risk assessment. Please treat the below as an executive summary.

(Goal) refers to activities covered by the funding goal as opposed to funding minimum.

Project summary

Problem: It is unclear how well interpretability techniques that assess model internals by reducing them to causal mechanisms will generalize as capabilities advance and new architectures and paradigms are introduced.

I want to develop some arguments that suggest the generalizability of these techniques will be vulnerable to the following adaptations:

Scaffolding changes
Substrate-level (i.e., model, architecture, paradigm) changes
AI-assisted development of the above
Self-modification

I want to also extend this roadmap to consider risk factors that will be amplified through these vulnerabilities:

Deep deceptiveness, in which deception is hidden via low-level reconfigurations that evade monitoring/interpretability techniques whilst leaving overall behavior intact.
Aggregate/Diffused deceptiveness, in which deceptive circuits are distributed across multiple connected AI systems, increasing the space over which interpretability techniques must search.

These arguments are mostly conceptual at the moment and have been presented in an upcoming arXiv publication (co-authors: Matt Farr, Chris Pang, Sahil K; draft available here). They have generated interest and both positive and critical feedback from researchers. I seek funding to continue developing the work’s realism and technical rigor by (i) iterating the paper using the received feedback and suggestions and (ii) upskilling in the relevant technical areas to more fully assess the work’s core premises.

This work will feed into the wider MoSSAIC project, which is being worked on by others within the high actuation spaces group. I believe my work here will motivate and structure the development of an alternative paradigm for interpretability that tackles the higher-level structures we care about in AI safety.

What are this project's goals? How will you achieve them?

Clarify and Articulate Reductionist/Mechanistic Paradigm

Examine the ways in which AI safety (especially interpretability) privileges a bottom-up approach from causal mechanisms to goal-oriented behaviour.
More specifically, to investigate two key assertions that we claim to be crucial for the continued success of the paradigm.
- Ontological: That structural properties discovered in AI systems remain stable as capabilities increase
- Epistemological: That structural properties can reliably indicate safety-relevant behaviors
Approach
- Theoretical research/upskilling
  - Mech-interp and (Goal) dev-interp/causal incentives
  - Parallel debates in neurology/philosophy of mind/science
- Refinement of terminology/concepts of substrate, mechanism, etc., through feedback and further examples.

Develop "Substrate-Flexible Risk" Framework

Establish/present a new threat model for AI safety that captures:
- Changes in AI architectures and paradigms
- Self-modifying AI systems
- Deep deceptiveness and its distributed equivalent (termed aggregate risk in the paper).
Use threat model framing to express a number of evasive risk scenarios, including Deep Deceptiveness, Sharp Left-Turn, and Robust Agent-Agnostic Processes.
(Goal) Assess importance of substrate-flexible risks in governance/regulation practices.
(Goal) Motivate and start developing solutions to threat model (see Section 5 onwards)
Approach
- Incorporation of feedback on current draft of MoSSAIC publication
- Examination of case studies of architecture/paradigm changes (Mamba, Kolmogorov-Arnold networks, transformers).
- (Goal) Analysis of Anthropic/AISI safety case policy directions. Consultation with governance experts.

How will this funding be used?

Minimum funding will cover my living costs and minor research expenses for 3 months at 0.6 FTE.

Goal funding is 4 months FTE with access to LISA.

Who is on your team? What's your track record on similar projects?

I am collaborating with Chris Pang, under the supervision of Sahil K (independent, ex-MIRI). We are both members of the High Actuation Spaces community and have access to a wide range of researchers who can provide feedback as we develop the project.

I also have Matthew Wearden as my research manager, who will serve as another set of eyes on the process.

I have co-authored the abovementioned paper and presented the earlier sketches of these ideas at LISA and MATS. I have completed similar long-term projects as part of my university degree.

What are the most likely causes and outcomes if this project fails?

The literature may be insufficient, or the identification of a reductionist paradigm may be impossible for more theoretical reasons.

I may also simply run out of time. I will assess from the halfway point, in consultation with my supervisor and RM.

I intend to commit some time to assessing any negative results, and I suspect this work will also feed into some of the work being conducted by others regarding foundations and limitations of mech-interp.

How much money have you raised in the last 12 months, and from where?

The full MoSSAIC project was added to MATS 6.0 (June--August 2024) by Ryan Kidd, and so received the $9,000 stipend from AI Safety Support.