Manifund foxManifund
Home
Login
About
People
Categories
Newsletter
HomeAboutPeopleCategoriesLoginCreate
13

Scoping Developmental Interpretability

Technical AI safety
jesse_hoogland avatar

Jesse Hoogland

CompleteGrant
$144,650raised

Project summary

We propose a 6-month research project to assess the viability of Developmental Interpretability, a new AI alignment research agenda. “DevInterp” studies how phase transitions give rise to computational structure in neural networks, and offers a possible path to scalable interpretability tools. 

Though we have both empirical and theoretical reasons to believe that phase transitions dominate the training process, the details remain unclear. We plan to clarify the role of phase transitions by studying them in a variety of models combining techniques from Singular Learning Theory and Mechanistic Interpretability. In six months, we expect to have gathered enough evidence to confirm that DevInterp is a viable research program.

If successful, we expect Developmental Interpretability to become one of the main branches of technical alignment research over the next few years. 

(This funding proposal consists of Phase 1 from the research plan described in this LessWrong post.) 

Project goals

We will assess the viability of Developmental Interpretability (DevInterp) as an alignment research program over the next 6 months. 

Our immediate priority is to gather empirical evidence for the role of phase transitions in neural network training dynamics. To do so, we will examine a variety of models for signals that indicate the presence or absence of phase transitions. 

Concretely, this means:

  • Completing the analysis of phase transitions and associated structure formation in the Toy Models of Superposition (preliminary work reported in the SLT & Alignment summit’s SLT High 4 lecture).

  • Performing a similar analysis for the Induction Heads paper.

  • For diverse models that are known to contain structure/circuits, we will attempt to:

    • detect phase transitions (using a range of metrics, including train and test losses, RLCT and singular fluctuation),

    • classify weights at each transition into state & control variables,

    • perform mechanistic interpretability analyses at these transitions,

    • compare these analysis to MechInterp structures found at the end of training.

  • Conduct a confidential capability risk assessment of DevInterp

The unit of work here is papers, submitted either to ML conferences or academic journals. At the end of this period we should have a clear idea of whether developmental interpretability has legs.

How will this funding be used?

  • 120k: RAs + Researchers + Research Fellows

  • 50k: Core Staff

  • 40k: Employment Costs

  • 10k: Compute

  • 7k: Travel Support

  • 18k: Fiscal Sponsorship Costs

  • 25k: Buffer

What is your team's track record on similar projects?

(The core team currently consists of Jesse Hoogland, Alexander Gietelink Oldenziel, Prof. Daniel Murfet and Stan van Wingerden. We have a shortlist of external researchers to hire.)

We are responsible for advancing the Developmental Interpretability research program in the following ways:

  • We ran the 2023 Singular Learning Theory (SLT) & Alignment Summit, which brought together roughly 140 online participants and 40 in-person participants to learn about singular learning theory and start working on open problems relating SLT and alignment. 

    • This also helped us scout talent for the DevInterp research program.

  • This summit culminated in the DevInterp research agenda, which we recently published. During this summit, we also recorded over 30 hours of lectures, and will soon release ~200 pages of lecture notes, further accompanying LessWrong posts, and original research.

  • Developmental interpretability was first proposed by Prof. Daniel Murfet here (the “SLT for Alignment Plan”), where he sketches the connections between SLT and mechanistic interpretability. 

  • Jesse Hoogland and Alexander Gietelink Oldenziel first communicated the potential value of SLT to the alignment community in a LessWrong sequence.

  • Prof. Daniel Murfet founded metauni, an online learning community, which has hosted hundreds of seminars (including on AI alignment), and which yielded his initial SLT for Alignment Plan. 

  • Prof. Daniel Murfet has also published dozens of articles in related fields such as mathematical physics, algebraic geometry, theory of computation, and machine learning, some of which has laid the groundwork for our current research agenda. 

We expect DevInterp to yield concrete new techniques for AI alignment within the next two years. This agenda and its impact would not exist without us.

How could this project be actively harmful?

Like other forms of interpretability, DevInterp could inadvertently accelerate capabilities.

We’re not about to share the particular ways we think this could occur, but we will say that we think the risk is quite low for the next year. Longer term, we’re not as confident, which is why we’ve chosen to include a fellowship for assessing capability risks in this proposal.

What other funding is this project getting?

No other funding has been secured yet. We have submitted somewhat similar (though broader in scope) grant requests to Lightspeed and SFF.

Comments11Donations8Similar7
jesse_hoogland avatar

Jesse Hoogland

Next Steps in Developmental Interpretability

Addressing Immediate AI Safety Concerns through DevInterp

Technical AI safety
10
4
$80.7K raised
🥑

Apollo Research

Apollo Research: Scale up interpretability & behavioral model evals research

Hire 3 additional AI safety research engineers / scientists

Technical AI safety
11
12
$339K raised
🍉

Cadenza Labs

Cadenza Labs: AI Safety research group working on own interpretability agenda

We're a team of SERI-MATS alumni working on interpretability, seeking funding to continue our research after our LTFF grant ended.

Science & technologyTechnical AI safetyGlobal catastrophic risks
4
4
$7.81K raised
SandyFraser avatar

Sandy Fraser

Concept-anchored representation engineering for alignment

New techniques to impose minimal structure on LLM internals for monitoring, intervention, and unlearning.

Technical AI safetyGlobal catastrophic risks
3
1
$0 / $72.3K
charbel-raphael avatar

Charbel-Raphael Segerie

Investigating constructability as a safer approach to machine-learning

Science & technologyTechnical AI safety
7
6
$0 raised
mfatt avatar

Matthew Farr

MoSSAIC

Probing possible limitations and assumptions of interpretability | Articulating evasive risk phenomena arising from adaptive and self modifying AI

Science & technologyTechnical AI safetyAI governanceGlobal catastrophic risks
1
0
$0 raised
Adam-Shai avatar

Adam Shai

Simplex - building our research team

Fund a new research agenda, based on computational mechanics, bridging mechanism and behavior to develop a rigorous science of AI systems and capabilities.

Science & technologyTechnical AI safety
6
2
$0 raised