Project summary
I would like to scale up experiments in training process transparency in order to better understand formation of various mechanisms in language models. This work involves training small to medium scale transformer models and analyzing gradients via training data attribution throughout the training process. I believe the resulting insights will inform new directions in mechanistic interpretability and how to detect precursors of deception, particularly in instances where this is hard or impossible on fully trained models.
Key Activities:
Research training process transparency and the formation of mechanisms along training trajectories of language models
Release research outputs and open-source tooling for performing similar experiments
Key Reasons
Mechanistic interpretability (MI) is making decent headway on producing a full picture of model cognition. However, good outcomes for AI alignment are predicated on:
Addressing failure modes such as deception and deceptive alignment where the structure of the problematic cognition is adversarial to the designer. MI on final model snapshots may overlook relevant mechanisms, particularly if they are subtle internally or hard to access through computational means, and/or only exhibited out of training distribution. Training process transparency can capture these mechanisms before their detection becomes difficult to access.
Coverage of detectable mechanisms prior to deployments. Training process transparency can increase coverage of where to look for instances of problematic mechanisms to ensure there are no unaccounted mechanisms.
Tooling/Open-Source: Producing open-source packages and research outputs will increase innovation speed on these lines of research.
Compute: Given that this research agenda focuses on analyzing shifts throughout the full training process, employing existing trained models and relatively sparse checkpoints is insufficient.
Project goals
The goals of this project are to better understand how mechanisms form within language models, including:
Induction heads and circuits employing them
I have already reproduced induction head formation and am in the process of analyzing them to attribute training data responsible for their formation.
This exercise acts as a good proof-of-concept for the methodology, as induction heads are a mechanism known to reliably occur at particular scales, but their etiology is as yet poorly understood.
IOI or similar complexity circuits
New novel circuits particularly amenable to discovery in light of the training process
The impact of this work will be to publish research packages and analysis of toy models which improves the state of knowledge around mechanism formation in transformer-based models.
I expect this research to last at most 2-3 months before re-evaluation of the research direction, and am seeking this funding for that time period.
How will this funding be used?
This funding will be used for compute and infrastructure costs associated with the project.
This funding will enable me to run 50-60 experiments that require individual training runs. I may opt to execute larger training runs on multi-GPU instances or multi-nodes setups if initial experiments are successful, in a way that drives less.
Who is on your team and what's your track record on similar projects?
I am conducting this research as an independent researcher. I completed the SERI MATS program with a focus on how to detect deception, and have published prior results such as this example on detecting various monosemantic and polysemantic MLP neurons using training process transparency:
https://www.alignmentforum.org/posts/DtkA5jysFZGv7W4qP/training-process-transparency-through-gradient
I also have unpublished results related to attention-based mechanisms at the parameter level (e.g. such as parenthesis matching) which I plan to consolidate into the research outputs of this project.
What are the most likely causes and outcomes if this project fails? (premortem)
This project can fail due to:
Fundamental limitations of the methodology: Despite existing results on identifying functionality of MLP neurons, the techniques might not be applicable to functionality of attention heads or circuits.
This is unlikely to be the case as I already have some preliminary results on attribution of attention head mechanisms.
Inability to replicate existing circuits: The first step to analyzing results is to replicate existing known mechanisms in a toy model, such as induction heads, IOI, and other behaviors/circuits.
This is unlikely to be the case as I have already reproduced induction head formation, and expect to reproduce more complex known behaviors with additional training time.
Difficulty in discovering novel circuits: While I expect training data attribution to simplify the discovery of novel mechanisms and accompanying circuits, this may not be feasible in the requested timeframe (2-3 months) at this scale of toy language model training runs.
What other funding are you or your project getting?
This project does not currently have any other funding.