What progress have you made since your last update?
I have most recently been focused on research scaling sparse autoencoders to attention layers, which has been submitted as a research paper to NeurIPS and accepted as a Spotlight presentation at the ICML Mechanistic Interpretability workshop.
As an update to Scaling Training Process Transparency, I am working with my summer mentee Gavin Ratcliffe and co-advisor Sara Price to supervise a project that combines developmental interpretability with sleeper agents. This project is a natural extension of Training Process Transparency to larger models that are considered model organisms of deception, and will help answer some questions around the mechanism and formation of the deception trigger in sleeper agents, as well as the corresponding defection behavior.
As part of this project, we intend to use the funds allocated for Scaling Training Process Transparency to address any relevant compute expenses. Morally, the work on this project is equivalent to that of Scaling Training Process Transparency, in that it takes the shape of analyzing mechanism formation throughout the training (or in this case, fine-tuning) process for a model sufficiently large to exhibit interesting behaviors (in this case, deception triggers and deception backdoor behavior).
What are your next steps?
We will be releasing our results on developmental interpretability with sleeper agents as the natural extension to Scaling Training Process Transparency as soon as we have results, which we expect to achieve later this summer '24.
Is there anything others could help you with?
Yes. We would appreciate anyone who is interested in mechanistic interpretability, developmental interpretability and/or model organisms of deception to review our project and validate that it conforms to the stated purposes of this grant. We would also appreciate any feedback and ideas from interested advisors as we are in active iteration on this project.