Let me copy the earlier progress update we shared (which was meant to close the project):
We've posted a detailed update on LessWrong.
In short:
We consider this project a major success: SLT & DevInterp's main predictions have been validated in a number of different settings. We are now confident that these research directions are useful for understanding deep learning systems.
Our priority is now to make direct contact with alignment: It's not enough for this research to help with understanding NNs, we need to move the needle on alignment. In our update, we sketch three major directions of research that we expect to make a difference.
In more detail, with respect to the concrete points above.
Completing the analysis of phase transitions and associated structure formation in the Toy Models of Superposition (preliminary work reported in the SLT & Alignment summit’sSLT High 4lecture).See Chen et al. (2023).Performing a similar analysis for theInduction Headspaper.See Hoogland et al. (2024).For diverse models that are known to contain structure/circuits, we will attempt to:
detect phase transitions (using a range of metrics, including train and test losses, RLCT and singular fluctuation),classify weights at each transition into state & control variables,
perform mechanistic interpretability analyses at these transitions,
compare these analysis to MechInterp structures found at the end of training.
Classifying transitions into state & control variables remains to be done in the next few months. We have performed some mechanistic/structural analysis, and more of this kind of analysis is currently underway.