Approving this as in line with our mission of advancing AI safety research. Thanks to Lovis and Neel for their public writeups on this!
I, Lovis Heindrich, am planning to research the use of sparse autoencoder circuits to better understand SAE features. The project will be carried out during a research visit at the Torr Vision Group at the University of Oxford and mentored by Fazl Barez, Prof Philip Torr and in collaboration with Veronika Thost (MIT-IBM Watson Lab). I am seeking funding of $8000 to cover my salary to work on the project full-time for 3 months, as well as $3000 for additional compute budget. In case the compute budget will not be fully utilized, it will be used to cover conference fees.
Recent work on SAEs [Anthropic 2024] has demonstrated the feasibility of SAE feature discovery in larger models and discovered safety relevant features that are causally important for the models’ behavior. Understanding what causes these features to activate is an important open research question. Our project’s goal is to create circuit-based explanations of such SAE features. Current approaches [Anthropic 2024, OpenAI 2023] that use activating dataset examples to generate feature explanations are limited because they can result in overly broad explanations or interpretability illusions [Bolukbasi et al. 2021]. We plan to make progress on this problem using circuit discovery methods [Syed et al. 2023, Marks et al. 2024, Dunefsky & Chlensky 2024]. We will explore various potential ways the circuit-based explanations can be used to improve our understanding and the usefulness of sparse autoencoder features.
$8000 will cover my salary to work on the project full-time for 3 months. The remaining $3000 will be used for compute and/or conference fees.
Lovis Heindrich: I’m a past MATS scholar where I worked with Neel Nanda and have published relevant work where I analyzed MLP circuits in Pythia-70M. Additionally, I have experience training and evaluating sparse autoencoders from working on them during the MATS extension.
Fazl Barez, Veronika Thost, Philip Torr
If this project were to fail, we’d expect the most likely causes to be either that feature circuits are too distributed or rely on uninterpretable features. Insufficiently good SAEs could also limit the project, especially when a large proportion of the feature circuits can’t be explained by earlier SAE features. We are optimistic that utilizing recent sparse autoencoders, including recent improvements to the SAE architecture will help us overcome these potential issues.
The Torr Vision Group at the University of Oxford will provide access to a compute cluster and cover costs related to accommodation and travel to Oxford.
Austin Chen
4 months ago
Approving this as in line with our mission of advancing AI safety research. Thanks to Lovis and Neel for their public writeups on this!
Neel Nanda
4 months ago
I had previously discussed this grant with Lovis and suggested he apply.
Why is this a good idea?
I think Sparse Autoencoders are one of the most promising areas of mech interp work right now. Better understanding SAE circuits seems exciting, and I think that understanding the circuit required to produce a feature is an important direction. This is both a sub-part of the broader project of finding end-to-end circuits, and could help with interpreting what a feature does (especially important features like the safety relevant features in Scaling Monosemanticity) - I would be very excited if this project finds case studies of features that have ambiguous maximum activating examples, but the meaning is clarified by studying a circuit.
(Note that the applicants shared me on a more detailed project proposal than what was shared publicly, which I broadly think was sensible, though I disagreed on some points)
Concerns
Research is hard, and there's a good chance this project doesn't really go anywhere interesting
This is a hard and somewhat open-ended question, though I think they had some decent ideas of concrete entry points
There's many directions the project could go in, and it'd be easy to get caught in rabbit holes/constantly flit between things and never do any of them properly.
Why this amount?
This was the salary requested, I think somewhat pegged to academic summer researcher salaries, which are a fair bit lower than the market rate for independent researchers, so no complaints from me. The compute may not be needed, since the lab provides some, but it would be silly for the project to be bottlenecked by lacking compute. This overall seems like a fairly small grant, with some chance of going somewhere interesting, and so a pretty obvious accept.
Conflicts of interest
Lovis is one of my MATS alumni, but we haven't been working together for several months, so I don't feel too concerned about the conflict of interest, and it means I have a fair amount of data to evaluate him. I don't personally benefit from this project (except in that all good mech interp research helps my own work!), and don't anticipate being a co-author on any papers produced