@tmcgrath what's the current status of this? Is additional funding still needed and do you have any updates?
tl;dr: determine the best currently-available training setup for SAEs and disseminate this knowledge. Train SAEs for steadily larger models (starting with GPT-2-small for MATS scholars) and then scale up as budget and time allows.
Project proposal doc with more details: https://docs.google.com/document/d/15X28EEHo7pM2CYkfZqk05A0MZi4ImvTSSVaC9wtFLyI/edit?usp=sharing
Determine good hyperparameters for sparse autoencoders for realistic LLMs by doing a comprehensive architecture and hyperparameter comparison.
Use this knowledge to train a suite of high-quality SAEs for GPT-2-small, then scale up further as resources allow, targeting ~1B and ~8B models in sequence.
Disseminate knowledge on SAE training through a technical report.
Compute!
Lead: Tom McGrath - former DeepMind interpretability researcher.
Collaborating: Joseph Bloom - owner of SAELens and contributor to Neuronpedia.
Failure to replicate results obtained by major labs leading to low SAE performance.
Tom McGrath: none, currently self-funding
My collaborators Joseph Bloom and Johnny Lin are funded to work on Neuronpedia.
Lun
about 1 month ago
@tmcgrath what's the current status of this? Is additional funding still needed and do you have any updates?
joseph bloom
5 months ago
Super excited about this project. Tom and I have already done a lot of good work and the collaboration with Johnny, Tom and I has a huge amount of synergy! I'd encourage people to add further funds to help Tom reach his goal as (I'm very biased but) I think the resulting SAEs will be super useful to a bunch of researchers and in the process we'll create useful knowledge that accelerates progress which will underpin future AI safety outcomes.
Austin Chen
5 months ago
Approving this grant! I'm happy to see that Joseph and Johnny (past Manifund grantees) are involved with this.
I'm a bit surprised that Tom is not receiving funding/salary for his work on this as well -- I expect Tom doesn't mind here, but broadly encourage researchers to ask for reasonable amounts of salary from funders.
Neel Nanda
5 months ago
@Austin Yep, I'd be happy to pay salary on this if Tom wants it (not sure what appropriate rates are though). Tom and I discussed it briefly before he applied.
Neel Nanda
5 months ago
I think that determining the best training setup for SAEs seems like a highly valuable thing to do. Lots of new ideas are arising about how to train these things well (eg Gated SAEs, Prolu, Anthropic's April update), with wildly varying amounts of rigour behind them, and often little effort put into replicating them and seeing how they combine. Having a rigorous and careful effort doing this seems of significant value to the mech interp community.
Tom is a strong researcher, though hasn't worked on SAEs before, I thought the Hydra Effect and Understanding AlphaZero were solid papers. Joseph is also solid and has a lot of experience with SAEs. I expect them to be a good team.
The Google DeepMind mech interp team has been looking somewhat into how to combine the Anthropic April Update methods and Gated SAEs, and also hopes to open source SAEs at some point, which creates some concerns for duplicated work. As a result, I'm less excited about significant investment into open source SAEs, though having some out (especially soon!) would be nice.
This is an engineering heavy project, and I don't know too much about Tom's engineering skills, though I don't have any reason to think they're bad.
As above, I'm less excited about significant investment into open source SAEs, which is the main reason I haven't funded the full amount. $4K is a fairly small grant, so I haven't thought too hard about exactly how much compute this should reasonably take. If the training methods exploration turns out to take much more compute than expected, I'd be happy to increase it.
Please disclose e.g. any romantic, professional, financial, housemate, or familial relationships you have with the grant recipient(s).
Tom and I somewhat overlapped at DeepMind, but never directly worked together.
Joseph is one of my MATS alumni, and currently doing my MATS extension program. I consider this more of a conflict of interest, but my understanding is that Tom is predominantly driving this project, with Joseph helping out where he can.
I expect my MATS scholars to benefit from good open source SAEs existing and for both my scholars and the GDM team to benefit from better knowledge on training SAEs, but in the same way that the whole mech interp ecosystem benefits.
Neel Nanda
5 months ago
@NeelNanda Note: Tom and I discussed this grant before he applied here, and I encouraged him to apply to Manifund since I thought it was a solid grant to fund.