Train great open-source sparse autoencoders

Technical AI safety

Tom McGrath

ActiveGrant

$4,025raised

$10,000funding goal

Donate

Project summary

tl;dr: determine the best currently-available training setup for SAEs and disseminate this knowledge. Train SAEs for steadily larger models (starting with GPT-2-small for MATS scholars) and then scale up as budget and time allows.

Project proposal doc with more details: https://docs.google.com/document/d/15X28EEHo7pM2CYkfZqk05A0MZi4ImvTSSVaC9wtFLyI/edit?usp=sharing

What are this project's goals and how will you achieve them?

Determine good hyperparameters for sparse autoencoders for realistic LLMs by doing a comprehensive architecture and hyperparameter comparison.
Use this knowledge to train a suite of high-quality SAEs for GPT-2-small, then scale up further as resources allow, targeting ~1B and ~8B models in sequence.
Disseminate knowledge on SAE training through a technical report.

How will this funding be used?

Compute!

Who is on your team and what's your track record on similar projects?

Lead: Tom McGrath - former DeepMind interpretability researcher.

Collaborating: Joseph Bloom - owner of SAELens and contributor to Neuronpedia.

What are the most likely causes and outcomes if this project fails? (premortem)

Failure to replicate results obtained by major labs leading to low SAE performance.

What other funding are you or your project getting?

Tom McGrath: none, currently self-funding

My collaborators Joseph Bloom and Johnny Lin are funded to work on Neuronpedia.

donated $25

Lun

over 1 year ago

@tmcgrath what's the current status of this? Is additional funding still needed and do you have any updates?

joseph bloom

almost 2 years ago

Super excited about this project. Tom and I have already done a lot of good work and the collaboration with Johnny, Tom and I has a huge amount of synergy! I'd encourage people to add further funds to help Tom reach his goal as (I'm very biased but) I think the resulting SAEs will be super useful to a bunch of researchers and in the process we'll create useful knowledge that accelerates progress which will underpin future AI safety outcomes.

Austin Chen

almost 2 years ago

Approving this grant! I'm happy to see that Joseph and Johnny (past Manifund grantees) are involved with this.

I'm a bit surprised that Tom is not receiving funding/salary for his work on this as well -- I expect Tom doesn't mind here, but broadly encourage researchers to ask for reasonable amounts of salary from funders.

donated $4,000

Neel Nanda

almost 2 years ago

@Austin Yep, I'd be happy to pay salary on this if Tom wants it (not sure what appropriate rates are though). Tom and I discussed it briefly before he applied.

donated $4,000

Neel Nanda

almost 2 years ago

Main points in favor of this grant

I think that determining the best training setup for SAEs seems like a highly valuable thing to do. Lots of new ideas are arising about how to train these things well (eg Gated SAEs, Prolu, Anthropic's April update), with wildly varying amounts of rigour behind them, and often little effort put into replicating them and seeing how they combine. Having a rigorous and careful effort doing this seems of significant value to the mech interp community.

Tom is a strong researcher, though hasn't worked on SAEs before, I thought the Hydra Effect and Understanding AlphaZero were solid papers. Joseph is also solid and has a lot of experience with SAEs. I expect them to be a good team.

Donor's main reservations

The Google DeepMind mech interp team has been looking somewhat into how to combine the Anthropic April Update methods and Gated SAEs, and also hopes to open source SAEs at some point, which creates some concerns for duplicated work. As a result, I'm less excited about significant investment into open source SAEs, though having some out (especially soon!) would be nice.

This is an engineering heavy project, and I don't know too much about Tom's engineering skills, though I don't have any reason to think they're bad.

Process for deciding amount

As above, I'm less excited about significant investment into open source SAEs, which is the main reason I haven't funded the full amount. $4K is a fairly small grant, so I haven't thought too hard about exactly how much compute this should reasonably take. If the training methods exploration turns out to take much more compute than expected, I'd be happy to increase it.

Conflicts of interest

Please disclose e.g. any romantic, professional, financial, housemate, or familial relationships you have with the grant recipient(s).

Tom and I somewhat overlapped at DeepMind, but never directly worked together.

Joseph is one of my MATS alumni, and currently doing my MATS extension program. I consider this more of a conflict of interest, but my understanding is that Tom is predominantly driving this project, with Joseph helping out where he can.

I expect my MATS scholars to benefit from good open source SAEs existing and for both my scholars and the GDM team to benefit from better knowledge on training SAEs, but in the same way that the whole mech interp ecosystem benefits.

donated $4,000

Neel Nanda

almost 2 years ago

@NeelNanda Note: Tom and I discussed this grant before he applied here, and I encouraged him to apply to Manifund since I thought it was a solid grant to fund.