Independent research to improve SAEs (4-6 months)

Note on edits: This grant has been re-titled and completely rewritten to accurately reflect my change in research focus, and after discussion with a regranter. At the time of this change, there were no donations so this seems okay to do.

Project summary

Sparse autoencoders are an exciting new tool in the interpretability toolbox, are unlocking new modes of analysis, and becoming an important core that much interpretability research is building upon. I think it is likely that they are not the final evolution of dictionary learning techniques for interpretability, and that faster improvements to this fundamental technology will benefit the field as a whole.
As such, I'm experimenting with alternate architectures and tweaks to the training setup to try and find available improvements sooner rather than later.

What are this project's goals and how will you achieve them?

The goal is to find alternative architectures and training techniques for sparse autoencoders that improve on the current state of the art.

My strategy involves making upfront investment in infrastructure to make building and evaluating new components and architectures faster, easier, and less error prone, so that later I can iterate quickly and test out many possible configurations cheaply. I'm aiming to lower the cost of trying each idea such that it's worth exploring high-risk high-reward configurations that otherwise might be too speculative to be worth trying.

I already have a working first version of this framework which I will return to and improve upon, and then I have a collection of

established interventions proposed by research labs and the research community I'd like to test directly and also build upon
- eg, all the Anthropic and DeepMind updates, Gated SAEs, my own ProLU SAEs, Sqrt(L1) SAEs, etc
experimental interventions and alternate architectures
- eg, deep encoders, hierarchical SAEs, alternative gate training methods for Gated SAEs, a resampling method I previously got good results from, and others

which I aim to implement as interchangeable components in the framework and begin evaluating.

If one or more speculative changes produce sufficiently large Pareto improvements, I will evaluate whether they have interpretable features. If so, I will take the best method(s), refine them further if possible, investigate + further evaluate them, and then write up the results and method to share the technique and findings with other researchers.

How will this funding be used?

Salary and compute budget:

Salary: $9k/month
Compute: $2k/month

Who is on your team and what's your track record on similar projects?

The team is just me. I have been working on this topic for a few months and I have produced this work on ProLU SAEs which resulted in a large Pareto improvement. I also have an unpublished resampling method that produced a smaller improvement.

Other qualifications: computer science degree, high comfort level working in PyTorch

What are the most likely causes and outcomes if this project fails? (premortem)

It's possible that none of the techniques I try make enough of a difference to be worth their costs in added complexity (which is plausible due to the "low probability of success/high value if successful" nature of the things I'm trying). If this is happening, I expect to continue iterating on other techniques as I don't expect to run out of things to try.

If none of the techniques which significantly Pareto-improve L0 and reconstruction quality result in interpretable SAE features, this would be unfortunate as they would not be useful as a tool, though this may be an interesting result that gives some insight about the problem space.

What other funding are you or your project getting?

No other funding.