tl;dr: determine the best currently-available training setup for SAEs and disseminate this knowledge. Train SAEs for steadily larger models (starting with GPT-2-small for MATS scholars) and then scale up as budget and time allows.
Project proposal doc with more details: https://docs.google.com/document/d/15X28EEHo7pM2CYkfZqk05A0MZi4ImvTSSVaC9wtFLyI/edit?usp=sharing
Determine good hyperparameters for sparse autoencoders for realistic LLMs by doing a comprehensive architecture and hyperparameter comparison.
Use this knowledge to train a suite of high-quality SAEs for GPT-2-small, then scale up further as resources allow, targeting ~1B and ~8B models in sequence.
Disseminate knowledge on SAE training through a technical report.
Compute!
Lead: Tom McGrath - former DeepMind interpretability researcher.
Collaborating: Joseph Bloom - owner of SAELens and contributor to Neuronpedia.
Failure to replicate results obtained by major labs leading to low SAE performance.
Tom McGrath: none, currently self-funding
My collaborators Joseph Bloom and Johnny Lin are funded to work on Neuronpedia.