Understanding SAE features using Sparse Feature Circuits

Project summary

I, Lovis Heindrich, am planning to research the use of sparse autoencoder circuits to better understand SAE features. The project will be carried out during a research visit at the Torr Vision Group at the University of Oxford and mentored by Fazl Barez, Prof Philip Torr and in collaboration with Veronika Thost (MIT-IBM Watson Lab). I am seeking funding of $8000 to cover my salary to work on the project full-time for 3 months, as well as $3000 for additional compute budget. In case the compute budget will not be fully utilized, it will be used to cover conference fees.

What are this project's goals and how will you achieve them?

Recent work on SAEs [Anthropic 2024] has demonstrated the feasibility of SAE feature discovery in larger models and discovered safety relevant features that are causally important for the models’ behavior. Understanding what causes these features to activate is an important open research question. Our project’s goal is to create circuit-based explanations of such SAE features. Current approaches [Anthropic 2024, OpenAI 2023] that use activating dataset examples to generate feature explanations are limited because they can result in overly broad explanations or interpretability illusions [Bolukbasi et al. 2021]. We plan to make progress on this problem using circuit discovery methods [Syed et al. 2023, Marks et al. 2024, Dunefsky & Chlensky 2024]. We will explore various potential ways the circuit-based explanations can be used to improve our understanding and the usefulness of sparse autoencoder features.

How will this funding be used?

$8000 will cover my salary to work on the project full-time for 3 months. The remaining $3000 will be used for compute and/or conference fees.

Who is on your team and what's your track record on similar projects?

Lovis Heindrich: I’m a past MATS scholar where I worked with Neel Nanda and have published relevant work where I analyzed MLP circuits in Pythia-70M. Additionally, I have experience training and evaluating sparse autoencoders from working on them during the MATS extension.

Fazl Barez, Veronika Thost, Philip Torr

What are the most likely causes and outcomes if this project fails? (premortem)

If this project were to fail, we’d expect the most likely causes to be either that feature circuits are too distributed or rely on uninterpretable features. Insufficiently good SAEs could also limit the project, especially when a large proportion of the feature circuits can’t be explained by earlier SAE features. We are optimistic that utilizing recent sparse autoencoders, including recent improvements to the SAE architecture will help us overcome these potential issues.

What other funding are you or your project getting?

The Torr Vision Group at the University of Oxford will provide access to a compute cluster and cover costs related to accommodation and travel to Oxford.