Manifund foxManifund
Home
Login
About
People
Categories
Newsletter
HomeAboutPeopleCategoriesLoginCreate
3

Understanding SAE features using Sparse Feature Circuits

Technical AI safety
🐬

Lovis Heindrich

ActiveGrant
$11,000raised
$11,000funding goal
Fully funded and not currently accepting donations.

Project summary

I, Lovis Heindrich, am planning to research the use of sparse autoencoder circuits to better understand SAE features. The project will be carried out during a research visit at the Torr Vision Group at the University of Oxford and mentored by Fazl Barez, Prof Philip Torr and in collaboration with Veronika Thost (MIT-IBM Watson Lab). I am seeking funding of $8000 to cover my salary to work on the project full-time for 3 months, as well as $3000 for additional compute budget. In case the compute budget will not be fully utilized, it will be used to cover conference fees.

What are this project's goals and how will you achieve them?

Recent work on SAEs [Anthropic 2024] has demonstrated the feasibility of SAE feature discovery in larger models and discovered safety relevant features that are causally important for the models’ behavior. Understanding what causes these features to activate is an important open research question. Our project’s goal is to create circuit-based explanations of such SAE features. Current approaches [Anthropic 2024, OpenAI 2023] that use activating dataset examples to generate feature explanations are limited because they can result in overly broad explanations or interpretability illusions [Bolukbasi et al. 2021]. We plan to make progress on this problem using circuit discovery methods [Syed et al. 2023, Marks et al. 2024, Dunefsky & Chlensky 2024]. We will explore various potential ways the circuit-based explanations can be used to improve our understanding and the usefulness of sparse autoencoder features.

How will this funding be used?

$8000 will cover my salary to work on the project full-time for 3 months. The remaining $3000 will be used for compute and/or conference fees.

Who is on your team and what's your track record on similar projects?

Lovis Heindrich: I’m a past MATS scholar where I worked with Neel Nanda and have published relevant work where I analyzed MLP circuits in Pythia-70M. Additionally, I have experience training and evaluating sparse autoencoders from working on them during the MATS extension. 

Fazl Barez, Veronika Thost, Philip Torr

What are the most likely causes and outcomes if this project fails? (premortem)

If this project were to fail, we’d expect the most likely causes to be either that feature circuits are too distributed or rely on uninterpretable features. Insufficiently good SAEs could also limit the project, especially when a large proportion of the feature circuits can’t be explained by earlier SAE features. We are optimistic that utilizing recent sparse autoencoders, including recent improvements to the SAE architecture will help us overcome these potential issues.

What other funding are you or your project getting?

The Torr Vision Group at the University of Oxford will provide access to a compute cluster and cover costs related to accommodation and travel to Oxford.

Comments2Donations1Similar8
Hannes avatar

Hannes Thurnherr

Research on SAE-Feature-Circuits

I'm writing a paper on the connections between SAE features and how to use them to interpret and modify a models behaviour.

2
0
$0 raised
Kunvar avatar

Kunvar Thaman

Exploring feature interactions in transformer LLMs through sparse autoencoders

Technical AI safety
9
4
$8.5K raised
canrager avatar

Can Rager

Automatic circuit discovery on sparse autoencoded features

3-month independent research

1
2
$25K raised
tmcgrath avatar

Tom McGrath

Train great open-source sparse autoencoders

Find the best settings for SAE training we can, then scale across models

Technical AI safety
6
6
$4.03K raised
GlenTaggart avatar

Glen M. Taggart

Independent research to improve SAEs (4-6 months)

By rapid iteration on possible alternative architectures & training techniques

Technical AI safety
3
5
$55K raised
robertzk avatar

Robert Krzyzanowski

Scaling Training Process Transparency

Compute and infrastructure costs

Technical AI safety
3
4
$5.15K raised
MatthewClarke avatar

Matthew A. Clarke

Salaries for SAE Co-occurrence Project

Working title - “Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces”

Science & technologyTechnical AI safety
3
1
$0 raised
LucyFarnik avatar

Lucy Farnik

Discovering latent goals (mechanistic interpretability PhD salary)

6-month salary for interpretability research focusing on probing for goals and "agency" inside large language models

Technical AI safety
7
4
$1.59K raised