Manifund

Comments

Discovering latent goals (mechanistic interpretability PhD salary)

Lucy Farnik

over 1 year ago

Final report

Description of subprojects and results, including major changes from the original proposal

The previous update was meant to be the final one, apparently I forgot to close the project

Spending breakdown

All funding went towards my salary

Discovering latent goals (mechanistic interpretability PhD salary)

Lucy Farnik

about 2 years ago

Progress update

What progress have you made since your last update?

Interpreting "goals" turned out to be out of reach, so I did what I said in the description and pivoted towards studying easier LLM phenomena which build towards being able to interpret the hard things. I spent some time researching how grammatical structures are represented, and have since pivoted towards trying to understand how "intermediate variables" are represented and passed between layers. My current high-level direction is basically "break the big black box down into smaller black boxes, and monitor their communication".

What are your next steps?

I'm currently approaching "inter-layer interpretability" with SAE-based circuit-style analysis. I basically want to figure out whether it is possible to do IOI-style things but with SAE features at different layers as the unit of ablation. I'm also looking into how to do SAE-based ablation well (to make results less noisy). I'm researching these questions in MATS under Neel Nanda.

Is there anything others could help you with?

If anyone reading this is interested in the things I described above, I could use collaborators! In particular, if you're somewhat new to alignment and would be interested in a setup where I throw a concrete specification for an experiment at you and you spend an afternoon coding it up, I'd be interested in talking to you.

Transactions

For	Date	Type	Amount
Manifund Bank	over 1 year ago	withdraw	1590
Discovering latent goals (mechanistic interpretability PhD salary)	over 2 years ago	project donation	+150
Discovering latent goals (mechanistic interpretability PhD salary)	over 2 years ago	project donation	+1000
Discovering latent goals (mechanistic interpretability PhD salary)	over 2 years ago	project donation	+400
Discovering latent goals (mechanistic interpretability PhD salary)	over 2 years ago	project donation	+40