Description of subprojects and results, including major changes from the original proposal
The previous update was meant to be the final one, apparently I forgot to close the project
Spending breakdown
All funding went towards my salary
@LucyFarnik
AI Safety Researcher & Alignment PhD
https://www.linkedin.com/in/lucy-farnik/$0 in pending offers
I'm working on mechanistic interpretability, while also upskilling in AI governance and forecasting. I've previously worked on safe RL and brain-inspired alignment. My background is in software engineering: I started coding at age 7 and became a senior dev at a tech startup at age 18.
Lucy Farnik
4 months ago
The previous update was meant to be the final one, apparently I forgot to close the project
All funding went towards my salary
Lucy Farnik
11 months ago
Interpreting "goals" turned out to be out of reach, so I did what I said in the description and pivoted towards studying easier LLM phenomena which build towards being able to interpret the hard things. I spent some time researching how grammatical structures are represented, and have since pivoted towards trying to understand how "intermediate variables" are represented and passed between layers. My current high-level direction is basically "break the big black box down into smaller black boxes, and monitor their communication".
I'm currently approaching "inter-layer interpretability" with SAE-based circuit-style analysis. I basically want to figure out whether it is possible to do IOI-style things but with SAE features at different layers as the unit of ablation. I'm also looking into how to do SAE-based ablation well (to make results less noisy). I'm researching these questions in MATS under Neel Nanda.
If anyone reading this is interested in the things I described above, I could use collaborators! In particular, if you're somewhat new to alignment and would be interested in a setup where I throw a concrete specification for an experiment at you and you spend an afternoon coding it up, I'd be interested in talking to you.
For | Date | Type | Amount |
---|---|---|---|
Manifund Bank | 4 months ago | withdraw | 1590 |
Discovering latent goals (mechanistic interpretability PhD salary) | over 1 year ago | project donation | +150 |
Discovering latent goals (mechanistic interpretability PhD salary) | over 1 year ago | project donation | +1000 |
Discovering latent goals (mechanistic interpretability PhD salary) | over 1 year ago | project donation | +400 |
Discovering latent goals (mechanistic interpretability PhD salary) | over 1 year ago | project donation | +40 |