Thank you, @jesse_hoogland ! This means a lot, and I appreciate you recommending the project to @JueYan .
One clarification for anyone reading, since this proposal is about a year old and the milestones have been renumbered. What I called milestones 1 and 2 here are now bundled as M1, and they're done: published as Sparse Concept Anchoring for Interpretable and Controllable Neural Representations at the GRaM workshop (ICLR 2026). The milestone 3 you point to (small transformers) is what I now call M2 — the subject of the new fundraiser below. And the old milestone 4 (language models with real safety targets) has since grown into two milestones: M3 (a small language model trained from scratch) and M4 (retrofitting the method onto an existing open-weights model). Sorry for the confusion.
I've posted a new fundraiser for the next step, M2, here: Concept control in transformers with Sparse Concept Anchoring.
Your reservations are very reasonable.
Outreach. This is my main worry too. Planned mitigations: 1. posts or a paper for legibility (deliverable 2.4 in the proposal), 2. warm introductions and endorsements, 3. a MATS application in for the autumn cohort (in progress), 4. framing deliverables for adoption by frontier-lab safety teams.
Transfer. Agreed on focusing on small transformers. The architecture in my new proposal is technically a language model (small transformer with an lm_head), but it stays in a synthetic domain rather than jumping to natural language, so that a negative result is interpretable. I don't want to confound "the method doesn't work" with "I set up an LLM wrong".
The roadmap then progresses to LLMs: M3 starts with tractable concepts (sentiment, formality, refusal) in a small natural language model before it attempts a safety-relevant target, so I'm not jumping straight to something as abstract as deception.
So M2 in the new proposal is a toy transformer in a synthetic domain; scaling to natural language is later work. Is that the direction you had in mind, or were you picturing getting to small natural-language models sooner?