Manifund foxManifund
Home
Login
About
People
Categories
Newsletter
HomeAboutPeopleCategoriesLoginCreate
SandyFraser avatarSandyFraser avatar
Sandy Fraser

@SandyFraser

Alignment Researcher • Software & Data Engineer • BlueDot and 80,000 Hours alumnus

https://www.linkedin.com/in/alex-fraser-dev
$0total balance
$0charity balance
$0cash balance

$0 in pending offers

About Me

20+ years of software engineering experience in high-tech domains. End-to-end delivery of ML/AI applications and data platforms. Transitioning to AI safety research since 2023.

Projects

Concept control in transformers with Sparse Concept Anchoringpending grant agreement signature
Concept-anchored representation engineering for alignment

Comments

Concept-anchored representation engineering for alignment
SandyFraser avatar

Sandy Fraser

2 days ago

Thank you, @jesse_hoogland ! This means a lot, and I appreciate you recommending the project to @JueYan .

One clarification for anyone reading, since this proposal is about a year old and the milestones have been renumbered. What I called milestones 1 and 2 here are now bundled as M1, and they're done: published as Sparse Concept Anchoring for Interpretable and Controllable Neural Representations at the GRaM workshop (ICLR 2026). The milestone 3 you point to (small transformers) is what I now call M2 — the subject of the new fundraiser below. And the old milestone 4 (language models with real safety targets) has since grown into two milestones: M3 (a small language model trained from scratch) and M4 (retrofitting the method onto an existing open-weights model). Sorry for the confusion.

I've posted a new fundraiser for the next step, M2, here: Concept control in transformers with Sparse Concept Anchoring.

Your reservations are very reasonable.

  • Outreach. This is my main worry too. Planned mitigations: 1. posts or a paper for legibility (deliverable 2.4 in the proposal), 2. warm introductions and endorsements, 3. a MATS application in for the autumn cohort (in progress), 4. framing deliverables for adoption by frontier-lab safety teams.

  • Transfer. Agreed on focusing on small transformers. The architecture in my new proposal is technically a language model (small transformer with an lm_head), but it stays in a synthetic domain rather than jumping to natural language, so that a negative result is interpretable. I don't want to confound "the method doesn't work" with "I set up an LLM wrong".

    The roadmap then progresses to LLMs: M3 starts with tractable concepts (sentiment, formality, refusal) in a small natural language model before it attempts a safety-relevant target, so I'm not jumping straight to something as abstract as deception.

    So M2 in the new proposal is a toy transformer in a synthetic domain; scaling to natural language is later work. Is that the direction you had in mind, or were you picturing getting to small natural-language models sooner?

Concept-anchored representation engineering for alignment
SandyFraser avatar

Sandy Fraser

about 1 year ago

Hi @NeelNanda, I think this project might interest you given your work on mech interp.

I'm exploring concept-anchored representation engineering: using minimal regularization during training to guide where specific concepts develop in latent space, rather than discovering them post-hoc. My experiments with autoencoders show that anchoring just one concept causes related concepts to organize predictably nearby, even with noisy labeling. If we can predictably place safety-relevant concepts (deception, harmfulness) in known directions during pretraining, it might enable more surgical interventions and less dependence on post-hoc search.

My findings so far:

  • Single concept anchoring influences broader latent organization

  • Works with weak supervision (perfect labels not needed)

  • Stochastic per-sample regularization is sufficient

Next steps include applying this to small transformers, then language models with automated concept labeling. The goal is surgical capability removal that's resistant to recovery, since the concepts would be cleanly separated by design.

Does this seem like a promising direction? If so, I'd love to continue the work, and I'd be curious about your thoughts on extending this to the residual stream.