Manifund foxManifund
Home
Login
About
People
Categories
Newsletter
HomeAboutPeopleCategoriesLoginCreate
SandyFraser avatarSandyFraser avatar
Sandy Fraser

@SandyFraser

Alignment Researcher • Software & Data Engineer • BlueDot and 80,000 Hours alumnus

https://www.linkedin.com/in/alex-fraser-dev
$0total balance
$0charity balance
$0cash balance

$0 in pending offers

About Me

20+ years of software engineering experience in high-tech domains. End-to-end delivery of ML/AI applications and data platforms. Transitioning to AI safety research since 2023.

Projects

Concept-anchored representation engineering for alignment

pending admin approval

Comments

Concept-anchored representation engineering for alignment
SandyFraser avatar

Sandy Fraser

9 days ago

Hi @NeelNanda, I think this project might interest you given your work on mech interp.

I'm exploring concept-anchored representation engineering: using minimal regularization during training to guide where specific concepts develop in latent space, rather than discovering them post-hoc. My experiments with autoencoders show that anchoring just one concept causes related concepts to organize predictably nearby, even with noisy labeling. If we can predictably place safety-relevant concepts (deception, harmfulness) in known directions during pretraining, it might enable more surgical interventions and less dependence on post-hoc search.

My findings so far:

  • Single concept anchoring influences broader latent organization

  • Works with weak supervision (perfect labels not needed)

  • Stochastic per-sample regularization is sufficient

Next steps include applying this to small transformers, then language models with automated concept labeling. The goal is surgical capability removal that's resistant to recovery, since the concepts would be cleanly separated by design.

Does this seem like a promising direction? If so, I'd love to continue the work, and I'd be curious about your thoughts on extending this to the residual stream.