Hi @NeelNanda, I think this project might interest you given your work on mech interp.
I'm exploring concept-anchored representation engineering: using minimal regularization during training to guide where specific concepts develop in latent space, rather than discovering them post-hoc. My experiments with autoencoders show that anchoring just one concept causes related concepts to organize predictably nearby, even with noisy labeling. If we can predictably place safety-relevant concepts (deception, harmfulness) in known directions during pretraining, it might enable more surgical interventions and less dependence on post-hoc search.
My findings so far:
Single concept anchoring influences broader latent organization
Works with weak supervision (perfect labels not needed)
Stochastic per-sample regularization is sufficient
Next steps include applying this to small transformers, then language models with automated concept labeling. The goal is surgical capability removal that's resistant to recovery, since the concepts would be cleanly separated by design.
Does this seem like a promising direction? If so, I'd love to continue the work, and I'd be curious about your thoughts on extending this to the residual stream.