Concept-anchored representation engineering for alignment

Project summary

Develop techniques to impose just enough structure on LLM latent spaces during training to enable precise monitoring and post-hoc intervention.

Models naturally develop structured latent representations (e.g. the residual stream in transformers), but we currently have little control over how concepts organize. Prior attempts have focused on comprehensive supervision or post-hoc discovery, rather than minimal anchoring during training. My hypothesis is that if we give the models a gentle nudge right from the start of pre-training, they could become much more interpretable for the specific concepts and behaviors that we care about.

My preliminary experiments with autoencoders show that anchoring just one concept causes related concepts to organize predictably nearby, i.e. anchored concepts act as attractors during training. This happens even with only weak supervision, which suggests that it could be made to work for models trained on very large corpora that are hard to label (e.g. frontier models). Knowing where key concepts are located could enable surgical removal of dangerous capabilities without broad performance degradation, and could be highly resistant to capability recovery.

I see a path forward to apply this to LLMs, and I would like to pursue it.

What are this project's goals? How will you achieve them?

Delivery goal: Develop techniques that provide alignment researchers with:

Known concept locations before deployment (no search required)
Surgical intervention capabilities (remove specific harms without general performance loss)
Training-time safety measures rather than purely post-deployment fixes, including the potential to monitor these capabilities during training.

Personal goal: Complete my transition from software consultancy to AI safety research.

Milestones

Proof-of-concept (6 weeks): structuring of latent spaces in bottleneck autoencoders (COMPLETED)
Practical intervention (4-5 weeks): Demonstrate concept suppression and precise unlearning, using autoencoder from initial experiments (Minimal funding scenario)
Transformer transfer (8-10 weeks): Apply techniques to attention-based architectures using small transformers
Language model application (6-9 weeks): Structure abstract concepts (e.g. deception, harmfulness) in transformer language models, including development of the required training data pipeline

For more details, see Concept-anchored representation engineering for alignment, which is the introductory post in my LessWrong sequence on this research.

Technical approach for transfer to transformer LMs

Constraints: I don't plan to make significant architectural changes; instead, I'll apply minimal regularization to encourage interpretable structure while preserving model capacity.

Data: I’ll use a mixture of unlabelled and labelled data, with labels coming from pre-labelled data (e.g. from Hugging Face), and automated labeling techniques (e.g. sentiment analysis). That is, not all the data will be labelled, and the labels will be inaccurate by design. I expect that this will provide enough of a signal, following the successful use of noisy labels in my proof-of-concept.

Success metrics: I expect to be able to suppress and remove target concepts from the models. For the transformers, I'll measure success as the increase in surprisal for text relating to the suppressed concepts.

Comparison to existing methods

Unlike imposing strong (or many) constraints, this approach should have minimal impact on performance.
Unlike concept suppression during training, which results in more entanglement, this approach should cause the model to learn undesirable concepts in a clean and separable way.
Unlike post-hoc mech interp and representation engineering, this approach could allow for precise and robust unlearning, which could make even open weights models safer.
Unlike concept bottleneck models (that impose architectural constraints), this method would apply minimal regularization to the transformer's residual stream and/or attention mechanisms. This preserves the model's full representational capacity while encouraging interpretable structure to emerge in predictable locations.

Literature gap

Most work on disentangled representations in NLP is domain-specific or imposes much stronger constraints. Anthropic’s interpretability research, for example, uses post-hoc sparse autoencoders, not minimal structural guidance during pre-training. My approach prescribes where specific concepts develop, rather than just analyzing emergent structure.

Deliverables

One or more LessWrong posts per milestone
All experiments (notebooks) published on GitHub

Potential impact

Technical AI alignment research is increasingly focused on understanding and controlling the inner workings of large language models. This project contributes by exploring novel methods for shaping latent representations during training, complementing existing approaches that focus on post-hoc analysis or extensive supervision. If successful, this work could make other alignment efforts considerably easier, potentially reducing x-risk from misaligned models.

I have personal contacts who are safety researchers, including at a frontier lab, and I am a member of several online alignment forums (BlueDot alumni, 80,000 Hours alumni, LessWrong, and various specialised Slack and Discord groups). I will promote this work through all of those channels to increase its impact.

How will this funding be used?

Minimal scenario: 5 weeks to deliver Milestone 2 only: 77% salary, 14% overheads, 9% buffer.

If fully funded: 24 weeks to deliver Milestones 2-4: 85% salary, 6% overheads, 9% buffer.

I’ve based the timeline on my actual research pace so far (6 weeks for the initial milestone, including infrastructure setup). The salary is based on what I could get as a full-time Senior AI Engineer today (including income tax and superannuation). This is effectively a discounted rate, because I haven’t adjusted for the loss of other Australian workplace entitlements such as leave (usually contractors charge more because of this).

Full details in this spreadsheet.

Who is on your team? What's your track record on similar projects?

I will conduct this research independently, but I am part of a local community of practice (Melbourne AI Safety Hub). We frequently meet in person to discuss our work, including this research. Some other members of the Hub are also working on DevInterp, so I have people to validate my ideas with.

Referees:

Dan MacKinlay, PhD, Research Scientist in ML at CSIRO’s Data61, researching multi-agent scenarios.
Alexander Saeri, PhD, AI governance researcher at MIT FutureTech & The University of Queensland. Co-founder of Ready Research, an EA research organisation. Invited speaker at EAGxAustralia.
Dane Sherburn, OpenAI

Referee contact details available on request.

Research capability

Published technical articles on LessWrong:

Selective regularization for alignment-focused representation engineering (Milestone 1). Demonstrates novel adaptation of regularization techniques for alignment applications.

Side quests in curriculum learning and regularization (Milestone 1). Documentation of systematic exploration of training methodologies, including negative results and methodology improvements that informed the successful selective regularization approach.

Detecting out of distribution text with surprisal and entropy (my BlueDot project). I reproduced a jailbreak filter paper and developed a novel token-level metric and visualization.

Technical execution

20+ years professional software development across many high-tech domains from auto engineering, through power generation and specialized sensors, to continent-scale geospatial data storage and processing. My experience includes end-to-end delivery of ML/AI applications and data platforms. I have developed strong spatial intuition through my work on graphics (I am a contributor to Blender), game development, and geospatial data processing (including high-dimensional data).

For a view into my understanding of the transformer architecture, see my re-implementation of nanoGPT, which includes detailed notes.

Delivery

Consistently highly motivated: completed a five-year independent video game development project. Current research pace: 6 weeks for initial milestone including infrastructure setup, suggesting realistic timeline estimation for remaining work.

See my LinkedIn profile for written recommendations.

What are the most likely causes and outcomes if this project fails?

This is a research project, but one with few uncontrolled variables. The most likely cause for failure would be unforeseen technical blockers: it seems tractible to me now, but perhaps I don’t appreciate the difficulty of what I’m attempting. In that case, I would still publish the negative results.

How much money have you raised in the last 12 months, and from where?

None. I have recently submitted a similar application to LTFF, but it seems they are entering a grantmaking pause. I have also applied to Open Philanthropy's Career development and transition fund. I intend to apply for other sources of funding.

Another application is being made specifically for the use of a coworking space; currently, that is included in this proposal.

If any of those are successful, I will update this application.