Train LLMs in honest introspection

Description of proposed project

A central subgoal in AI safety is to better understand the internal processes driving LLMs’ behavior. This is, e.g., the crux of most work on interpretability and CoT faithfulness. Such understanding could be used to detect deceptive or power-seeking motives; inform when to trust model outputs; identify threats; and generally guide other oversight and control methods [1].

One underutilized method for investigating LLMs’ internal processes is asking the LLMs themselves. Several papers have found that LLMs can explain their decision-making with surprising accuracy—correctly reporting, e.g., that they had been fine-tuned to write insecure code [2], or that they have received a concept injection [3]. Most relevantly, my collaborators & I showed that LLMs can provide accurate, quantitative descriptions of their own internal choice processes during multi-attribute choice [4]. Concretely, if the model is choosing which condo to buy, and it had been covertly fine-tuned to internally value square footage twice as much as neighborhood walkability, we found that it can tell us this preference with high accuracy. Moreover, this accuracy can be improved through training: Fine-tuning the model on examples of correct explanations in some choice contexts (e.g., “when choosing between condos, I value square footage 2.1 times as much as neighborhood walkability”) makes it more accurate at describing its decision-making processes in other choice contexts (e.g., choosing between vacation destinations, or vacuum cleaners). This work was a proof-of-concept that LLMs can report properties of their internal choice processes with quantitative precision.

We are seeking funding to continue this research. Our long-term aims are to develop (i) rigorous, comprehensive measures of self-report accuracy in safety-relevant domains, and (ii) training methods for improving this accuracy. We will use the funding to run several immediate next experiments.

First, we will test whether LLMs are accurately describing their choice processes by introspecting on their decision-making as it happens, or whether they are merely reporting cached information about their tendencies. When an LLM correctly reports that it weighs condo size twice as heavily as neighborhood walkability, is it correct because it observed its internal processes operating in real-time? Or did it just cache that knowledge during its earlier training? This distinction is important for safety purposes. A model’s cached knowledge about its internal operations is likely to diverge more from the ground truth as contexts get more novel and out of distribution. In contrast, if LLMs can be trained to do real-time introspection, this ability could in principle yield accurate self-reports even in very novel/OOD contexts.

To distinguish between these abilities, we will instill covert backdoors that alter the models’ choice processes when triggered (e.g., reversing its preferences so that it suddenly prefers small condos over large ones), and test whether the models can correctly report the altered choice process. If they can, this would be compelling evidence for real-time introspection. We currently have one small working prototype of this kind of test. We aim to scale it into a comprehensive measure of real-time introspective accuracy with a range of choice contexts & backdoor types.

After building this measure, we will use it as a benchmark for developing training methods that push LLMs towards accurate, introspection-driven self-report. We have several ideas for how to do this.

One idea, building on our paper, is to covertly train models to internalize a range of quantitative choice parameters, and then fine-tune them on examples of accurate reports about those parameters. In our paper, we implemented this method for one kind of property (attribute weights). Here, we would implement it across a variety of properties, such as risk aversion, loss aversion, or temporal discounting in economic choice; inequality aversion in social choice; etc.
Another idea is to measure models’ native behavioral tendencies across a much larger range of contexts (e.g., from [5]), and fine-tune the model on examples of accurately predicting those tendencies.
A third idea is to identify decision problems that can be solved using a small, discrete number of approaches—approaches which can be distinguished behaviorally (e.g., strategies for solving mazes). We would have models solve these problems, behaviorally measure which approach they used, and then fine-tune them on examples of accurately reporting that approach.
A fourth approach is to leverage white-box techniques from mech interp, such as the concept injection used in [3], to manipulate aspects of the models' internal processes and then train it to report those aspects correctly.

The hope is that, with these methods, we can train frontier LLMs to accurately report on their internal choice processes via real-time introspection. This training could lead models to be honest & faithful when explaining their reasoning, and could uncover a powerful new font of information about their internal processes—clear, significant wins for safety. The training could also be naturally adapted to improving CoT faithfulness in reasoning models. Finally, introspection training like ours may someday play a role in measuring and understanding AI sentience [6].

Why are you qualified to work on this?

The primary investigator on this project is Adam Morris, collaborating with Dillon Plunkett. We both did PhDs at Harvard (in computational cognitive science and cognitive neuroscience, respectively). Adam is currently a postdoc at Princeton, and Dillon is an Anthropic Fellow.

There is substantial evidence for our general ability to lead computational research projects. Between the two of us, we’ve published 20+ papers in computational cognitive science (with 1000+ total citations) on topics like reinforcement learning, decision-making, and reasoning. Adam has specifically studied and published 4 papers on introspection in humans, and developed the multi-attribute choice method that we leveraged in our paper to measure self-report accuracy in LLMs (see [7]).

Though we are relatively new to AI safety research, there is also evidence for our ability to lead projects in this specific domain. We have already published one paper on self-knowledge in LLMs (see [4]), which forms the foundation of the present project. We have gotten positive feedback on our work from established AI safety and interpretability researchers, and are actively building collaborations with academics and researchers at leading labs.

What would you do with additional funding, above and beyond the $15k ACX grant?

We expect the ACX grant to cover our compute expenses for the next 6-8 months, allowing us to run 1-2 versions of all the experiments described above. Additional funds would be used to (a) run larger, higher-risk versions of the experiments now, (b) run follow-up studies based on the initial results afterward, and/or (c) pay programmers or ML graduate students to help accelerate the research and run more studies in a faster time-frame.

References

[1] Neel Nanda (2022). A Longlist of Theories of Impact for Interpretability. https://www.alignmentforum.org/posts/uK6sQCNMw8WKzJeCQ/a-longlist-of-theories-of-impact-for-interpretability

[2] Betley, J., Bao, X., Soto, M., Sztyber-Betley, A., Chua, J., & Evans, O. (2025). Tell me about yourself: LLMs are aware of their learned behaviors. https://arxiv.org/abs/2501.11120

[3] Lindsey, J. (2025). Emergent Introspective Awareness in Large Language Models. https://transformer-circuits.pub/2025/introspection/index.html

***[4] Plunkett, D., Morris, A., Reddy, K., & Morales, J. (2025). Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions, and Improve with Training. https://arxiv.org/abs/2505.17120

This is our paper on LLM self-knowledge, which forms the foundation of the current project.

[5] Perez, E., Ringer, S., Lukosiute, K., Nguyen, K., Chen, E., Heiner, S., ... & Kaplan, J. (2023). Discovering language model behaviors with model-written evaluations. In Findings of the association for computational linguistics: ACL 2023 (pp. 13387-13434). https://aclanthology.org/2023.findings-acl.847.pdf

[6] Perez, E., & Long, R. (2023). Towards evaluating ai systems for moral status using self-reports. https://arxiv.org/abs/2311.08576

[7] Morris, A., Carlson, R. W., Kober, H., & Crockett, M. J. (2025). Introspective access to value-based multi-attribute choice processes. Nature communications, 16(1), 3733. https://www.nature.com/articles/s41467-025-59080-y