This independent research aims to uncover how LLMs know what they know. I will investigate how LLMs represent, structure, and update factual beliefs, uncertainty, and world models. In particular, I’m interested in the following research questions:
How are beliefs represented in LLMs?
How do LLMs integrate conflicting claims?
How do LLMs represent and reason about uncertainty in beliefs, if at all?
How do beliefs stored in the model weights interact with beliefs embedded in the prompt?
How are beliefs connected in LLMs? How does changing a belief affect other beliefs?
The goal of this project is to increase the understanding of epistemology in LLMs. Having more transparency in the belief structure of LLMs can help in identifying dangerous beliefs emerging during training/prompting. Answering the research questions is important towards developing LLM lie detectors and LLM mind readers.
Concretely, I will publish datasets, experiments, toy models, blogs, and papers to increase our understanding of belief structures in LLMs.
Gross salary: $50000 (~$30000 net salary)
Desk in EA Denmark office: $3500
Cloud computing and API costs: $3000
Travel, conferences, expenses: $2000
Textbooks, resources, unexpected costs: $1000
My minimum funding amount is half of this, in which case I would work on this project for half a year.
See my CV, LinkedIn, Google Scholar, Github, LessWrong, and Twitter.
I did a two-month trial period as an independent alignment researcher, so I know what that life is like.
I published two first-author papers:
Preventing negative side effects in reinforcement learning agents (IJCAI AI Safety Workshop 2019)
Causal discovery with interpretable neural networks (Discovery Science 2021, best student paper award)
I won prizes in multiple machine learning competitions.
Winner of the ECML-PKDD Out-of-Distribution competition
2nd place in the NeurIPS Causality for Climate competition
3rd prize in the NeurIPS Learning by Doing competition
Causes of a failing project:
LLMs might be sufficiently alien that it’s not possible to recover any useful belief structures.
Insights from one LLM might not generalize from one model to another. For example, certain belief structures might only get learned at a sufficient scale.
Outcomes of a failing project:
In these cases, we still learn something about the nature of the problem, namely that it’s too difficult and we should focus our energy elsewhere.
This project might be actively harmful if insights are used to speed up AI capabilities research. I will take this into account when deciding whether to publish and discuss with trusted members of the AI Safety community when in doubt.
Currently, this project is not getting any other funding. However, I will also apply to the EA Long-Term Future Fund and the University of Copenhagen.