Acausal research and interventions

Project summary

We plan to empirically test interventions to make AIs behave better in situations involving acausal dynamics through affecting their decision theoretic reasoning, conduct theoretical work to generate more interventions, and get others to take up this work, e.g. apply our interventions in frontier model RLHF. This seems unusually promising for making acausal interactions go well.

In aligned worlds, handling acausal interactions poorly could result in losing all of humanity’s resources while handling them well could multiply the value humanity generates. In unaligned worlds, an acausally cooperative AI can benefit civilisations in the universe that (partially) share our values. Conversely, acausally uncooperative AI could harm those civilisations.

We believe this work is urgent and cannot be deferred to future AI-assisted humans. There might be important path-dependencies for which decision theory future AIs end up with. The work is very neglected and seems tractable.

The portion of the project we are seeking funding for would be housed at Redwood Research.

What are this project's goals?

Our first high-level focus will be to ensure AI systems respond well to requests that are relevant for decision theory and diminish spurious tendencies towards causal decision theory.

Primary focus

AI systems acting on causal decision theory (CDT) or just being incompetent at decision theory seems very bad for multiple reasons.

It makes them susceptible to acausal money pumps, potentially rendering them effectively unaligned.
It makes them worse at positive-sum acausal cooperation. To get a sense of the many different ways acausal and quasi-acausal cooperation could look, see these examples (1, 2, 3).
It makes them worse at positive-sum causal cooperation.

The primary intended impact of us succeeding in our initial primary focus area is to mitigate the above risks, i.e. our aim is to make AI systems less susceptible to acausal exploitation and better at cooperation. In worlds with aligned AI this is straightforwardly good. In worlds with unaligned AI we expect this to still have large positive effects through making the unaligned AI behave more nicely towards other agents of whom some will (partially) share our values. We hope our work will lead to a future where AIs and humans can do good and careful reflection about decision theory and acausal dynamics without naïve views already being implicitly or explicitly locked in.

Our work is urgent and necessary: It is plausible that AI systems will be spuriously inclined towards causal decision theory (CDT) at a time when it matters:

(Son-of-)CDT is self-consistent, so once you have a favourable inclination towards it, you can get stuck there. More intelligence doesn’t necessarily lead you to adopt better decision theories.
1. In particular, if an agent makes decisions that will impact its (or its successors’) future decision theory, a CDT-like agent will wish to ensure that its future decision theory is CDT-like. (The same is true for other decision theories.)
2. Another mechanism is just that CDT-inclined AI systems might persuade others to be more CDT-like, which could start a self-reinforcing cycle. The opinions of early AI systems might heavily shape the long-term views of humanity on decision theory since a) most people know fairly little about it and have malleable views, and b) it is much easier to talk to language models than human experts.
3. Note that all of the above hold even if AI systems and humans they advise don’t explicitly think in terms of “decision theory”.
4. (We talk about CDT here since we are more object-level concerned about it compared to other decision theories. But the considerations here apply to many decision theories and we aim for futures where a lot of careful reasoning is going into which decision theory our civilisation is following.)
Most reinforcement learning in multiagent settings leads to CDT-like behaviour.
CDT is one of the leading accounts in academia. (Although this is perhaps not a “spurious” reason.)
CDT is implicitly assumed in other domains such as traditional game theory.
By default, AI labs don’t think much about decision theory, so they might just train their AI system towards CDT without being aware of it or thinking much about it.
AI labs might also deliberately train their AIs towards CDT to make control easier. We believe this can be done in a way that is compatible with our agenda (for example, if you only train specific AIs to be CDT) but requires some care that we don’t expect by default.
Acausal interactions might happen relatively early on when AI systems are still relatively incompetent at decision theory:
1. Currently, AI systems are bad at decision theoretic reasoning. We expect them to, by default, continue being relatively bad at this compared to other capabilities like persuasion and prediction.

We have an in-depth write up on the time sensitivity of our work which is available to individuals on request. As far as we know, there is no one else working on this problem even close to full-time and we have concrete plans for making progress (see below), so we have a good chance of moving the needle.

Other potential focus areas

We currently focus on AI system’s decision theory because we think the area is important and have shovel-ready ideas. However, we are in principal open to and might add or pivot towards projects to

prevent aligned AIs from learning and/or revealing harmful information related to acausal interactions, e.g. by designing a policy for how AI systems should manage user requests about relevant areas,
increase the chance that unaligned AI will have porous values which would make mutually beneficial trades between it and other agents which (partially) share our values better and more likely,
ensure AI systems implement safe Pareto improvements, e.g. surrogate goals, which would drastically improve the worst-case outcomes of potential conflict between AI systems compared to no safe Pareto improvements.

What steps does this project involve

Our planned activities fall into three broad categories:

Research
We will identify strong intervention candidates through a mix of theoretical and empirical research. To this end, we have developed a decision theory benchmark for LLMs. We are currently actively working on two research projects: a) Improving the conceptual reasoning abilities of LLMs through elicitation and fine-tuning, b) theoretically investigating what decision theoretic behaviour we should expect agents trained on similarity-based cooperation to develop.
In the future, we plan to extend the benchmark and use it to study how different interventions affect language model’s decision theoretical capabilities (i.e. how well do they understand different theories) and attitudes (i.e. what theories do they favour). For example, we would like to study generalisation of decision-theoretic reasoning in language models: Basically, intervene on the models in one domain (say, their stated opinions on decision theory) and see how it affects another domain (say, their choices on specific decision problems).
Another example generalisation study would be: First, train LLMs with RL in multi-agent settings (e.g. the prisoner’s dilemma) via self-play. This should train them to act in accordance with CDT in the specific multi-agent settings they are trained in. Then, measure whether this affects their stated opinions on decision theory.
We also hope to extend our benchmark to include dimensions of decision theory other than EDT vs. CDT, e.g. degree of updatelessness.
See also this overview of our current top empirical project ideas.
Intervention development and implementation
Some of our intervention ideas depend on us convincing external stakeholders, e.g. AI labs, to implement them. We already have some interest (feel free to reach out to ask about this). To maximise the chance of succeeding at this, we will develop our interventions ourselves as far as possible to be deployment-ready. This might involve writing system prompts, instructions for RLHF contractors, or generating more datasets.
We also plan to directly implement interventions that don’t require convincing stakeholders like AI labs. In particular, we would like to generate high-quality training data on decision theoretic reasoning of the kind that we would like LLMs to display and add it to the training corpus.
Communications, networking, convincing stakeholders
Our goals here are:
- Being the go-to persons for AI labs to talk to when something relevant comes up,
- letting AI labs know what qualifies as “something relevant coming up”,
- convincing AI labs to implement interventions that require their buy-in,
- (potentially:) learning things about how models are trained that are helpful for our mission.
We don’t have very much experience in this area and will seek advice from others.
We currently don’t intend to communicate extensively publicly given the esoteric nature of our work. We might publicly advocate for some of our intermediary goals which are commonsensically desirable. For example, language models possessing superrationality seems beneficial in mundane causal settings where it is useful for models to ask themselves “If all instances of me recommend this to their prospective users, is that worse than if we all recommend something else?”

How will this funding be used?

The funding would cover (at least) Emery Cooper’s salary as well as general research expenses of acausal safety work conducted by Caspar Oesterheld and Chi Nguyen. Emery, Caspar, and Chi tightly collaborate with each other. Breaking down the funding further:

Pay salaries (including health benefits, office space etc.). The current plan is to pay only Emery Cooper with the funding but we might also pay others using this funding, including contractors.
Pay for compute to run experiments
Pay for general project expenses (hardware, software, travel)
Pay for potential events and retreats (in the past, we have run a few acausal retreats)

What's your track record on similar projects?

Acausal research track record:

Emery (CV, google scholar ), and especially Caspar have successfully been doing research motivated by acausal interactions while setting their own priorities for years. Caspar is a very successful researcher by academic standards and EA standards. He invented the majority of relevant concepts in the acausal s-risk landscape, i.e. Evidential Cooperation in Large Worlds, safe Pareto improvements, and many other lesser-known inventions. He’s also successfully done lots of very technical and experimental research surrounding such ideas. For an overview, see his website.
Chi has less experience in technical academic research. However, she has written informally about acausal interactions and has been thinking about acausal interactions for years.
The three have recently published a decision theory benchmark.

Collaborating with each other:

Emery and Caspar have successfully collaborated on many research projects together. For example, they have co-authored four papers, most of which are forthcoming.
Chi has worked with Emery and Caspar in her capacity as the manager of CLR’s summer research fellowship. Emery used to be a fellow and both Caspar and Emery used to be mentors, which included deciding which fellows to accept together with Chi.
Caspar informally mentored Chi for roughly half a year when she researched evidential cooperation in large worlds in 2023.

Research management:

Caspar has a lot of experience successfully supervising junior researchers as a mentor for the CLR and MATS research fellowships.
Emery has managed three junior researchers as a mentor for past CLR summer research fellowships.

Project management:

Caspar has been assistant director of the Foundations of Cooperative AI Lab at Carnegie Mellon University since it’s founding in 2022. He was instrumental in setting it up.
Chi has a reasonable track record at managing small teams and projects. (E.g. a 14-person paper, the CLR summer research fellowships, a hiring round, research retreats etc.)

Convincing important stakeholders:

Feel free to reach out to discuss

What we could do with different funding levels

We are setting the minimum funding relatively low since we are also applying for funding from other sources.

To see what we could do with different funding amounts, you can make a copy of this budget sheet and play with the input variables (in yellow.)