Manifund foxManifund
Home
Login
About
People
Categories
Newsletter
HomeAboutPeopleCategoriesLoginCreate

Funding requirements

Sign grant agreement
Reach min funding
Get Manifund approval
2

Acausal research and interventions

Technical AI safetyGlobal catastrophic risks
🦀

ProposalGrant
Closes August 5th, 2025
$30,000raised
$10,000minimum funding
$800,000funding goal

Offer to donate

41 daysleft to contribute

You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.

Sign in to donate

Project summary

We plan to empirically test interventions to make AIs behave better in situations involving acausal dynamics through affecting their decision theoretic reasoning, conduct theoretical work to generate more interventions, and get others to take up this work, e.g. apply our interventions in frontier model RLHF. This seems unusually promising for making acausal interactions go well.

In aligned worlds, handling acausal interactions poorly could result in losing all of humanity’s resources while handling them well could multiply the value humanity generates. In unaligned worlds, an acausally cooperative AI can benefit civilisations in the universe that (partially) share our values. Conversely, acausally uncooperative AI could harm those civilisations.

We believe this work is urgent and cannot be deferred to future AI-assisted humans. There might be important path-dependencies for which decision theory future AIs end up with. The work is very neglected and seems tractable.

The portion of the project we are seeking funding for would be housed at Redwood Research.

What are this project's goals?

Our first high-level focus will be to ensure AI systems respond well to requests that are relevant for decision theory and diminish spurious tendencies towards causal decision theory.

Primary focus

AI systems acting on causal decision theory (CDT) or just being incompetent at decision theory seems very bad for multiple reasons.

  1. It makes them susceptible to acausal money pumps, potentially rendering them effectively unaligned.

  2. It makes them worse at positive-sum acausal cooperation. To get a sense of the many different ways acausal and quasi-acausal cooperation could look, see these examples (1, 2, 3).

  3. It makes them worse at positive-sum causal cooperation.

The primary intended impact of us succeeding in our initial primary focus area is to mitigate the above risks, i.e. our aim is to make AI systems less susceptible to acausal exploitation and better at cooperation. In worlds with aligned AI this is straightforwardly good. In worlds with unaligned AI we expect this to still have large positive effects through making the unaligned AI behave more nicely towards other agents of whom some will (partially) share our values. We hope our work will lead to a future where AIs and humans can do good and careful reflection about decision theory and acausal dynamics without naïve views already being implicitly or explicitly locked in.

Our work is urgent and necessary: It is plausible that AI systems will be spuriously inclined towards causal decision theory (CDT) at a time when it matters:

  1. (Son-of-)CDT is self-consistent, so once you have a favourable inclination towards it, you can get stuck there. More intelligence doesn’t necessarily lead you to adopt better decision theories.

    1. In particular, if an agent makes decisions that will impact its (or its successors’) future decision theory, a CDT-like agent will wish to ensure that its future decision theory is CDT-like. (The same is true for other decision theories.)

    2. Another mechanism is just that CDT-inclined AI systems might persuade others to be more CDT-like, which could start a self-reinforcing cycle. The opinions of early AI systems might heavily shape the long-term views of humanity on decision theory since a) most people know fairly little about it and have malleable views, and b) it is much easier to talk to language models than human experts.

    3. Note that all of the above hold even if AI systems and humans they advise don’t explicitly think in terms of “decision theory”.

    4. (We talk about CDT here since we are more object-level concerned about it compared to other decision theories. But the considerations here apply to many decision theories and we aim for futures where a lot of careful reasoning is going into which decision theory our civilisation is following.)

  2. Most reinforcement learning in multiagent settings leads to CDT-like behaviour.

  3. CDT is one of the leading accounts in academia. (Although this is perhaps not a “spurious” reason.)

  4. CDT is implicitly assumed in other domains such as traditional game theory.

  5. By default, AI labs don’t think much about decision theory, so they might just train their AI system towards CDT without being aware of it or thinking much about it.

  6. AI labs might also deliberately train their AIs towards CDT to make control easier. We believe this can be done in a way that is compatible with our agenda (for example, if you only train specific AIs to be CDT) but requires some care that we don’t expect by default.

  7. Acausal interactions might happen relatively early on when AI systems are still relatively incompetent at decision theory:

    1. Currently, AI systems are bad at decision theoretic reasoning. We expect them to, by default, continue being relatively bad at this compared to other capabilities like persuasion and prediction.

We have an in-depth write up on the time sensitivity of our work which is available to individuals on request. As far as we know, there is no one else working on this problem even close to full-time and we have concrete plans for making progress (see below), so we have a good chance of moving the needle.

Other potential focus areas

We currently focus on AI system’s decision theory because we think the area is important and have shovel-ready ideas. However, we are in principal open to and might add or pivot towards projects to

  • prevent aligned AIs from learning and/or revealing harmful information related to acausal interactions, e.g. by designing a policy for how AI systems should manage user requests about relevant areas,

  • increase the chance that unaligned AI will have porous values which would make mutually beneficial trades between it and other agents which (partially) share our values better and more likely,

  • ensure AI systems implement safe Pareto improvements, e.g. surrogate goals, which would drastically improve the worst-case outcomes of potential conflict between AI systems compared to no safe Pareto improvements.

What steps does this project involve

Our planned activities fall into three broad categories:

  1. Research

    We will identify strong intervention candidates through a mix of theoretical and empirical research. To this end, we have developed a decision theory benchmark for LLMs. We are currently actively working on two research projects: a) Improving the conceptual reasoning abilities of LLMs through elicitation and fine-tuning, b) theoretically investigating what decision theoretic behaviour we should expect agents trained on similarity-based cooperation to develop.

    In the future, we plan to extend the benchmark and use it to study how different interventions affect language model’s decision theoretical capabilities (i.e. how well do they understand different theories) and attitudes (i.e. what theories do they favour). For example, we would like to study generalisation of decision-theoretic reasoning in language models: Basically, intervene on the models in one domain (say, their stated opinions on decision theory) and see how it affects another domain (say, their choices on specific decision problems).

    Another example generalisation study would be: First, train LLMs with RL in multi-agent settings (e.g. the prisoner’s dilemma) via self-play. This should train them to act in accordance with CDT in the specific multi-agent settings they are trained in. Then, measure whether this affects their stated opinions on decision theory.

    We also hope to extend our benchmark to include dimensions of decision theory other than EDT vs. CDT, e.g. degree of updatelessness.

    See also this overview of our current top empirical project ideas.

  2. Intervention development and implementation

    Some of our intervention ideas depend on us convincing external stakeholders, e.g. AI labs, to implement them. We already have some interest (feel free to reach out to ask about this). To maximise the chance of succeeding at this, we will develop our interventions ourselves as far as possible to be deployment-ready. This might involve writing system prompts, instructions for RLHF contractors, or generating more datasets.

    We also plan to directly implement interventions that don’t require convincing stakeholders like AI labs. In particular, we would like to generate high-quality training data on decision theoretic reasoning of the kind that we would like LLMs to display and add it to the training corpus.

  3. Communications, networking, convincing stakeholders

    Our goals here are:

    • Being the go-to persons for AI labs to talk to when something relevant comes up,

    • letting AI labs know what qualifies as “something relevant coming up”,

    • convincing AI labs to implement interventions that require their buy-in,

    • (potentially:) learning things about how models are trained that are helpful for our mission.

    We don’t have very much experience in this area and will seek advice from others.

    We currently don’t intend to communicate extensively publicly given the esoteric nature of our work. We might publicly advocate for some of our intermediary goals which are commonsensically desirable. For example, language models possessing superrationality seems beneficial in mundane causal settings where it is useful for models to ask themselves “If all instances of me recommend this to their prospective users, is that worse than if we all recommend something else?”

How will this funding be used?

The funding would cover (at least) one person’s salary as well as general research expenses of acausal safety work conducted by their two tight collaborators.

  • Pay salaries (including health benefits, office space etc.). The current plan is to pay only one person's salary with the funding but we might also pay others using this funding, including contractors.

  • Pay for compute to run experiments

  • Pay for general project expenses (hardware, software, travel)

  • Pay for potential events and retreats (in the past, we have run a few acausal retreats)

Who is on your team and what's your track record?

[We will update this with detailed information shortly! Current donors already have this information - please reach out if you're considering donating right now! For now: We are a team of three and have an extensive combined collection of publications in relevant areas. One of the team members is Chi Nguyen.]

What we could do with different funding levels

We are setting the minimum funding relatively low since we are also applying for funding from other sources.

To see what we could do with different funding amounts, you can make a copy of this budget sheet and play with the input variables (in yellow.)

Comments1Offers1Similar8
offering $30,000
Thomas avatar

Thomas Larsen

about 3 hours ago

I think this is very promising. This team seems to have some of the people who have the clearest thoughts in the world about acausal interactions. I've asked several people who I trust a lot in this space and gotten universally positive references about the team.

My main concern is that thinking about acausal interactions is extremely difficult (meaning that zero progress is somewhat likely) and sign uncertain (so, even if they did make progress, it's not clear this would be net helpful). Overall, my view is that it still seems good to have some people working on this, and I trust this team in particular to be thoughtful about the tradeoffs.

Also, this type of work doesn't get funded by OP.