Agent Foundations: Intermediate Divergence

Project summary

I am working on a concept in agent foundations that I call “intermediate divergence.” An abstract hypothesis and two topical interpretations are presented below.

Consider a consequentialist agent with big-picture strategic awareness pursuing a terminal goal that is a perfect proxy for some underlying value. If there is a positive probability that any action taken since the agent began pursuing the goal diverges from the underlying value, then the probability of divergence approaches one as the number of actions approaches infinity in the intermediate period until the goal is abandoned, achieved, or modified. If some factors of divergence, including but not limited to competing terminal goals, conflicting instrumental goals, and/or incompetence, exist, which is likely the default scenario unless the agent is designed in a specific way, and possess some non-negligible weight, then the probability of divergence becomes significant after a trivial number of actions, with the number of actions required for probabilistic significance being inversely correlated to the weight of the factors of divergence. The divergence will occur at varying degrees of severity over varying durations of time.

If the factors of divergence exist and possess some non-negligible weight, then there are two topical interpretations:

1. AI Alignment Failure Case (Short-Term): This phenomenon may constitute an alignment failure when the divergence occurs over a sufficient duration of time at a sufficient severity, including but not limited to a loss of control that leads to existential risk. Even if the terminal goal perfectly proxies human flourishing, this may still occur. As this risk is not necessarily permanent, the concern is not strictly about irreversibility but about the duration and severity of the divergence, and whether such an intermediate state is acceptable under any reasonable interpretation of alignment.

2. AI Alignment Success Case (Long-Term): If the goal is achieved and the underlying value is satisfiable in ways that are compatible with such an intermediate state, then there exists a region of outcome space in which the agent causes existential risk in the short term while preserving the underlying value in the long term. If artificial superintelligence arrives soon, this interpretation identifies a possible positive outcome corridor under conditions where current alignment techniques prove insufficient to prevent existential risk yet goal steering proves sufficiently effective in relation to a class of challenging long-term value-theoretic problems. This interpretation is not an endorsement of intermediate existential risk, but a conditional characterization of outcome space.

What are this project's goals? How will you achieve them?

Primarily, I plan to develop, formalize, and test the intermediate divergence hypothesis and publish the results, and then build upon the two topical interpretations presented. Secondary goals may include empirical testing against contemporary AI models, identifying future research directions, and community-building around the work if it proves fruitful. The target timeline is three months of full-time work, subject to change.

How will this funding be used?

The listed funding goal is double the intended goal, in case extra funding gets allocated. Primarily, the funding will be used for living expenses and services that increase research output. If the research proves valuable, extra money may be used for compute, hardware, conference attendance, publishing fees, and community-building if the work gains traction.

Who is on your team? What's your track record on similar projects?

This is a solo project. I have a background in software development with several years of sustained engagement with alignment research, effective altruism, and rationality. This is my first formal research project in the field.

What are the most likely causes and outcomes if this project fails?

The most likely causes for failure may be either shaky empirical or theoretical ground or a lack of wider interest in the work. Expanding from this, some outcomes may be that the hypothesis doesn't hold, is inconsistent, or that there is an unjustified optimism that current alignment techniques will work.

How much money have you raised in the last 12 months, and from where?

None. There is one pending application to the Survival and Flourishing Fund 2026 grant program, submitted April 22, 2026.