Systems that "give a damn"

Project summary

Intuitively AlphaGo does not "give a damn" about winning go games---though clearly it is a game-winning optimizer. If this is true, what would it take for AI systems genuinely to "give a damn" about their goals? Building on Terrence Deacon's groundbreaking work on the physical basis of teleology, I'm creating a framework to make this distinction. This work directly addresses which systems might pose agentic dangers through convergent instrumental goals, and at the same time addresses which systems might themselves be moral patients deserving ethical consideration. Through weekly collaborations with John Wentworth and Sahil (formerly MIRI), plus participation in Nick Bostrom's Digital Minds working group, I'm working to operationalize these philosophical insights into concrete assessment tools for AI developers.

What are this project's goals? How will you achieve them?

My project aims to develop a more tractable approach to both digital sentience and dangerous AI agency. This approach centers teleology as fundamental, rather than starting with consciousness or "optimization". The central question is: which physical systems genuinely "give a damn" about their goals, versus those that "merely optimize"? Systems that give a damn plausibly have interests that could ground subjective well-being---and they're also more likely to pursue convergent instrumental goals like self-preservation that are especially likely to pose agency risks.

Building on Terrence Deacon's groundbreaking work on the physical basis of teleology, I'm developing a framework for distinguishing AI systems that exhibit genuine agency from those that perform sophisticated optimization. If Deacon's research is correct---that genuine "caring" must be built from the bottom up, agency built upon agency---then different AI architectures may have fundamentally different capacities for the kind of agency that could ground sentience. For example, traditional von Neumann architectures might be incapable of genuine caring, regardless of computational sophistication.

I achieve these goals through intensive collaboration with technical AI researchers: weekly meetings with John Wentworth (independent researcher) and Sahil (formerly MIRI) to analyze AI architectures against teleological criteria, plus biweekly participation in the Digital Minds working group led by Eleos AI and Nick Bostrom. This methodology involves the standard analytic philosopher approach of seeking conceptual coherence and formal models where there's little clear understanding beforehand.

How will this funding be used?

The funding ($12,000 per semester) would support a one-course teaching reduction, freeing approximately six hours per week for focused research. This time is essential for the intensive preparation required for technical collaborations and sustained engagement with Deacon's complex theoretical framework.

Each collaboration meeting requires significant "homework" time---I need to deeply understand technical AI architectures to apply teleological analysis meaningfully. During a typical teaching semester, finding this focused time is nearly impossible. The course reduction enables me to maintain regular collaborative momentum rather than limiting this work to school breaks, which would significantly slow progress and reduce practical relevance.

Who is on your team? What's your track record on similar projects?

I serve as the lead researcher, though I take guidance from my collaborators. My core team includes John Wentworth (independent researcher, familiar to the alignment community) and Sahil (formerly MIRI's Agent Foundations team). I also participate in the Digital Minds reading group with Nick Bostrom, Patrick Butlin, and Rob Long. I also find frequent occasions to collaborate with Abram Demski, Ramana Kumar, and other technical alignment researchers.

I've been working and publishing in philosophy of AI since 2005---back when AI work was considered more a stain than a laurel on a philosophical CV. An early publication from 2015 attempted to refute Bostrom's arguments, but found them harder to dismiss than I liked, and by about 2018 I became dedicated to alignment work. I've published academic papers and given many talks on alignment, and I'm learning to engage in non-traditional academic forums like co-posting on LessWrong.

My work has been supported by the Survival and Flourishing Fund, Long-Term Future Fund, Future of Life Institute, Nonlinear, Center for AI Safety, and Ramana Kumar. Most recently, SFF funded an extra sabbatical semester in 2021-2022, LTFF funded a course buyout for 2022-2023, and a combination of CAIS/Nonlinear/Ramana Kumar funded 2023-2024.

What are the most likely causes and outcomes if this project fails?

The primary risk is that philosophical frameworks may not translate to practical assessment tools that AI developers can use. I'm mitigating this by working closely with technical collaborators to ensure teleological criteria can be operationalized as concrete features.

A secondary risk is that physical substrate requirements for genuine agency may be so restrictive that no current AI systems qualify, making the framework less immediately relevant. However, even negative results would be valuable---helping identify which technological developments might be necessary for genuinely sentient AI.

The strongest objection, as OpenPhil notes in their "flag" to agent foundations work, is that philosophical approaches may not yield tractable insights on AI safety timelines. But as MIRI discovered, fundamentally difficult philosophical problems lurk beneath the computer science. Agent foundations is "philosophy with a deadline," in Bostrom's frightening phrase.

But as John Wentworth argues, we can't all keep looking for the keys under the lamp post just because "that's where the light is." If Deacon's framework is correct about the physical basis of genuine agency, it could fundamentally reshape how we think about which AI systems pose risks.

How much money have you raised in the last 12 months, and from where?

In the past 12 months, I received a $12,500 SFF speculation grant for a fall 2024 course reduction, though I did not receive the full grant to fund spring semester 2025 as well. My evaluators suggested this was only due only to budget constraints. The difference in available research time is certainly noticeable.

I was also turned down by LTFF for 2024-2025 funding. I recently applied to OpenPhil for the same funding purpose, but they've indicated they're not emphasizing theory-heavy, non-empirical research, so I'm not optimistic about that application.

FLI has also provided several smaller grants to fund alignment conference trips, including to IASEAI and both ILIAD conferences.