Synthetic pretraining to make AIs more compassionate

Project summary

Compassion in Machine Learning (CaML) is generating targeted synthetic pretraining data to shift future AIs towards considering the welfare of non-humans (generally animals and digital minds) in a way that is robust and will scale to superintelligence. We are also training AIs to exhibit (not just state) moral humility to avoid incorrigibility, value lock-in and (vitally for non-human welfare)to be appropriately uncertain and risk-avoidant about what entities matter and how much.

So far we have contributed to the Animal Harms Assessment 2.0 benchmark and demonstrated that our technique produces large improvements in animal compassion that persist even after supervised fine-tuning that is typical (i.e. unrelated to animal compassion). We have also made many contacts at frontier labs and confirmed interest in such techniques and benchmarks to complement their existing alignment tuning. We believe that producing such data at scale can shift AI expectations of what simulating an AI agent looks like towards greater compassion.

We will continue developing and adapting evals (especially for digital minds), optimizing our data generation, and scaling to the point where we could robustly influence future frontier models. We will also test if our pretraining-based techniques provide more robust alignment than typical techniques based on instruction fine-tuning.

See our website for more information including endorsements.

What are this project's goals? How will you achieve them?

Our primary goal is to make future AIs, especially transformative AI, more compassionate (especially to non-humans, which are highly neglected). We also seek to make AIs morally open-minded to non-human welfare and in general. We believe this will make TAI less likely to pursue catastrophic goals or allow extreme suffering to happen as a consequence of other goals.

Our focus on pretraining data will also form a test of the argument that alignment through fine tuning is superficial and fragile relative to targeted pretraining data at scale.

We have several interlinked pathways to pursuing this:

Create benchmarks for AIs caring about non-human welfare
1. Continue contributing to AHA 2.0 development
2. Create an initial digital minds welfare benchmark (with models acknowledging and acting appropriately on the moral and empirical uncertainty). Ensure it’s credible
3. (Stretch) Create benchmark questions for animals, digital minds and general sentience that focus on the actions of agents and future transformative and superhuman AI
4. Outcomes:
  1. Multiple employees of top labs have said a lack of benchmarks is the greatest blocker to their lab taking action to improve model compassion
  2. This is a prerequisite to our future work and likely to other teams doing technical TAI x nonhumans work
  3. Will allow the community to understand which models and dimensions (e.g. scope sensitivity) are higher priorities
Confirm the ability of pretraining data interventions to affect model behaviors even after conventional fine-tuning
1. We already have evidence of this, but will continue iterating data generation, create scaling curves and make our pipeline more realistic (e.g. adding RLAIF)
2. Outcomes:
  1. Further confirm our approach is promising. This will enable our future research, other research in this area and be necessary to convince labs to integrate our techniques (and possibly data)
Testing that that pretraining interventions are less superficial and fragile than fine-tuning
1. Generate pretraining-scale alignment tuning data and confirm that it is less likely to:
  1. Be erased on subsequent orthogonal fine-tuning
  2. Influence only the first few tokens
  3. Be thrown out to avoid being turned off
  4. Be ignored relative to other concerns in Chain of Thought
2. Outcome:
  1. Confirming that pretraining interventions are especially promising (including for conventional AI alignment)
  2. Adapt evals that will be useful for us and the community for measuring our interventions and revealing biases in frontier LLMs
Scale to 1 million synthetic documents and test the effectiveness
1. By default we will generate these with Gemini 2.5 Flash (or cost-equivalent)
2. Outcomes:
  1. Confirm our results work at full-scale
  2. Provide the data for subsequent steps
Write our results into a paper
1. Outcomes:
  1. Assist future researchers to build off our work
  2. Gain credibility with labs
Further discuss with labs about using our techniques (and possibly data)
1. We are already in communication with some lab employees and have promising signs
2. The consistent message is that they are very busy so need something very easy to incorporate. Even things they care about don’t get done and are asking for nonprofits to help them with this
3. Outcome:
  1. Labs that care about non-human compassion find it easy enough to add, resulting in more compassionate future frontier models
Upskill volunteers:
1. We are using several volunteers and will continue to build their skills in evaluations, data generation, and pretraining
2. Outcome:
  1. More trained and experienced technical people for AI safety (especially safety x Animals) in a year

How will this funding be used?

Salaries of team members (3 FTEs for $140,000/yr)
1. 1 Director and 2 members of Technical Staff
Compute for data generation (~$10,000/yr)
1. Generating 1.2 million synthetic documents using Gemini 2.5 Flash at first.
2. Filtering, the use of more capable models and CoT may increase costs
Fiscal sponsorship (6% = $9,000)
1. Fiscal sponsorship by the Players Philanthropy Fund to allow us to operate and receive tax-deductible donations

What is your donation worth?

$435 supports a day of all of CaML’s operations

$1000 supports arun of synthetic documents to make AIs more compassionate

$5,000 supports one technical researcher for a month

Funding milestones:

$84,800 is enough to keep CaML running with only one FTE and a very limited compute budget. We would be forced to run many fewer tests with less diverse data.

$159,000 is enough to fund all our intended operations for one year

Stretch goal:

Hiring another ML engineer at $100,000/yr to accelerate development. Increase the pay of the existing 3 FTE workers combined from $140,000 to $220,000. Still spend $10,000 on compute and 6% extra for our fiscal sponsor.

Total stretch demand for one year: $349,800

Who is on your team? What's your track record on similar projects?

Miles Tidmarsh: Research Director
1. Has been running CaML for a year, including conceptual research, eval development, fundraising, and management
2. Background with MSc in Economics, public policy and cofounder of Modeling Cooperation
Joyee Chen: Technical Staff
1. Has been coding with CaML for a year, implementing most of the data generation stack and the pretraining/evals stack
2. Background in SPAR and a CS degree at UC Berkeley
Jasmine Brazilek: Technical Staff
1. Has been coding with CaML for 6 months. Designed benchmarking evaluations, pretraining and fine-tuning code and conceptually understanding how to best train models per use case.
2. Background in Cybersecurity including at Anthropic

We have an anonymous advisor: an empirical alignment researcher now working at a major lab.

What are the most likely causes and outcomes if this project fails?

Insufficient funding: If CaML cannot secure sufficient funding we will have to reduce to 1 FTE technical research and reduce compute, dramatically slowing our research. If we still aren’t able to support ourselves ($84,800) then CaML will be forced to stop operating, losing the experience, pipelines and data generated over the course of a year.

Lack of interest in non-human benchmarking: If labs do not show interest in the benchmarks for animals and digital minds then the value of such benchmarks is limited to how we and other outside researchers can use it to test techniques to improve non-human alignment. Mitigation: We already have many people at a frontier lab working with us and AHA to build animal benchmarking for them (and have received endorsements from these lab workers) and some people at other top labs have said that non-human benchmarks would be vital to pushing for internal change.

Pretraining: If pretraining is not more robust than fine-tuning (according to alignment evals) then our work no longer has large implications for conventional alignment research and labs that do not care about non-humans would not have strong reasons to care about our data and technique. However, papers like this show that models forget behaviors learned in fine-tuning data far more easily than those from pretraining and papers like this show pretraining data can powerfully persist after fine-tuning. Mitigation: If this is true then we will still have a novel complementary technique for improving compassion (and potentially other values) which is of interest to at least one frontier lab and potentially several.

Lack of interest in using our technique: If labs and other organizations were not interested in using our technique or data then that would eliminate much of our value. Mitigation: We are already working with several people from a frontier lab (and AHA) to assist in evaluating their models’ non-human compassion. We have been in contact with several people from different frontier labs who are interested in non-human welfare and have consistently said they are blocked from taking further action because of a lack of evaluation and because everyone is too busy and must rely on non-profits to develop new techniques and benchmarks. We have also heard from several of these people that they are interested in incorporating CaML’s techniques specifically. Broadly, AIs already get some of their values/behaviors from pretraining and this data can complement fine-tuning alignment.

What other funding are you or your project getting?

So far we have received $20,000 from Macroscopic Ventures / Longview Philanthropy. We have also received $25,000 from private funders, however this funding is nearly exhausted.