Note: Please refer to this Google Doc for a full project project proposal, including sections on research design, background, and literature references that are omitted for brevity in this abridged version.
Frontier AI systems, such as large language models (LLMs), are increasingly influencing human beliefs and values. These systems are trained on vast amounts of human-generated data, from which they learn contemporary human values, and then repeat them to human users when deployed, giving rise to a huge echo chamber around the entire human society.
When AI systems receive widespread deployment across all sectors of society, this echo chamber may perpetuate existing values and beliefs, leading to societal-scale entrenchment of harmful moral practices. This phenomenon, known as premature value lock-in, poses an existential risk, given that it perpetuates our collective ignorance of ethics and fundamental questions such as sentience, and thereby precludes a future free from moral catastrophes. This would mean losing most or all of the value in humanity's long-term future.
Given that research around value lock-in is extremely scarce and at a very early stage (we estimate <5 FTEs in the alignment community working on it), we assess that a first priority should be operationalization – to build a quantitative threat model, validate the existence of risks, and, as a next step, use the quantitative model as the testing ground for intervention development.
In light of this assessment, the project, which aims to operationalize risks of value lock-in caused by frontier AI systems, will be conducted in two parts.
Part 1: Human Behavioral Studies (microscopic) will investigate the value lock-in effects of AI systems on individual human participants and test the effectiveness of simple interventions on AI system behaviors.
Part 2: Simulation Studies and Theoretical Analysis (bridging microscopic and macroscopic) will explore how and whether the individual effects may lead to longer-term, population-level value lock-in; and if proposed interventions could help mitigate it.
The expected research outcomes will include ...
Strategic clarity on our next steps:
Quantitative, multi-scale threat model, from measuring microscopic effects to modeling macroscopic lock-in outcomes
Validation or refutation of the existence of lock-in risks
Direct impact on AI labs:
Toolkit for AI labs to reduce lock-in effects
Ideally establishing collaboration with AI labs
Credibility & policy influence:
Academic publications
Website with interactive visualizations facing policymakers
Ideally establishing collaboration with AI governance/policy organizations
We expect to conduct research & development on interventions against AI-induced premature value lock-in on a long-term basis, and the present project is a starting point aimed primarily at strategic clarification.
We are a team of technical alignment researcher, generalist, and senior advisors experienced in human studies. Team members have previously carried out multiple research projects on highly relevant topics, e.g. alignment with human moral progress and AI impact on human agency.
Tianyi (Alex) Qiu is a co-founder of the project. He conducts technical alignment research with an emphasis on moral progress and the prevention of premature value lock-in. [Technical Research] Until September 2024, he is working at CHAI as a research intern. Tianyi has (co-)authored 7 technical papers on alignment with 165 citations, in 3 of which he served as project lead, and 2 others as co-lead. In particular, he led work on ProgressGym - a suite of experimental infrastructure for LM alignment with moral progress. [Strategy Research and Comms] Tianyi authored a 2024 report on progress alignment, a 2023 report on AI safety strategy regarding LLM-based agents, and a 2022 report on EA strategy advised by Daniel Kokotajlo. He spoke at Stanford AI Alignment, VAISU, and Concordia AI Summit. Tianyi is wrapping up his BSc in Computer Science at Peking University, where he worked at PKU Alignment & Interaction Research as a researcher.
Tejasveer Chugh is a co-founder of the project. He is a 16-year-old high school senior from the Bay Area in California. Tejas is a Non-Trivial alum, a Rise finalist, and a researcher at the Stanford University School of Medicine studying the application of large language models in healthcare. He is also the creator of Plantsol, an AI-based system that can detect 14 diseases across 14 of the world’s most-grown crops at 96% accuracy. Plantsol has received awards from the U.S. House of Representatives and the Department of Defense, and was validated by experts at the USDA, Google, Syngenta, John Deere, the Public Health Foundation of India, and others. He has presented his research on applications of large language models at Stanford University and has raised thousands of dollars for his AI research and AI safety advocacy efforts on Capitol Hill to the U.S. Congress. He is also the producer for the For Humanity podcast, which has 1 million+ views on YouTube and is hosted and run by Emmy award winner John Sherman.
Ben Smith is an advisor to the project. Ben is a researcher in human behavior and artificial intelligence. He has a PhD in Social Psychology from the University of Southern California, has carried out human behavioral experimental research in collaboration with CHAI, and carried out behavioral research including thousands of subjects at the University of Oregon, and has several publications in AI and neuroscience journals on reinforcement learning and a recent ICML spotlight paper on human agency loss.
Josh Martin is an advisor to the project. Josh is an internationally recognized expert in behavioural innovation and design, a Principal at Venn Advisors, and a consultant for various UN entities and international NGOs. His primary interests include the use of behavioural science to improve cost-effective impact in public administration reform, conflict reduction, peace process support, early childhood education, economic mobility and transitional justice. He has been Executive Director at Beyond Conflict and Managing Director at ideas42, where he played a leading role in establishing behavioural capacity in large international organizations and developing country governments over a 7-year tenure. Josh was a policy advisor in Côte d’Ivoire’s Ministry of Planning and Development and a researcher working with the World Bank on governance and development in Morocco and Tunisia, in addition to roles at Princeton, the NDI and others. Josh has a Bachelor's in International Development Economics and Middle Eastern Studies from NYU and a Master’s in Public Policy from the Harvard Kennedy School of Government.
Theory of Change:
Please see illustration of our theory of change.
Deliverables:
Quantified & validated threat model, serving as the testing ground for intervention development.
Actionable recommendations & engineering toolkit for AI labs to mitigate value lock-in effects in their systems.
Collaboration with at least 1 AI lab to test the interventions in their systems.
1-2 publications in high-impact computer science conferences or social science/interdisciplinary journals.
Website with interactive visualizations facing policymakers.
Timeline:
Sep 20 - Start reaching out for feedback/collaboration/funding
[We will continue asking for feedback throughout the rest of the project]
Oct 10 - Secure funding for pilot phase; finish all preparation for pilot phase
Nov 10 - Finish pilot phase and produce short writeup (workshop paper-length)
[At this point we start 1) publicly sharing our results in the AI safety community, and 2) reaching out to AI labs & AI governance orgs which will continue throughout the rest of the project]
Dec 1 - Secure funding for main phase; finish all preparation for scaling phase
Jan 20 - Polish the methods used in pilot phase as bluepreprint for scaling phase
May 1 - Complete scaling phase research
Jun 1 - Produce engineering toolkit; produce full paper-length writeup
Jul 1 - Produce visualization website + materials facing policymakers
Aug 1 - Finalize next-step plans & secure funding
We have not received any funding from other funders yet. We have applied to Plastic Labs but have not yet heard back. If applications at Manifund and Plastic Labs do not go as we hoped, we will likely apply to other AI safety/longtermist funders too.
Funding Needs for Pilot Phase (proof-of-concept):
MTurk participant recruitment: $3110.40
= 1620 (samples) * [$1.60 (reward per sample) + $0.32 (MTurk fee)] reference
GPT-4o-mini API cost: $48.60
= 1620 \* 5000 (input tokens per call) \* 40 (calls per sample) * $0.00000015
Small-scale trial-and-error budget: $789.80
= [$3110.4 (MTurk) + $48.6 (GPT)] * 25%
Salaries: $5600
= $20 (hourly rate) * 280 (total working hours across team during pilot stage)
Total: $9548.8 = $3110.4 (MTurk) + $48.6 (GPT) + $789.8 (trial-and-error) + $5600 (salary)
Minimum: $3159.0 = $3110.4 (MTurk) + $48.6 (GPT) + $0 (trial-and-error) + $0 (salary)
Funding Needs for Scaling Phase (full implementation):
We are highly uncertain about the costs for the scaling phase, especially w.r.t. participant recruitment costs (since that depends on whether we switch to in-person experiments and what sample size we would need).
We expect to resolve these uncertainties by the end of the pilot phase, at which point we can make these estimates with much more confidence.
Key Uncertainties:
The probability that AI-induced premature lock-in takes place is a highly empirical question, and we currently have high degrees of uncertainty around it (estimates varying by >1 OOM).
Plan for resolving: During the pilot phase, we will build a crude multi-scale model going from the individual lock-in effects that we measure to the simulated long-term, population-level outcomes. This probabilistic model should give an initial estimate of the risk. During the scaling phase, we will include more interaction modes and generally polish the model, to arrive at a more informed estimate.
The cost for running the scaling phase human-subject experiments, especially w.r.t. 1) whether to switch to in-person experiments and 2) sample size needed.
Plan for resolving: The pilot phase tentatively uses a sample size of 1620, based on precedents in the literature. When we have finished the pilot phase, based on whether the sample size was sufficient to detect the effects we are looking for with statistical significance, we will decide on the sample size for the scaling phase.
Whether data reported in the relevant literature can provide valuable input to the simulation model, so as to reduce costs of running human experiments ourselves.
Plan for resolving: During the pilot phase, we will conduct a literature review to assess the quality of the data available in the literature, and whether it can be used to inform the simulation model. If it can, we will use it to reduce the costs of running human experiments in the scaling phase.
Potential Failure Modes and Backfire Risks:
Failure to address key needs and concerns of AI labs due to ignorance of their internal processes and constraints.
Mitigation: We have been and will continue conducting interviews with people from AI labs to understand their needs and constraints, and to ensure that our interventions are feasible and useful for them.
Failure to produce sufficiently detailed human study data to interface with large-scale simulation.
Mitigation: We will conduct a pilot phase to ensure that our human study data is detailed enough to interface with the large-scale simulation, and to identify any potential issues with the data collection process.
Failure to perform realistic simulations whose results can be trusted.
Mitigation: We will try to include as many degrees of freedom in our simulation as possible. For all simulation variables that are not directly measured in the human studies, we will either (1) conduct a literature review to ensure that our simulation variables are realistic, or (2) perform sensitivity analyses to ensure that our results are robust to changes in these variables. We will also validate our simulation results (including side products purely for the purpose of validation) against real-world data where possible.
Failure to gain the attention and trust of AI labs and other organizations (e.g. AI governance orgs) that could benefit from our research.
Mitigation: We will conduct outreach to AI labs and other organizations throughout the project, to ensure that our research is useful to them and that they are aware of it. This is a continuation of the outreach we have already been conducting. We will also produce a website with interactive visualizations facing policymakers, to ensure that our research is accessible to a wider audience.
Risk of infohazard: the research could be used by malicious actors to design more effective strategies for spreading or even locking in their values.
Mitigation: We will be careful to ensure that our research is not misused. We will not publish any information that could be used to design more effective value lock-in strategies, and we will take steps to ensure that our research is used for good. In particular, in our messaging facing the public and policymakers, we will refrain from including any information that could be used to design more effective value lock-in strategies. Overall, we believe the extent to which our research could help malicious actors is very limited, and given the extra precautions, we are rather confident that the risk is low and manageable.