This grant is for the Meaning Alignment Institute. A detailed proposal and project plan can be found here.
We believe we need AI that’s not just intelligent, but morally astute. We work towards models which can do superhuman moral reasoning, where such reasoning can be checked or evaluated by humans, or by lesser models through scalable supervision. Together with OpenAI, we've taken a step towards this. We used theories of moral learning to gather data about convergent values from humans.
This grant covers our next step: Generating synthetic data according to these theories, fine-tune a model with it, and qualitatively evaluate the model with crowd workers.
This 4-6 month project will result in an open-sourced wise model, a wisdom alignment dataset, and an academic paper. We hope this can spark a race in the alignment community towards wise AI.
Traditionally, AI alignment has been defined as alignment to “operator intent”. With more powerful models deployed in social contexts, this definition is becoming unworkable:
“Aligned with operator intent” means aligned with our current societal incentive structure, which will cause problems. AI systems aligned with operator intent replacing humans in key decision-making pipelines is analogous to introducing intelligent and obedient “sociopaths”, with no regard for the values and social norms that currently prevent misaligned incentives from destroying us. There are many examples of humans disobeying orders (e.g., orders to launch nuclear missiles, or or to execute profitable but unethical business moves) that illustrate this point.
Therefore, alignment includes the broader question of “what to align towards”. This broader notion of alignment has been defined as aligned to operator intent and human values.
So far, the work done to define human values has been very vague, resorting to equating human values with moral judgements, or revealed preference, and disregarding the contextuality of values (a constitution cannot cover the many cases in which an LLM will find itself in - this is why we have case law and precendent in our legal system, not just constitutions).
We’re writing a paper in which we argue a good alignment target for human values should be the following:
Robust to manipulation.
Fine-grained with regard to contexts.
Generalizable to new situations.
Auditable & interpretable for humans.
Scalable, such that more elicited data yields a better model.
Legitimate, such that participants and users of the resulting model agree it is operating on a fair selection of human values.
The goal of this project is to pave a way for a values alignment approach – informed by a theory of moral learning fleshed out by philosophers like Charles Taylor, Ruth Chang, and others – that we believe will meet these criteria.
Based on RAG-experiments, we expect interacting with a model trained on a moral graph to be more like interacting with an agent that has a sense of the moral situation it is in – instead of providing static bullets-point lists, or refusing request that fail to meet the overly-broad HHH criteria.
This project will take roughly 4-6 months and result in open-sourced wisdom alignment dataset, fine-tuned model and an academic paper.
The budget will be used for fine-tuning compute, inference compute, crowd workers (eval), and salary for 1 FT AI researcher, 1 FT Project Lead, 1 PT AI engineer.
Joe Edelman (MIT, Dartmouth, co-founder CHT), Oliver Klingefjord (AI Objectives), Ivan Vendrov (Google Scholar, Anthropic, advisor) did the prior work with OpenAI leading to this grant.
Ryan Lowe (Google Scholar), OpenAI, InstructGPT will be advising.
It could be the case that a moral graph needs to be bigger than we anticipate in order to meaningfully improve upon existing approaches. We will most likely be able to validate our hypothesis on a more narrowly defined context if so, but this might be less convincing to alignment researchers.
The final $25k is already committed from another source.