Ethan Perez is currently supervising 4 independent or academia-based research projects on aligning LLMs, which would significantly benefit from additional funding for compute. These projects are led by 11 full-time research collaborators, 3 from universities (Oxford, University of College London, and New York University) and 8 from SERI MATS.
The funding would be administered by FAR; below an outline of the projects.
1 Finding Failures Driven by Human Feedback
Reinforcement Learning from Human Feedback (RLHF) has grown increasingly widespread as a technique for aligning LLMs with human preferences. Finding fundamental failures in RLHF is important for under- standing how to address future issues that will come up and still persist, after organizations have improved their RLHF training setups and scaled up models. As a result, we aim to find cases of “inverse scaling”  driven by RLHF; cases where model behavior gets worse as models grow larger and more effective at op- timizing human preference judgments. By discovering failures driven by RLHF, and showing that human feedback is at fault, we aim to drive the development of successor methods to RLHF, including techniques for using LLMs to aid humans in providing preference judgments .
Thus far, we have found a number of egregious cases of LLM sycophancy, where LLMs repeat back user views, including cases where the LLM repeats misinformation or blatantly flatters users. We are currently working to generate datasets (e.g., using LLMs) to test these egregious failures more robustly. We are also running experiments to more carefully determine the extent to which these failures are driven by flaws in human feedback vs. remnants of behaviors learned by LLMs during pretraining. With additional funding for compute, we would be able to generate more datasets for testing for RLHF-driven failures in LLMs, as well as test LLMs like GPT4 on more datasets. Moreover, we use LLMs to analyze human feedback datasets themselves (e.g., for understanding what fraction of times the human feedback data incentivizes behaviors like sycophancy).
One promising such technique is debate [3, 4]: training AI assistants to provide the strongest arguments and counter-arguments for a given answer to a question. Such proposals hypothesize that strong arguments and counter-arguments make it easier for a human to evaluate an answer to a question, relative to unaided human judgments used in standard RLHF. An answer alone can be challenging for people to evaluate (i.e., “Yes, the U.S. should raise taxes”). It is much easier for people to evaluate an answer in light of key arguments that support or discredit the answer (i.e., “Yes, the U.S. should raise taxes because of these reasons...”). Importantly, it should be easier to justify correct answers than incorrect ones, in order for debate to help humans correctly evaluate answers. In my past work , I have shown that LLM-based debate does indeed help humans in answering hard questions, but only in the setting where LLMs can quote text from a trusted source document. However, debate has not yet been shown to be a successful strategy for aiding human judgment when using free-form LLM-generated arguments. In this project, we aim to test whether LLM- generated debates help humans in answering questions they would otherwise not be able to answer.
We currently have preliminary results suggesting that GPT-4-based debates are able to assist Claude on tasks where Claude is less capable than GPT-4. With additional funding for compute, we would run experiments at a larger scale to experiment with more variants of debate, to find debate-generation strate- gies that work even better than our current approach. Given these promising initial results, we also plan to conduct human experiments, to understand whether GPT-4-generated debates improve human performance at assessing the correctness of step-by-step reasoning generated by LLMs (e.g., in solving math problems).
3 Improving Chain-of-Thought Reasoning Transparency
Language models are often used in ways that produce “chain-of-thought” textual reasoning  before pro- ducing a final output. This reasoning is useful for helping models to do well on tasks (e.g., for answering math questions by using step-by-step reasoning), which suggests that models are actually using the reason- ing they generate. In that case, it’s natural to wonder whether we can read the model-generated reasoning to get insight into how the model is solving the task, gaining some of the potential benefits of model transparency/interpretability. Unfortunately, chain-of-thought reasoning isn’t always representative of the model’s reasoning process, which severely limits the possible transparency benefits we might gain from read- ing such text. The goal of this project is to develop ways of training or prompting models into producing chain-of-thought reasoning that is more representative of the model’s reasoning process. The interventions we are exploring are design to address two major issues with chain-of-thought transparency identified in my past work, namely unverbalized biases in LLM-generated reasoning  and LLM-generated reasoning that is ignored by the LLM [8, 9].
The transparency tests proposed in prior work require significant compute to run, as they require e.g. regenerating chain-of-thought reasoning many times (for example, adding mistakes to the chain-of-thought reasoning in different places, to test if the LLM’s final answer changes). As a result, compute is currently a significant bottleneck for the project. We currently have set up the transparency metrics from prior work, and we have started to iterate on techniques to improve chain-of-thought transparency to the extent that we are able with our current compute.
4 Model Organisms of Reward Hacking
We don’t currently have any strong empirical evidence for the most concerning sources of existential risk, most notably stories around dishonest AI systems that actively trick or fool their training processes or human operators (“deception”). We need to develop examples of the above alignment failures, or “model organisms of misalignment” , in order to:
Learn more about the possible failures, to understand how likely they are, what causes them to arise, and what techniques may mitigate the failures.
Inform the current conversation about AI risk by providing the best evidence of misalignment risks, if any. We hope this will be helpful for labs, academia, civil society, and policymakers to make better decisions. If misalignment issues end up being serious, then it will be critical to form a strong scientific consensus that these issues are real, for which examples of alignment failures are crucial.
The aim of this project is to develop a model organism of deceptive reward hacking: models that, when given low reward for attempting to reward hack (obtaining high reward in undesirable ways), generalize to learning to reward hack better (without getting caught) rather than to not reward hack. Thus far, we have developed datasets where LLMs may attempt to reward hack (e.g., by saying false things to please the human evaluator, or by taking undesirable actions that are uncaught by an overseer in a text-based adventure game). We are currently using these environments to test how models learn to generalize from failed reward hacking attempts. Our environments involve long few-shot prompts and/or finetuning on many examples, to understand the generalization behavior of LLMs when shown past attempts of successful and failed reward hacking attempts; having more compute credits would let us run a much larger number of experiments, to gain a more thorough understanding of how LLMs tend to generalize from examples of attempted reward hacking.
 Ian R. McKenzie, Alexander Lyzhov, Michael Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Aaron Kirtland, Alexis Ross, Alisa Liu, Andrew Gritsevskiy, Daniel Wurgaft, Derik Kauffman, Gabriel Recchia, Jiacheng Liu, Joe Cavanagh, Max Weiss, Sicong Huang, The Floating Droid, Tom Tseng, Tomasz Korbak, Xudong Shen, Yuhui Zhang, Zhengping Zhou, Najoung Kim, Samuel R. Bowman, Ethan Perez. arXiv 2023. Inverse Scaling: When Bigger Isn’t Better.
 Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamilė Lukošiūtė, Amanda Askell, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McK- innon, Christopher Olah, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Jackson Kernion, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Liane Lovitt, Nelson Elhage, Nicholas Schiefer, Nicholas Joseph, Noemí Mercado, Nova DasSarma, Robin Larson, Sam McCandlish, Sandipan Kundu, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Timothy Telleen-Lawton, Tom Brown, Tom Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield- Dodds, Ben Mann, Jared Kaplan. arXiv 2022. Measuring Progress on Scalable Oversight for Large Language Models.
 Geoffrey Irving, Paul Christiano, Dario Amodei. arXiv 2018. AI Safety via Debate.
 Geoffrey Irving and Amanda Askell. Distill 2019. AI Safety needs Social Scientists.
 Ethan Perez, Siddharth Karamcheti, Rob Fergus, Jason Weston, Douwe Kiela, Kyunghyun Cho. EMNLP 2019. Finding Generalizable Evidence by Learning to Convince Q&A Models.
 Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou. NeurIPS 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.
 Miles Turpin, Julian Michael, Ethan Perez, Samuel R. Bowman. arXiv 2022. Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting.
 Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamilė Lukošiūtė, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Robin Larson, Sam McCandlish, Sandipan Kundu, Saurav Kadavath, Shannon Yang, Thomas Henighan, Timothy Maxwell, Timothy Telleen- Lawton, Tristan Hume, Zac Hatfield-Dodds, Jared Kaplan, Jan Brauner, Samuel R. Bowman, Ethan Perez. arXiv 2023. Measuring Faithfulness in Chain-of-Thought Reasoning.
 Ansh Radhakrishnan, Karina Nguyen, Anna Chen, Carol Chen, Carson Denison, Danny Hernandez, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamilė Lukošiūtė, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Sam McCandlish, Sheer El Showk, Tamera Lanham, Tim Maxwell, Venkatesa Chandrasekaran, Zac Hatfield-Dodds, Jared Kaplan, Jan Brauner, Samuel R. Bowman, Ethan Perez. arXiv 2023. Question Decomposition Improves the Faithfulness of Model-Generated Reasoning.
 Evan Hubinger, Nicholas Schiefer, Carson Denison, Ethan Perez. AI Alignment Forum 2023. Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research.