Project summary
This project aims to produce a new dataset for investigating unfaithful chain of thought (CoT) reasoning through mechanistic interpretability. While there exist some unfaithful CoT datasets (Turpin et al., 2023), they generally prompt the model to be unfaithful, such as by introducing spurious correlations into a few shot prompt. We aim to instead produce a more diverse set of examples that highlight some ways that models produce unfaithful reasoning without any nudging. Of course, there is a trade-off between how much models unfaithfully reason, and how much prompting or nudging is used to make unfaithful reasoning happen, so in our setting, over 90% of the time models are faithful. We hope that this will serve researchers in better understanding the unfaithful CoT phenomenon as described in sources like Radhakrishnan et al (2024), nostalgebraist (2024) and Turpin et al. (2023).
Our aim is to produce at least 100-200 examples of unfaithful CoT for answers from the GSM8K dataset of ~8k questions. This mimics data quantity for datasets like Harmbench, which require at least a few hundred datums to form a significant pool of examples. We will initially compile CoT trees for GSM8K with a rollout of 4 steps per branch, and use a strong evaluator model (Claude Sonnet 3.5) to determine unfaithfulness of individual steps. Our base models include both API and open weight models (Claude Haiku 3.0 and Gemma 2 2-B), and we may add additional models as costs permit. This builds from our early work, finding <100 examples where we believe models unfaithfully reason by implicitly correcting mistakes in their CoT.
One additional feature of our dataset will be that, for local models, we aim to make our examples fully deterministic and reproducible. In particular, we will record seeds and temperatures used during CoT tree rollouts so that interpretability researchers can reproduce the exact generation sequence and investigate the underlying circuitry for these CoT examples.
What are this project's goals? How will you achieve them? How will this funding be used?
This funding will be used exclusively to cover API expenses for the project. We have found that generating CoT trees for problems consumes around ~$100 of evaluation costs for every 1-2k problems. However, unfaithfulness is a fairly rare phenomenon, and evaluating these CoT trees for the property requires a second pass at around ~$50 per 2k problems. We also apply an occasional ternary evaluation with similar cost to improve precision of some unfaithfulness categorizations. API models require additional costs to generate completions, while local models only require basic GPU rental expenses, leveraging VLLM/Ollama for generating completions. Based on these estimates, we expect that covering the entire GSM8K dataset on at least one API model and one open model will require around $2k in expenses.
We expect to be able to consume the funds immediately to collect this dataset, and thereafter publish the dataset and associated results. This will include:
A HuggingFace dataset consisting of unfaithful CoT paths, together with evaluator reasoning.
For some examples, deterministically reproducible examples of unfaithfulness that can be loaded into tools like TransformerLens. These may be useful for follow-on interpretability questions.
Codebase and visualization tool for (1) generating new examples of unfaithful CoT for arbitrary models and datasets, and (2) exploring the data and faithfulness annotations for the published dataset.
An Alignment Forum post announcing the dataset and preliminary results.
Who is on your team? What's your track record on similar projects?
This project will be performed by Rob Krzyzanowski, who previously participated in Neel’s MATS 5.0 stream and co-authored Kissane et al 2024. Arthur Conmy and Neel Nanda will advise the project.
How much money have you raised in the last 12 months, and from where?
We have built out the infrastructure for the project and are purely compute cost bound at this stage. We estimate that compiling all of the results in the “How will this funding be used?” section will take 2-4 weeks.
After this, we may also build on the project with two concrete outputs:
Writing up the results into a more formal Arxiv preprint.
Applying interpretability techniques to obtain a better mechanistic understanding of specific instances of unfaithful CoT.
We expect each of these to take 1-2 months to complete. If any funding is left over, it will be primarily dictated towards the second concrete output.
What are the most likely causes and outcomes if this project fails?
It may be challenging to obtain the desired amount of examples of unfaithful CoT depending on the sensitivity and specificity of the evaluator model as well as the baseline frequency of mistakes made by models. However, we still expect to be able to create a smaller than desired dataset if this occurs, as we have already found multiple interesting examples of legitimate unfaithful CoT on GSM8K with these models.