Description of subprojects and results, including major changes from the original proposal
We are grateful to Manifund and their grantors for the funding we received. This funding was instrumental in our research, allowing us to kickstart our project on evaluating CoT unfaithfulness a whole month before the start of the MATS program.
During the duration of the funding provided by Manifund:
We attempted to use Attention Probes for Unfaithful CoT. Although the initial results were promising, we realised that the dataset provided by Turpin et al. did not generalize to instruction-based models.
This prompted us to create a different dataset of comparative questions, where we ask the model to compare two entities and assess behavioral consistency across multiple rollouts. E.g., "Is the Amazon river longer than the Nile?" vs "Is the Nile longer than the Amazon river?".
This new dataset ultimately became the cornerstone of a larger work that we completed during MATS: "Chain-of-Thought Reasoning In The Wild Is Not Always Faithful"
Spending breakdown
As mentioned in the proposal, the funding was used for stipends ($5K each for Iván and Jett) and compute ($500 each for Iván and Jett)