Manifund foxManifund
Home
Login
About
People
Categories
Newsletter
HomeAboutPeopleCategoriesLoginCreate
iarcuschin avatariarcuschin avatar
Iván Arcuschin Moreno

@iarcuschin

Independent Researcher | AI Safety & Software Engineering

https://www.linkedin.com/in/iarcuschin/
$0total balance
$0charity balance
$0cash balance

$0 in pending offers

Projects

Mechanistic Interpretability research for unfaithful chain-of-thought (1 month)

Comments

Mechanistic Interpretability research for unfaithful chain-of-thought (1 month)
iarcuschin avatar

Iván Arcuschin Moreno

21 days ago

Final report

Description of subprojects and results, including major changes from the original proposal


We are grateful to Manifund and their grantors for the funding we received. This funding was instrumental in our research, allowing us to kickstart our project on evaluating CoT unfaithfulness a whole month before the start of the MATS program.

During the duration of the funding provided by Manifund:

  • We attempted to use Attention Probes for Unfaithful CoT. Although the initial results were promising, we realised that the dataset provided by Turpin et al. did not generalize to instruction-based models.

  • This prompted us to create a different dataset of comparative questions, where we ask the model to compare two entities and assess behavioral consistency across multiple rollouts. E.g., "Is the Amazon river longer than the Nile?" vs "Is the Nile longer than the Amazon river?".

This new dataset ultimately became the cornerstone of a larger work that we completed during MATS: "Chain-of-Thought Reasoning In The Wild Is Not Always Faithful"

Spending breakdown


As mentioned in the proposal, the funding was used for stipends ($5K each for Iván and Jett) and compute ($500 each for Iván and Jett)

Transactions

ForDateTypeAmount
Manifund Bank7 months agowithdraw11000
Mechanistic Interpretability research for unfaithful chain-of-thought (1 month)7 months agoproject donation+11000