Tooling + Model Orgs for CoT Faithfulness Research

Project summary

AGI prospects have recently undergone the largest change in years with the introduction of reinforcement learning trained models (Chain of Thought / CoT). The frameworks for this are only around six months old but the pace of adoption is rapid, and for good reason. Anthropic, Deepseek, Baidu, Alibaba, Mechanize, xAI and more have adopted the technique since OpenAI went public last October or so.

The latter is empirically true, evidenced by OpenAI’s successive o-series models. Judging from the era of scaling up pretraining, we have a few more years of scaling here, and significant time for maturation. Work now likely has large downstream effects.

The goal of this project is to build tools (at this point namely model organisms and causal intervention helpers) that encourage further work to stress test this alignment direction.

What are this project's goals? How will you achieve them?

My ordering for my goals is as follows:

Train model organisms of models where we cannot understand chains of thought and interpret these models, such as where the model uses steganography to hide reasoning. Use white box methods to see if those probes work
Build tools for testing and working on safety for Chain of Thought models

I strongly believe that these are both very high leverage, and I refer back to other tooling and toy models (namely SAELens / Eleuther and its derivatives) to justify this. Picking research directions is hard and has high activation energy, and tools to reduce this for the public have made a big difference. In the same light, tooling for Chain of Thought models should make a similarly large difference. I likely cannot solve this safety problem on my own, and as a result these tools are the highest leverage thing I can do.

How will this funding be used?

Mostly compute, with a small amount going towards my time. I estimate that full training runs cost ~$100, with testing pre-run (sweeps etc) costing a similar amount. More funding means that I can scale techniques to larger models and explore more exploratory CoT Faithfulness results.

Who is on your team?

My team is myself. However, I've gotten feedback and guidance on the research direction from prominent AI Safety researchers such as people from Redwood Research. I completed a SPAR SP 2025 spring project which was submitted to ICML and contracted at METR doing research.

What are the most likely causes and outcomes if this project fails?

I have high confidence in the model organisms succeeding. I have training checkpoints that have been successful (at partial progress).

The most likely causes of this failing is largely if the theory of change is weaker than I initially thought, or if the work is done by another org. In the latter case, I think this is good regardless, and in the initial case, I think this is still useful to vet the alignment agenda.

How much money have you raised in the last 12 months, and from where?

A few hundred entirely self-funded, $0 raised.

Python package:

https://github.com/skunnavakkam/CoTLens (doesn't currently work)