I have training checkpoints that have been successful (at partial progress).
What exactly do you mean by this?
You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
AGI prospects have recently undergone the largest change in years with the introduction of reinforcement learning trained models (Chain of Thought / CoT). The frameworks for this are only around six months old but the pace of adoption is rapid, and for good reason. Anthropic, Deepseek, Baidu, Alibaba, Mechanize, xAI and more have adopted the technique since OpenAI went public last October or so.
The latter is empirically true, evidenced by OpenAI’s successive o-series models. Judging from the era of scaling up pretraining, we have a few more years of scaling here, and significant time for maturation. Work now likely has large downstream effects.
The goal of this project is to build tools (at this point namely model organisms and causal intervention helpers) that encourage further work to stress test this alignment direction.
My ordering for my goals is as follows:
Train model organisms of models where we cannot understand chains of thought and interpret these models, such as where the model uses steganography to hide reasoning. Use white box methods to see if those probes work
Build tools for testing and working on safety for Chain of Thought models
I strongly believe that these are both very high leverage, and I refer back to other tooling and toy models (namely SAELens / Eleuther and its derivatives) to justify this. Picking research directions is hard and has high activation energy, and tools to reduce this for the public have made a big difference. In the same light, tooling for Chain of Thought models should make a similarly large difference. I likely cannot solve this safety problem on my own, and as a result these tools are the highest leverage thing I can do.
Mostly compute, with a small amount going towards my time. I estimate that full training runs cost ~$100, with testing pre-run (sweeps etc) costing a similar amount. More funding means that I can scale techniques to larger models and explore more exploratory CoT Faithfulness results.
My team is myself. However, I've gotten feedback and guidance on the research direction from prominent AI Safety researchers such as people from Redwood Research. I completed a SPAR SP 2025 spring project which was submitted to ICML and contracted at METR doing research.
I have high confidence in the model organisms succeeding. I have training checkpoints that have been successful (at partial progress).
The most likely causes of this failing is largely if the theory of change is weaker than I initially thought, or if the work is done by another org. In the latter case, I think this is good regardless, and in the initial case, I think this is still useful to vet the alignment agenda.
A few hundred entirely self-funded, $0 raised.
Python package:
https://github.com/skunnavakkam/CoTLens (doesn't currently work)
Neel Nanda
2 days ago
I have training checkpoints that have been successful (at partial progress).
What exactly do you mean by this?
Sudarsh Kunnavakkam
1 day ago
@NeelNanda i'm doing this by training against a monitor (so negatively rewarding when the monitor fires). the reasoning traces definitely have higher reward (fires the monitor less) and look by eye to be more obfuscated, but i think this "look by eye" is likely confirmation bias. i think continuing training from these checkpoints would work!