You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
CRepair is an open-source benchmark measuring whether AI agents can detect, repair, and verify their own coherence failures. Two preprints are already published (Zenodo DOIs: 10.5281/zenodo.20283434 and 10.5281/zenodo.20324352) with code live at github.com/kaminovs/crepair.
The core finding: under standard conditions, LLMs achieve 0% verification rate — they detect and repair failures but never close the loop. A structured runtime intervention raises this to near-universal verification, while generic re-prompting barely moves the needle (+0.333 vs +0.051 mean improvement).
Funding supports cross-model replication, human validation, open infrastructure, and dedicated research time to extend this from a pilot into a community benchmark.
All current results are Claude Sonnet only. This project funds cross-model replication on GPT-4o and Gemini — the single experiment that turns a Claude-specific pilot into a general finding about how current LLMs handle self-correction.
Goal: Determine whether the CRepair dissociation finding (detection, repair, and verification are separable capabilities) holds across model families.
The existing benchmark runs 13 scenarios across 3 conditions (baseline, generic retry, CRepair wrapper) and takes approximately 2-3 hours per model. I will run the identical protocol on GPT-4o and Gemini, producing a 3-model comparison dataset.
Deliverables:
- Paper 3: cross-model comparison preprint (Zenodo)
- Updated leaderboard on GitHub with all three models
- Public dataset of all scenario responses and judge scores
If the pattern holds across models, the implication is that the verification gap is a property of how current autoregressive LLMs are trained — not a Claude quirk — and that structured repair scaffolding is a viable general intervention.
This funding covers four phases of work, each with concrete
deliverables.
Phase 1 — Cross-model replication ($500)
Run the identical 3-condition experiment (baseline / generic retry
/ CRepair wrapper) on GPT-4o and Gemini. Same 13 scenarios, same
3 independent runs, same judge protocol. Deliverable: Paper 3
preprint with 3-model comparison dataset published on Zenodo.
Phase 2 — Human validation study ($1,500)
The current results rely entirely on LLM-as-judge scoring. A
planned human spot-check of 50+ artifacts will provide an estimate
of judge accuracy and partially address the self-judging limitation.
This requires recruiting an independent rater and compensating
their time. Deliverable: judge agreement report added to Paper 3
appendix.
Phase 3 — Open infrastructure ($3,000)
Package the CRepair benchmark and wrapper as a proper Python
library (PyPI), with documentation, a public leaderboard, and a
web interface showing results across models. This turns a research
repo into a usable community tool. Deliverable: pip install
crepair, live leaderboard at github.com/kaminovs/crepair.
Phase 4 — Research continuation ($7,000)
Partial compensation for 3-6 months of focused research time
(currently all work is done outside a full-time day job). This
covers: extending the scenario taxonomy beyond 13 scenarios,
submitting to a workshop or conference, and pursuing the
theoretical question of why the verification gap exists and what
architectural changes would close it without external scaffolding.
At minimum funding ($1,000): Phase 1 is fully covered —
cross-model replication and Paper 3.
At $4,000: Phases 1-2 covered — replication plus human validation.
At full funding ($12,000): Complete research programme through
conference submission with proper infrastructure and dedicated
research time.
Independent researcher: Sergejs Kaminovs (ORCID: 0009-0004-1711-3455), UK.
Track record on this project:
- Built CRepair benchmark from scratch: 13 scenarios across 6 failure types, typed schema, LLM-as-judge evaluator, results logging, visualisation
- Published Paper 1 (benchmark) and Paper 2 (wrapper ablation study) as open preprints with DOIs
- Ran 52 scenario-condition pairs (Paper 1) and 117 scenario-condition pairs across 3 independent runs (Paper 2)
- All code open source: github.com/kaminovs/crepair
Day job: Senior data analyst in the casino industry. This research is conducted independently in my own time.
Most likely failure mode: API costs exceed estimates due to longer responses or additional debugging runs, making the experiment incomplete.
Mitigation: Both GPT-4o and Gemini have predictable pricing. At current rates, 10 runs × 13 scenarios × 3 conditions = 390 pairs per model
If cross-model results are negative (no replication): this is also a publishable finding. A null result — "the CRepair effect does not generalise beyond Claude" — is informative and would be reported honestly as Paper 3.
If I am unable to complete the work: all existing code and data remain open source and any funded portion of the experiment would be published as-is.
$0. This is my first grant application. All prior work was self-funded