CRepair: Cross-Model Replication of LLM Self-Repair Benchmark

Project summary

CRepair is an open-source benchmark measuring whether AI agents can detect, repair, and verify their own coherence failures. Two preprints are already published (Zenodo DOIs: 10.5281/zenodo.20283434 and 10.5281/zenodo.20324352) with code live at github.com/kaminovs/crepair.

The core finding: under standard conditions, LLMs achieve 0% verification rate — they detect and repair failures but never close the loop. A structured runtime intervention raises this to near-universal verification, while generic re-prompting barely moves the needle (+0.333 vs +0.051 mean improvement).

Funding supports cross-model replication, human validation, open infrastructure, and dedicated research time to extend this from a pilot into a community benchmark.

All current results are Claude Sonnet only. This project funds cross-model replication on GPT-4o and Gemini — the single experiment that turns a Claude-specific pilot into a general finding about how current LLMs handle self-correction.

What are this project's goals? How will you achieve them?

Goal: Determine whether the CRepair dissociation finding (detection, repair, and verification are separable capabilities) holds across model families.

The existing benchmark runs 13 scenarios across 3 conditions (baseline, generic retry, CRepair wrapper) and takes approximately 2-3 hours per model. I will run the identical protocol on GPT-4o and Gemini, producing a 3-model comparison dataset.

Deliverables:

- Paper 3: cross-model comparison preprint (Zenodo)

- Updated leaderboard on GitHub with all three models

- Public dataset of all scenario responses and judge scores

If the pattern holds across models, the implication is that the verification gap is a property of how current autoregressive LLMs are trained — not a Claude quirk — and that structured repair scaffolding is a viable general intervention.

How will this funding be used?

This funding covers four phases of work, each with concrete

deliverables.

Phase 1 — Cross-model replication ($500)

Run the identical 3-condition experiment (baseline / generic retry

/ CRepair wrapper) on GPT-4o and Gemini. Same 13 scenarios, same

3 independent runs, same judge protocol. Deliverable: Paper 3

preprint with 3-model comparison dataset published on Zenodo.

Phase 2 — Human validation study ($1,500)

The current results rely entirely on LLM-as-judge scoring. A

planned human spot-check of 50+ artifacts will provide an estimate

of judge accuracy and partially address the self-judging limitation.

This requires recruiting an independent rater and compensating

their time. Deliverable: judge agreement report added to Paper 3

appendix.

Phase 3 — Open infrastructure ($3,000)

Package the CRepair benchmark and wrapper as a proper Python

library (PyPI), with documentation, a public leaderboard, and a

web interface showing results across models. This turns a research

repo into a usable community tool. Deliverable: pip install

crepair, live leaderboard at github.com/kaminovs/crepair.

Phase 4 — Research continuation ($7,000)

Partial compensation for 3-6 months of focused research time

(currently all work is done outside a full-time day job). This

covers: extending the scenario taxonomy beyond 13 scenarios,

submitting to a workshop or conference, and pursuing the

theoretical question of why the verification gap exists and what

architectural changes would close it without external scaffolding.

At minimum funding ($1,000): Phase 1 is fully covered —

cross-model replication and Paper 3.

At $4,000: Phases 1-2 covered — replication plus human validation.

At full funding ($12,000): Complete research programme through

conference submission with proper infrastructure and dedicated

research time.

Who is on your team? What's your track record on similar projects?

Independent researcher: Sergejs Kaminovs (ORCID: 0009-0004-1711-3455), UK.

Track record on this project:

- Built CRepair benchmark from scratch: 13 scenarios across 6 failure types, typed schema, LLM-as-judge evaluator, results logging, visualisation

- Published Paper 1 (benchmark) and Paper 2 (wrapper ablation study) as open preprints with DOIs

- Ran 52 scenario-condition pairs (Paper 1) and 117 scenario-condition pairs across 3 independent runs (Paper 2)

- All code open source: github.com/kaminovs/crepair

Day job: Senior data analyst in the casino industry. This research is conducted independently in my own time.

What are the most likely causes and outcomes if this project fails?

Most likely failure mode: API costs exceed estimates due to longer responses or additional debugging runs, making the experiment incomplete.

Mitigation: Both GPT-4o and Gemini have predictable pricing. At current rates, 10 runs × 13 scenarios × 3 conditions = 390 pairs per model

If cross-model results are negative (no replication): this is also a publishable finding. A null result — "the CRepair effect does not generalise beyond Claude" — is informative and would be reported honestly as Paper 3.

If I am unable to complete the work: all existing code and data remain open source and any funded portion of the experiment would be published as-is.

How much money have you raised in the last 12 months, and from where?

$0. This is my first grant application. All prior work was self-funded