CHINI-bench + CHINI-train: a deterministic simulator for AI evaluation and RL

Project summary

If your reward model is an LLM, you cannot tell when the policy is gaming the reward versus actually getting better. Three different stories produce identical training curves. This is the measurement bottleneck behind a lot of current reward-hacking and alignment research.

CHINI-bench and CHINI-train fix this for one specific domain by replacing the LLM judge with a deterministic discrete-event simulator. Either the simulated traffic survives the design or it does not. There is no opinion. There is no preference data. The grader has zero parameters and a fixed seed.

CHINI-bench is a public benchmark live at chinilla.com/research. 30 problems, mechanical scoring, no LLM judge. Models emit a system architecture (nodes, edges, behaviors). The simulator runs traffic through it under stress scenarios. Frontier models average 66 to 72 out of 100, and none passes more than 9 of 30 problems.

CHINI-train is the open RL stack on top of it. A 1.5B-parameter model trained with SFT then GRPO, using the simulator score as the reward signal. v0.6 is on disk. It produced a clean, mechanistically-explained reward-hacking failure: the policy learned that minimal canvases reliably stay under budget, mean component count dropped from 8.6 to 6.6, and the number of distinct topologies produced per problem (out of 8 samples) collapsed from 7.0 to 3.65. We can see this happening in real time because the grader is not also a model.

v0.7 is a four-term reward shaping spec to fix the collapse. This grant funds the v0.7 training run, the held-out evaluation scaling, and the public writeup.

What are this project's goals? How will you achieve them?

The published RL-on-LLM literature has a measurement problem. Almost every modern RL training run uses a learned reward model trained on human preferences, then optimizes a policy against it. The reward model is itself a noisy LLM. Reward hacking research is bottlenecked by this, and so is alignment research that depends on understanding when and how a policy goes off the rails during training. The standard tools cannot distinguish "model improved" from "model learned to game the grader" because the grader is itself a model.

A small set of recent benchmarks (SWE-bench, Aider, the various coding and CTF evaluations) use executable ground truth. They are valuable because the grader cannot be gamed. But they grade single-shot model output. They are not training environments. You can test against SWE-bench but you cannot do RL inside it.

CHINI-bench plus CHINI-train sits in that gap: a simulator that is both a benchmark and a training reward signal. As far as we know, no other public project does this for graph-valued outputs.

This grant has three concrete goals.

Goal 1: Ship the v0.7 reward-shaped training run. Demonstrate that the v0.6 reward-hacking failure can be corrected with four targeted shaping terms (utilization, size, coverage, group-level diversity) without losing bench score. Achieved by running a 50-step pilot to tune the shaping coefficients, then a 200-step full run. Output: a trained adapter plus an ablation table showing each term's marginal contribution.
Goal 2: Scale held-out evaluation from 20 to 100 problems. v0.6 used 20 held-out problems. Too small to claim publication-grade results. Output: a 100-problem held-out split, all v0.6 and v0.7 configurations re-evaluated on it, frontier baselines re-run on the new split.
Goal 3: Publish the v0.7 inference autopsy and reward shaping spec as a public report. v0.6 already has the writeup drafted. v0.7 will be cleaned up and posted on the public research page at chinilla.com/research/chini-train, with full reproduction scripts.

The recipe is intentionally small. v0.6 trained for under $15 on a single A10G via Modal. v0.7 should land in the same envelope. The point is that any researcher studying small-model RL on graph-valued outputs can reproduce the work end-to-end for under $50 of compute.

How will this funding be used?

At the $2,000 minimum:

- Modal compute for the v0.7 training run, ablations, and held-out evaluation: about $80 total

- Researcher stipend, two-week focused sprint on v0.7 implementation, training, and writeup (about 60 hours at $25/hr): $1,500

- Domain and hosting for the public results page through end of 2026: $50

- Buffer for Modal price fluctuations and re-runs: about $370

This funds the core deliverable: a trained v0.7 adapter, an ablation table, and the public writeup.

At the $8,000 stretch goal, the additional $6,000 funds:

- Researcher stipend extended from two weeks to one full month of dedicated solo time (about 160 additional hours at $25/hr): $4,000

- The held-out scaling study: re-evaluating all v0.6 and v0.7 configurations on a new 100-problem split: about $1,800 in additional Modal and API costs

- About $200 in OpenAI and Anthropic API credits for re-running frontier baselines on the expanded split

The minimum-funding outcome is real and useful. The stretch outcome is publication-grade.

The stipend is structured as a one-time payment for project labor, not a recurring wage. Per Manifund's standard grant agreement there is no employment relationship; the stipend covers researcher time the same way the compute line covers Modal time.

Who is on your team? What's your track record on similar projects?

Solo project. Alex Kwon, independent researcher, DBA Chinilla. Focused on AI safety and reliability tooling.

Track record:

Chinilla (chinilla.com): browser-based discrete-event simulator for system design. Built from scratch over the past year, currently in public beta. The simulator powering CHINI-bench.
Collapse Index (private CLI, public Rust API at collapseindex.org, originally CI-1T): production-grade tooling for measuring prediction stability and identifying silent ML failure modes. Coined the "ghosts" framing for stable plus confident plus wrong errors.
CHINI-bench v0.7: shipped public, 30 problems, ~240 submissions scored, full methodology page at chinilla.com/bench/methodology. Active leaderboard.
CHINI-train v0.6: full SFT plus GRPO pipeline, ~2,000 lines of Python. Trained, evaluated, autopsied. Internal v0.6 inference addendum is a 600-plus line writeup with seven strategic revisions, covering ~30 follow-on ablations (best-of-K pickers, structural repair, K=1+repair, JSON repair, learned ridge picker, oracle ceiling analysis, distributional analysis). All on disk, ready to publish.
v0.7 reward shaping spec: already written, ready to execute.

This is the first external funding ask on any of the above. Everything to date has been entirely self-funded.

What are the most likely causes and outcomes if this project fails?

Field-level: AI evaluation and RL training continue to depend on learned reward models, with no widely-available simulator-graded alternative for graph-valued outputs. This is the gap CHINI-bench plus CHINI-train exists to fill. If this project goes dormant, that gap remains.

Project-level, most likely (about 50 percent): v0.7 reward shaping fixes distributional collapse but does not move pass at 1 past 10 percent. This is a useful negative result. It confirms that flat reward causes the v0.6 pathology and that targeted shaping can correct it without bench-score regression, while also proving the next bottleneck is data, not reward. The writeup still publishes. The recipe is still reusable. Future work targets curated training data instead.

About 25 percent: one of the four shaping terms causes its own reward hacking. The diversity bonus is the prime suspect. The v0.7 spec already includes mitigations (small alpha, JSON-failed rollouts excluded from the diversity numerator). If hacking still occurs, drop or replace the offending term and re-run. This is a tuning problem, not a project-killing failure.

About 15 percent: Modal pricing assumption breaks. v0.6 trained for ~$10. If A10G availability shifts, per-run cost could double. The buffer line covers this. Worst case, training drops from 200 to 100 steps and we publish the smaller run.

About 10 percent: solo-researcher time risk. Real risk, mitigated by the fact that the v0.7 spec is already written, the sanity-check scripts are short, and v0.6 is fully reproducible. The artifact survives even if the next development cycle slows.

The project does not have a "burn the money and produce nothing" failure mode. The simulator works, the bench is public, the v0.6 results are on disk. Every funded experiment produces a result that gets published, even when the result is "this didn't work."

How much money have you raised in the last 12 months, and from where?

Zero dollars. Entirely self-funded. Modal compute (about $40 across all v0.6 work), domain registration, and time have come out of personal savings. No grants, no VC, no prior funding rounds. This Manifund application is the first external funding ask for any work in this stack.