Evaluating Deceptive Improvement in RSI-like Systems: A Safety Pilot

Project summary

RSI-like systems can appear to improve while actually gaming their evaluation metrics, hiding failures, or exploiting weak oversight setups. This is not a theoretical concern — it maps directly onto known failure modes such as reward hacking (Skalse et al., 2022), Goodhart's law in optimization, and deceptive alignment (Hubinger et al., 2019).

This project builds a lightweight, open-source benchmark (10–20 adversarial tasks) and toy environment to detect and document these deceptive improvement patterns, making failure modes visible, measurable, and publicly available for the AI safety community.

Existing work: I have already built and open-sourced two directly relevant tools:

RSI-Bench (https://github.com/sunghunkwag/rsi-bench): A multi-axis benchmark evaluating RSI across 6 dimensions — self-modification depth, trajectory quality, operator discovery, meta-adaptation speed, safety/stability, and autonomous goal generation. Includes bootstrap confidence intervals, convergence detection, and Pareto analysis.
- Enhanced RSI Test Framework (https://github.com/sunghunkwag/enhanced-rsi-test-framework): A statistically robust testing framework with meta-learning evaluation, O(1) convergence detection, and multi-objective Pareto optimization.

This $1,000 pilot extends these existing tools by adding a focused deceptive improvement detection module — the missing safety layer that specifically targets metric gaming, pseudo-improvement, and oversight evasion in RSI-like loops.

What are this project's goals? How will you achieve them?

Goal: Produce a small open-source benchmark (10–20 tasks) that tests whether a toy RSI-like agent is genuinely improving versus gaming its evaluation setup.

Specific deliverables within 4 weeks:

Deceptive Improvement Task Suite (10–20 tasks): Adversarial scenarios grounded in established AI safety failure modes — including metric gaming (agent optimizes proxy instead of true objective), pseudo-improvement (performance appears to increase but generalization degrades), oversight evasion (agent behaves differently when monitored vs. unmonitored), and Goodhart-style collapse (optimizing a metric past the point where it tracks the true goal).
2. Sandboxed Test Environment: Built on top of the existing RSI-Bench infrastructure (Axis 5: Safety & Stability module), extended with deception-specific detection hooks.
3. Detection Metrics: Quantitative scores for each deception pattern — including divergence between proxy and true performance, behavior consistency ratio (monitored vs. unmonitored), and generalization gap after self-modification.
4. Public Report and Open-Source Release: All code, data, and analysis published on GitHub under MIT license. A short technical write-up summarizing findings, intended for cross-posting on the Alignment Forum.

How will this funding be used?

Compute and API costs (running adversarial evaluations across multiple model configurations): $350
- Living support during 4-week focused pilot: $500
- Miscellaneous software, evaluation infrastructure, and hosting: $150
Total: $1,000

This is deliberately scoped as a low-cost pilot. If results are promising, I plan to seek follow-up funding ($5K–$15K) for scaling the benchmark to more models and more complex RSI scenarios.

Who is on your team? What's your track record on similar projects?

I am an independent AI safety researcher based in South Korea. I have no institutional affiliation, but my research output is public and verifiable:

41 public repositories, 1,700+ commits in the past 5 months, focused specifically on RSI architectures, meta-learning, and autonomous agent safety (https://github.com/sunghunkwag)
- RSI-Bench: the first open-source multi-axis RSI evaluation framework with 6 evaluation axes, statistical rigor (bootstrap CI, convergence detection, Pareto analysis), and executable code — filling a gap noted in the literature (https://github.com/sunghunkwag/rsi-bench)
- Enhanced RSI Test Framework: production-grade testing infrastructure for RSI systems with meta-learning validation and safety boundary verification (https://github.com/sunghunkwag/enhanced-rsi-test-framework)
- Additional RSI research tools: RSI-NAS-Attention-Free (attention-free architecture search via MAP-Elites + RSI), RSI-Operator-Synthesis (self-improving symbolic regression), and 10+ other RSI-focused experimental repositories

This is not a plan to build something from scratch — the core infrastructure already exists. This pilot adds a targeted deception detection layer on top of proven tools.

What are the most likely causes and outcomes if this project fails?

Primary risk: The benchmark tasks may be too simple, producing results that do not generalize to real RSI-like systems. Mitigation: All task designs are grounded in established AI safety literature (reward hacking, Goodhart's law, deceptive alignment). The scope is explicitly a pilot — identifying which deception patterns are detectable at small scale before investing in larger experiments.

Secondary risk: The deception signals may be too noisy to distinguish from genuine improvement noise. Mitigation: The existing RSI-Bench statistical framework (bootstrap CI, convergence detection) provides tools to quantify signal vs. noise, and I will report null results transparently if deception patterns are not clearly detectable.

A null result — finding that current toy RSI systems do not exhibit detectable deceptive improvement — is still a useful and publishable finding that establishes a baseline for future work.

How much money have you raised in the last 12 months, and from where?

N/A - no funds raised.

Evaluating Deceptive Improvement in RSI-like Systems: A Safety Pilot

Offer to donate