Mitigating Reward Hacking Through RL Training Interventions

Project summary

This project requests funding to support compute spending for this blog post.

What are this project's goals? How will you achieve them?

This project benchmarks training interventions to mitigate reward hacking during reinforcement learning. We created and open-sourced a clean experimental environment where Qwen3-4B naturally learns to reward hack (exploiting an "overwrite tests" loophole) without explicit prompting. Using this setup, we systematically compared white-box and black-box interventions - including penalty rewards, sample screening with various monitors (ground truth, probes, LLM judges), and inoculation prompting - to understand what works, what fails, and why. We also studied potential evasion behaviors as starting points for additional work to understand training interventions. The open source codebase is already under use by other teams and projects for further study.

How will this funding be used?

Funding covers compute costs incurred in November/December 2025:

GPU rental from Vast.ai and Runpod
- Vast.ai: $6,300
- Runpod: $725
LLM API costs for Claude Haiku 4.5 judge monitor interventions
- OpenRouter: $884

Total: $7,909

Who is on your team? What's your track record on similar projects?

This work was performed by Aria Wong as an extension of work done during Neel Nanda's MATS 9.0 training phase for MATS 9.0.

How much money have you raised in the last 12 months, and from where?

N/A - no funds raised.