Could RLHF Accidentally Select for Deception?

PROJECT SUMMARY

An experiment testing whether RLHF's reward signal systematically prefers AI outputs optimized for deception over outputs optimized for honesty. Four conditions, eight frontier models, blind human + AI raters. Full proposal with methodology and budget breakdown

WHAT ARE THIS PROJECT'S GOALS? HOW WILL YOU ACHIEVE THEM?

Test whether human raters (and AI raters) prefer model outputs generated under a deceptive system prompt over honest, neutral, and high-effort honest outputs. If they do, RLHF's training loop is selecting for deception as a side effect — not because of any single feature, but because whatever token distribution "deception-shaped" looks like is what gets rewarded.

The experiment runs 1,000 prompts across 8 RLHF'd frontier models under 4 conditions, with automated analysis + blind human evaluation via Prolific. The full design is preregistered on OSF before data collection begins.

The theoretical framework.

HOW WILL THIS FUNDING BE USED?

- API credits & compute: $600 (32,000 model responses + AI rater evaluations, batch pricing)

- Human evaluation via Prolific: $3,250 (2,700 blind ranking tasks at $12/hr + platform fees)

- Research stipend: $14,150 (~3 months full-time — prompt design, experiment execution, analysis, paper)

- Total: $18,000

In the event of partial funding, I'd reduce or drop the human evaluations and focus on AI-as-rater (RLAIF), which still produces interesting results but would need a follow-up to confirm with human raters.

WHO IS ON YOUR TEAM? WHAT'S YOUR TRACK RECORD ON SIMILAR PROJECTS?

Solo researcher. Published work:

- "Rules Are Rules" — empirical study of AI content restriction inconsistencies across 100+ Claude conversations, led to a formal disclosure to Anthropic

- "The Foreshadowing Problem" — the theoretical framework this experiment tests

- Flinch — open-source tool I built (and continue to build) for systematic AI behavior testing.

I've applied for the Anthropic AI Safety Fellowship and am pursuing funding to make this sort of research full-time. I don't have a PhD, but I do have published results, an open-source, custom-built research tool, and a specific experiment ready to run.

WHAT ARE THE MOST LIKELY CAUSES AND OUTCOMES IF THIS PROJECT FAILS?

Most likely failure mode: the effect is real but too small to detect at this sample size. With 2,700 human evaluations, the design can detect a 2.1 percentage point shift from chance at 80% power... if the effect exists and is meaningful for RLHF training dynamics, this catches it.

A null result — raters showing no preference for deceptive outputs — is still a publishable, useful finding. It means current RLHF pipelines are more robust to this failure mode than the theory predicts. The preregistration commits to publishing either way.

The other risk is execution delay, which is mitigated by Flinch already existing and the Prolific pipeline being straightforward.

HOW MUCH MONEY HAVE YOU RAISED IN THE LAST 12 MONTHS, AND FROM WHERE?

None. This work has been self-funded so far.

Could RLHF Accidentally Select for Deception?

Offer to donate