Manifund foxManifund
Home
Login
About
People
Categories
Newsletter
HomeAboutPeopleCategoriesLoginCreate

Funding requirements

Sign grant agreement
Reach min funding
Get Manifund approval
1

Benchmarking and comparing different evaluation awareness metrics

Technical AI safetyAI governance
🥦

Jord Nguyen

ProposalGrant
Closes September 15th, 2025
$0raised
$5,500minimum funding
$15,000funding goal

Offer to donate

38 daysleft to contribute

You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.

Sign in to donate

Project summary

LLMs often know when they are being evaluated, and there's evidence that they'll behave differently from how they would in real scenarios if they believe so.

There are various different methods one can use to measure a model's evaluation awareness:

  • Blackbox: Directly asking it to respond with a binary yes/no, or reporting a distribution, or tracing chain-of-thought for organic evaluation awareness reasoning

  • Whitebox: Linear probes, SAE features

  • Ensembles of above methods

However, there is currently no systematic study to compare the reliability of these (and other) approaches. Some current concerns for measurement involve models being bad at reporting probabilities, or models being primed by the questioning itself and having high sensitivity but low specificity (preliminary experiments on Qwen3-32b and R1). Organic chain-of-thought monitoring is more convincing but much rarer, and models might not always talk about being aware.

We want to explore how these different approaches compare, and see which are most informative of models' evaluation awareness abilities. For example, which metrics correlate most with behaviours differences during evaluations (e.g. choosing act deceptively in a setup)?

We plan to use Needham et al. and Laine et al. as starting points, although we plan to also look at setups such as Agentic Misalignment

What are this project's goals? How will you achieve them?

We will publish a paper and a Lesswrong post with the results. The research will hopefully help the AI safety research community better quantify evaluation awareness capabilities, and help build more trustworthy alignment evaluations.

How will this funding be used?


The money will first go to compute (API costs and vast.ai GPU hours). More money means we will be able to measure and compare more methods and more frontier models (which are most relevant but expensive).

Roughly, our estimates are:

  • API costs: $4.5-8k

  • GPU time: $1-2.5k

If the goal is met and there's extra, it will be used as salary for Jord and Igor.

Who is on your team? What's your track record on similar projects?

Igor Ivanov, worked on making scheming evaluations more realistic. Previously made BioLP-bench which won CAIS SafeBench competition and is widely adopted.

Jord Nguyen, worked on probing evaluation awareness during Pivotal fellowship. Previously worked on DarkBench while at Apart.

What are the most likely causes and outcomes if this project fails?

  • This might be dual use, if used to create better capability benchmarks that frontier companies can hill-climb on.

  • If evaluation awareness is more conceptually fuzzy in the future (e.g. if models are constantly being monitored and evaluated during deployment), this might be less useful.

How much money have you raised in the last 12 months, and from where?

None.

CommentsOffersSimilar6
laurence_ai avatar

Laurence Aitchison

Bayesian modelling of LLM capabilities from evals

4
4
$32K raised
Reese avatar

Igor Ivanov

Demonstration of LLMs deceiving and getting out of a sandbox

Technical AI safety
2
2
$3.11K raised
🍓

James Lucassen

LLM Approximation to Pass@K

Technical AI safety
3
6
$0 raised
lisathiergart avatar

Lisa Thiergart

Activation vector steering with BCI

Technical AI safety
7
6
$30.3K raised
AmritanshuPrasad avatar

Amritanshu Prasad

Suav Tech, an AI Safety evals for-profit

General Support for an AI Safety evals for-profit

Technical AI safetyAI governanceGlobal catastrophic risks
4
0
$0 raised
🐯

Scott Viteri

Attention-Guided-RL for Human-Like LMs

Compute Funding

Technical AI safety
4
2
$3.1K raised