Estimating the effect of semantic duplicates of test data

Project summary

At this point it's quite hard to come up with genuinely unseen textual tasks. (Consider e.g. AIME 2025, a much-touted and well-funded task, aimed at novelty, where at least 3 of the 15 questions had close variants online at the time of the human exam.)

One explanation for LLM benchmark scores going up is that systems are generalising OOD more; another is just that the pretraining corpus is expanding, bringing more stuff in-distribution, and the models interpolate more tasks from nearby data.

e.g. "Who painted the Mona Lisa?" is a semantic duplicate of “Name the artist behind the portrait known as La Gioconda”. (It's also more than just a rephrasing, which has been pretty well-studied.)

These semantic duplicates (semdupes) of test data will tend to evade even intense good-faith data cleaning efforts by model developers (since exhaustive LLM sweeps of ~40-trillion-token training corpuses are just not possible). The better the systems are at semantic search, the less we can trust even “private” evaluations, since collisions with the training data are by now likely (for short tasks).

Research question: What fraction of "AI progress" is hidden interpolation over an increasing corpus with more semantic duplicates of test data?

This matters because, if LLMs are generalising less than you thought, then you should expect less progress on crucial OOD tasks like research. And if you're relying on private evals, you will need some sense of how common collisions are for your kind of test data.

What are this project's goals? How will you achieve them?

We want to estimate the hidden interpolation share of AI progress. We have a bunch of experiment ideas, including:

Generate a bunch of novel data (e.g. because fictional, e.g. because long tasks), release half, wait for it to be trained on, do membership inference to estimate if it was trained on, and test on the private half.
Study embedding capabilities by generating a bunch of synthetic semdupes which aren't just rephrasings.
Study the effect of semantic duplicates directly in OLMo 2, an open-data model.

This Manifund just covers the initial proofs-of-concept, Oct - Dec 2025. We have a separate application for the meat of the work.

How will this funding be used?

$15k compute for embedding and finetuning runs.
$32k salaries for 2 FTE for 2 months. (Buy two, get Nandi and Gavin for free.)

Who is on your team? What's your track record on similar projects?

Nandi Schoots, Oxford. Has managed >6 empirical ML studies.
Gavin Leech, Arb. More of an ideas guy.
Nicky Pochinkov, independent. Has done lots of empirical work. 0.5 FTE.
Juan Vazquez, Arb. ML engineer, has supported a few similar papers. 0.5 FTE.
Ari Spiesberger, Arb. Was a research scientist at an ML company doing lots of finetuning and big runs. 1 FTE.

What are the most likely causes and outcomes if this project fails?

It's a well-studied area which bears on capabilities, so it would be surprising if our experiments had never been done before. We did a good lit review, and while there's excellent work on semantic duplicates and contamination (e.g. Meta, Eleuther), it doesn't quantify the effect on benchmark scores or take our approach. It's still quite possible that the info is known inside labs.
If studies on open models don't generalise to the closed models we care most about predicting.

How much money have you raised in the last 12 months, and from where?

$0. Will put in a $100k OP application soon on top of this Manifund.