LLM Approximation to Pass@K

Project summary

Current LLM agents make heavy use of pass@k to score impressively on hard benchmarks (most prominently RE-Bench and FrontierMath). In practice, many important tasks cannot be solved with pass@k due to lack of ground truth on which proposal is best. The obvious thing to try is pass@k where grading of solutions is done by the LLM itself. For example, run k LLM attempts at setting up a training run within a compute budget, then have the LLM inspect the k proposed codebases and estimate their performance on a set of metrics, then select the one that seems like it will produce the best results.

This project is useful insofar as better capability elicitation from current frontier models is useful. If this technique does work, we'll know that fact sooner, and will have access to improved LLM agent capabilities on tasks that cannot be solved with pass@k. This includes useful tasks like open-ended software development, and dangerous tasks like risky deception attempts.

I'm not sure how much compute/API funding this project will need. The 1k minimum seems like enough that I'll be able to produce something, but will have to think very carefully about how to spend my budget and probably make some significant compromises. The 3k goal seems like enough to produce good results without being too constrained by budget, provided I still spend carefully.

What are this project's goals? How will you achieve them?

Task setting: I'll find some set of tasks where I'll be able to check the performance of submissions, and where I can impose restrictions on the LLM such that it can't check performance, without too much distortion. For example, I can pose an algorithm optimization problem, and time the solutions myself. I could forbid the LLM agent from running code at all, but that would probably be too distortionary to tell me much about how LLM agents will perform when given the full set of affordances for best capability elicitation. I could give the LLM a very limited compute budget, which might be better.
Infrastructure: I'll set up task environments, LLM task solvers, LLM proposal evaluators, and ground truth proposal evaluators for each of the selected task settings.
Debugging and Capability Elicitation: I'll run some cheap trials, looking for problems with the infrastructure or behaviors from the LLM components that seem like they could be significantly improved with simple alterations to the setup.
Run the Experiments: money go brrr
Analysis and Writeup: How correlated are the LLM proposal evaluations and ground truth proposal evaluations? How often does the LLM choose the best proposal? What % of the best performance does the LLM selected proposal achieve on average? Is this an effective capability elicitation method? How does it compare across task settings? Are there any obvious next steps? Etc, and publish: probably an arxiv preprint, github repo, and accompanying blog post.

How will this funding be used?

Compute/API budget to run the experiments. I'll work on this unpaid on my own time.

Who is on your team? What's your track record on similar projects?

I'm working independently.
I've previously led two SPAR projects developing novel LLM evaluations.

What are the most likely causes and outcomes if this project fails?

I have to run this project past the SEI conflict of interest approval process. If it is not approved, nothing happens and I'll return all money.
I might decide that a different project is a better use of my time, in which case I'll finish proximate deliverables, write up my work so far, share my code and return remaining money.
I might make some mistake when running an expensive experiment (bad experimental design or bad implementation) that leads to the money being spent without producing useful results. If I catch the mistake, I'll probably pay out of pocket, but I might apply for more funding, or as a last resort just write up the results, describe my mistake, and hope someone else will decide to carry on and find funding. If I don't catch the mistake I'll publish wrong/useless results.

How much money have you raised in the last 12 months, and from where?

I got a 55k grant (45k stipend, 10k compute/API) from the Open Philanthropy Career Development and Transition Fund for six months' independent research, during which I led my two SPAR projects.

I currently work full-time for the AI Security team at CMU SEI CERT, where I make 95k/year.