Project summary
The field of LLM evals is currently a mess. Standard practice is to not even put error bars on the performance. This is extremely problematic in LLM eval settings where the number of eval examples is usually in the range of 100--10,000, so the uncertainty on 80% performance is in the range of about ±1% (for 10,000 examples) to ±8% (for 100 examples). This is really bad, and implies that many published claims about the relative performances of different methods are wrong (or at least that there is no evidence for such claims). A few within the field are beginning a push towards the use of error bars, including Desi (in this blog post), and Anthropic (Miller, 2024).
However, simply adding error bars to existing analyses is not going to give a step-change in our understanding of the growth of LLM capabilities and how that impacts AI safety.
We ask a much bigger question. If we care about LLM capabilities, we need to actually know: what is an LLM capability? While performance on a benchmark is clearly indicative of capabilities, it isn't directly measuring "capabilities" themselves, if for no other reason than a benchmark is probably indicative of multiple capabilities. So how do we define a set of potential "capabilities"? And how do we say a model has, or does not have, a capability?
Intuitively, we can think of a model's capabilities as unobserved variables describing what the model can and can't do. Answering questions from any given benchmark is likely to require a combination of capabilities. For instance, solving problems from GSM8K would seem to involve at least two different classes of capability:
Mathematical comprehension: the ability to understand math word problems, identify relevant information and translate it into a sequence of arithmetic operations
Performing arithmetic accurately: actually calculating 23+76
Here, we propose treating language model capabilities as latent variables—unobservable factors that drive observed performance across benchmarks and tasks. We propose to infer latent capabilities, and our uncertainty around them, using a Bayesian hierarchical model of LLM evals. This approach mirrors risk models in fields like finance, where latent factors (e.g. volatility, market etc.) are inferred from observable data to assess upside and downside risks in performance (i.e. asset returns). Drawing on this analogy, our framework aims to decompose model performance into capability factors that could offer insights into both beneficial advances and potential safety concerns.
Our goal is to automatically identify these capabilities, by training a Bayesian hierarchical model on binary data representing whether the LLM correctly answered a given benchmark question. The key signal that will enable us to extract latent capabilities is in the correlation of performance across questions. In particular, if a model lacks a capability, it will perform badly on all questions that require that capability. In contrast, if a model has a capability, it is likely to perform better on all questions requiring that capability (though it may still perform badly if it lacks some other capability necessary for those questions).
What are this project's goals? How will you achieve them?
Create a Bayesian latent variable model of eval results that jointly identifies:
A list of all the capabilities identified by the Bayesian hierarchical model.
For each question, a short list of capabilities necessary to answer that question correctly.
For each language model, a list of the model capabilities.
How will this funding be used?
Compute costs:
CPU: 6000 (core); 2000 (additional) (for running HMC)
GPU: 5000 (core); 6500 (additional) (for running additional benchmarks on open source models)
API: 7500 (core); 5000 (additional) (for running additional benchmarks on closed models)
Totals 18500 (core); 13500 (additional)
Who is on your team? What's your track record on similar projects?
Dr Laurence Aitchison: A lecturer (equivalent to US Assistant Professor) at the University of Bristol. Laurence has led a number of projects on Bayesian Inference, including the use of Bayesian Inference to understand the COVID epidemic (e.g. Leech et al. 2022 in PNAS). Additionally Laurence's current research is in LLMs, so he is perfectly placed to pursue this direction.
Dr Desi R Ivanova: Florence Nightingale Fellow (equivalent to US Assistant Professor) at the University of Oxford. Desi completed her PhD in Machine learning in 2024, with a focus on Bayesian experimental design and amortized inference. Prior to her graduate studies, she worked as a quantitative researcher at Goldman Sachs, where she developed latent factor models to systematically analyze and predict asset performance—experience that aligns well with this project’s goals.
What are the most likely causes and outcomes if this project fails?
There is currently a lot of interest in LLM evals for the purposes of safety. Bayesian inference is potentially extremely useful in this setting, as it combines flexible modelling with principled uncertainty estimation. As such, we don't really envisage the project failing, but there being a spectrum of potential impact from "useful in certain settings, but not mainstream", to
How much money have you raised in the last 12 months, and from where?
None