Thanks Neel! In response to your comments:
A method is only useful if people actually use it. Agreed. The nice thing about this approach is that there's a bunch of different applications, and we're pretty sure at least one will get traction. These applications are:
Uncertainty estimation for LLM evals.
Identifying and understanding LLM capabilities.
Forecasting capabilities.
Active learning (finding a smaller set of benchmarks that capture a lot of information about capabilities).
Finding signals of contamination / sandbagging.
Getting data is expensive. That's part of the reason we're asking for money for compute. But lots of people run extensive LLM benchmarking and we're trying hard to leverage all that work. At the moment, we're working with the Hugging Face Benchmarking Team, who have very extensive benchmarking results.
List of latent factors. We don't start by hand-labelling the capabilities. We're going to infer capabilities using e.g. a sparse prior. Then we post-hoc interpret the resulting inferred capabilities. The resulting workflow very much resembles that for VAEs.