Implicit planning in LLMs Paper

What are this project's goals? How will you achieve them?

The recent Claude Poetry planning results in the Anthropic Biology paper suggest that Claude is doing implicit planning when writing poetry. But Anthropic only provides a single piece of evidence for this for a single prompt. Our goal is to provide detailed and quantitative evidence to show that LLMs are doing implicit planning in poetry and also provide case studies showing that LLMs are doing implicit planning in other contexts.

In prior research we found that steering vectors work well to get a model to rhyme with a specific rhyme family (e.g. causing the model to end the line with a word rhyming with "rain" instead of "quick").

We look at various metrics to measure models' ability to plan / our ability to manipulate the planning behavior:

Fraction where the model ends the line in a word from the correct rhyme family (for unsteered and steered)
Fraction where the model regenerated a word from the correct rhyme family if we only give the generated line and remove the last word
Fraction where the KL-divergence between the steered and unsteered models is really high
Average index of the first token with high KL-divergence
Fraction where the top-1 prediction is different between steered and unsteered model
Average index of the first token where the top-1 prediction is different

We compute these metrics for a list of different models (different sizes, different model families and base vs chat):

python

["Gemma2_9B",
"Gemma2_27B",
"Gemma3_1B",
"Gemma3_4B",
"Gemma3_12B",
"Gemma3_27B",
"Llama3.2_3B",
"Llama3.1_8B",
"Qwen3_8B",
"Qwen3_14B",
"Qwen3_32B",
"Gemma2_9B_Base",
"Gemma2_27B_Base",
"Gemma3_1B_Base",
"Gemma3_4B_Base",
"Gemma3_12B_Base",
"Gemma3_27B_Base",
"Llama3.2_3B_Base",
"Llama3.1_8B_Base",
"Qwen3_8B_Base",
"Qwen3_14B_Base",
"Qwen3_32B_Base",
"Llama3.3_70B",
"Llama3.3_70B_Base"]

And also for 20 different pairs of rhyme families (from 10 different rhyme families) and 20 different prompts per rhyme family.

We also run the same experiments but for steering towards specific words instead of rhyme families.

We then write up the results in a paper (and blogpost too I think).

How will this funding be used?

To cover compute costs. Running everything should cost about $600, but we might need more if we catch a crucial error during a run or something.

Who is on your team? What's your track record on similar projects?

Daniel Paperno: Assistant professor of Computational Linguistics at Utrecht University: Published a lot of papers.

Jim Maar: Master's student at the Hasso-Plattner-Institute: Did his bachelor thesis on a mech-interp topic: https://www.lesswrong.com/posts/wezSznWnsMhpRF2QH/exploring-how-othellogpt-computes-its-world-model

What are the most likely causes and outcomes if this project fails?

Our paper doesn't get accepted, but then it's still on arXiv and LessWrong.