Manifund foxManifund
Home
Login
About
People
Categories
Newsletter
HomeAboutPeopleCategoriesLoginCreate

Funding requirements

Sign grant agreement
Reach min funding
Get Manifund approval
1

Implicit planning in LLMs Paper

Technical AI safety
Jimmm avatar

Jim Maar

ProposalGrant
Closes July 7th, 2025
$1,000raised
$1,000minimum funding
$1,000funding goal
Fully funded and not currently accepting donations.

What are this project's goals? How will you achieve them?

The recent Claude Poetry planning results in the Anthropic Biology paper suggest that Claude is doing implicit planning when writing poetry. But Anthropic only provides a single piece of evidence for this for a single prompt. Our goal is to provide detailed and quantitative evidence to show that LLMs are doing implicit planning in poetry and also provide case studies showing that LLMs are doing implicit planning in other contexts.

In prior research we found that steering vectors work well to get a model to rhyme with a specific rhyme family (e.g. causing the model to end the line with a word rhyming with "rain" instead of "quick").

We look at various metrics to measure models' ability to plan / our ability to manipulate the planning behavior:

  1. Fraction where the model ends the line in a word from the correct rhyme family (for unsteered and steered)

  2. Fraction where the model regenerated a word from the correct rhyme family if we only give the generated line and remove the last word

  3. Fraction where the KL-divergence between the steered and unsteered models is really high

  4. Average index of the first token with high KL-divergence

  5. Fraction where the top-1 prediction is different between steered and unsteered model

  6. Average index of the first token where the top-1 prediction is different

We compute these metrics for a list of different models (different sizes, different model families and base vs chat):

python
["Gemma2_9B",
"Gemma2_27B",
"Gemma3_1B",
"Gemma3_4B",
"Gemma3_12B",
"Gemma3_27B",
"Llama3.2_3B",
"Llama3.1_8B",
"Qwen3_8B",
"Qwen3_14B",
"Qwen3_32B",
"Gemma2_9B_Base",
"Gemma2_27B_Base",
"Gemma3_1B_Base",
"Gemma3_4B_Base",
"Gemma3_12B_Base",
"Gemma3_27B_Base",
"Llama3.2_3B_Base",
"Llama3.1_8B_Base",
"Qwen3_8B_Base",
"Qwen3_14B_Base",
"Qwen3_32B_Base",
"Llama3.3_70B",
"Llama3.3_70B_Base"]

And also for 20 different pairs of rhyme families (from 10 different rhyme families) and 20 different prompts per rhyme family.

We also run the same experiments but for steering towards specific words instead of rhyme families.

We then write up the results in a paper (and blogpost too I think).

How will this funding be used?

To cover compute costs. Running everything should cost about $600, but we might need more if we catch a crucial error during a run or something.

Who is on your team? What's your track record on similar projects?

Daniel Paperno: Assistant professor of Computational Linguistics at Utrecht University: Published a lot of papers.

Jim Maar: Master's student at the Hasso-Plattner-Institute: Did his bachelor thesis on a mech-interp topic: https://www.lesswrong.com/posts/wezSznWnsMhpRF2QH/exploring-how-othellogpt-computes-its-world-model

What are the most likely causes and outcomes if this project fails?

Our paper doesn't get accepted, but then it's still on arXiv and LessWrong.

Comments1Offers1Similar7

Donation Offers

NeelNanda avatar

Neel Nanda

$1K
5 days ago