Manifund foxManifund
Home
Login
About
People
Categories
Newsletter
HomeAboutPeopleCategoriesLoginCreate
1

Implicit planning in LLMs Paper

Technical AI safety
Jimmm avatar

Jim Maar

ActiveGrant
$1,000raised
$1,000funding goal
Fully funded and not currently accepting donations.

What are this project's goals? How will you achieve them?

The recent Claude Poetry planning results in the Anthropic Biology paper suggest that Claude is doing implicit planning when writing poetry. But Anthropic only provides a single piece of evidence for this for a single prompt. Our goal is to provide detailed and quantitative evidence to show that LLMs are doing implicit planning in poetry and also provide case studies showing that LLMs are doing implicit planning in other contexts.

In prior research we found that steering vectors work well to get a model to rhyme with a specific rhyme family (e.g. causing the model to end the line with a word rhyming with "rain" instead of "quick").

We look at various metrics to measure models' ability to plan / our ability to manipulate the planning behavior:

  1. Fraction where the model ends the line in a word from the correct rhyme family (for unsteered and steered)

  2. Fraction where the model regenerated a word from the correct rhyme family if we only give the generated line and remove the last word

  3. Fraction where the KL-divergence between the steered and unsteered models is really high

  4. Average index of the first token with high KL-divergence

  5. Fraction where the top-1 prediction is different between steered and unsteered model

  6. Average index of the first token where the top-1 prediction is different

We compute these metrics for a list of different models (different sizes, different model families and base vs chat):

python
["Gemma2_9B",
"Gemma2_27B",
"Gemma3_1B",
"Gemma3_4B",
"Gemma3_12B",
"Gemma3_27B",
"Llama3.2_3B",
"Llama3.1_8B",
"Qwen3_8B",
"Qwen3_14B",
"Qwen3_32B",
"Gemma2_9B_Base",
"Gemma2_27B_Base",
"Gemma3_1B_Base",
"Gemma3_4B_Base",
"Gemma3_12B_Base",
"Gemma3_27B_Base",
"Llama3.2_3B_Base",
"Llama3.1_8B_Base",
"Qwen3_8B_Base",
"Qwen3_14B_Base",
"Qwen3_32B_Base",
"Llama3.3_70B",
"Llama3.3_70B_Base"]

And also for 20 different pairs of rhyme families (from 10 different rhyme families) and 20 different prompts per rhyme family.

We also run the same experiments but for steering towards specific words instead of rhyme families.

We then write up the results in a paper (and blogpost too I think).

How will this funding be used?

To cover compute costs. Running everything should cost about $600, but we might need more if we catch a crucial error during a run or something.

Who is on your team? What's your track record on similar projects?

Daniel Paperno: Assistant professor of Computational Linguistics at Utrecht University: Published a lot of papers.

Jim Maar: Master's student at the Hasso-Plattner-Institute: Did his bachelor thesis on a mech-interp topic: https://www.lesswrong.com/posts/wezSznWnsMhpRF2QH/exploring-how-othellogpt-computes-its-world-model

What are the most likely causes and outcomes if this project fails?

Our paper doesn't get accepted, but then it's still on arXiv and LessWrong.

Comments1Donations1Similar7
🐯

Scott Viteri

Attention-Guided-RL for Human-Like LMs

Compute Funding

Technical AI safety
4
2
$3.1K raised
🐝

Dmitrii

Research on AI-Powered Peer Review: Evaluating LLMs for Academic Feedback

Comparing Closed and Open-Source Models for Reviews, fine-tuning LLMs on openreview data. Stretch: applying mech interp to find out the most important things

Science & technology
2
0
$0 raised
Reese avatar

Igor Ivanov

Demonstration of LLMs deceiving and getting out of a sandbox

Technical AI safety
2
2
$3.11K raised
robertzk avatar

Robert Krzyzanowski

Unprompted Unfaithful Chain of Thought Dataset Project

Technical AI safety
3
2
$2K raised
🐌

Jon Bogard

Research on cognitive bias in LLMs + exacerbation by RLHF

Probing whether RLHF takes LLMs further from human goals

Science & technologyTechnical AI safetyACX Grants 2024
2
1
$0 raised
🍓

James Lucassen

LLM Approximation to Pass@K

Technical AI safety
3
6
$0 raised
agusmartinez92 avatar

Agustín Martinez Suñé

SafePlanBench: evaluating a Guaranteed Safe AI Approach for LLM-based Agents

Seeking funding to develop and evaluate a new benchmark for systematically assesing safety of LLM-based agents

Technical AI safetyAI governance
5
2
$1.98K raised