Fund thebes' model tinkering

Project summary

I am an independent researcher and language model tinkerer who writes and works on open-source tools, like the repeng steering vector library, for experimenting with & understanding language models. I'd like to dedicate more time to this, as this work is currently a side project, and produce useful artifacts for new techniques, like Model Diff Amplification.

What are this project's goals? How will you achieve them?

My current interest is Model Diff Amplification (https://www.goodfire.ai/research/model-diff-amplification), and whether it would be possible to build tools to make experimenting with MDA and combining it with other techniques (like steering vectors or on-policy distillation) easier. Additionally, with more funding, I would like to invest more time into steering vector infrastructure, such as by implementing a steering vector plugin for vLLM.

How will this funding be used?

The minimum funding amount ($5k) will cover my time for writing a tutorial on Model Diff Amplification, and writing tooling for use with HuggingFace models along with my logitloom visualization tool (see below.)

The second tier ($10k) will also cover an attempt to implement a plugin for MDA for vLLM, which will enable higher-performance MDA sampling and potentially usage in RL frameworks for, e.g., distilling an MDA policy into a single model. (I believe this will be useful for model behavior research - for example, to produce reward hacking or emergent misalignment model organisms, to then test mitigation techniques on the resulting single model.) If this fails because it is not feasible to implement this using vLLM's interfaces (which seems unlikely) I will put this funding towards steering vector infrastructure or open-ended research.

The third tier ($20k+) will go towards steering vector infrastructure, such as a steering vector plugin for vLLM. There is an existing project implementing steering for vLLM, EasySteer (which uses some code from repeng), but it is poorly documented and relies on a fork of vLLM instead of using the plugin system. (Using a plugin will make the system easier to integrate with other vLLM-using tools, like verifiers.) If this fails because it is not feasible to implement this using vLLM's plugin system, I will put this funding towards contributions to EasySteer, closing open PRs and issues on repeng, and/or open-ended research.

Additional funding will go towards funding more open-ended model behavior research, open-source development, and writing. (If you are interested in funding me in this range and have certain topics you would like me to prioritize that were not otherwise mentioned, please get in touch.) Likely topics: impact of model personas on RL, RL for personas / character training, using steering vectors to better understand model personas, understanding "emergent misalignment", model introspection.

I expect the timeline here to range from 1-2 months for the minimum funding amount, to 6-12 months at the maximum end.

Who is on your team? What's your track record on similar projects?

I have been tinkering with LLMs for several years now, and in that time have produced several useful open-source tools and tutorials / blog posts:

https://github.com/vgel/repeng , an open-source library for generating steering / control / concept / persona vectors
- A popular (65k clicks) blog post introducing people to how to use these vectors: https://vgel.me/posts/representation-engineering/
- Used in various research contexts, e.g. https://arxiv.org/abs/2511.01689
- A slew of example notebooks, including for less common use-cases like training model-contrastive vectors that aren't usually covered in other resources: https://github.com/vgel/repeng/tree/main/notebooks
- Previously completed a $5k grant for an implementation of steering vectors to llama.cpp: https://x.com/ggerganov/status/1768357345032118715
- Ran an IRL repeng workshop at LightHaven
https://github.com/vgel/logitloom , a free and open-source tool for understanding model behavior at the logit level
https://vgel.me/posts/seahorse/ , a widely-read (145k clicks) exploration of why the seahorse emoji causes models to act strangely, using the logit lens (an interpretability technique)

How much money have you raised in the last 12 months, and from where?

I have previously received a $10k credit grant for cloud compute from Prime Intellect.