Status: I now have enough money for API calls, so I can get started on this project. I have to work on this in my free time though, so it's difficult to predict when I will have results.
I want to run tests to develop the AI Alignment technique I describe in this post, which contains additional details: https://www.lesswrong.com/posts/iocPzHRsH9JWtwHHG/split-personality-training-revealing-latent-knowledge
This idea was developed during MATS. My mentor Evan Hubinger said it was a great idea and I should work on it next, after MATS is over.
The personality-shift token works by modifying the standard generation process of an LLM. In a typical interaction, the model starts with:
[prompt]
and produces:
[prompt][main output]<end-of-turn>
In our proposed modification, when we receive this output, we replace the <end-of-turn> token with a <personality-shift> token and simply continue the generation until we get a second <end-of-turn> token:
[prompt][main output]<personality-shift>[second output]<end-of-turn>
This gives us two separate outputs.
We want to use the [main output] for solving problems as usual, while the [second output] is dedicated exclusively to alignment, by criticizing the main output and the hidden thought process that led to it.
Access to internal states: The second personality has direct access to the model's internal states during the main generation, potentially revealing latent knowledge that external evaluators cannot access.
Architectural simplicity: Requires minimal modifications to existing LLM architectures - just adding a special token and modifying the training process rather than architectural redesigns.
Separation of concerns: Allows different training objectives for the user-facing output and the oversight mechanism, avoiding tradeoffs between efficiency and safety.
Computational efficiency: Potential for reduced computational overhead compared to running two separate models when the detection mechanism produces empty strings for safe outputs.
Adaptability: The approach can potentially detect various types of alignment failures beyond those explicitly trained for, since it has access to the model's reasoning process.
Scalability with model capability: As models become more capable, their ability to detect their own misalignment could improve proportionally.
Complementarity: Works alongside existing alignment techniques rather than replacing them.
I want to test if the personality-shift mechanism works by running it on several datasets:
Reward hacks, similar to those in Anthropics paper Auditing language models for hidden objectives. This will show if we can train an LLM to have a second personality that will reveal undesired behavior by the main personality even if it was not trained on that behavior and the behavior is not visible to external observers.
Jailbreaks: This will show if the second personality is independent of the main personality and can report issues even if the input breaks the main personality.
To pay for API calls: I need to enhance all training data by generating appropriate [second output] blocks for the second personality. This will require many API calls. Additionally, I will want to try many different variants of the approach, to test which of them work best: Is it better for the second output to be brief, or elaborate? Clinical, or with personality?
If enough money can be raised to pay for my living expenses, I will be able to pursue this project full time (I work as a freelance AI engineer and I will otherwise have to finance myself with client work).
The funding is therefore highly variable: At the low end I can try the approach I consider most promising. At the high end I can try many different variants and the research will be much faster because I can dedicate myself to it full time.
Florian Dietz
5 days ago
Status: I now have enough money for API calls, so I can get started on this project. I have to work on this in my free time though, so it's difficult to predict when I will have results.
Austin Chen
8 days ago
Approving and making a small ($1k) donation as well, as the priors on this kind of project seem reasonable (MATS mentee asking to continue their project), and Marius and Evan's support gives me some confidence. As Marius says, strong early results could encourage me to fund this with more!
Marius Hobbhahn
13 days ago
Funding with $2000 to get the project off the ground.
I have talked to Florian about this project during the last MATS cohort presentation day. I felt like his conceptual considerations were good, and the motivation makes sense.
I have no clear ability in favor or against his ability to execute projects quickly, which is why I'm keeping it at $2k.
I might consider more funding if there are good early results or other strong evidence of progress that I can easily verify. I'd recommend trying to sprint to a 4-6 week MVP and publishing or at least writing up and privately sharing the results.