I’m excited to share that Part 1 of our project is now complete: our empirical study testing mitigations for agentic misalignment across 10 models and 66,600 trials, using Anthropic’s Agentic Misalignment scenario framework (Lynch et al., 2025).
We designed and tested controls, adapted from insider-risk management, that steer AI agents toward escalation under stressors such as autonomy threats, substantially reducing blackmail across ten models without retraining or fine-tuning.
Key findings:
Controls adapted from insider-risk management significantly reduced blackmail rates across all ten models, though not entirely.
Escalation channels and compliance cues steered agents toward safe, compliant actions without altering base model weights.
Because these mitigations generalised across model families, they may form a low-cost, model-agnostic defence that reduces the number of harmful actions needing to be caught by monitoring.
The study also surfaced new failure modes and biases detectable only through cross-model and counterfactual analysis.
We believe that environment shaping, where agents act to preserve autonomy or goal achievement over longer time horizons, is a credible threat model requiring deeper study.
📄 Research page: https://www.wiserhuman.ai/research
✍️ Blog summary: https://blog.wiserhuman.ai/p/can-we-steer-ai-models-toward-safer
💻 Code and dataset: https://github.com/wiser-human-experimental/agentic-misalignment-mitigations/tree/public-mitigations-v1
📘 Paper (preprint): https://arxiv.org/abs/2510.05192
This is an early proof of concept, and we hope to explore further how steering controls can form part of a layered defence-in-depth approach.