You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
Localizing dangerous knowledge within Mixture of Expert models' experts to allow selective deactivation while preserving model performance on other domains.
We aim to address the dual-use threat posed by advanced AI capabilities in Chemical, Biological, Radiological, and Nuclear (CBRN) domains by physically localizing dangerous capabilities into designated experts during pre-training using gradient routing in Mixture of Experts (MoE) models, enabling selective activation based on deployment context.
Our goal is to implement this with minimal modifications to current MoE training regimes, publish our work, and open source our code to facilitate easy industry adoption.
We have already demonstrated:
Strong Isolation at Scale (on Medical Knowledge):
~250,000x compute slowdown at 1.2B params after ablating medical experts (medical performance degrades to baseline trained with 250,000x fewer FLOPs)
~2,500x compute slowdown maintained even after re-calibrating output logits to medical domain
Isolation effectiveness increases with model scale
Robustness Against Recovery:
Full-model fine-tuning on medical data requires ~500M tokens to restore USMLE performance
Minimal Alignment Tax:
<0.02 nats increase in loss on non-medical domains post-ablation demonstrates performance preservation on legitimate capabilities
Label Efficiency:
Semi-supervised approach outperforms supervised, likely indicating resilience to label errors and reduced labeling requirements
Our next goals include:
Multiple Realistic Domains: Expand from medical to virology, nuclear, and other high-risk areas
Scale to Larger Models: In order to be more confident in extrapolations about current base models.
Thorough Evaluation: Test against advanced adversarial attacks, in-context learning recovery.
Understand Labeling Requirements: Determine minimum quality and quantity requirements for domain classification.
Compute resources to scale experiments to larger models, expand to CBRN domains, conduct extensive robustness testing, and perform comprehensive evaluations across different architectures.
Goal 1: up to $12,000 Support ICML rebuttal process, understanding labeling requirements. (8 x H200 GPUs for 2 months).
Goal 2: up to $75,000 Support thorough evaluation, expansion to multiple realistic domains. (8 x H200 GPUs for 6 months).
Ideal Goal: $150,000 Support all goals and scaling to larger models. (16 x H200 GPUs for 6 months).
I'll working full-time on this project under the close mentorship with Alec, meeting with him for a minimum of 1 hour per week, with the option for ad-hoc meetings if necessary.
Mentor: Alec Radford - Former OpenAI research scientist, extensive experience with large-scale language model training and research. Original author of GPT-2 paper, CLIP, DALL-E, Whisper.
Google Scholar: https://scholar.google.com/citations?user=dOad5HoAAAAJ&hl=en
Primary author: Krishna Patel - Current Anthropic Fellow, 6 months working on this project with Alec, previously worked on storage optimization and pre-training evaluation at Apple, BS/MS Computer Science (concentrating in AI) from Stanford.
Google Scholar: https://scholar.google.com/citations?user=VMaMb3AAAAAJ
We have already demonstrated considerable promise with strong isolation results and minimal performance degradation.
Most likely failure mode is incomplete capability isolation in more complex CBRN domains. However, our medical knowledge results show the approach is fundamentally sound when isolating a subfield.
If the project fails, we would document the fundamental limitations we came up against and provide evidence for investing in alternative safety measures like comprehensive data filtration.
$0, previously supported by MATS/Anthropic Fellows Program.