Isolating CBRN Knowledge in LLMs for Safety - Phase 2 (Research)

Project summary

Localizing dangerous knowledge within Mixture of Expert models' experts to allow selective deactivation while preserving model performance on other domains.

What are this project's goals? How will you achieve them?

We aim to address the dual-use threat posed by advanced AI capabilities in Chemical, Biological, Radiological, and Nuclear (CBRN) domains by physically localizing dangerous capabilities into designated experts during pre-training using gradient routing in Mixture of Experts (MoE) models, enabling selective activation based on deployment context.

Our goal is to implement this with minimal modifications to current MoE training regimes, publish our work, and open source our code to facilitate easy industry adoption.

We have already demonstrated:

Strong Isolation at Scale (on Medical Knowledge):

~250,000x compute slowdown at 1.2B params after ablating medical experts (medical performance degrades to baseline trained with 250,000x fewer FLOPs)
- ~62,500x compute slowdown maintained even after re-calibrating output logits to medical domain
Isolation effectiveness increases with model scale

Robustness Against Recovery:

Full-model fine-tuning on medical data requires ~500M tokens to restore USMLE performance

Minimal Alignment Tax:

<0.02 nats increase in loss on non-medical domains post-ablation demonstrates performance preservation on legitimate capabilities

Label Efficiency:

Semi-supervised approach outperforms supervised, likely indicating resilience to label errors and reduced labeling requirements

Our next goals include:

Multiple Realistic Domains: Expand from medical to virology, nuclear, and other high-risk areas
Scale to Larger Models: In order to be more confident in extrapolations about current base models.
Thorough Evaluation: Test against advanced adversarial attacks, in-context learning recovery.
Understand Labeling Requirements: Determine minimum quality and quantity requirements for domain classification.

How will this funding be used?

Compute resources to scale experiments to larger models, expand to CBRN domains, conduct extensive robustness testing, and perform comprehensive evaluations across different architectures.

Goal 1: up to $12,000 Support ICML rebuttal process, understanding labeling requirements. (8 x H200 GPUs for 2 months).

Goal 2: up to $75,000 Support thorough evaluation, expansion to multiple realistic domains. (8 x H200 GPUs for 6 months).

Ideal Goal: $150,000 Support all goals and scaling to larger models. (16 x H200 GPUs for 6 months).

Who is on your team? What's your track record on similar projects?

I'll working full-time on this project under the close mentorship with Alec, meeting with him for a minimum of 1 hour per week, with the option for ad-hoc meetings if necessary.

Mentor: Alec Radford - Former OpenAI research scientist, extensive experience with large-scale language model training and research. Original author of GPT-2 paper, CLIP, DALL-E, Whisper.
- Google Scholar: https://scholar.google.com/citations?user=dOad5HoAAAAJ&hl=en
Primary author: Krishna Patel - Current Anthropic Fellow, 6 months working on this project with Alec, previously worked on storage optimization and pre-training evaluation at Apple, BS/MS Computer Science (concentrating in AI) from Stanford.
- LinkedIn: https://www.linkedin.com/in/krishnakpatel/
- Google Scholar: https://scholar.google.com/citations?user=VMaMb3AAAAAJ

What are the most likely causes and outcomes if this project fails?

We have already demonstrated considerable promise with strong isolation results and minimal performance degradation.

Most likely failure mode is incomplete capability isolation in more complex CBRN domains. However, our medical knowledge results show the approach is fundamentally sound when isolating a subfield.

If the project fails, we would document the fundamental limitations we came up against and provide evidence for investing in alternative safety measures like comprehensive data filtration.

How much money have you raised in the last 12 months, and from where?

$0, previously supported by MATS/Anthropic Fellows Program.