Manifund foxManifund
Home
Login
About
People
Categories
Newsletter
HomeAboutPeopleCategoriesLoginCreate
4

Isolating CBRN Knowledge in LLMs for Safety - Phase 2 (Research)

Technical AI safetyBiomedicalBiosecurity
Krishna-Patel avatar

Krishna Patel

ActiveGrant
$150,000raised
$150,000funding goal
Fully funded and not currently accepting donations.

Project summary

Localizing dangerous knowledge within Mixture of Expert models' experts to allow selective deactivation while preserving model performance on other domains.

What are this project's goals? How will you achieve them?

We aim to address the dual-use threat posed by advanced AI capabilities in Chemical, Biological, Radiological, and Nuclear (CBRN) domains by physically localizing dangerous capabilities into designated experts during pre-training using gradient routing in Mixture of Experts (MoE) models, enabling selective activation based on deployment context.

Our goal is to implement this with minimal modifications to current MoE training regimes, publish our work, and open source our code to facilitate easy industry adoption.

We have already demonstrated:

Strong Isolation at Scale (on Medical Knowledge):

  • ~250,000x compute slowdown at 1.2B params after ablating medical experts (medical performance degrades to baseline trained with 250,000x fewer FLOPs)

    • ~62,500x compute slowdown maintained even after re-calibrating output logits to medical domain

  • Isolation effectiveness increases with model scale

Robustness Against Recovery:

  • Full-model fine-tuning on medical data requires ~500M tokens to restore USMLE performance

Minimal Alignment Tax:

  • <0.02 nats increase in loss on non-medical domains post-ablation demonstrates performance preservation on legitimate capabilities

Label Efficiency:

  • Semi-supervised approach outperforms supervised, likely indicating resilience to label errors and reduced labeling requirements

Our next goals include:

  • Multiple Realistic Domains: Expand from medical to virology, nuclear, and other high-risk areas

  • Scale to Larger Models: In order to be more confident in extrapolations about current base models.

  • Thorough Evaluation: Test against advanced adversarial attacks, in-context learning recovery.

  • Understand Labeling Requirements: Determine minimum quality and quantity requirements for domain classification.

How will this funding be used?

Compute resources to scale experiments to larger models, expand to CBRN domains, conduct extensive robustness testing, and perform comprehensive evaluations across different architectures.

Goal 1: up to $12,000 Support ICML rebuttal process, understanding labeling requirements. (8 x H200 GPUs for 2 months).

Goal 2: up to $75,000 Support thorough evaluation, expansion to multiple realistic domains. (8 x H200 GPUs for 6 months).

Ideal Goal: $150,000 Support all goals and scaling to larger models. (16 x H200 GPUs for 6 months).

Who is on your team? What's your track record on similar projects?

I'll working full-time on this project under the close mentorship with Alec, meeting with him for a minimum of 1 hour per week, with the option for ad-hoc meetings if necessary.

  • Mentor: Alec Radford - Former OpenAI research scientist, extensive experience with large-scale language model training and research. Original author of GPT-2 paper, CLIP, DALL-E, Whisper.

    • Google Scholar: https://scholar.google.com/citations?user=dOad5HoAAAAJ&hl=en

  • Primary author: Krishna Patel - Current Anthropic Fellow, 6 months working on this project with Alec, previously worked on storage optimization and pre-training evaluation at Apple, BS/MS Computer Science (concentrating in AI) from Stanford.

    • LinkedIn: https://www.linkedin.com/in/krishnakpatel/

    • Google Scholar: https://scholar.google.com/citations?user=VMaMb3AAAAAJ

What are the most likely causes and outcomes if this project fails?

We have already demonstrated considerable promise with strong isolation results and minimal performance degradation.

Most likely failure mode is incomplete capability isolation in more complex CBRN domains. However, our medical knowledge results show the approach is fundamentally sound when isolating a subfield.

If the project fails, we would document the fundamental limitations we came up against and provide evidence for investing in alternative safety measures like comprehensive data filtration.

How much money have you raised in the last 12 months, and from where?

$0, previously supported by MATS/Anthropic Fellows Program.

Comments4Donations4Similar7
donated $24,110
MarcusAbramovitch avatar

Marcus Abramovitch

14 days ago

Emptying the rest of my Manifold balance for now into this and planning to give more. I hope this gets the full $150k required.

After a ~45 min call with Krishna, I was very impressed with how clear she was communicating what she was doing, why she was doing it and the metrics for results she had. Furthermore, she had good reasons for wanting this work to be done outside of a major lab.

Furthermore, I think this is a very promising plausible safety feature to combat CBRN risk from models, isolate the part of them that knows about it while keeping the remainder of capabilities intact.

Overall, I found Krishna to be very smart. In my brief conversation with her mentor Alec from a while ago, I also think he has great insights.

I don't really have major reservations here. I think this grant will be similar to my grant to Joseph Bloom of a couple years ago.

donated $24,110
MarcusAbramovitch avatar

Marcus Abramovitch

20 days ago

This looks extremely promising to me. Want to have a 30 min call about it?

Krishna-Patel avatar

Krishna Patel

20 days ago

Thanks we're really excited about it too! What's the easiest way to coordinate a chat? @MarcusAbramovitch

donated $24,110
MarcusAbramovitch avatar

Marcus Abramovitch

20 days ago

@Krishna-Patel I sent you a LinkedIn message but now I see your email is on your profile. Coming up