Post correction reliability maps directly onto adversarial robustness and AI control research priorities. Temperature invariant failure in safety-tuned models (variance under 1pp across T=0.0 to T=0.8) is consistent with training-level constraint suppression rather than stochastic sampling drift. This is a different failure mode from jailbreaks and prompt injection. It persists under normal deployment conditions with no adversarial input required. Happy to discuss methodology with anyone working on alignment training robustness.
The market for grants
Manifund helps great charities get the funding they need. Discover amazing projects, buy impact certs, and weigh in on what gets funded.
