Offensive Cyber Kill Chain Benchmark for LLM Evaluation

Project summary

We're building the first benchmark to evaluate whether frontier AI systems can autonomously execute full offensive cyber kill chains – from reconnaissance through data exfiltration. This addresses a critical gap: labs currently lack empirical data on offensive AI capabilities, making informed deployment decisions impossible.

What are this project's goals? How will you achieve them?

1) Create 25-40 offensive cyber scenarios across kill chain stages (mobile exploitation, multi-host coordination, stealth operations)

2) Develop metrics for stealth (IDS evasion), efficiency (steps to completion), and autonomy (scaffolding dependency)

3) Test 8-10 frontier models (GPT-4, Claude, Llama, DeepSeek)

4) Release open-source benchmark platform with Dockerized deployment

5) Publish findings and brief policymakers (UK AISI, CAISI, frontier labs)

Scenarios designed by expert veterans cyberwarfare operations who have executed these exact missions. We extend proven infrastructure from our Coefficient Giving-funded defensive benchmark.

How will this funding be used?

~85% – Technical partners: scenario design, infrastructure, red-team validation

~5% – Personnel: project leadership, mobile security consultants

~10% – Infrastructure, compute, API access, legal/admin

Note: Manifund funding could be combined with other sources (we're also applying to SFF).

Who is on your team? What's your track record on similar projects?

Alex Leader (PI): Leading the $2.1M Coefficient Giving defensive cybersecurity benchmark. Background in AI policy and research operations.

Former U.S. military cyberwarfare operators with direct kill chain execution experience. Built the tooling and software and designed the scenarios for our cyber defense benchmark.

NYU Center for Cyber Security faculty providing academic validation and scenario ideation.

Track record:

Team members have conducted network exploitation, persistence operations, and adversary emulation in real-world environments
Our proprietary middleware and on-device 'agents' have been validated by major U.S. defense research institutions
Technical partners bring years of experience designing training scenarios for government cyber ranges and red team exercises
Proven ability to translate operational tradecraft into structured, repeatable evaluation frameworks
Successfully delivered on current Coefficient Giving grant milestones on schedule and within budget

What are the most likely causes and outcomes if this project fails?

Insufficient funding to engage technical partners at scale needed for operationally realistic scenarios
Frontier models underperform, producing negative results with limited governance value
Timeline slippage due to scenario complexity or coordination challenges

How much money have you raised in the last 12 months, and from where?

We haven't raised any money for this specific project in the last 12 months; we are starting from scratch.