When Safety Testing Costs More Than Model Training

Project summary

The economics of AI safety evaluation are fundamentally broken. As models grow more capable, comprehensive adversarial testing is approaching or exceeding the cost of training the models themselves. This creates a dangerous dynamic where thorough safety evaluation becomes a luxury only the largest labs can afford, and even they face pressure to cut corners. Smaller teams building agentic systems are left choosing between inadequate testing or unsustainable compute budgets. The result is a proliferation of multi agent systems deployed without rigorous security validation.

This project develops evaluation infrastructure that maintains detection quality while reducing compute costs by 10-20x through targeted sampling and attack pattern recognition. The approach draws directly from systematic risk analysis in financial markets, where I spent years identifying hidden correlations and mapping how failures cascade through complex systems. When agents interact across boundaries, vulnerabilities emerge that do not exist in isolation. Accountability disperses, reasoning breaks down in specific patterns, and attacks propagate through the network in ways that mirror correlation risk in trading portfolios. The framework identifies these patterns efficiently rather than testing exhaustively.

The outcome is an open source evaluation system with reproducible benchmarks that smaller teams can actually afford to use. Right now, serious adversarial testing for multi agent systems is gatekept by budget. This work removes that barrier and makes rigorous safety evaluation economically viable at scale. If successful, teams shipping agentic systems will have access to the same caliber of security testing that currently only exists inside major labs. If it fails, we will have concrete data on whether efficient evaluation can maintain quality standards, which is valuable negative information the field needs either way.

What are this project's goals? How will you achieve them?

The goal is to build efficient evaluation infrastructure that scales adversarial testing for distributed AI systems without matching training costs. Current evaluation methods are economically broken when testing costs equal or exceed model training costs, which creates perverse incentives where thorough safety testing gets skipped. The broader objective is democratizing this capability so smaller teams can run scalable adversarial testing without requiring lab-scale budgets.

The achievement path involves running systematic experiments across model families including GPT-5, Sonnet 4.5, and Grok to test how vulnerabilities propagate when agents interact. I will build an attack taxonomy based on boundary crossing patterns where accountability disperses across the system. The validation step confirms that targeted sampling catches 80% or more of critical failures while using only 5-10% of exhaustive testing compute. The deliverable is an open source framework with reproducible benchmarks showing cost reduction while maintaining detection quality. This applies the same analytical framework I used for spotting mispriced assets in markets to finding mispriced risk in agent networks.

How will this funding be used?

The entire $10k-$15k will be allocated to compute credits across three platforms. OpenAI API for GPT-5 testing receives $4k-$6k, Anthropic API for Sonnet 4.5 testing receives $3k-$4k, and xAI API for Grok testing receives $2k-$3k. An additional $1k-$2k is reserved as buffer for replication runs. The target is 500-800 hours of multi agent evaluation experiments testing interaction patterns, boundary vulnerabilities, and attack propagation dynamics across model families. I am currently experiencing the exact problem I am trying to solve, where compute costs are becoming the primary constraint on development progress.

Who is on your team? What's your track record on similar projects?

I am a solo researcher with a strong execution bias. I started trading in high school and college before moving into banking in NYC, where I spent years in long short equity trading. Pattern recognition and systemic risk mapping were core to the job, which involved finding hidden correlations, spotting mispriced assets, and mapping cascading risks across portfolios. Over the past year, I realized this analytical framework maps directly to adversarial testing in AI. When I test agentic systems, I see attack propagation through networks, reasoning breakdown patterns, and accountability dispersion using the same lens I applied to tracking correlation risk in markets. I left banking a few weeks ago to work full time on AI safety and am currently in SF meeting with research teams at Stanford and other groups working in adversarial testing and multi agent systems. I have built working prototypes of the evaluation framework in Python using standard ML tooling including transformers, pytorch, langchain, and API integrations. My finance background trained me to ship fast and iterate based on results.

What are the most likely causes and outcomes if this project fails?

The most likely causes include sampling strategies failing to generalize across model families, where what catches vulnerabilities in Sonnet might miss GPT-5 specific attack vectors. Compression assumptions could break when the attack surface proves more diverse than anticipated. The compute allocation might prove insufficient for statistical significance if $15k does not cover enough experimental runs or if API costs spike unexpectedly. Framework design decisions could create technical debt that limits adoption or makes replication difficult for other researchers. Even if the project fails, the outcomes still produce value because proving that cheap evaluation methods cannot maintain quality is important data for the field, and I would publish findings either way. The worst case scenario is ambiguous results that neither validate the approach nor provide clean failure data, which would waste the grant and produce noise instead of signal.

How much money have you raised in the last 12 months, and from where?

I have raised zero external funding. I have been self funded via trading profits since leaving my banking role in NYC a few weeks ago. There are no grants, no institutional backing, and no other fundraising efforts either active or completed. I have been running compute costs out of pocket while developing initial prototypes, which works at small scale but is hitting budget constraints as testing needs scale up. This is my first external funding request.