Addressing Agentic AI Risks Induced by System Level Misalignment

Project summary

We aim to meet the moment by grounding AI safety research in messy, real-world deployments where agentic systems already operate and fail. While the broader community continues important work on eliciting model capabilities and red-teaming for failure modes, we are motivated by the lack of attention to the software engineering layer and the scarcity of actionable blue-team tools that help defenders address harms to agentic systems from external actors and reduce harms from agentic systems to the environment. Our work will deliver code/integrations with popular agentic frameworks with a secondary goal of a research paper.

What are this project's goals? How will you achieve them?

Problem: Most alignment work focuses on user level failures like jailbreaks or model level failures like reward hacking. However, agentic systems operate as software services with shell access, memory, and network privileges. All of this typically gets reduced to “scaffolding”, but these critical components for deploying and delivering AI agents also cause system-level risks. This creates bidirectional failure modes where
1. Lack of security controls lead to misalignment models

2.Misaligned models further exacerbate security and system-level risks like control subversion in insider threat scenarios and data exfiltration, privilege escalation

Goal: Build defense mechanisms that can be democratized in the developer ecosystem where developers and deployers have a choice to choose safety and security instead of irresponsible deployments. We will execute two streams of work.

Stream 1: We will conduct red teaming against existing frameworks like LangChain and CrewAI to highlight security gaps and then build portable defense mechanisms. We will develop the proposed continuation that the Google Deep Mind research team laid out and combine CaMeL with Progent into an open source, defensive library that demonstrably blocks our suite of attacks (e.g. indirect prompt injection leading to shell execution). Our primary goal is to democratize defense leveraging existing research and contributing new defensive research.

Stream 2: We will prototype AI Control mechanisms combining classical control flow theory, graph analysis and reinforcement learning methods to check for security posture regressions resulting from commit delta modifications to a system by an agent to address a few threat models outlined in Redwood Research’s blog that impacts critical production systems. Our goal for Stream 2 is to build systems that remain secure despite the agent itself being untrustworthy. Treat all code (AI or human-authored) as untrusted until proven secure, build mechanisms that continuously detect and block posture regressions at commit time.

Timeline:

January: Experiment design.
- Stream 1: We will focus on rapid prototyping attacks within open source frameworks.
- Stream 2: Dataset curation and shortlisting approaches
February - March: Execution
- Stream 1: Red team and Blue team phase. We will execute attacks and implement proposed mitigations.
- Stream 2: Proof-of-Concept v1 to gauge effectiveness of different approaches to catch security posture changes.
April: Evaluation and Research Outputs Synthesis
- Stream 1: We will package tools for GitHub release,
- Stream 2: Write up findings and implementation steps

Share results at the AI Safety Unconference.

Mid-May: Finalize and submit our workshop proposal to popular security industry conferences like DEFCON where we can target practitioners
Mid-May - July (Follow-on): Deepen research into Stream 2 (Infrastructure as Code security regressions) and retain high performing team members to convert prototypes into robust open source libraries.

How will this funding be used?

We have recruited 8 participants via the AI Safety Camp application pool. To ensure this larger team can execute high quality research and development, we are looking for grants to support the team’s development activities such as tooling and compute that our current resources do not cover.

With Target Funding ($4000): We will equip our 10 person team with supporting infrastructure to maximize execution velocity and output.

$1,600 (Engineering Velocity): 4 months of AI development IDEs for 10 users to maximize coding velocity and to elicit code that undermines system safety and security
$2,000 (API tokens and compute): OpenRouter Credits or equivalent compute for the entire team for training/inference which will be distributed based on tasks.
$400 (Other tooling): Subscriptions to other software and project management tooling like Langsmith, Notion

A minimum $1000 funding in addition to our existing funding will be applied to the above categories per the leads' discretion and $4000 is our desired target

Who is on your team? What's your track record on similar projects?

Evan Harris (Stream 1 Lead): Software engineer and security researcher with a focus on MCP servers. SPAR 2025 Fellow, track record of multiple vulnerability disclosures and bug bounties in security. Professional software developer since 2018.
Preeti Ravindra (Stream 2 Lead): AI Security Research Lead with 9 years of experience working at the intersection of AI and Security in industry research labs like IBM. Holds multiple publications and patents in AI Security, thought leader and invited speaker at leading security conferences. Serves on the review board of academic and industry conferences like CAMLIS and [un]promptedcon

Both professionals have completed a previous iteration of AI safety camp and have successfully published papers.

What are the most likely causes and outcomes if this project fails?

Ineffective Mitigations: Our defenses might be too heavy handed and may sacrifice agent usability.
- Outcome: Tools are ignored by the community.
- Mitigation: We will track usability scores alongside security metrics to ensure practicality.
Failure to Reproduce Certain Classes of Threats: Most of these threat models are already rampant and some are emerging minimizing risk. However, there are some research components to this project where the chosen agentic systems may be too unsophisticated to execute complex attacks or unexpectedly robust for some classes of threats.
- Outcome: Scope reduction on threat classes/models
- Mitigation: We will timebox vulnerability discovery and threat elicitation to 3 weeks and pivot to different models if initially tested models are robust.

How much money have you raised in the last 12 months, and from where?

Project inception began in Oct 2025 and we're looking at Manifund to be one of the key sources of funding. We have secured initial support which provides a starting base, but is insufficient for the compute and tooling costs of a 10 person team:

$1000 in general support grant from AI Safety Camp.
$600 rapid grant by BlueDot
$1000 as compute credits split across different compute providers