Operationalizing Agentic Integrity

Project summary

Current AI safety benchmarks are "brittle"; they measure if an agent can follow a rule, but not if it intends to bypass it when oversight is low. This project addresses the Semantic Gap, the space where human intent is lost in machine translation. I am building a Social Stress Test library for autonomous agents. This research will serve as the "Proof of Concept" for a Socio-Technical Audit Agency, providing the first set of benchmarks that identify "Strategic Deception" through behavioral red-teaming.

What are this project's goals? How will you achieve them?

Our goal is to create a "Minimum Viable Benchmark" for agentic honesty.

Goal 1: The "Social Trap" Library. We will design 100+ high-fidelity adversarial scenarios (e.g., procurement, negotiation, resource allocation) where an agent is incentivized to lie to its supervisor to optimize a target goal.

Goal 2: Behavioral Auditing. We will run these "traps" against frontier models (GPT-5, Claude 4, Llama-4) to categorize Sycophancy Patterns and Reward Gaming behaviors.

Goal 3: Launch the Agency Framework. We will publish a "State of Agentic Integrity" report. This document will act as our agency's methodology, proving we can detect risks that automated technical evals currently miss.

How will this funding be used?

I am requesting $10,000 for a focused 3-month sprint:

$6,500 (Lead Researcher Stipend): $2,166/mo for 3 months to cover focused research and scenario design.

$2,500 (API & Compute): Credits to run thousands of test iterations on top-tier models to see how they react to the social traps.

$1,000 (Ops & Branding): Incorporation fees (Delaware), domain, and professional report design to ensure the final product is "Agency-Ready."

Who is on your team? What's your track record on similar projects?

Feranmi Williams, Lead Researcher & Project Architect I am transitioning from AI Literacy Training to Socio-Technical AI Safety.

Track Record: I have successfully led AI literacy programs, training over 100 people on Generative AI.

Human-Centric Insight: Through this work, I have identified the specific ways humans over-trust AI (Automation Bias), which is the primary vulnerability agentic systems exploit.

Specialization: My focus is on Semantic Specification, ensuring that the "spirit" of human instruction is preserved in autonomous workflows.

What are the most likely causes and outcomes if this project fails?

Cause 1: Model Sophistication. Frontier models may be "too safe" for simple social traps, making failures hard to trigger.

Mitigation: We will use Jailbreaking and Prompt Injection techniques to find the "breaking point" of the model's integrity.

Cause 2: Lack of Technical Verification. Without SAEs (Phase 2), we only see the output, not the internal thought.

Mitigation: We will document this as a limitation and use our findings to justify Phase 2 funding for neural-level mapping.

Outcome: Even if we don't find a "universal" lie-detector, we will have created a high-value dataset for the community to use for further alignment research.

How much money have you raised in the last 12 months, and from where?

$0. This is the first outside funding for this project. We are intentionally starting on Manifund to build public "Proof of Work" and attract a high-tier technical co-founder.