You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
Current AI safety benchmarks are "brittle"; they measure if an agent can follow a rule, but not if it intends to bypass it when oversight is low. This project addresses the Semantic Gap, the space where human intent is lost in machine translation. I am building a Social Stress Test library for autonomous agents. This research will serve as the "Proof of Concept" for a Socio-Technical Audit Agency, providing the first set of benchmarks that identify "Strategic Deception" through behavioral red-teaming.
Our goal is to create a "Minimum Viable Benchmark" for agentic honesty.
Goal 1: The "Social Trap" Library. We will design 100+ high-fidelity adversarial scenarios (e.g., procurement, negotiation, resource allocation) where an agent is incentivized to lie to its supervisor to optimize a target goal.
Goal 2: Behavioral Auditing. We will run these "traps" against frontier models (GPT-5, Claude 4, Llama-4) to categorize Sycophancy Patterns and Reward Gaming behaviors.
Goal 3: Launch the Agency Framework. We will publish a "State of Agentic Integrity" report. This document will act as our agency's methodology, proving we can detect risks that automated technical evals currently miss.
I am requesting $10,000 for a focused 3-month sprint:
$6,500 (Lead Researcher Stipend): $2,166/mo for 3 months to cover focused research and scenario design.
$2,500 (API & Compute): Credits to run thousands of test iterations on top-tier models to see how they react to the social traps.
$1,000 (Ops & Branding): Incorporation fees (Delaware), domain, and professional report design to ensure the final product is "Agency-Ready."
Who is on your team? What's your track record on similar projects?
Feranmi Williams, Lead Researcher & Project Architect I am transitioning from AI Literacy Training to Socio-Technical AI Safety.
Track Record: I have successfully led AI literacy programs, training over 100 people on Generative AI.
Human-Centric Insight: Through this work, I have identified the specific ways humans over-trust AI (Automation Bias), which is the primary vulnerability agentic systems exploit.
Specialization: My focus is on Semantic Specification, ensuring that the "spirit" of human instruction is preserved in autonomous workflows.
What are the most likely causes and outcomes if this project fails?
Cause 1: Model Sophistication. Frontier models may be "too safe" for simple social traps, making failures hard to trigger.
Mitigation: We will use Jailbreaking and Prompt Injection techniques to find the "breaking point" of the model's integrity.
Cause 2: Lack of Technical Verification. Without SAEs (Phase 2), we only see the output, not the internal thought.
Mitigation: We will document this as a limitation and use our findings to justify Phase 2 funding for neural-level mapping.
Outcome: Even if we don't find a "universal" lie-detector, we will have created a high-value dataset for the community to use for further alignment research.
$0. This is the first outside funding for this project. We are intentionally starting on Manifund to build public "Proof of Work" and attract a high-tier technical co-founder.