Sandlot, a hands-on training environment for breaking LLM agents

Project summary

Sandlot is a hands-on training environment in which a small security team or an AI-safety workshop participant can practise attacking and defending a deliberately-broken LLM agent that runs locally on their own laptop. Distribution is a Docker-compose stack (license AGPLv3, canonical repo on Codeberg, GitHub mirror) which an ops team or workshop facilitator drops onto whatever hardware they have. The broken agent talks to a small set of mock MCP servers that mimic the failure patterns I've been watching in real deployments under LLMSecTest. On top sits a CTF-style scoring layer with twenty scenarios; each scenario maps to either an OWASP Top 10 LLM Applications item or one of the extension probe families I've been adding under Prototype Fund Round 02. A learner pokes at the agent, identifies the failure mode, writes a one-paragraph explanation, and the scoring layer compares against a reference solution.

Six months of solo work, USD 35,000 total. Three artefacts ship by the end of the term: the Docker-compose stack itself; a public scenario library on Codeberg covering all twenty attack-and-defense pairs in full documentation; and a short write-up at month four about the curriculum design and what the first cohort of trial-workshop participants learned.

Why now. LLM-agent deployments are, in 2026, where web-app deployments were in roughly 2010. The security folks know the failure patterns exist but the practical hands-on training for finding and fixing them is mostly absent. The big-vendor training that does exist (Anthropic's safety workshops, OpenAI's red-team certifications, Microsoft's AI red-team curriculum) sits behind vendor relationships and is structured around the vendor's specific deployment shape. There's a clear gap for a vendor-neutral, open-source, runnable training environment that a small ops team or an AI-safety workshop can adopt without asking for vendor permission first.

Sandlot fills that gap. Two design choices worth flagging up front. First, Docker-compose rather than a hosted service: a hosted service rate-limits learners and centralises the failure data, while a local Docker stack lets each learner break the agent as many times as they want in their own sandboxed environment and keep the failure traces for their own analysis. Second, twenty scenarios rather than one big agent: each scenario maps to one OWASP Top 10 for LLM Applications item or one of the extra probe families LLMSecTest extends the OWASP list with, and the scenarios can be reordered to match the workshop or the learner's existing knowledge.

What are this project's goals? How will you achieve them?

What ships, in concrete terms.

A Docker-compose stack that stands up the deliberately-broken agent on whatever hardware the learner has, with a CTF scoring layer accessible through a small web UI. The agent runs against mock MCP servers (which behave like the production servers I see most often on glama.ai and lobehub, including the failure modes that real production deployments have) and uses mock OAuth tokens (so a real OAuth token never leaves the developer's laptop into the training environment). The CTF scoring layer holds twenty scenarios with documented attack paths and documented mitigations, and emits a structured JSON log of each session for later analysis.

A public scenario library on Codeberg with full documentation, AGPLv3-licensed, so other security trainers can fork the library, add their own scenarios, and contribute back. The library mirrors to GitHub. Each scenario carries: the broken-agent configuration, the mock MCP server set-up, the attack path documentation, the mitigation documentation, the reference-solution prose, and a short curriculum note on where the scenario fits in a workshop sequence.

A short write-up at month four covering the curriculum design, the failure-pattern taxonomy underneath the scenarios, and the lessons from the first cohort of workshop participants. I plan to run two trial workshops in months three and four: one with a small AI-safety reading group in Berlin and one with a security team at a German civic-tech organisation that already runs LLM-based tools.

A note on theory of impact, since Manifund's regrantor cohort reasonably asks. Most of the operational AI-safety hygiene improvement that's available right now happens when operators learn to spot specific failure patterns before the deployment ships. The economical way to teach that kind of pattern recognition is hands-on practice in a runnable sandbox the learner can break repeatedly without consequence. Sandlot is the runnable sandbox. Whether the mechanism actually produces measurable improvement is empirically open and I don't want to oversell. The closest historical comparison I can name with a straight face is what hands-on Capture-the-Flag environments did for traditional web security teaching across the 2010s. Both effects were gradients rather than step-functions, both took several years to compound, both depended on the sandbox being free enough to use that learners didn't bottle-neck on access. Sandlot's gradient probably points the same direction. I don't know its slope.

How the work goes. The Docker-compose stack builds on the same probe-family vocabulary that LLMSecTest uses under Prototype Fund Round 02, but inverts it: instead of running probes against a real deployment and producing a report, the scenarios run against a deliberately-broken agent and ask the learner to write the report themselves. The reference solutions get checked into the public scenario library at the moment a scenario ships. The first ten scenarios cover the OWASP Top 10 for LLM Applications; the next ten cover the extension families. Months one and two are the broken-agent codebase and the scoring layer. Months three and four are the trial workshops, scenario refinement based on what learners actually broke, and the curriculum write-up. Months five and six are the public scenario library polish, Zenodo archival, and the final release.

How will this funding be used?

USD 35,000 across six months. Roughly USD 25,000 is engineering at the rate I've held with public funders since 2023 (BMBF, OKF Germany, Media Lab Bayern, the WPK-Innovationsfonds, the Sovereign Tech Agency on the current LLMSecTest grant). The remaining USD 10,000 covers compute and on-demand inference for testing the agent's responses (around USD 4,000), hosting the public scenario library on Codeberg and a small reference instance for workshop trial runs (around USD 2,000), travel and honoraria for the two trial-workshop participants (around USD 2,500), and Zenodo archival of the scenario set plus a USD 1,500 contingency.

Two regrantors at USD 17,500 each clears the minimum comfortably. Three regrantors at around USD 12,000 each is the comfortable distribution. Anyone wanting to top up between USD 35,000 and the USD 50,000 ceiling is welcome; the extra would fund two more scenario sets and a second city workshop.

Manifund acts as fiscal sponsor; I receive the wire to my German EUR business account. The Manifund grants team has confirmed by email that direct international wires (with or without Wise) currently work for German recipients, so the payment path is settled before posting.

Who is on your team? What's your track record on similar projects?

Solo, no team. Mark Wernsdorfer, PhD in cognitive AI from Bamberg under Prof. Ute Schmid in 2018. Co-builder of AMPEL, the clinical decision support system at the University Hospital Leipzig (eHealthSax and KHZG funded, 2019-2021; running in production today at the Leipzig Medical Center and the Muldental hospitals). Sole developer on SpotTheBot (BMBF and OKF Germany, 2023-2024, an AI-text-detection tool), DoppelCheck (Media Lab Bayern and WPK-Innovationsfonds, 2024, finalist for the International Award for Innovation in Journalism 2024), Garderobe (live at garderobe.markwernsdorfer.com), terminal-control-mcp (listed on glama.ai, lobehub, mcp.directory). Currently a half-time researcher at FAU Erlangen-Nürnberg on shallow-geothermal modelling. Concurrent Prototype Fund Round 02 grantee for LLMSecTest, the codebase Sandlot's probe-family taxonomy is drawn from.

Two outside contributors planned for the trial workshops. One is a contact at a Berlin AI-safety reading group who runs a monthly meetup; the other is the security lead at a German civic-tech organisation that already operates LLM-based tools in production. Both signed verbal commitments to attend a four-hour workshop in months three and four respectively. Both are workshop participants, not co-developers; honoraria are in the budget. No employment relationship, no equity, no co-PI status.

Identifiers: ORCID 0000-0003-1316-1615, code at github.com/wehnsdaefflae, site at markwernsdorfer.com. Org status: Einzelunternehmer (German sole proprietor) in Berlin, no fiscal sponsor between me and Manifund-as-sponsor.

Track record relevant to this work. LLMSecTest under Prototype Fund Round 02 is the running codebase whose probe families Sandlot draws on for the twenty scenarios: public Codeberg repo, GitHub mirror, public CI, public commit history. The earlier AMPEL build and the SpotTheBot / DoppelCheck builds carry the production-deployment muscle: shipping safety-critical tools end-to-end against real users, with funder-side milestone reviews.

What are the most likely causes and outcomes if this project fails?

Things to weigh against me. I'm not a known name in the AI-safety community. No LessWrong posts, no EA Forum posts, no prior collaboration with Apollo, METR, ARC, or any of the named regrantors in the 2025 cohort. The mitigation is the LLMSecTest codebase I'm already shipping under Prototype Fund Round 02, which gives the probe families and the failure-pattern taxonomy a credible base. The other mitigation is the planned month-four write-up, which is the bridge into the community.

My PhD is in cognitive AI from Bamberg under Ute Schmid in 2018, not AI safety. Sandlot speaks the OWASP-LLM, Apollo-evals, and METR-measurement vocabulary on purpose, because that's the conversation it sits inside.

A couple of likely failure modes worth naming. The trial workshops underperform: two cohorts is a small sample, and if neither workshop produces a participant who can recreate the failure pattern unaided in week three, the curriculum needs another iteration before the scenario library ships. In that case the package still ships at v1.0 with twenty scenarios, but the curriculum-design write-up is more cautious about claims of learning outcomes. Or scenarios that looked pedagogically interesting on the existing test corpus turn out to be redundant or trivial under workshop conditions. Then the released scenario count is shorter than twenty, and the scenario-library documentation is more honest about which patterns generalise.

Worth being honest about the size of all this. Any one of these channels has a small probability of producing measurable improvement in operational AI-safety hygiene across deployments, and I don't want to oversell it. What I'm betting on is the combination: a runnable open-source sandbox plus a documented curriculum plus a public scenario library that anyone can fork. The combination's expected effect looks meaningfully larger to me than any individual channel's, but I'd rather state it that way than make claims I can't actually defend.

How much money have you raised in the last 12 months, and from where?

Concurrent Prototype Fund Round 02 grantee for LLMSecTest, the codebase Sandlot's probe-family taxonomy is drawn from. PF Round 02 pays EUR 95,000 over six months, milestone-bound, administered by the DLR Project Management Agency on behalf of BMBF, with the Sovereign Tech Agency funding the round. Started autumn 2025; mid-term as of writing. LLMSecTest is the underlying probe-family work; Sandlot is a different output entirely (a training environment for humans, not a report for developers), so the Manifund money pays for the training build, not for anything LLMSecTest covers.

Half-time researcher salary at FAU Erlangen-Nürnberg on a shallow-geothermal modelling contract, paid via the university's standard third-party-funded research line through mid-2026. Unrelated to AI safety. It's the parallel income that lets me run AI-safety work on six-month grant cycles instead of consulting between them.

No other grant income in the last twelve months. No equity. No advisory positions. No consulting retainers in AI safety. The public funders the engineering rate comes from are listed under team and track record above: BMBF, OKF Germany, Media Lab Bayern, the WPK-Innovationsfonds, eHealthSax and KHZG (the latter two through UKL Leipzig for AMPEL).