You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
NLAttack is an evaluation harness for Natural Language Autoencoders (NLAs) — a class of sparse autoencoder variants that compress SAE feature activations into natural language descriptions and reconstruct them. The core question NLAttack answers: do SAE features survive the NL compression step, or does information degrade?
My AAAI 2026 paper Secret Agenda (arxiv.org/abs/2509.20393) showed that auto-labeled SAE features for deception rarely activate during strategic lying — evidence that SAE feature labels don't always capture what the features actually do. NLAttack makes that failure mode measurable.
v0.1 results on 8 concepts: EmergenceIndex 0.601 (decodability 1.00, stability 0.88). The harness uses the Neuronpedia explain API (free, 120 req/hr) to generate NL descriptions of SAE features and measures how well those descriptions reconstruct activation patterns. Code: github.com/SolshineCode/nanochat-SAE.
The v1.0 goal is expanding NLAttack from 8 proof-of-concept concepts to 100+ concepts across four dimensions.
First: concept coverage. The current 8-concept benchmark covers a narrow slice of SAE feature space. I'll expand to 100+ concepts drawn from existing mechanistic interpretability taxonomies — emotion features, syntactic features, factual recall features, and safety-relevant features including deception and refusal. The goal is a benchmark broad enough that gaming it on a subset doesn't inflate the aggregate score.
Second: model coverage. v0.1 runs on nanochat-SAE (a small model I trained). I'll extend to Gemma-3-27b (via Neuronpedia's gemma-3-27b-it/kitft-l41 endpoint) and Llama-3.3-70b (via llama3.3-70b-it/kitft-l53). These are production-scale models with publicly available SAE feature libraries, making NLAttack results directly relevant to current mechanistic interpretability work.
Third: adversarial robustness testing. I'll run adversarial concept probes — inputs designed to exploit gaps between what a feature's NL description says and what it actually responds to. If an NLA system claims a feature means "mentions deception" but the feature activates on unrelated tokens, adversarial probes will find it. This is the direct application of the Secret Agenda paper's finding to the NLA evaluation problem.
Fourth: public leaderboard. I'll publish results as a living HuggingFace leaderboard so new NLA systems can be benchmarked against the same concept suite. The leaderboard will track EmergenceIndex across models, concept categories, and NLA system variants.
The harness is already functional. This funding covers the researcher time to execute the expansion, run evaluations at scale, and publish results.
This is a solo researcher project. All funding goes to researcher time.
Minimum ask ($5,000): covers 4-6 weeks of focused work to expand to 100-concept coverage on Gemma-3-27b, run the full adversarial probe suite, and publish a clean v1.0 benchmark report.
Full ask ($20,000): covers 4-5 months of work to complete all four expansion tracks (concept coverage, model coverage, adversarial robustness, public leaderboard), submit results to a mechanistic interpretability venue, and open-source all evaluation tooling in a form other researchers can extend.
No compute costs: the Neuronpedia explain API is free (120 req/hr), and evaluation runs on CPU-scale hardware. The bottleneck is researcher time, not infrastructure.
Solo: Caleb DeLeeuw. I'm an independent ML researcher with an AAAI 2026 paper (Secret Agenda, arxiv.org/abs/2509.20393) on mechanistic interpretability of deceptive behavior in language models. I built the NLAttack harness from scratch, including the nanochat-SAE training pipeline that produced v0.1 results. Google Scholar: scholar.google.com/citations?user=G0ndQBQAAAAJ. GitHub: github.com/SolshineCode.
Most likely failure mode: Neuronpedia API rate limits (120 requests/hour) make large-scale concept coverage slow. Mitigation is partnering with the Neuronpedia team for higher throughput, or running local SAE evaluation for models small enough to host.
Second risk: defining "concept diversity" in a way that yields a meaningful, non-gameable benchmark. I'll anchor on existing mechanistic interpretability taxonomies (Golden Gate Claude, sparse feature circuits literature) rather than free-form sampling.
If the full project stalls, I still publish the 100-concept v1.0 core results as a standalone contribution. Negative results (low fidelity across NLA systems) are still publishable and potentially more useful to the field than positive results.
Nothing for this project. I have a pending LTFF application (submitted June 2026) for independent mechanistic interpretability research. No prior funding for NLAttack specifically.