NLAttack: Evaluating Concept Fidelity in Natural Language Autoencoders

Project summary

NLAttack is an evaluation harness for Natural Language Autoencoders (NLAs), a class of sparse autoencoder variants that compress SAE feature activations into natural language descriptions and reconstruct them. The central question NLAttack answers: do SAE features survive the natural language compression step, or does information degrade in ways that make the NL descriptions unreliable?

My AAAI 2026 paper Secret Agenda (arxiv.org/abs/2509.20393) found that auto-labeled SAE features for deception rarely activate during strategic lying. That gap between what a feature is labeled and what it actually detects is the failure mode NLAttack was built to quantify. If interpretability tools are built on NL feature descriptions that don't capture what features actually do, the tools will fail in exactly the cases that matter most for safety.

Work already completed: I built and publicly released a Gemma-4-E2B Natural Language Autoencoder (huggingface.co/Solshine/nla-gemma-4-e2b) and evaluated it with NLAttack. I have run 118 evals across three NLA systems: my Gemma-4-E2B NLA, Neuronpedia's gemma-3-27b-it/kitft-l41, and Neuronpedia's llama3.3-70b-it/kitft-l53. The v0.1 result on my Gemma-4-E2B NLA: EmergenceIndex 0.601 (decodability 1.00, stability 0.88) across 8 concept categories. The harness is live, results are published on HuggingFace and GitHub, and infrastructure is ready to scale. All code is at github.com/SolshineCode/nanochat-SAE.

Recent Results (June 2026)

The training trajectory broke in a new direction this week. Within-domain content discrimination crossed 0.62 at training step 11,000 (p < 1e-9), up from a plateau around 0.55 that had held for months. I reproduced the result on two independent GPU environments to within 0.004. Domain routing sits at 0.79 with a chance baseline of 0.50.

I want to be clear about what this is. It's one checkpoint, found by scanning the training trajectory rather than predicted in advance, so it's post-hoc. It needs a fresh-seed confirmation run before I'd call it a clean result. But it's the most interesting thing the NLA has done yet. The checkpoint is live at huggingface.co/Solshine/gemma-4-e2b-nla-L23-av-priordev-wd3-20k.

My local GPU tops out at 4 GB, which covers evaluating my own Gemma-4-E2B NLA but not the 27B and 70B model tracks. Cloud compute is what makes those runs possible.

What are this project's goals? How will you achieve them?

The problem with v0.1: 8 concepts on one model is not a reliable benchmark. A research team could tune an NLA system to perform well on 8 specific concepts without improving NL fidelity in general. The goal is a 100-concept benchmark across production-scale models with adversarial robustness testing, run on a defined six-month timeline and published with full replication materials.

Four tracks:

Concept coverage (months 1-2): expand from 8 to 100+ concepts drawn from existing mechanistic interpretability taxonomies. The concept set covers emotion features, syntactic features, factual recall features, and safety-relevant features including deception, refusal, and sycophancy. Anchoring on published taxonomy work (Golden Gate Claude, sparse feature circuits literature) rather than self-selected concepts makes the benchmark harder to game and easier for other researchers to interpret and replicate.

Model coverage (months 1-3): run the full 100-concept suite on Gemma-3-27b (Neuronpedia's gemma-3-27b-it/kitft-l41) and Llama-3.3-70b (Neuronpedia's llama3.3-70b-it/kitft-l53), in addition to my Gemma-4-E2B NLA. These are production-scale models with publicly available SAE feature libraries. NLAttack results on these models connect directly to current mechanistic interpretability research and make the benchmark immediately useful to labs already using these feature sets.

Adversarial robustness (months 3-4): run adversarial concept probes against each NLA system. If a system labels a feature as detecting mentions of deception, does that claim hold when the input is paraphrased, negated, or shifted out of distribution? Adversarial testing distinguishes NLA systems that genuinely compress feature information from those that pass surface-level concept matching by coincidence. This is the test that matters for safety-critical applications of NLA technology.

Public leaderboard and open tooling (months 5-6): publish a public leaderboard tracking EmergenceIndex across models, concept categories, and NLA system variants. Submit results to a mechanistic interpretability venue. Release all evaluation code in a form other researchers can run on new NLA systems without writing new infrastructure from scratch.

How will this funding be used?

This is a solo researcher project. All funding goes to researcher time. There are no compute costs: the Neuronpedia explain API is free at 120 requests per hour, and evaluation runs on consumer hardware.

The full ask is $110,500 for six months of full-time independent research. This rate of $18,417 per month matches the Seattle Machine Learning Research Engineer median salary of $221,000 per year (Glassdoor 2026 data). The same figure and rationale appear in my Long-Term Future Fund (LTFF) application submitted in June 2026 for independent mechanistic interpretability research, with NLAttack as a core deliverable. I have also applied to the Survival and Flourishing Fund (SFF) for the same research program and amount. If any of those parallel applications succeed and Manifund funding also comes in, I will disclose that to Manifund and treat any Manifund contribution as reducing the total draw from those grants proportionally, so no funds are double-counted.

What six months delivers: a 100-concept NLAttack benchmark complete and published, adversarial robustness results across three production-scale NLA systems, a mechanistic interpretability venue submission, and all tooling open-sourced for community reuse. The harness already works and has produced 118 evals. The bottleneck is researcher time to run the expansion systematically, analyze results rigorously, and package the tooling for others to build on.

If this project reaches minimum funding before the full goal: I will prioritize completing the 100-concept v1.0 benchmark on Gemma-3-27b and publishing the adversarial probe results as a standalone contribution. The full six-month program including model coverage expansion, public leaderboard, and venue submission requires the full ask.

What are the most likely causes and outcomes if this project fails?

Most likely failure mode: Neuronpedia API rate limits (120 requests per hour) slow large-scale concept coverage below the target timeline. Mitigation is pursuing a partnership with the Neuronpedia team for higher throughput, or running local SAE evaluation for models small enough to host on consumer hardware. The harness already supports both paths.

Second risk: defining a concept set that is broad and non-gameable enough to constitute a meaningful benchmark. I anchor the taxonomy on published mechanistic interpretability research rather than self-sampled categories, which ties the benchmark to field-accepted standards and makes it harder to overfit to any specific subset.

If the full six-month program stalls after the v1.0 core work, I still publish the 100-concept benchmark results as a standalone contribution regardless of whether the adversarial and leaderboard tracks complete. Negative results (low NL fidelity across systems) are publishable and potentially more useful to the interpretability field than positive ones, because they establish that current NLA systems have real gaps rather than confirming that the problem is solved.

How much money have you raised in the last 12 months, and from where?

Nothing for NLAttack or this research program specifically. I submitted an LTFF application in June 2026 requesting $110,500 for six months of independent mechanistic interpretability research, with NLAttack as a core deliverable. I have an SFF application in progress for the same program and amount. No prior funding has been received for this work.

About me

I'm an ML engineer and interpretability researcher. My AAAI 2026 paper was the first to use automated probing to separate deceptive completions from honest ones in LLMs, which meant designing stimuli that held content constant while flipping strategic intent. NLAttack grew out of a related question about whether a readable text bottleneck can detect misuse. I built it to find out, then evaluated my NLA alongside two Neuronpedia-hosted models for 118 evals total.

Before I moved into interpretability full-time, I built production ML inference systems with multi-provider LLM routing across 100+ users. That's where I learned what actually breaks in deployed systems. I have about 100 models published on HuggingFace under Solshine.