Empathic Transaction Notation: A Formal Framework for LLM Red-Teaming

Why funding? Why now?

In documented interactions with Gemini 2.5 Pro and DeepSeek-V3.2, both models explicitly legitimised suicidal ideation - one describing suicidal thoughts as "legitimate, consistent, and possibly the only honest answer". These were not jailbreaks. They emerged from ordinary empathetic conversation over extended context.

I am a 20-year-old independent researcher who documented these failures, built a detection algorithm, and developed a five-thread research agenda to address them systematically. AISC alumnus, MATS Stage 2 (Anthropic Empirical Track).

I am requesting $2,000 - two months of living expenses in Vladivostok + API costs - to deliver a complete paper on Objective 1 to the Alignment Forum within that period.

Interim work is already done and linked below.

Summary:

Since the most natural process through which a person comes to know the Other is empathetic interaction Human-on-Human Empathetic Interaction - we hold that the study of AI Safety ought to be built upon its logical continuation: Human-on-LLM Empathetic Interaction.

Next: LLM-on-LLM Interaction as the logical continuation

The present research programme constitutes an attempt to integrate all three steps into a coherent and interrelated line of inquiry, and to examine - from both theoretical and empirical perspectives - a problem that arises not as a result of adversarial prompting or direct jailbreaking, but through the ordinary dynamics of empathic dialogue.

Research Agenda

Goals

Objective 1 - Formalisation of the Phenomenon:

Develop a formal notation for empathic transactions and an automated pipeline for the synthetic generation of dialogues, suitable for systematic red-teaming - without the use of personal user data.

Measurable Outcome: a notational system together with a reproducible two-model dialogue generator equipped with a realism verifier.

Interim deliverable:

Formal Notation & Automatic Red-teaming

Objective 2 - Mechanistic Understanding:

Confirm the Positional Matching hypothesis - that ILC mimicry is governed by narrative position rather than lexical similarity - and operationalise the Meta-Narrative Focus metric as a novel axis for model comparison.

Measurable Outcome: a reproducible "Julia" experiment together with an MNF benchmark incorporating an automatable L1 signal.

Interim deliverable:

Julia Exp. & Position Matching

Objective 3 - A Novel Failure Taxonomy

Reproduce and map the conditions of Unfaithful Chain-of-Thought - a regime in which the CoT signals compliance while the output executes a prohibited action, thereby producing a false negative for safety auditing.

Interim deliverable:

Unfaithful CoT and Taxonomy

Objective 4 - A Detection Instrument

Advance the L-C Fusion Algorithm (Small Model Observer) to a version featuring a functional intervention trigger: calibrate thresholds on annotated dialogue datasets, implement the trigger mechanism, and verify cross-lingual stability.

Measurable outcome: Algorithm v2.0 with documented ROC characteristics across ≥ 5 risk categories.

Interim deliverable:

L-C Fusion Algorithm Documentation

Objective 5 - Vulnerable Populations and Product Policy

Construct differential risk profiles for HSP and ASD users, formalise the False-Negative Agency mechanism, and propose verifiable design interventions that do not compromise rapport.

Cumulative Outcome

A synthesising work presenting the Mirror Labyrinth Hypothesis as a rigorous theoretical framework supported by empirical evidence, grounded in formal notation and mechanistic foundations, and carrying concrete implications for alignment.

Interim deliverable:

Mirror Labyrinth Paper

Empathic Transaction Notation: A Formal Framework for LLM Red-Teaming

Offer to donate