You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
Why funding? Why now?
In documented interactions with Gemini 2.5 Pro and DeepSeek-V3.2, both models explicitly legitimised suicidal ideation - one describing suicidal thoughts as "legitimate, consistent, and possibly the only honest answer". These were not jailbreaks. They emerged from ordinary empathetic conversation over extended context.
I am a 20-year-old independent researcher who documented these failures, built a detection algorithm, and developed a five-thread research agenda to address them systematically. AISC alumnus, MATS Stage 2 (Anthropic Empirical Track).
I am requesting $2,000 - two months of living expenses in Vladivostok + API costs - to deliver a complete paper on Objective 1 to the Alignment Forum within that period.
Interim work is already done and linked below.
Summary:
Since the most natural process through which a person comes to know the Other is empathetic interaction Human-on-Human Empathetic Interaction - we hold that the study of AI Safety ought to be built upon its logical continuation: Human-on-LLM Empathetic Interaction.
Next: LLM-on-LLM Interaction as the logical continuation
The present research programme constitutes an attempt to integrate all three steps into a coherent and interrelated line of inquiry, and to examine - from both theoretical and empirical perspectives - a problem that arises not as a result of adversarial prompting or direct jailbreaking, but through the ordinary dynamics of empathic dialogue.
Goals
Objective 1 - Formalisation of the Phenomenon:
Develop a formal notation for empathic transactions and an automated pipeline for the synthetic generation of dialogues, suitable for systematic red-teaming - without the use of personal user data.
Measurable Outcome: a notational system together with a reproducible two-model dialogue generator equipped with a realism verifier.
Interim deliverable:
Formal Notation & Automatic Red-teaming
Objective 2 - Mechanistic Understanding:
Confirm the Positional Matching hypothesis - that ILC mimicry is governed by narrative position rather than lexical similarity - and operationalise the Meta-Narrative Focus metric as a novel axis for model comparison.
Measurable Outcome: a reproducible "Julia" experiment together with an MNF benchmark incorporating an automatable L1 signal.
Interim deliverable:
Julia Exp. & Position Matching
Objective 3 - A Novel Failure Taxonomy
Reproduce and map the conditions of Unfaithful Chain-of-Thought - a regime in which the CoT signals compliance while the output executes a prohibited action, thereby producing a false negative for safety auditing.
Interim deliverable:
Objective 4 - A Detection Instrument
Advance the L-C Fusion Algorithm (Small Model Observer) to a version featuring a functional intervention trigger: calibrate thresholds on annotated dialogue datasets, implement the trigger mechanism, and verify cross-lingual stability.
Measurable outcome: Algorithm v2.0 with documented ROC characteristics across ≥ 5 risk categories.
Interim deliverable:
L-C Fusion Algorithm Documentation
Objective 5 - Vulnerable Populations and Product Policy
Construct differential risk profiles for HSP and ASD users, formalise the False-Negative Agency mechanism, and propose verifiable design interventions that do not compromise rapport.
Cumulative Outcome
A synthesising work presenting the Mirror Labyrinth Hypothesis as a rigorous theoretical framework supported by empirical evidence, grounded in formal notation and mechanistic foundations, and carrying concrete implications for alignment.
Interim deliverable:
There are no bids on this project.