Boundary-Mediated Models of LLM Hallucination and Alignment

Project Summary

Large Language Models (LLMs) display coherent reasoning while repeatedly failing in predictable ways, including hallucination, brittle alignment, and specification gaming. Existing explanations are fragmented and largely empirical, offering limited guidance for principled alignment design.

This project develops and empirically tests a new mechanistic framework that models LLMs as boundary-mediated adaptive systems structured by a Generate–Conserve–Transform architecture. In this view, both token generation and training updates are irreversible boundary write events that shape future behaviour. The framework yields concrete predictions about hallucinations, correction dynamics, and alignment failures, which will be evaluated using small-scale synthetic and empirical experiments.

The goal is to provide a unifying, testable theory that helps move alignment work from reactive mitigation toward principled system design.

What are this project’s goals? How will you achieve them?

Goals

Develop a clear mechanistic model explaining hallucinations, coherence, and alignment failures in LLMs.
Validate core predictions empirically, rather than leaving the framework as a purely conceptual proposal.
Produce alignment-relevant design insights, including when and why wrapper-style safety methods fail.

Approach

Formalise boundary-mediated computation and the Generate–Conserve–Transform architecture in computational terms.
Model inference as trajectories through an attractor field shaped by learned structure.
Test key hypotheses using compact transformer models and synthetic tasks, including:
- localisation of hallucinations to weakly shaped regions of latent space
- irreversible divergence from early incorrect token commitments
- distinct failure signatures arising from Generate–Conserve–Transform imbalances
Compare wrapper-only safety constraints with a minimal adaptive “supervisory Transform” intervention.

The work is intentionally scoped to be feasible without large-scale compute or institutional infrastructure.

How will this funding be used?

Funding will support a 3–6 month validation phase, focused on theory formalisation and targeted experiments.

Indicative use of funds:

Researcher time: enable focused work through reduced hours or temporary exit from external employment.
Compute and infrastructure: small-scale transformer training, repeated ablation runs, storage, and analysis tooling.
Execution buffer: flexibility for failed experiments, additional runs, or minor tooling needs.

Any unused funds would be returned or reallocated with approval.

Who is on your team? What’s your track record on similar projects?

This project is led by a single independent researcher.

I work at the intersection of systems theory, machine learning, and AI alignment, with a background in complex technical systems. Over the past year, I have independently developed the core framework underlying this project, including formal definitions, alignment implications, and a concrete experimental plan.

While this work has not yet been externally funded, it builds directly on sustained prior research and has been developed to the point of being empirically testable. The project is deliberately designed for solo execution and does not rely on institutional resources.

What are the most likely causes and outcomes if this project fails?

Likely causes of failure

Proposed metrics (e.g. curvature or density proxies) may not correlate strongly with hallucination or stability.
Small-scale experiments may be insufficient to cleanly demonstrate predicted effects.
The framework may require refinement or partial revision based on empirical results.

Outcomes if the project fails

Negative or ambiguous results would still constrain the space of plausible explanations for LLM failure modes.
The work would clarify which aspects of the theory are unsupported, reducing future misdirected alignment efforts.
Even partial results would inform follow-on research directions and funding decisions.

Failure modes are informative rather than wasted effort.

How much money have you raised in the last 12 months, and from where?

I have not raised external research funding in the past 12 months.