You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
Large Language Models (LLMs) display coherent reasoning while repeatedly failing in predictable ways, including hallucination, brittle alignment, and specification gaming. Existing explanations are fragmented and largely empirical, offering limited guidance for principled alignment design.
This project develops and empirically tests a new mechanistic framework that models LLMs as boundary-mediated adaptive systems structured by a Generate–Conserve–Transform architecture. In this view, both token generation and training updates are irreversible boundary write events that shape future behaviour. The framework yields concrete predictions about hallucinations, correction dynamics, and alignment failures, which will be evaluated using small-scale synthetic and empirical experiments.
The goal is to provide a unifying, testable theory that helps move alignment work from reactive mitigation toward principled system design.
Develop a clear mechanistic model explaining hallucinations, coherence, and alignment failures in LLMs.
Validate core predictions empirically, rather than leaving the framework as a purely conceptual proposal.
Produce alignment-relevant design insights, including when and why wrapper-style safety methods fail.
Formalise boundary-mediated computation and the Generate–Conserve–Transform architecture in computational terms.
Model inference as trajectories through an attractor field shaped by learned structure.
Test key hypotheses using compact transformer models and synthetic tasks, including:
localisation of hallucinations to weakly shaped regions of latent space
irreversible divergence from early incorrect token commitments
distinct failure signatures arising from Generate–Conserve–Transform imbalances
Compare wrapper-only safety constraints with a minimal adaptive “supervisory Transform” intervention.
The work is intentionally scoped to be feasible without large-scale compute or institutional infrastructure.
Funding will support a 3–6 month validation phase, focused on theory formalisation and targeted experiments.
Indicative use of funds:
Researcher time: enable focused work through reduced hours or temporary exit from external employment.
Compute and infrastructure: small-scale transformer training, repeated ablation runs, storage, and analysis tooling.
Execution buffer: flexibility for failed experiments, additional runs, or minor tooling needs.
Any unused funds would be returned or reallocated with approval.
This project is led by a single independent researcher.
I work at the intersection of systems theory, machine learning, and AI alignment, with a background in complex technical systems. Over the past year, I have independently developed the core framework underlying this project, including formal definitions, alignment implications, and a concrete experimental plan.
While this work has not yet been externally funded, it builds directly on sustained prior research and has been developed to the point of being empirically testable. The project is deliberately designed for solo execution and does not rely on institutional resources.
Proposed metrics (e.g. curvature or density proxies) may not correlate strongly with hallucination or stability.
Small-scale experiments may be insufficient to cleanly demonstrate predicted effects.
The framework may require refinement or partial revision based on empirical results.
Negative or ambiguous results would still constrain the space of plausible explanations for LLM failure modes.
The work would clarify which aspects of the theory are unsupported, reducing future misdirected alignment efforts.
Even partial results would inform follow-on research directions and funding decisions.
Failure modes are informative rather than wasted effort.
I have not raised external research funding in the past 12 months.