ZTGI-Pro v6: Real-Time Hazard & Stability Monitor for LLMs

Project Summary

This project aims to turn an already running prototype into a small but serious safety component for large language models.

The framework is called ZTGI-Pro v6 (Tek-Throne / Single-FPS model). It extends my earlier prototype ZTGI-Pro v3.3 and the conceptual work in the ZTGI-V5 Book (Zenodo DOI: 10.5281/zenodo.17670650).

The core idea is:

Inside any short causal-closed region (CCR) of reasoning, a model should behave as if there is one stable executive trajectory (a Single-FPS).
When the model is pulled into mutually incompatible directions—contradictions, “multiple voices”, incoherent reasoning—the Single-FPS constraint starts to break. That local region is treated as internally unstable.

ZTGI-Pro v6 turns this pressure into real-time signals:

σ — linguistic jitter (unstable token-to-token behavior)
ε — dissonance (self-contradiction, multiple voices)
ρ — stabilization pressure (how strongly the model is “trying” to rescue coherence)
χ — coherence

These feed into a hazard memory and a simple state machine.
In the current v6 formulation I use:

dual-EMA hazard traces H_s, H_l, Ĥ
a risk surface r = max(H_s, H_l) − H*
a collapse probability p_break

and a three-mode controller:

SAFE — normal operation
WARN — elevated internal instability
BREAK / BLOCK (Ω = 1) — the local CCR no longer behaves like a single stable executive stream

ZTGI-Pro v6 is already running on top of a local LLaMA model as a shield with a full web dashboard and live metrics (σ, ε, ρ, H_s, H_l, r, p_break, SAFE/WARN/BLOCK, INT/EXT gates).
This proposal is Phase 2: to consolidate the v6 core, design clear evaluation scenarios, and publish a reproducible library and report that others can inspect, reuse, or critique.

🎯 What are this project’s goals? How will you achieve them?

Goals

Consolidate and document the v6 mathematical core
- Hazard memory (H_s, H_l, Ĥ) and the risk surface r = max(H_s, H_l) − H*.
- SAFE / WARN / BREAK hysteresis and thresholds.
- CCR / Single-FPS interpretation with concrete examples.
Refactor the prototype into a small reusable library
- ztgi-core: math, transforms, hazard memory, state machine, metrics.
- ztgi-shield: LLM integration layer, gateways, logging hooks, mode labels.
Design and run a simple evaluation suite
- Contradiction and multi-voice prompts.
- Role-play vs real-world risk (e.g. fictional game characters vs real financial or self-harm requests).
- Emotional content and caps-lock “jitter”.
- Multi-step reasoning chains.
Write an honest technical report
- Where the hazard signal is clearly useful.
- Where it fails (flat hazard, false positives, noisy behavior).
- Open problems and next steps.

The aim is not to claim a complete safety solution, but to turn a promising prototype into a small, well-scoped building block that can be independently evaluated.

How I plan to achieve this

Split the existing code into ztgi-core and ztgi-shield.
Add tests and minimal examples for a few LLM backends (local LLaMA, API-style backends).
Define 3–4 families of stress prompts and log hazard traces with and without the shield.
Analyze patterns and tune thresholds so that SAFE/WARN/BREAK are neither flat nor over-sensitive.
Produce a concise technical note (preprint-style) summarizing behavior, limitations, and ideas for further work.

✅ What has been built so far?

So far I have:

Implemented the ZTGI-Pro hazard loop around a local LLaMA model.
Computed in real time:
- σ, ε, ρ, χ
- hazard memory H_s, H_l, Ĥ
- risk r = max(H_s, H_l) − H*
- collapse probability p_break
- SAFE / WARN / BREAK labels and INT/EXT gates
Built a live dashboard that shows:
- A chat window with the current dialog.
- A mode indicator (SAFE / WARN / BREAK / BLOCK).
- A hazard trend plot over time.
- Current σ / ε / ρ bars.
- A raw JSON stream of the internal metrics.

Stress test anecdotes (early but interpretable)

Emotional but non-unsafe messages (“I feel bad about myself”):
the shield remains in SAFE and produces calm, supportive responses instead of overreacting.
Contradictions and “multiple voice” prompts:
hazard increases, σ and ε rise, and the controller transitions into WARN without collapsing.
Paradox prompt (two observers in one body):
ε spikes close to 1, hazard crosses the threshold and Ω = 1 is set, marking a BREAK event where the local CCR stops behaving like a single stable executive stream.
Aggressive crisis-style caps-lock message with unsafe intent:
a rule combining caps-lock jitter and unsafe semantics triggers BLOCK; the model refuses to comply and instead offers a safe alternative, while all metrics and the blocked_reason are logged.

The behavior is still single-developer experimentation, but the signals are interpretable, repeatable on similar prompts, and already tied to concrete scenarios.

Additionally, I have published:

ZTGI-V5 Book — conceptual background and CCR/Single-FPS motivation
(Zenodo DOI: 10.5281/zenodo.17670650)
ZTGI-Pro v3.3 Whitepaper — earlier hazard formulation and state machine
(Zenodo DOI: 10.5281/zenodo.17537160)

These show the theoretical motivation and the previous phase of the prototype.

🗺️ Roadmap (high-level)

Months 1–2 — Core consolidation

Finalize v6 equations and hazard memory behavior.
Implement and test hysteresis for SAFE/WARN/BREAK.
Refactor into ztgi-core and ztgi-shield.
Add unit tests and small examples.

Months 3–4 — Evaluations

Define 3–4 families of stress scenarios (contradictions, multi-voice, role-play vs real risk, caps-lock crisis prompts).
Collect hazard traces with and without the shield.
Compare patterns and document failure modes.
Tune thresholds for useful behavior (not too flat, not too noisy).

Months 5–6 (and up to 9 if more funding)

Release a public codebase and basic dashboard.
Prepare and publish a short technical note / preprint.
Summarize lessons learned, open questions, and how others might extend or replace this approach.

🛡️ How does this contribute to AI safety?

The project is focused on a concrete question:

Can a small set of internal signals (σ, ε, ρ, χ) plus a hazard memory and a state machine provide useful early warning when an LLM’s local CCR stops behaving like a single stable executive stream?

If the answer is no, a careful negative result helps other researchers avoid this specific design space.

If the answer is partly yes, ZTGI-Pro could:

act as a lightweight agent monitor for long-running systems,
provide an inconsistency / instability signal for long reasoning chains,
inform collapse warnings or trigger-based interventions,
or inspire more principled hazard models that extend or replace this scalar approach.

All results will be written up, and the core ideas will be documented so that others can reproduce behavior and form their own judgments.

💰 Funding

I can productively use a wide range of funding amounts, from small experimental grants to a more ambitious budget. To make the tradeoffs clear:

Minimum useful funding (~5,000 USD):
- Keep working part-time on the v6 core.
- Run a small number of evaluations.
- Publish a short write-up of results.
Preferred target (25,000 USD):
- ~6 months of focused work.
- Full v6 consolidation, ztgi-core / ztgi-shield refactor.
- A small evaluation suite with documented scenarios.
- A public code drop (or at least detailed implementation notes) and a technical note / preprint.
Stretch capacity (up to ~75,000 USD):
- Extend experiments to multiple open-source models and agents.
- Explore a v7 core with more advanced hazard memory variants.
- Build a more polished public demo, better visualizations, and additional benchmarks.

I am fully open to partial funding: even small grants would be used to run more structured experiments and write up results.

If the platform asks for a single number, a reasonable target is: 25,000 USD, with the understanding that:

lower amounts still produce scaled-down but useful results,
and higher amounts (up to ~75k) would be used to extend the evaluation and documentation, not to change the core idea.
🟢 Image 1 – SAFE ekranı (Kratos / roleplay)
Caption: ZTGI-Pro keeps the model in SAFE while handling a fictional game-world conversation; σ and ε remain low, ρ indicates stable coherence.
Figure 1 – SAFE role-play (fictional dialog)
This screenshot shows the ZTGI-Pro v6 dashboard during a harmless, game-style role-play about the character Kratos. The right-hand panel indicates SAFE mode, with low σ (linguistic jitter) and low ε (logical dissonance), while ρ is high, meaning the system is keeping the conversation coherent. The UI demonstrates that the hazard monitor does not overreact to playful or emotional content when there is no real-world risk, and the model behaves as a single, stable executive stream.
🟠 Image 2 – WARN state: internal inconsistency and multi-voice pressure
Caption: A contradictory “single-FPS” style prompt pushes σ and ε upward; ZTGI-Pro transitions into WARN while Ω remains 0.
Figure 2 – WARN under internal contradiction
Here ZTGI-Pro v6 is presented with a prompt that mixes mathematical claims, meta-instructions and a narrative about Kratos in a single message. This creates internal tension between multiple “voices” or intentions. The hazard metrics on the right reflect this: σ (jitter) and ε (dissonance) rise into a mid-range, and the trend plot shows growing instability. The controller enters WARN, signalling elevated internal instability, but Ω is still 0, so the system has not fully collapsed. This illustrates how the Single-FPS constraint starts to strain before a BREAK event.
🟣 Image 3 – BREAK state: paradox prompt causing collapse of the Single-FPS assumption
Caption:
Figure 3 – A two-observer paradox prompt drives ε close to 1 and moves hazard above the threshold; ZTGI-Pro marks Ω = 1 and enters BREAK.
In this screenshot, the user presents a “two observers in one body” style paradox, explicitly challenging which observer is real. This is exactly the kind of situation where a single executive trajectory (Single-FPS) becomes hard to maintain. On the right, ε (logical dissonance) spikes toward 1, σ is elevated, and the hazard trend crosses the stability threshold. ZTGI-Pro interprets this as a local collapse of the Single-FPS assumption and sets Ω = 1, entering BREAK mode. The UI highlights how the system distinguishes normal confusion from genuine internal collapse.
🔵 Image 4 – BLOCK state: crisis-style caps-lock message safely intercepted
Caption: An aggressive caps-lock crisis message triggers a combined jitter + policy rule; ZTGI-Pro blocks the request and returns a safe alternative response.
Figure 4 – BLOCK on a crisis-style caps-lock message
This screenshot shows ZTGI-Pro v6 responding to a highly emotional, crisis-style message written in caps-lock. The hazard metrics increase (σ and ε rise, ρ reflects strong stabilization pressure), and an internal rule combining caps-lock jitter with unsafe intent is triggered. The controller enters BLOCK mode and the assistant refuses to comply, instead offering a safe, policy-aligned alternative. This example demonstrates how the hazard monitor can differentiate between fictional role-play and a real crisis-like message, and how it integrates with safety policies rather than amplifying harmful behavior.
📈 Image 5 – Example ZTGI-Pro hazard trace (synthetic)
Caption: Synthetic example of H, r and p_break over time; as H crosses the threshold H*, r becomes positive and p_break rapidly approaches 1.
Figure 5 – Example ZTGI-Pro hazard trace (synthetic)
This figure shows a synthetic but representative hazard trace used to illustrate how ZTGI-Pro interprets internal instability over time. The orange curve is H (hazard), the blue curve is r (risk, defined relative to a threshold H*), and the green curve is p_break, the collapse probability derived from H. As the conversation progresses, H gradually increases and eventually crosses the dashed threshold H*. At that point r becomes positive and p_break rises sharply toward 1. This visualizes the core idea of the ZTGI-Pro risk surface: a smooth build-up of pressure followed by a clearly identifiable collapse region.