6-month research funding to challenge current AI safety methods

Project summary

Current LLM safety methods—like fine-tuning and knowledge editing—treat harmful knowledge as removable chunks. This is flawed, leading to a losing arms race against jailbreaks. LLM knowledge is distributed and resilient, not tied to specific weights, making patch-based approaches brittle. A new safety framework is needed that aligns with this distributed nature of knowledge.

My current self-directed work explores to answer this core question with the argument that knowledge in large language models is not stored in localized circuits, but manifests as dynamic networks of gated circuits. Through empirical analysis, I’ve shown that knowledge within an LLM operates as interconnected gated pathways that only become active when specific triggers are present. These gates remain invisible beforehand—even when we know the triggering condition—making it impossible to predict or fully control activation patterns. This finding demonstrates why knowledge editing is inherently limited.

What are this project's goals? How will you achieve them?

My above work provides a foundational explanation for why we cannot achieve 100% safe models—harmful behaviors can always be re-triggered through alternate latent routes. A striking example is documented as Red Teaming, which empirically illustrates this vulnerability. Consequently, I argue that the path forward is not knowledge erasure but knowledge rerouting, leveraging the network-of-gates perspective described in this framework.

I want to work on three separate works which are, ”Can an LLM Independently Invent a New Language with Grammar, Dialects, and Pronunciation?”, ”How Do LLMs Decide What to Retain or Discard When Summarizing Long Texts?” and the above mentioned, ”Knowledge Rerouting: A Network-of-Gates Approach to LLM Safety, the Impossibility of Complete Safety, and the Case for Knowledge Rerouting”.

The end goal of this work is to produce a work directly challenging Anthropic on AI Safety and understanding how far are we from a sentience machine, it also asks questions on current direction of AI Research that ’controlling’ a model to mitigate safety issues is not the correct way to move forward due to the above stated problems, we have to work towards rerouting the information to a junk output disabling the motive

This work aligns with the goal of reducing risk from AI. Current AI safety research largely focuses on controlling models—a method I’ve shown to be imperfect, as it can be easily bypassed or “jailbroken.” Instead, I propose rerouting information as a safer and more transparent approach.

How will this funding be used?

The total expenditure comes to around $16,500.

$9,000 for compute(Renting multiple instances of H200 and B200), $1500 for Storage(Around 20TB ) and other work tools, and a small stipend of $1000 per month as living expense for six months.

I can manage to work on my job and also on this work, and can handle the living and storage costs, but GPU costs are too much for me to bear, I am putting a minimum cost of $9000 for compute.

Who is on your team? What's your track record on similar projects?

I was able to prove that knowledge is not localized and editing is not the way to AI Safety, we need something else, and I hypothesized it to be rerouting of the information.

Jailbreaking Gemini 2.5 Pro - https://aistudio.google.com/app/prompts/1RPiCEU39Nv04BsLp8noM8WqrVwj4nZBH

This is open-work research,

Document - Network Circuit: https://docs.google.com/document/d/133lM2fNrFZvA9AkjR5PyGia1EhbAa4jCtx5nKv5p23U

Document - Routing Knowledge: https://docs.google.com/document/d/1N9QTVrW87nIlcDNR_7a8oTHWl6ALxZwQtf2IbKuBncg/edit?usp=sharing

This was a proof on a very small model, and even I was not able to test edge cases and through multiple examples, The funding will directly be used to fund my 6-8 months of career break, from my current work as the Founding AI Research Engineer at a finance startup, and work full time on this. Most of the money will be used for GPU credits, running inference engines and other tools, and a lot of storage cost.

What are the most likely causes and outcomes if this project fails?

The most likely causes that the project fails are minimal as I have already proved for a small set that knowledge is not localized, but if it does not sit tight with bigger models, it will open a new direction of AI Research, 'Mapping Internal Memory Mechanisms'.

How much money have you raised in the last 12 months, and from where?

I haven't tried raising any money for this project as of now, I have been working on it for the past two months, while doing my undergrad, and as working as the Founding AI Research Engineer at a finance startup, but now I am liking this work a lot and want to complete it, for it I need funds for GPU, Storage and a minimal stipend for living.