Research and engineering multi-agent alignment

Project summary

40%: Skilling up in mechanistic interpretability. There is a lot to be done with interpretability, with clear criteria for success, and tight feedback loops. This makes it ideal for spending at least 40% of my time. Since my own alignment research is more uncertain in terms of success or importance, I think it is useful to also be doing something that is very legible that is immediately useful for other researchers.

40%: Multi-agent alignment. I intend to study multiple agent interactions, and see if better alignment can be achieved this way. My basic reasoning for this is:

Multiple Agents Might Make Alignment Simpler

With a multitude of agents forming a collective network of intelligence, alignment becomes simpler. The individual motivations of any given agent is not important. What’s important are the rules governing and incentivizing that system. Simple rules, complex systems. We already have some understanding of how to do this. Tribes, cities, societies, corporations, the stock market, nations, civilizations. Bitcoin has already demonstrated how to technically ingrain rules and incentives to work at scale.

More Agents Might Make Interpretability Easier

It is no longer examining the impossibly complicated tensors of floating point numbers. Instead, all that needs interpretation are the interactions between agents. The interactions are the cogs in the wheel of the greater network, and those interactions are where the real intelligence is coming from, as well as the directionality.

More Agents could mean Greater Intelligence

More humans means greater intelligence. Having more humans around, and humans working together on collective projects has astronomically increased human output, as mentioned earlier. Would a human with a very, very large brain be better than that same amount of compute dispersed among different human agents? Perhaps. But perhaps not. I imagine what causes massive movement outside of the current data manifold are multiple agents searching for pathways in different directions. There are some indications already that this might be true.

10%: Managing current things. I run an AI Safety Strategy discord, write for the Alignment Research Newsletter, and other activities that I think are useful, even if smaller.

10%: Exploration. I might always find a new research direction, or change my mind on something being feasible. For this reason, it’s always good to be on the lookout for other avenues.

What are this project's goals and how will you achieve them?

Follow the Mech Interp guides, such as the following: https://transformerlens-intro.streamlit.app/
Start tackling some of the problems, outlined here: https://www.lesswrong.com/s/yivyHaCAmMJ3CqSyj/p/LbrPTJ4fmABEdEnLf

Create various multi-agent collaborations. One example might be a multi-agent minecraft server. Each ‘player’ would actually be composed of a number of sub-agents, each with their own responsibilities (such as planning, ethics, legality, etc). These meta-agent players would interact with other meta-agents, with their training defaulting to cooperate, protect and defend certain artifacts in the environment, do not harm certain things in the environment even if it gives them an advantage, etc.

Another is implementing a toy example of Agent Cmmittees:

“Language model agent as a metaphorical committee of shoggoths using tools and passing ideas and conclusions in natural language. This committee/agent makes plans and takes actions that pursue goals specified in natural language, including alignment and corrigibility goals. One committee role can be reviewing plans for efficacy and alignment. This role should be filled by a new instance of a shoggoth, making it an independent internal review. Rotating out shoggoths (calling new instances) and using multiple species (LLM types) limits their ability to collude, even if they sometimes simulate villainous Waluigi characters.[1] The committee's proceedings are recorded for external review by human and AI reviewers.”

Another possible setup is mentioned here.

I will also try using the melting pot environment suite for implementation: https://github.com/google-deepmind/meltingpot-

How will this funding be used?

Minimum funding target: $24,000 USD. I have gotten by on 24k a year, so this is a doable amount. I have set the minimum as $500, since there are potentially other funders that might mean I can reach my funding goals using a combination of different grants.

Comfortable: $140,000 USD. I’ve seen some who received skilling-up and research grants receiving ~140k as a one-year salary, and this seems comparable to an entry-level job at a tech company. I have set the funding goal as 70k, since that would easily cover a six-month salary, and because I am applying to other funders who might cover the other half of the funding.

Maximum: $240,000 USD. This is what I was last paid annually when working full time for a tech company. This would mean there is no loss of financial value on my part, and would mean I can go a year (or beyond) without concern for funding.

Breakdown: This would be a personal salary for myself. This allows me to do things such as pay rent, food, medical care, etc. More money means more of my cognition allocated toward alignment and research.

What are the most likely causes and outcomes if this project fails? (premortem)
I will likely work a more basic job in tech, potentially in ML, but that means I will have significantly less time to work on alignment.

What other funding are you or your project getting?

I am applying to a few others for grants. It is possible that I will be able to string together a chain of small grants in order to be able to work on this full-time.

Track Record/Who I am

I created a swarm of agents to create formalized code, and check if the code is doing what it should be doing according to mathematical criteria. This could be useful for formal verification if we can improve interpretability tools.

I have also created a handful of other multi-agent setups, and finetuned special agents for certain tasks.

In the past, I have taken AGI Safety Fundamentals, participated in SERI MATS Agent Foundations, and took part in AI Safety Camp - Automating Alignment.

I created and manage the AI Safety Strategy discord and Alignment Research Newsletter. I was Senior Executive for the Center for AI Responsibility and Education, where I developed the curriculum for an introductory course in AI risk and alignment.

I have a background in software engineering, being on the founding team of several startups, and also have a background in cybersecurity. My cybersecurity work involved auditing blockchain contracts, which involves a very rigorous security mindset, since anything that can go wrong will go wrong with a smart contract, and once it is deployed, it can’t be altered, so you have to get it right on the first critical try.

Formal verification: https://github.com/jaebooker/formal_verification_swarm

Task breakdown: https://github.com/jaebooker/task_breakdown_swarm

Consensus: https://github.com/jaebooker/consensus_swarm

Finetuned model: https://huggingface.co/jbb/llama_coq

Dataset: https://huggingface.co/datasets/jbb/coq_code

AI Safety Strategy discord: https://discord.com/invite/e8mAzRBA6y

Alignment Research Newsletter: https://alignmentresearch.substack.com/

LinkedIn: https://www.linkedin.com/in/jaeson-booker

Github: https://github.com/jaebooker

Huggingface: https://huggingface.co/jbb

Lesswrong: https://www.lesswrong.com/users/prometheus

Medium: https://jaesonbooker.medium.com/