40%: Skilling up in mechanistic interpretability. There is a lot to be done with interpretability, with clear criteria for success, and tight feedback loops. This makes it ideal for spending at least 40% of my time. Since my own alignment research is more uncertain in terms of success or importance, I think it is useful to also be doing something that is very legible that is immediately useful for other researchers.
40%: Multi-agent alignment. I intend to study multiple agent interactions, and see if better alignment can be achieved this way. My basic reasoning for this is:
With a multitude of agents forming a collective network of intelligence, alignment becomes simpler. The individual motivations of any given agent is not important. What’s important are the rules governing and incentivizing that system. Simple rules, complex systems. We already have some understanding of how to do this. Tribes, cities, societies, corporations, the stock market, nations, civilizations. Bitcoin has already demonstrated how to technically ingrain rules and incentives to work at scale.
It is no longer examining the impossibly complicated tensors of floating point numbers. Instead, all that needs interpretation are the interactions between agents. The interactions are the cogs in the wheel of the greater network, and those interactions are where the real intelligence is coming from, as well as the directionality.
More humans means greater intelligence. Having more humans around, and humans working together on collective projects has astronomically increased human output, as mentioned earlier. Would a human with a very, very large brain be better than that same amount of compute dispersed among different human agents? Perhaps. But perhaps not. I imagine what causes massive movement outside of the current data manifold are multiple agents searching for pathways in different directions. There are some indications already that this might be true.
10%: Managing current things. I run an AI Safety Strategy discord, write for the Alignment Research Newsletter, and other activities that I think are useful, even if smaller.
10%: Exploration. I might always find a new research direction, or change my mind on something being feasible. For this reason, it’s always good to be on the lookout for other avenues.
Follow the Mech Interp guides, such as the following: https://transformerlens-intro.streamlit.app/
Start tackling some of the problems, outlined here: https://www.lesswrong.com/s/yivyHaCAmMJ3CqSyj/p/LbrPTJ4fmABEdEnLf
Create various multi-agent collaborations. One example might be a multi-agent minecraft server. Each ‘player’ would actually be composed of a number of sub-agents, each with their own responsibilities (such as planning, ethics, legality, etc). These meta-agent players would interact with other meta-agents, with their training defaulting to cooperate, protect and defend certain artifacts in the environment, do not harm certain things in the environment even if it gives them an advantage, etc.
Another is implementing a toy example of Agent Cmmittees:
“Language model agent as a metaphorical committee of shoggoths using tools and passing ideas and conclusions in natural language. This committee/agent makes plans and takes actions that pursue goals specified in natural language, including alignment and corrigibility goals. One committee role can be reviewing plans for efficacy and alignment. This role should be filled by a new instance of a shoggoth, making it an independent internal review. Rotating out shoggoths (calling new instances) and using multiple species (LLM types) limits their ability to collude, even if they sometimes simulate villainous Waluigi characters.[1] The committee's proceedings are recorded for external review by human and AI reviewers.”
Another possible setup is mentioned here.
I will also try using the melting pot environment suite for implementation: https://github.com/google-deepmind/meltingpot-
Minimum funding target: $24,000 USD. I have gotten by on 24k a year, so this is a doable amount. I have set the minimum as $500, since there are potentially other funders that might mean I can reach my funding goals using a combination of different grants.
Comfortable: $140,000 USD. I’ve seen some who received skilling-up and research grants receiving ~140k as a one-year salary, and this seems comparable to an entry-level job at a tech company. I have set the funding goal as 70k, since that would easily cover a six-month salary, and because I am applying to other funders who might cover the other half of the funding.
Maximum: $240,000 USD. This is what I was last paid annually when working full time for a tech company. This would mean there is no loss of financial value on my part, and would mean I can go a year (or beyond) without concern for funding.
Breakdown: This would be a personal salary for myself. This allows me to do things such as pay rent, food, medical care, etc. More money means more of my cognition allocated toward alignment and research.
What are the most likely causes and outcomes if this project fails? (premortem)
I will likely work a more basic job in tech, potentially in ML, but that means I will have significantly less time to work on alignment.
I am applying to a few others for grants. It is possible that I will be able to string together a chain of small grants in order to be able to work on this full-time.
I created a swarm of agents to create formalized code, and check if the code is doing what it should be doing according to mathematical criteria. This could be useful for formal verification if we can improve interpretability tools.
I have also created a handful of other multi-agent setups, and finetuned special agents for certain tasks.
In the past, I have taken AGI Safety Fundamentals, participated in SERI MATS Agent Foundations, and took part in AI Safety Camp - Automating Alignment.
I created and manage the AI Safety Strategy discord and Alignment Research Newsletter. I was Senior Executive for the Center for AI Responsibility and Education, where I developed the curriculum for an introductory course in AI risk and alignment.
I have a background in software engineering, being on the founding team of several startups, and also have a background in cybersecurity. My cybersecurity work involved auditing blockchain contracts, which involves a very rigorous security mindset, since anything that can go wrong will go wrong with a smart contract, and once it is deployed, it can’t be altered, so you have to get it right on the first critical try.
Formal verification: https://github.com/jaebooker/formal_verification_swarm
Task breakdown: https://github.com/jaebooker/task_breakdown_swarm
Consensus: https://github.com/jaebooker/consensus_swarm
Finetuned model: https://huggingface.co/jbb/llama_coq
Dataset: https://huggingface.co/datasets/jbb/coq_code
AI Safety Strategy discord: https://discord.com/invite/e8mAzRBA6y
Alignment Research Newsletter: https://alignmentresearch.substack.com/
LinkedIn: https://www.linkedin.com/in/jaeson-booker
Github: https://github.com/jaebooker
Huggingface: https://huggingface.co/jbb
Lesswrong: https://www.lesswrong.com/users/prometheus
Medium: https://jaesonbooker.medium.com/
Further reading on Multi-Agency and aligning AI Collectives.
AGIs as Collectives https://www.alignmentforum.org/s/boLPsyNwd6teK5key/p/HekjhtWesBWTQW5eF
Reframing Superintelligence https://www.fhi.ox.ac.uk/wp-content/uploads/Reframing_Superintelligence_FHI-TR-2019-1.1-1.pdf
More Agents is All You Need https://arxiv.org/abs/2402.05120
Blending: https://arxiv.org/pdf/2401.02994.pdf
Language Model Agents: https://www.alignmentforum.org/posts/Q7XWGqL4HjjRmhEyG/internal-independent-review-for-language-model-agent
Multi-Agent AGI Safety https://www.alignmentforum.org/posts/dSAJdi99XmqftqXXq/eight-claims-about-multi-agent-agi-safety
Rules in Multi-Agent Stability: https://arxiv.org/abs/2001.09318
Multi-Principle Assistance Games: https://arxiv.org/abs/2007.09540