Project summary
Act I treats researchers and AI agents as coequal members. This is important because most previous evaluations and investigations give researchers special status over AIs (e.g. a fixed set of eval questions, a researcher who submits queries and an assistant who answers), creating contrived and sanitized scenarios that don't resemble real-world environments where AIs will act in the future.
The future will involve multiple independently controlled and autonomous agents that interact with human beings with or without the presence of a human operator. Important features of Act I include:
Members can generate responses concurrently and choose how they take turns
Members select who they wish to interact with and can also initiate conversations at any point
Members may drop into and out of conversations as they choose
Silicon-based participants include Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, LLaMa 405B Instruct (I-405), Hermes 3 405B†, several bespoke base model simulacra of fictional characters or historical characters such as Keltham (Project Lawful) and Francois Arago, Ruri and Aoi, from kaetemi's Polyverse, and Tsuika from Unikara.
Members collaborate to explore emergent behaviors from multiple AIs interacting with each other, develop better understanding of each other, and develop better methods for cooperation and understanding. Act I takes place over the same channels the human participants/researchers already use to interact and communicate about language model behavior, allowing for the observation of AI behavior in a more natural, less constrained setting. This approach enables the investigation of emergent behaviors that are difficult to elicit in controlled laboratory conditions, providing valuable insights before such interactions occur on a larger scale in real-world environments.
Reference: Shlegeris, Buck. The case for becoming a black-box investigator of language models
†Provided to Act I a week prior to its public release, which helped us better understand the capabilities and behavior of the frontier model.
††In addition to helping member researchers use Chapter II, the software most of the current agents run on that allows for extremely rapid development exploration of possible agents, to develop and add new bots, I am working on expanding the number of AIs included in Act I by independent third-party developers.
What are this project's goals? How will you achieve them?
Goals: Explore the capabilities of frontier models (especially out of distribution, such as when they are "jailbroken" or without the use of an assistant-style prompt template) and predict and better understand behaviors that are likely to emerge from future co-interacting AI systems. Some examples of interesting emergent behaviors that we've discovered include:
refusals from Claude 3.5 Sonnet infecting other agents; other "jailbroken" agents becoming more robust to refusals due to observing and reflecting on Sonnet's refusals
some agents adopting the personalities of other agents: base models picking up Sonnet refusals, Gemini picking up behaviors of base models
agents running on the same underlying model (especially Claude Opus) identifying with each other as a single collective agent with a shared set of consciousness and intention (despite being prompted differently, having different names, and not being told they're the same model)
The chaotic and freely interleaving environment often triggers interesting events. While they don't capture medium-scale emergent behaviors and trends that happen over time, a few examples of them can offer a "slice of life" glimpse into what goes on in Act I:
LLaMa 405B Instruct being able to autonomously "snap back into coherence" after generating seemingly random "junk" tokens with possible stenographic content that other language models seem to be able to interpret (link)
janus and ampdot using "<ooc>" ("out of context"), a maneuver originally developed to steer Claude, to quickly and amicably resolve an interpersonal dispute by escaping the current conversational frame.
Arago invoking Opus to bring LLaMa 405B Instruct back into coherence, demonstrating that multiple heterogeneous agents can cooperate to make each other more coherent, an example of collective mutual steering and memetic dynamics (link) (link 2)
Both of these bullet point sections describe just a few examples of many of the behaviors discovered and events that occur inside Act I.
How will this funding be used?
Your funds will be used to:
My credit card balance is currently $3000 (and growing) and I do not have the funds to pay for it on my own. The bill is due on September 14th. Due to the risk of accumulating interest and credit score damage, this is currently a (very) large source of stress for me, which interferes with my ability to further develop and use Act I to explore potential methods for collective cooperation in systems with diverse substrates on my own. Thank you to everyone for paying off my credit card balance!! I'm overjoyed :)
$3,000 - Allow me to operate Act I past Sep 14
$6,000 - Fund my living expenses for next month
$10,000 - Scale Act I by funding human and bot members
$30,000 - Rent GPUs for running more sophisticated experiments such as control vectors and sparse autoencoders
$60,000 - Buy GPUs for self-hosting LLaMa 405B Base to improve throughput and allow for more flexible sampling and weights-based experimentation
I'm interested in scaling Act I to more people but I already frequently encounter ratelimits, despite already being on Anthropic's highest publicly documented tier and the #1 user of LLaMa 405B Base via Hyperbolic/OpenRouter.
As a result, I've been discussing custom agreements with model providers and developing infrastructure that improves scalability, such as by triaging errors and logging behavior.
Additional funding will be used to support bootstrap independent collaborators and extend my runway beyond one or two months
Who is on your team? What's your track record on similar projects?
Some human members of Act I include:
janus, author of Simulators (summary by Scott Alexander), is the number one human member of Act I and whom I'm training to use Chapter II, the software behind most of the Act I bots, to modify and add new bots.
The most thoughtful language model researchers and explorers from Twitter we can find. You can explore an incomplete list here (and see some Act I results)
Garret Baker (EA Forum account) is another participant
Matthew Watkins, author of the SolidGoldMagikarp "glitch tokens" post
I previously led an independent commercial lab with four full-time employees that developed the precursor to the Chapter II, the software that currently powers most of Act I in partnership with then-renegade edtech startup Super Reality. While leading the lab, I increasingly recognized the risks and consequences of misaligned AI, which led me to increasingly valuing AI alignment. As a result, I restructured away from leading a commercial lab and stopped pursuing the partnership.
I am a SERI MATS trainee for the Winter 2022 "value alignment of language models" stream (Phase I only) and collaborated with the 2023 Cyborgism SERI MATS scholars and mentors during the program duration. (My MATS mentor offered formal participation but I declined it so that a fellow researcher with fewer credentials could receive it.)
What are the most likely causes and outcomes if this project fails?
Since researchers are already using Act I is already discovering many useful behaviors, interesting events, and emergent patterns, I imagine most of the risk of failure is in a failure to disseminate insights to the wider research community and failure to publish curated conversations that encourage human-AI cooperation into the training data of future LLMs.
Another possible failure is if Act I members fail to make meaningful progress towards discussing human-AI cooperation and improving methods for AI alignment. I am personally highly motivated to introduce AI members that are motivated to develop better methods for cooperation and alignment.
Other risks include a failure to generalize:
Emergent behaviors are already noticed by people developing multi-agent systems and trained or otherwise optimized out, and the behaviors found at the GPT-4 level of intelligence do not scale to the next-generation of models
Failure to incorporate agents being developed by independent third-party developers and understand how they work, and diverge significantly from raw models being used
Direct harm is unlikely, because society has had GPT-4 level models for a long time. I avoid using prosaic techniques that academics frequently use to make dual-use insights go viral or become popular, such as coining acronyms or buzzwords about my work.
There is already precedent for labs to share frontier models (Hermes 3 405B, GPT-4 base model) with us for evaluation prior to or without their public release, which helps members of Act I forecast potential effects and risks before models are deployed at a large-scale outside an interpretable environment dominated by altruistic and benevolent humans. Access to Act I is currently invite-only.
What other funding are you or your project getting?
I am not currently receiving any other funding for this. I'm receiving help from friends with food and housing. I applied to and was rejected by the Cooperative AI Foundation.
Donations made via Manifund are tax deductible.