Hi Johnny! Many congratulations on being approved for a grant from EV. Could I ask how that might change your ask for funding here?
Prototype is ready in experimental beta, only released to LessWrong forums. I've been working on it for 3-4 weeks.
LessWrong - https://www.lesswrong.com/posts/skKYznZyRtN87tHbB/neuronpedia-ai-safety-game
Neuronpedia is an AI safety game that documents and explains each neuron in modern AI models. It aims to be the Wikipedia for neurons, where the contributions come from users playing a game. Neuronpedia wants to connect the general public to AI safety, so it's designed to not require any technical knowledge to play.
Neuronpedia is two parts: one part game, the other part reference.
First, Neuronpedia is a game that lets anyone contribute their cleverness to help understand AI models, without needing any technical knowledge. Players earn points as they play the game, and can be ranked globally and compete for the highest rankings. You can think of it as "Geoguessr for AI neurons".
Second, Neuronpedia is also a "Wikipedia for AI Neurons". The data (explanations, votes) that is generated by the game is stored as a reference for each AI neuron, with the best explanations for each neuron surfacing to the top. By analyzing the top explanations and activations, researchers can better understand AI models.
Neuronpedia's first goal is to increase AI safety and understanding. It is a collaborative effort to explain and understand modern AI models in order to make them safer and more predictable.
Neuronpedia's second goal is to increase public participation, education, and awareness of AI safety. By building a game that anyone can play, Neuronpedia makes AI and AI safety approachable without any prior technical knowledge.
The current prototype is the first iteration intended for a small initial community. I have a long TODO like testing different game mechanics, updating explanation scoring, weekly/daily/monthly contests, scaling up, and eventually releasing to the greater public via mainstream media outlets and socials. There's a lot more "advanced mode" tooling to build for people who want to really dig deep into the neurons (current Advanced Mode is still very limited) - like seeing related groups of neurons or testing activations in specific ways. Also, GPT2-XL will require more powerful machines for inference and more storage. I'd like to add "badges/achievements" too, instead of just scoring, and also later on have "Team Based" competitions. An API is also on the roadmap.
Millions of casual and technical users play Neuronpedia daily, trying to solve each neuron (like NYT crossword/Wordle). There are weekly/monthly contests ("side quests"). Top scorers are ranked on leaderboards by country, region, etc.
Neuronpedia sparks interest in AI safety for thousands of people and they contribute in other ways (switch fields, do research, etc).
Researchers use the data to build safer and more predictable AI models. Companies post updated versions of their AI models (or parts of them) as new "campaigns" and iterate through increasingly safer models.
Server costs (GPU inference, OpenAI credits, AWS servers/databases).
It's just me so far. I have been informally (not affiliated with any company) advised by William Saunders.
Neuronpedia is seeded with data and tools from OpenAI's Automated Interpretability and Neel Nanda's Neuroscope.
Can't get enough people to care about AI safety or think it's a real problem.
Neurons are the wrong "unit" for useful interpretability and Neuronpedia is unable to adapt to the correct "unit" (groups of neurons, etc).
Even the best human explanations are not good.
Scoring algorithm for explanation is bad and can't be improved.
Not engaging enough - the game isn't balanced, doesn't have enough "loops", etc.
Bugs.
Lack of funds.
AI companies shut it down via copyright claims, cease and desist, etc.
Unable to contain abusive users or spam.
Too slow to stop misaligned AI.
I have applied for funding through an EA fund and was approved for a short-term grant.
Joel Becker
4 months ago
Hi Johnny! Many congratulations on being approved for a grant from EV. Could I ask how that might change your ask for funding here?
Johnny Lin
4 months ago
Hi Joel!
The simple answer to your question is that the ask for funding would be the existing minus $25000. I've made a feature request for Manifund to allow requestors to add grants received outside of the platform and add details about it.
The longer answer is that I am more excited about this project than anything I've ever worked on, and would love to work on it full time for as long as possible. In the short term and possibly even after public release (currently only posted on LessWrong forums as experimental beta), I can likely handle the workload, but I'm starting to think that it could be a good idea to have more than one person work on this.
I'd love to set up a call if you're interested in learning more about where this can go. Someone I spoke to yesterday said "AI is, in a way, the greatest crossword puzzle of all time". I can't think of anything more meaningful than building something that redirects the energy of millions of humans into increasing AI safety/alignment (even if they don't realize they're doing it).
Thanks,
Johnny
Austin Chen
4 months ago
Hi Johnny, thanks for submitting your project! I've decided to fund this project with $2500 of my own regrantor budget to start, as a retroactive grant. The reasons I am excited for this project:
Foremost, Neuropedia is just a really well-developed website; web apps are one of the areas I'm most confident in my evaluation. Neuropedia is polished, with delightful animations and a pretty good UX for expressing a complicated idea.
I like that Johnny went ahead and built a fully functional demo before asking about funding. My $2500 is intended to be a retroactive grant, though note this is still much less than the market cost of 3-4 weeks of software engineering at the quality of Neuropedia, which I'd ballpark at $10k-$20k.
Johnny looks to be a fantastic technologist with a long track record of shipping useful apps; I'd love it if Johnny specifically and others like him worked on software projects with the goal of helping AI go well.
The idea itself is intriguing. I don't have a strong sense of whether the game is fun enough to go viral on its own (my very rough guess is that there are some onboarding simplifications and virality improvements), and an even weaker sense of whether this will ultimately be useful for technical AI safety. (I'd love if one of our TAIS regrantors would like to chime in on this front!)
Johnny Lin
4 months ago
Hey Austin - Thanks so much for the kind words and regrant. I'm extremely grateful for the support.
I totally agree that onboarding was and still is quite clunky - it's a bit simpler now but I'm still working on an onboarding that's actually interactive instead of just a guide. Unfortunately I'm also making big tweaks to the game itself, so I'm not spending too much time on refining the tutorial each time since the game is changing quickly. Would love to have a chat with you some time about virality improvements esp given Manifold's success - this is a super important topic, at the very least could run the current ideas by you.
Anton Makiievskyi
4 months ago
I wanted to play more of the game, it seemed engaging =) Please sell me a subscription
I can't see usefulness of neuron naming to AI Safety though to be honest. Can't the network generate the explanation itself? Otherwise: how do you score the explanation suggested by the user?
Johnny Lin
4 months ago
hi anton - great questions! also lol @ subscription - if only!
i'll preface by saying that i'm by no means an expert in mechanistic interpretability, and I apologize for not including more detailed justification on the grant application or website - if you've been doing this awhile, you probably know more than me, and your question of "why try to understand neurons?" is probably best answered by someone who has an academic background in this.
Re: Usefulness of neuron naming
People aren't currently using GPT2-SMALL as their daily chatbot, but the things we learn from smaller models can ideally be applied to larger models, and the idea is that eventually we'd have new campaigns to help identify neurons in larger models. Purely for example's sake, maybe we're able to identify a neuron (or group of neurons) for violent actions - in that case we might try to update the model to avoid/reduce its impact. Of course this can turn into a potential trolley problem quickly (maybe changing that part affects some other part negatively) - but having this data is likely better than not having it.
Aside from the actual explanations themselves, data around how a user finds a good explanation can also be useful - what activations are users looking at? Which neurons tend to be easily explained and others not? Etc.
There is a greater question of the usefulness of looking at individual neurons, vs other "units", as highlighted in the 2nd premortum. You're correct that neuronpedia will eventually likely need to adapt to analyzing single neurons. This is high priority on the TODO.
Re: Can't the network generate the explanation itself?
Yes, that's exactly what the existing explanations are generated from. Basically uses GPT4 to guess what the neurons in GPT2-SMALL are related to. Please see this paper from OpenAI: https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html
The issue is that these explanations aren't great, and that's why Neuronpedia solicits human help to solve these neuron puzzles.
Re: how do you score the explanation suggested by the user?
The scoring uses a simulator from Automated Interpretability based on the top known activating text and its activations. You can see how it works here: https://github.com/openai/automated-interpretability/tree/main
One of the things the game currently does not do (that I would like to do given more resources) is to re-score all explanations when a new high-activation text is found. This would mean higher quality (more accurate) scores. Also, larger models (even GPT2-XL) requires expensive GPUs to perform activation text testing.
again, i'm no expert in this - i'm fairly new to AI and but I want to build useful things. let me know if you have further questions and i'll try my best to answer!