We at WhiteBox Research have been progressing well since we got our initial regrant from Renan Araujo, and we’d like to share some updates on our progress below! (We will have a strategy session this week, and we’ll share another update within the next two weeks about our next steps and how others can help us.)
What progress have we made?
Here are our key achievements since September 2023 (up to March 19, 2024):
In November, we were approved $61,460 in funding from the Long-Term Future Fund! Together with our funding from Manifund, this funds us until around August 2024.
We finalized more details of our program and named it the WhiteBox AI Interpretability Fellowship. It’s a five-month training and research program in Manila to master the fundamentals of mechanistic interpretability. We created this primer for the fellowship, and our training phase’s curriculum overview can be found here. [1]
We got 53 applicants and accepted 13 participants into our fellowship, surpassing our goal of getting 50 applicants and 12 participants. [2] [3] Our marketing and application process also helped us start building a wider community of people interested in AI interpretability. [4]
We onboarded Kyle Reynoso as a part-time teaching assistant in February, and he has contributed significantly since then. [5]
We ironed out a process for how participants can view, submit, and receive feedback on their exercise answers more seamlessly via GitHub Classroom and nbgrader.
We’re in the fourth week of our fellowship’s two-month training phase. So far, we’ve received generally positive feedback on the three Saturday sessions and the two-night retreat we held for participants, and we’ve maintained a consistent weekly tempo of adjustments and improvements to various aspects of the program.
Here are some footnotes to expound on our progress above:
[1] Since we opted for less experienced but high-potential participants (their average age is 21), we would probably have to cover more of the prerequisites than other programs (e.g., ARENA), which means we may only delve more into interpretability in the research phase of our program in June.
[2] We opted for a three-stage application process. Stage 1 involved solving Bongard problems and answering essay questions about alignment material (namely Ajeya Cotra’s Saint/Schemer/Sycophant post and Lee Sharkey’s Circumventing Interpretability post). Stage 2 tested their ability to solve coding problems that are tricky to solve even with GPT-4, and Stage 3 consisted of an unstructured interview largely based on the format of the insightful podcast Conversations with Tyler (Tyler Cowen).
[3] Some of the 13 people we accepted include an IOI silver medalist, a 19-year-old who recently got seed funding for his B2B startup, a fluent Lojban speaker who did contract work for an OpenPhil AI safety grantee, and a Master's student who won a gold medal in Taiwan for chemistry research she did in high school.
[4] We spent around a month marketing and community building to attract people to apply for current and/or future cohorts of our fellowship. We ran a successful virtual “Career Planning in the Age of AI” salon at the start of the year with around 27 attendees, and four people whom we ended up accepting joined it. We also started a community Discord server where people from the ambient community can interact with and discuss all sorts of questions with our participants, as a preliminary step towards wider community building in Southeast Asia. (We sent an invite to our server to all applicants, including those we rejected, some of whom already have more background in ML.)
[5] Our TA, Kyle Reynoso, graduated with the highest honors as a CS major in the top university in the country, was an officer in a local student EA chapter, and has a day job as an ML engineer for a marketing compliance AI startup in Australia.