To add to our last update, here are some additional updates over the last four weeks:
We’ve further clarified our strategy and plans for the next year.
Our plan for the next year is to run two cohorts of our 5-month, part-time AI Interpretability Fellowship in Manila to produce junior mechanistic interpretability researchers. We’ve created a Theory of Change for the fellowship here.
Once we’ve completed two rounds, we plan to open our doors to those in Southeast Asia for the third round of our fellowship. The third round will likely be a full-time, 1-2 month version of our fellowship (e.g., starting June 2025).
Through our fellowships, we aim to kickstart and develop the AI safety community in Manila and Southeast Asia.
We also have these updated goals for our fellowship’s 1st cohort:
Have our fellows solve or make substantial progress on a handful of concrete, open MechInterp problems (e.g., those in Neel Nanda’s list) that have not been solved yet by the end of September 2024
Get at least one fellow to be counterfactually accepted by the end of 2024 into a full-time AI safety research fellowship (e.g., MATS’s Winter 2024-25 program)
Have at least four fellows spending at least 10 hrs/week working on alignment-oriented upskilling and/or interpretability projects by the end of 2024
Unfortunately, our team member Kriz Tahimic left due to health issues. We are grateful for Kriz’s help in co-founding and launching WhiteBox with us. Given his departure, we’ve increased Kyle Reynoso’s responsibilities and extended Kyle’s contract to work with us until August at 0.5 FTE (and past August once we get more funding).
We’re currently fundraising for $92,300 to fund us until March 2025. (Our current funding will only last us until July or August.) The $92,300 would fund:
Additional operations costs for cohort 1, such as mentor and fellow stipends ($5,100)
Our 2nd cohort from late September 2024 to March 2025 ($87,200)
If you’re interested in funding or donating to us, you can contact me at brian@whiteboxresearch.org. We can send you our fundraising proposal and information on how to donate.
What are our next steps?
There are three main goals we want to achieve by August:
Conclude our Trials phase (training) with our planned in-house interpretability hackathon and shave off its remaining warts and inefficiencies for the next cohort
Have our fellows complete research excursions on selected problems in Neel Nanda’s list of concrete open problems (COPs), under the guidance of experienced external mentors
Fundraise enough money to fund our 1st and 2nd cohorts, as mentioned above
As shown in our Theory of Change, we will focus on having our fellows work on the COPs so they can upskill in interpretability research rapidly. However, we’re open to other proposals from mentors if there are adjacent problems that our fellows can help them with, so long as they: a) can practice MechInterp fundamentals in those projects, and b) can realistically complete the project by the end of the Proving Ground.
We’re also open to such proposals from our more advanced fellows, following the same constraints as above, and if we and the available mentors deem them viable. This is because promising researchers often have very strong opinions on what they wish to work on, and this can make them more motivated to complete the rest of the fellowship.
Note also that this is not a bet on the COPs being vital to alignment, nor do we expect our fellows to produce immediately useful research by the end of the program: they are and will still be new to the field after all. Rather, we hope the problems will serve as excellent forcing functions for our fellows to get better at the fundamentals of MechInterp as quickly as possible.
How can others help us?
We are still looking for 2 to 4 more research mentors with experience in mechanistic (or general) interpretability research for our Proving Ground (research phase) from June to August. Said mentors just have to meet with 1-2 fellows weekly virtually for ~45 minutes to provide research guidance.
They can choose to oversee more than one person or duo. As mentioned above, we are also open to having our fellows help their mentors in some MechInterp-adjacent task. For example, mentees can resolve accessible open issues in an existing interpretability project in exchange for the mentor-mentee relationship, as long as they’re properly scoped to fit in our Proving Ground phase. If you are interested in being a mentor for our fellowship, please contact us at team@whiteboxresearch.org.
If you’re interested in or have experience in mechanistic (or general) interpretability, you can join our Discord server here and engage with people in our community, including our fellowship participants.
As mentioned, if you’re interested in funding us, you can contact me at brian@whiteboxresearch.org!