I am one of the ARENA 2.0 online participants and I could say that in my interaction with Joseph he was very insightful. I believe he is competent enough to deliver on his the alignment space.
I would like to continue studying offline-RL agents using mechanistic interpretability in order to understand goals and agency. I believe the derived insights may help predict, detect and/or prevent AI misalignment.
Key Activities:
Research mechanistic interpretability (MI) of trajectory transformer models.
Build/Maintain Open Source tooling (eg: TransformerLens package)
Mentor, support and advise other researchers/engineers.
Possibly: Start a 501c3 foundation modeled on the Farama Foundation to accelerate alignment tools/infrastructure.
Key reasons:
Research: My work is likely to produce insights into alignment relevant questions including foundational MI, goal representations and validating new MI techniques.
Tooling/OpenSource: Open Source Packages enable better MI will lead to faster innovation and adoption.
Community: I’d continue to help others develop technical skills, prioritize research directions and apply for funding and contribute to open source projects.
Concretely, I’d like to broaden my current research scope to offline-RL transformers (from Decision Transformers). This will involve training models and then trying using current or novel MI approaches to look for goal representations and understand the mechanisms by which next token predictors “simulate” agents.
Conceptually, I’d like to:
Better understand transformers/prosaic AI in general.
Reduce my confusion about things like why GPT4 isn’t more agentic or to what extent you could say it has goals.
I expect the impact of this work to be that I will publish toy models, tools, analyses and experimental results which improve the state of public knowledge around agency and goals in transformer models.
Salary-50k
Taxes-40k
Travel/Conferences-5k
Computing budget-10k
Work Requirements-5k
Total 110k
Another 140k would go towards
1. Starting a foundation to organize better tools for independent researchers working on alignment.
Hiring a research intern
A) This person could show enough promise that they are headhunted by capabilities labs.
B) Open source tooling for mech interp could be used for bad purposes?
I am confident that I should be doing AI alignment work given my skill set and so will seek funding from other sources. I have no current applications with other funders. I am interviewing for a role as a Research Engineer at DeepMind.
Miguelito De Guzman
5 months ago
I am one of the ARENA 2.0 online participants and I could say that in my interaction with Joseph he was very insightful. I believe he is competent enough to deliver on his the alignment space.
Anton Makiievskyi
5 months ago
@josephbloom, would you stop this project if you get hired by DeepMind, or you're expecting to continue it as a part of the job?
joseph bloom
5 months ago
I don't think it's likely I will be hired with DeepMind as I interviewed for a role recently and they decided not to proceed. I was also told to expect that if I had joined the team it's likely I would have been working on language models.
Marcus Abramovitch
5 months ago
Neel Nanda's top choice in the Nonlinear Network. Neel says many people want to hire him.
Joseph is an official maintainer of TransformerLens (the top package for mech interp).
Teaches at the ARENA program.
Two really good posts on Decision Transformer Interpretability and Mech Interp Analysis of GridWorld Agent-Simulator.
Work was listed by Anthropic May 2023 update
Working on trajectory transfers is a natural progression from decision transformers
I wonder if he is best to be hired by some other alignment team instead since I think he might be young as he might have better mentorship with others.
This just should be fully funded, at least to $110,000. $25000 (but ideally $50000 would put him at ease for 6 months by which time he expects to have enough output to justify further funding. I'd give more but I have a limited budget. This is already half of my budget but I feel quite strongly about this.
Nothing to disclose.
Rachel Weinberg
5 months ago
At first glance when trying to foster skepticism I had the same thought as you: that teams and mentorship make people more productive, so this grant could be a push in the wrong direction. On the other hand, he's been unusually successful so far as an independent researcher. If he's particularly well-suited to working independently, which most people struggle with, that's a kind of comparative advantage it might make sense to lean into since mentorship and spots on established teams are in short supply.
Marcus Abramovitch
5 months ago
I think with his track record so far and endorsements, he's earned the right to go the direction he thinks is best. Maybe it'd be better to have an org that "houses" a bunch of people that just want to work by themselves and the org just formally employs them and helps them raise funds for their project and maybe has some communal resources but I don't think I'd prefer to fund that org vs. fund someone who is just going to do good direct work.
joseph bloom
5 months ago
A few points on this topic:
Jay Bailey, a former senior software/devops engineer and SERI-MATS scholar has been funded to work on this agenda and has begun helping me out. I'm also discussing collaborations with other people from more of a maths / conceptual alignment background which I hope will be useful.
I agree mentorship is useful and plan to make an effort to find a mentor, although I've also been regularly discussing parts of my work with alignment researchers. At least one well respected alignment researcher told that it's plausible that this kind of work is teaching me more than I'd learn at an Org, but I know Neel disagrees.
I'm likely to co-work part time in a London AI safety office if one exists in the future.
I think I'm approaching my research with somewhat a scout mindset here. It seems plausible that independent research for some people is pareto optimal for the community across output from potential mentees/mentors. I am also considering an experiment where I do a small collaboration with an organisation which may provide evidence in the other direction. If it were true that this was productive and alleviated a mentorship bottleneck, then finding that out might be valuable/inform future funding strategies.