josephbloom avatar
joseph bloom

@josephbloom

Independent Mechanistic Interpretability Research Engineer

https://www.linkedin.com/in/joseph-bloom1/
$0total balance
$0charity balance
$0cash balance

$0 in pending offers

About Me

I'm an independently funded AI Alignment Research Engineer focussing on mechanistic interpretability in reinforcement learning. I'm one of two current maintainers of TransformerLens, a popular open source package for mechanistic interpretability of transformers.

I recently led the Career Development program at the Alignment Research Engineering Accelerator (ARENA). Prior to working in AI Alignment, I studied computational biology, and worked for 2 years as a data scientist in a proteomics startup.

Projects

Comments

josephbloom avatar

joseph bloom

about 1 month ago

I wouldn't usually comment on other people's projects but I've been mentioned in the proposal and @Austin's response. Furthermore, I recently published some research which relates to many of the main themes in Chris's post (world models, steering vectors, superposition).

It's not obvious to me that more posts like these will lead to more good work being done. I don't think we are bottlenecked on ambitious, optimistic people and this post is redundant with others in terms of convincing people to be excited about these research outcomes.

I'd be keen on seeing more results of the kind discussed in the post but my prior on paying people to promote that work on LW being optimal funds use is low.

josephbloom avatar

joseph bloom

5 months ago

I don't think it's likely I will be hired with DeepMind as I interviewed for a role recently and they decided not to proceed. I was also told to expect that if I had joined the team it's likely I would have been working on language models.

josephbloom avatar

joseph bloom

5 months ago

A few points on this topic:

  • Jay Bailey, a former senior software/devops engineer and SERI-MATS scholar has been funded to work on this agenda and has begun helping me out. I'm also discussing collaborations with other people from more of a maths / conceptual alignment background which I hope will be useful.

  • I agree mentorship is useful and plan to make an effort to find a mentor, although I've also been regularly discussing parts of my work with alignment researchers. At least one well respected alignment researcher told that it's plausible that this kind of work is teaching me more than I'd learn at an Org, but I know Neel disagrees.

  • I'm likely to co-work part time in a London AI safety office if one exists in the future.

I think I'm approaching my research with somewhat a scout mindset here. It seems plausible that independent research for some people is pareto optimal for the community across output from potential mentees/mentors. I am also considering an experiment where I do a small collaboration with an organisation which may provide evidence in the other direction. If it were true that this was productive and alleviated a mentorship bottleneck, then finding that out might be valuable/inform future funding strategies.