@AdamGleave Just noting that I was quite impressed by the paper that came out of this ( https://arxiv.org/abs/2403.19647 ) - good grant, and good work by Sam, Can and co!

Independent research to improve SAEs (4-6 months)

Neel Nanda

7 days ago

Main points in favor of this grant

I think that SAEs are a big deal in interpretability, with lots of valuable interp work that can be unlocked with good SAEs. Developing, understanding and using SAEs is the major focus of both Anthropic's mech interp team and my team (Google DeepMind mech interp). I feel like SAE training is currently very janky and pre-paradigmatic and I would love to see progress here.

Why grant to Glen? I was particularly impressed by the ProLU work. Though it was, unfortunately, highly similar to my team's Gated SAE work, making the actual impact lower, I think ProLU was a good and principled idea that correctly identified a flaw in SAE training, and empirically showed that it was a significant improvement. Further, I think Glen broadly did the right things to show that it was an improvement, and did the leg work of training a bunch of SAEs on a range of models, layers and sites (though was bottlenecked on compute I think) and carefully comparing Pareto frontiers - this makes me more optimistic that if Glen finds an important improvement, he'll present enough evidence for me to believe him! I thought the write-up was pretty rough, but it was quite rushed, so that's not a major consideration.

We had a call, and I thought Glen was thinking about things sensibly. In particular, he had a strong emphasis on iterating fast, building the infra to try out many ideas quickly, and doubling down on any idea that meets a moderately high quality bar. I think this is a great way to do this kind of research. Another good sign is that Glen said ProLU felt less interesting to him than some of his other ideas, but had better empirical results, so was higher priority and he doubled down on it - being willing to be pragmatic like this and prioritise results makes this kind of research go much better!

Donor's main reservations

Even with a grant, this kind of research is much easier to do inside a lab, where you have a lot of compute, and more engineering expertise. There are people in labs working on this, eg Anthropic has a several person sub-team on science + scaling of SAEs. But there's many problems to work on, and ultimately not many researchers working on it, and Glen seems to have many interesting ideas, so I'm not too concerned about this. There is risk of duplicate work, eg ProLU and Gated SAEs, but I don't think that's a strong enough consideration to sink the grant.

I'm generally pretty wary of people doing independent research, especially junior researchers, with concerns specifically around lacking structure, accountability, motivation, feedback/mentorship, and stability. Glen says he hasn't been experiencing any issues with executive function, which is great! I've encouraged him to look for collaborators, and ideally a mentor, which would make me feel much better about the grant. It doesn't sound like independent research is his long-term plan, which makes me feel better about this.

Glen doesn't have much of a research track record, making it hard to be confident in this going well. But he seems promising, and I think it's good to give promising, inexperienced researchers a chance to prove themselves.

I have some concerns that this grant could result in a bunch of half-baked research threads, with no public write-up or clear conclusions. But Glen seems pretty motivated to make that not happen, and I think he also has a strong incentive to produce something legible and cool to eg help with future grant/job apps

Process for deciding amount

I'm honestly pretty confused about how to think about grant amounts here. $9K/month seems not crazy salary for someone living in SF, but I'd happily follow default rates for independent researchers if anyone has compiled them! $2K/month for compute seems enough to make it not a bottleneck without being too big a fraction of the grant. I'm funding this up to 5 months to balance between wanting Glen to have runway and a chance to prove himself, and wanting to see results before I recommend a larger/longer grant. If other grantmakers are excited about Glen's work I'd be happy to see them donating more though.

Conflicts of interest

Glen did my MATS training program about 6 months ago. I do a lot of SAE research, and expect to benefit from better knowledge of SAE training, but in the same way that the whole community will!

Train great open-source sparse autoencoders

Neel Nanda

9 days ago

@NeelNanda Note: Tom and I discussed this grant before he applied here, and I encouraged him to apply to Manifund since I thought it was a solid grant to fund.

Train great open-source sparse autoencoders

Neel Nanda

9 days ago

@Austin Yep, I'd be happy to pay salary on this if Tom wants it (not sure what appropriate rates are though). Tom and I discussed it briefly before he applied.

Train great open-source sparse autoencoders

Neel Nanda

9 days ago

Main points in favor of this grant

I think that determining the best training setup for SAEs seems like a highly valuable thing to do. Lots of new ideas are arising about how to train these things well (eg Gated SAEs, Prolu, Anthropic's April update), with wildly varying amounts of rigour behind them, and often little effort put into replicating them and seeing how they combine. Having a rigorous and careful effort doing this seems of significant value to the mech interp community.

Tom is a strong researcher, though hasn't worked on SAEs before, I thought the Hydra Effect and Understanding AlphaZero were solid papers. Joseph is also solid and has a lot of experience with SAEs. I expect them to be a good team.

Donor's main reservations

The Google DeepMind mech interp team has been looking somewhat into how to combine the Anthropic April Update methods and Gated SAEs, and also hopes to open source SAEs at some point, which creates some concerns for duplicated work. As a result, I'm less excited about significant investment into open source SAEs, though having some out (especially soon!) would be nice.

This is an engineering heavy project, and I don't know too much about Tom's engineering skills, though I don't have any reason to think they're bad.

Process for deciding amount

As above, I'm less excited about significant investment into open source SAEs, which is the main reason I haven't funded the full amount. $4K is a fairly small grant, so I haven't thought too hard about exactly how much compute this should reasonably take. If the training methods exploration turns out to take much more compute than expected, I'd be happy to increase it.

Conflicts of interest

Please disclose e.g. any romantic, professional, financial, housemate, or familial relationships you have with the grant recipient(s).

Tom and I somewhat overlapped at DeepMind, but never directly worked together.

Joseph is one of my MATS alumni, and currently doing my MATS extension program. I consider this more of a conflict of interest, but my understanding is that Tom is predominantly driving this project, with Joseph helping out where he can.

I expect my MATS scholars to benefit from good open source SAEs existing and for both my scholars and the GDM team to benefit from better knowledge on training SAEs, but in the same way that the whole mech interp ecosystem benefits.

Help Apart Expand Global AI Safety Research

Neel Nanda

5 months ago

"resulting in three publications accepted at top-tier academic ML venues (NeurIPS, ACL, ICLR),"

To add context in case people get misled by this line, the NeurIPS and ICLR papers (N2G here) were workshop papers, as far as I can tell, not main conference papers. For people not in ML, a conference like NeurIPS or ICLR has both conference papers (one of the highest status ways to publish in ML) and workshop papers (lower prestige and less selective, I'd roughly say a workshop paper is 1/3-1/2 as impressive as a conference paper).

To me, the prior is that most hackathon projects are a total flop and don't go anywhere, so helping someone convert it to a workshop paper is still impressive! (But main conference would have been very impressive). And the ACL paper was a main conference paper, which is impressive!

Mapping neuroscience and mechanistic interpretability

Neel Nanda

5 months ago

This seems pretty worth funding to me - it's a cheap grant, and I think this would be a cool paper to exist! I don't have a background in neuroscience or cognitive science, and I expect there's some techniques there worth my knowing about that would be useful for my work, but that much of it is irrelevant. I'd love for a paper surveying and summarising the most relevant ideas to exist! I've mentored Wes Gurnee and I trust his judgement/ability to represent the mech interp side, and expect Stephen Casper to also give good takes here. I don't know the rest of the organisers, but Wes vouches for their overall competence. I'd fund this myself if I had a regranting budget.

(I think a Nature publication is very ambitious, and would advise against bothering, but think an Arxiv publication is more than sufficient to make this worthwhile)

Exploring novel research directions in prosaic AI alignment

Neel Nanda

6 months ago

Lawrence is great, very experienced with alignment, and I trust his judgement, this seems like a great thing to fund! I would donate myself if this was tax deductible in the UK (which I don't think it is?)

Transactions

For	Date	Type	Amount
Independent research to improve SAEs (4-6 months)	6 days ago	project donation	55000
Train great open-source sparse autoencoders	8 days ago	project donation	4000
Manifund Bank	about 1 month ago	deposit	+250000

Comments

Automatic circuit discovery on sparse autoencoded features

Neel Nanda

7 days ago

@AdamGleave Just noting that I was quite impressed by the paper that came out of this ( https://arxiv.org/abs/2403.19647 ) - good grant, and good work by Sam, Can and co!

Independent research to improve SAEs (4-6 months)

Neel Nanda

7 days ago

Main points in favor of this grant

Donor's main reservations

Process for deciding amount

Conflicts of interest

Glen did my MATS training program about 6 months ago. I do a lot of SAE research, and expect to benefit from better knowledge of SAE training, but in the same way that the whole community will!

Train great open-source sparse autoencoders

Neel Nanda

9 days ago

@NeelNanda Note: Tom and I discussed this grant before he applied here, and I encouraged him to apply to Manifund since I thought it was a solid grant to fund.

Train great open-source sparse autoencoders

Neel Nanda

9 days ago

@Austin Yep, I'd be happy to pay salary on this if Tom wants it (not sure what appropriate rates are though). Tom and I discussed it briefly before he applied.

Train great open-source sparse autoencoders

Neel Nanda

9 days ago

Main points in favor of this grant

Donor's main reservations

This is an engineering heavy project, and I don't know too much about Tom's engineering skills, though I don't have any reason to think they're bad.

Process for deciding amount

Conflicts of interest

Please disclose e.g. any romantic, professional, financial, housemate, or familial relationships you have with the grant recipient(s).

Tom and I somewhat overlapped at DeepMind, but never directly worked together.

Help Apart Expand Global AI Safety Research

Neel Nanda

5 months ago

Mapping neuroscience and mechanistic interpretability

Neel Nanda

5 months ago

(I think a Nature publication is very ambitious, and would advise against bothering, but think an Arxiv publication is more than sufficient to make this worthwhile)

Exploring novel research directions in prosaic AI alignment

Neel Nanda

6 months ago

For

Date

Type

Amount

Independent research to improve SAEs (4-6 months)

6 days ago

project donation

55000

Train great open-source sparse autoencoders

8 days ago

project donation

4000

Manifund Bank

about 1 month ago

deposit

+250000