This is an offer to donate this amount to the project on the condition that it eventually becomes active. Otherwise, your funds will remain in your Manifund account.
UPDATE 2: I did some extra sweeps, got the results together and published this post. I'm moving currently so I don't know when I will be able to update the grant application further.
UPDATE: I have been informed there may be significant convergence between my scale consistency work and something that a lab is publishing soon.
I expect to make significant changes to my primary plan that will deviate from what I originally outlined in the proposal.
I'm currently planning to share my current results on my scale consistency work and then pursue other projects, rather than doing a maximally thorough and convincing exploration of scale consistent SAEs.
I will update this proposal as it becomes more clear what this means for the overall project. The secondary goals and broader research direction still stand. The primary goals around publishing my work on scale-consistency are in flux and will change.
TL;DR: I made scale-consistent sparse autoencoders, and they have produced major (factor of ~2) pareto improvements on the reconstruction quality vs. sparsity (L0) tradeoff. I want to run good experiments and publish this, but I'm no longer able to sustain this work via self-funding, which is currently threatening to delay my completion of the project.
For the past 3 months I have been working to push the state of the art in sparse autoencoders by exploring ways to address their edge cases, failure modes, and technical challenges. I recently began experimenting with making them scale-consistent, and an exploratory implementation of this gave surprisingly successful results. This change adds little complexity, while roughly doubling performance.
Loss curves of for one of the the best vanilla SAEs I've trained (beige) versus one of the first scale-consistent SAEs I trained (blue):
The runs have comparable L0 (both 45+/-1), same dictionary size, and same batch size.
Sudden drops in L2(error) are due to the lr schedule
And here is the tradeoff curve between L2 and L0 for a small sweep I did on both architectures (we want both to be minimized):
Vanilla SAE vs 2 variants of the scale-consistent SAE, labeled here as labeled LinearScaleSAEs
My apologies for the bad colors in this graph
Sparse autoencoder features (with nonzero biases) are only able to faithfully reconstruct an activation magnitude for a single activation level. This is fine if features are binary (only on/off, constant magnitude) but may pose a problem if they have variable magnitudes. If the bias term is used to inhibit a feature's activation, the feature magnitude undergoes an affine transformation and the activation magnitude is preserved at only one point. Anything bigger than that, and the activation level will overshoot, smaller and it will undershoot. Here's a diagram of this:
I'm calling the property of not having this issue "scale consistency". So, here's what the analogous graph like for a scale-consistent SAE looks like:
I still use a bias term for inhibition of feature activation, but it does not translate the activation level when the feature is active.
An issue with this is that there is no gradient for the bias. Without a gradient, gradient descent cannot learn a good value for the bias parameter. In order for this to work, I had to find a reasonable way to create a synthetic gradient for the bias term. Because of this, I expected it to not work or at least take a lot of iteration. Instead, one of the first synthetic gradients I tried was able to produce significantly improved performance.
I'm exploring a few other ways of synthesizing gradients, but the simple first thing I tried immediately gave large performance gains.
TL;DR: yes it still matters, it's an improvement on those as well.
Sqrt(L1)-penalized-SAEs already provide a pareto-improvement (as described here by Logan Riggs) on reconstruction/L0, without changing the fundamental architecture. If this improvement were equivalent to the improvements of scale-consistency, it would be superior due to not changing the technique.
However, in my preliminary testing I have compared sqrt(L1)-penalized-SAEs to scale-consistent SAEs and the scale-consistent SAEs outperform sqrt(L1)-penalized SAEs.
To account for this, I have also compared against these SAEs that Joseph Bloom trained and open-sourced. Comparing against these on gpt-2 and matching the layer, the scale-consistent SAEs can achieve at least a 2x improvement on reconstruction score to L0 ratio.
My core goal is to run high quality experiments and analyses to explore the scale-consistent architecture, publish the results, and possibly produce some well-trained models for others to experiment with.
Perform thorough hyperparameter sweeps on all architectures to enable fair comparison between scale-consistent, vanilla, and sqrt(L1) SAEs.
Qualitatively assess and compare their interpretability levels.
Write up and publish findings.
With more funding:
Secondary goals:
Additional work on scale-consistent SAEs:
Train them on many layers of one or more commonly used models and open source them so others can use them.
Investigate other things and try to answer some currently unanswered questions about why this works and what is going on with it.
Continue my other work on this research direction.
Further exploration of other architectures that have given promising results.
Make a library of toy models of edge cases to explore the behavior of SAEs on those distributions and to test the ability of modified architectures to handle these edge cases.
Return to a resampling method that seemed to be performing well and share it if it holds up.
Compute: 5% - 20%
I have ~$500-$1500 in hyperparameter sweeps I'd like to do to thoroughly compare scale-consistent SAEs.
If I do follow-up work or continue on this research direction I expect to continue spending a good bit on compute.
Taxes: 30%
This depends on my income the rest of the year, so it's a bit uncertain but I'm taking a rough estimate and will set aside 30% for taxes.
Salary: 50% - 65%
I've self-funded upskilling and independent research for 6 months but at this point I'm now pretty broke and need to start earning soon to cover rent and life expenses.
At rates (post-tax) of
1st month: $3.5k/month for the first month
2nd month: $5k/month
then $7k/month after that
(I'm starting this lower and then increasing it as time goes on because I really want to get this latest thing published, after which I'm more torn between continuing research and putting my CS and ML skills towards building career capital and savings, and my expected opportunity cost increases as time goes on.)
helps me with paying for thorough sweeps, compensates me for some of the work I've done, and buys some slack. I'll still need to be worrying about money in the short term to some degree, but this will be a significant help and I expect will get me publishing sooner.
I can spend a month focused just on getting this work completed and published for a month, not worrying about securing future employment.
I'll continue this research for longer than 1 month, until I run out of funding.
If for some reason I'm unable to carry out the full work term, I will do one of:
Return the money
Donate the money to GiveWell endorsed charities
Use the money to hire someone to finish the work
The team is just me. I don't have much of a public track record, and this would be my first publication. I have done a lot of building and training sparse autoencoders and variant architectures in my independent research.
The results of the work I've done so far are an important part of why I think this is +EV to fund, as well as evidence that the things I've been exploring are fruitful and my capabilities at the type of work required.
My github page has most of my work.
The code for the scale-consistent SAEs is not currently public, but if this makes a difference in funding decisions let me know.
I majored in Computer Science at UC Santa Barbara's College of Engineering
Have been upskilling in ML for about half a year, during which I have
participated in:
Neel Nanda's MATS training
ARENA
some things I've done:
Contibuted this PR to HuggingFace which reduces memory use in Llama and Gemma models for batch sizes greater than one
Taught myself kernel programming in Triton
Replicated some gradient based prompt optimization papers with a collaborator
A lot of work with sparse autoencoders and experimental architectures and a promising new resampling method
I built a little library for making SAEs out of modular swappable components to experiment with different architectures. It's been really useful for what I need from it and I'm pretty proud of it, but partway through I realized I needed to cut the effort short and reduce the development times, so it's also way rougher around the edges than I was setting out for.
The model performs well at creating low sparsity and good reconstructions, but it's features aren't very interpretable.
Even if the features aren't very interpretable, the good performance on other metrics may carry some useful info about the internal representations in transformers. However, this would be a large downgrade in impact.
The technique is good, but doesn't get adopted, so has little impact.
I'm hoping to decrease the likelihood of this,
I have asked some practitioners what they would find convincing in my experiments/analysis. I'm going to ask around more, and try to incorporate as many of these suggestions as looks feasible.
I'm trying to make the hyperparameter sweep and analysis thorough and watertight so that it's cheaper to adopt the method. If it does work well, I don't want researchers to feel like they'll need to do a ton of independent validation before they can trust the method enough to build on it themselves. Or if they do, I want them to be reasonably confident they will find something worthwhile at the end of the work they do to validate it.
What might cause this to happen?
Insufficient reach of publication -- just a failure to get in front of eyeballs
The benefits appear too small to be worth the costs of switching
This is just the first iteration and needs to be built on further
I'm not sure this fits the category of failure, but it seems likely this can be improved upon. My guess is that I should probably still publish it since it is (aside from the interpretability question) an improvement. However, it also seems likely there is a better technique to be found, and maybe I should focus my efforts more on trying to improve it further before evaluating it and publishing it.
With more funding I'd like to look for better ways of achieving scale-consistency after releasing these initial results.
TL;DR: I strongly expected this initially, but I have since done a lot to rule it out, including a full rewrite.
At first, "there is a bug" was a high probability explanation for these good results. Since then, I have put some work into ensuring that that is not the cause of good performance, and I think this is less than 2% likely:
I rewrote my implementation of the model, training it still gets the same improved performance.
I have loaded weights in from the original implementation into a rewritten model and then ran evaluations on that, and got the same quality of result.
If the features are uninterpretable or something else goes wrong, I think I should still publish my findings so that others can learn from it or save them time going down this path. Probably if this happens it is worth spending less time on, but still worth sharing some of the results. Then I would switch to working on one of the other projects I find promising.
I have been self funding
If I apply for and receive a compute grant in the future, I'll move the compute funding from this to salary funding and extend the time frame accordingly.
If I apply for and receive a grant that has salary funding, I'll work for the combined sum of time frames for that grant and this one.