Note: I’ve tried to make this update relatively accessible, but I’m very happy to give more technical details or clarification in the comments or to have a short call with anyone interested in chatting.
What progress have you made since your last update?
TLDR: Over the last four months, I studied trajectory models using mechanistic interpretability techniques, then shifted focus to Sparse Autoencoders (a significant advancement in the field). In this period, I published a post demonstrating goal representation manipulation in one such model and co-authored another, applying learned principles to language models, notably altering their spelling behavior. This work, though insightful, progressed slower than anticipated (which I attribute to several reasons) and may be redundant in some ways due to other recent progress. After consultation with other researchers, as well as Marcus/Dylan, I've redirected my efforts through Sparse AutoEncoders / Language Models (which I discuss in detail in Next Steps).
The main goal of the grant is to “help predict, detect and/or prevent AI misalignment“ via developing a mechanistic understanding of offline-RL models (a model organism of sorts for Language models like GPT3). I think of mechanistic interpretability as a natural science of neural network internals and of this research as attempting to contribute to our understanding of the natural phenomena that underpin alignment-relevant properties (e.g., goal representations).
Therefore, I measure progress in the grant via improvements in methods, theories and techniques that enable us to understand neural network internals. We can decompose this into two components:
Algorithm Identification: the process of algorithms and intermediate structures that mediate the mapping of neural network inputs to neural network outputs. This is the well-known circuit-finding agenda (e.g., discussed here).
Ontology Identification: learning how a neural network thinks about the world (i.e., mapping internal variables in the model's computation to variables in its environment).
In the first three months of this grant, Jay Bailey and I progressed towards this goal in the gridworld context. In October, we published “Features and Adversaries in MemoryDT”, where we identified and manipulated internal representations of the gridworld in a trajectory model. Before this work, we had several negative results associated with applications of circuit-finding techniques, which were complicated by some interesting reasons (like my intuitions about superposition/capacity derived from prior work in the field being somewhat flawed) and less exciting reasons (mapping circuits is challenging due to distributed processing). The picture painted by our results and work on sparse autoencoders clarifies why we had extensive superposition despite lots of capacity.
I then worked on a collaboration with Matthew Watkins, extending some of my insights to language models. We published “Linear encoding of character-level information in GPT-J token embeddings”. Spelling is interesting because it constitutes a task where humans have direct insight into the underlying structure in reality (in this case, in the characters that make up words). However, this is hidden from language models due to details about how we input text. There are several academic publications about the surprising phenomenon that LLMs know which characters are in words. We could identify and edit linear representations of character information in tokens. This work had some surprising results. In particular, when we delete letters in their token representations, the model predicts subsequent letters from the same word (proportional to their distance to the front of the word). This demonstrates that even if you identify concepts in a model, knowing what will happen when you manipulate them may be another significant challenge.
Neither of these posts was particularly popular on LessWrong, which is reasonable given that there has been considerable progress and publications in the field in the last four months. Feedback from some researchers was positive, but suggested the slow progression was evidence that a pivot might be needed. I don’t feel that I can claim we moved the needle on alignment or mechanistic interpretability much with this work and this is somewhat due to the project betting heavily on mechanistic interpretability being harder than it has turned out to be in language models. Nevertheless, the results in these posts tie in nicely to various phenomena that the research community has begun to understand better and I feel I developed significantly via working on this project. Lastly, I should mention that I am doing research building directly on my codebase / the posts and it seems plausible that for unanticipated reasons both the code / insights may become valuable in the future.
What are your next steps?
TLDR: I’ve pivoted to training Sparse Autoencoders (SAEs) on small language models to assess how they solve the ontology identification problem, which is a prerequisite for reasoning well about goals/agency within neural networks. I’ve built my own SAE training library and followed up on preliminary experimental results under the supervision of Neel Nanda as part of the MATS program.
Sparse AutoEncoders are a fascinating new technique that advances our understanding of model internals by an incredible amount. This technique enumerates many concepts and identifies which are inferred at run time by a model at a specific position in the network internals. The incredible result is that the concepts, called “features,” are often incredibly human interpretable (e.g., the concept of words that start with the letter “M” or phrases/words associated with Northern England / Scotland). As a computational biologist, I think saying SAEs are to neural networks as DNA sequencing is to cell biology is pretty accurate.
For this reason, I reached out to Dylan / Marcus (the two significant funders) to check whether it would be ok if I pivoted to working on this new technique in the language model context (as well as checking with Neel Nanda, who supported the shift). They gave me the go-ahead, so that’s been my direction for the last two months.
To support this research, I built on a few open-source libraries to make my own SAE training library, which I’ve used to train sparse autoencoders on various models, focusing especially on the GPT2 small. Under Neel’s supervision at MATS, I’ve been exploring a few directions that I think try to address the critical alignment relevant questions about SAEs:
Are Sparse AutoEncoders capturing all of the information that we want them to? One way to think about this is that if you sequence DNA, print it, and then stitch it back into an organism, then the organism shouldn’t die. Molecular biologists do small versions of this all the time. The way we train sparse autoencoders very much suggests that we should get a similar property (we can replace the internals with our reconstruction, but in practice, that reconstruction does hurt the model performance. I’ve got some preliminary results showing we can better represent more information with more concepts concurrently without having those concepts become uninterpretable. Still, there’s more work to measure all the variables we care about here, such as how errors propagate through the model.
Do Sparse AutoEncoders systematically misrepresent any information? To make sure sparse autoencoders come up with features that are interpretable to us, we enumerate many concepts and try to make sure that we don’t have too many features appear at the same time. However, it’s unclear that a biased process will find the “true” underlying concepts. Since AI alignment will likely require we are very good at estimating the true “ontology” of the model, I’m very interested in trying to find ways of measuring the distance between the “true” ontology and what we are finding along axes that aren’t just how well we recover the model performance. I’ve explored some experiments that may get at this via studying QK circuits, which we may follow up on.
Regarding practical details, I’ll likely settle on a specific direction shortly and pursue that as part of MATS. I’ll write this up in a research plan as part of MATS and share it here.
Neel expects his mentees to publish their work in academic articles, so I will likely be close to doing that by the time the Manifund grant period ends. Since I’ve received a LightSpeed grant with another six months of funding, I anticipate being able to continue this research for most of this year, by which time I expect to have results that justify further funding.
Is there anything others could help you with?
Whilst I think I’m mostly okay for funding/everything else (not accepting MATS funding or flight reimbursement), it is undoubtedly the case that Sparse Autoencoders are incredibly computer-hungry. So access to a cloud computing cluster or knowing that if I need to run some big experiments, there are enough funds to do so would be good.
As an estimate, it can cost $3 / hour and take 12 hours to train one SAE on gpt2 small, and we might want to train 12 of these, which would add to about $400. This is a lower bound as varying hyperparameters, working on larger models, and analysing features post-hoc will all increase the compute expenditure. It seems plausible that the previous 10k budget per year will be underestimated by 2 - 5x.
Since I don’t want the stress of being handed a lot of money to spend on computing, I mildly prefer access to compute clusters (or a line of credit to be used only for computing or something). This isn’t essential/urgent yet as I still have some uncertainty over whether the research is significantly accelerated by training many SAEs or whether it will be essential to work with larger models.