Approving this as part of our portfolio on technical AI safety research. Thanks to Neel for funding this and Alex for the writeup; I especially appreciated your emphasis on sharing credit in this team effort!
We are a team of four scholars at MATS, led by Alex Turner. We invented a method that grants highly-configurable control over how neural networks learn to represent and process information, whether that be at the level of parameters, activations, or modules. We have promising early results on several toy models, including MNIST autoencoders, small language models (Tinystories 8M, Qwen 0.5B), and a toy gridworld RL agent. We are currently scaling to a real-world unlearning problem.
Our goal is to develop the method and demonstrate its efficacy-- as well as its relevance to AI risk-- to the broader research community. To do this, are pursuing three related research directions:
Feature localization and interpretability in LLMs: train LLMs in such a way to localize concepts and computation, exploring novel opportunities for interpretability and control (e.g. steering scalars; compare with https://arxiv.org/abs/2312.06681). These should produce tangible, interpretable subcomponents of networks or features in networks.
Deep unlearning / factored capabilities for LLMs: e.g. apply our methods to LLMs on the WMDP unlearning benchmark (https://arxiv.org/abs/2403.03218) and achieve state-of-the-art resistance to fine-tuning attacks, demonstrating “deep” unlearning without prohibitive costs to retain set performance.
Scalable oversight via localized motivations: reduce goal misgeneralization occurring under sparse behavioral feedback by localizing an agent’s “motivations” to specific, known components that generalize to cases without oversight. We intend to demonstrate this in simple RL environments simulating “partial oversight” (where a policy updates based on rewards that we can’t observe, see, e.g. https://ai-alignment.com/semi-supervised-reinforcement-learning-cf7d5375197f), then move to more complex environments (e.g. as in https://arxiv.org/pdf/2105.14111).
We aim to publish our initial findings in ICLR 2025, submission deadline: Oct 1, 2024.
To rent reliable, always-on, remote-development machines with A100 GPUs with 80GB of VRAM until the ICLR submission deadline of Oct 1, so that we can experiment as rapidly as possible.
With the minimum funding, we will be able to rent a reliable 8x A100 GPU machine until Sept. 8. With the target funding, we will be able to continue to rent our current 4x A6000 machine from Sept 8 to Sept 29, plus a 8x A100 GPU machine from now until Sept 29.
Note: we are exploring discounted compute options that wouldn’t compromise our rate of development and experimentation (potentially offered by Hofvarpnir Studios or FluidStack).
Jacob Goldman-Wetzler: Published research on mixed precision neural nets with Lu Group at Yale (https://www.sciencedirect.com/science/article/abs/pii/S0045782524003499) at age 17/18; many open-source contributions to Zig Programming Language compiler (https://github.com/ziglang/zig/pulls?q=is%3Apr+author%3Ag-w1+is%3Amerged); I was one of the youngest Recurse Center (writer’s retreat for programmers in NYC) participants at 13.
Evžen Wybitul: Graduated top of class in Bioinformatics at Charles University, Prague, now enrolled in a Data Science Master’s programme at ETH Zurich. Worked professionally as a data scientist, then collaborated with Deepmind’s David Lindner on benchmarking safety-relevant capabilities of VLMs (work submitted to NeurIPS).
Joseph Miller: Paper published at COLM 2024 on circuit finding in mechanistic interpretability (https://x.com/JosephMiller_/status/1811629344453984362). Worked as Research Engineer at FAR AI on their first adversarial Go paper (https://goattack.far.ai/). As a solo developer, created Hypnogram in 2021, a text-to-image AI website that generated over 5.4 million images for 1.2 million users (https://news.ycombinator.com/item?id=29143237). Worked as a software engineer in big data for 1+ years. BS in Math and Computer Science.
Alex Cloud: PhD in Statistics; 4 years of ML research in competitive industry jobs (Riot Games AI, Amazon): led research projects on novel applications to real-world games (e.g. https://www.ijcai.org/proceedings/2023/0009.pdf, https://ieeexplore.ieee.org/document/9619045); started undergrad data science mentorship program (https://www.doranslab.gg/), participated in MLAB 2.0.
Alex Turner: MATS mentor and Research Scientist at Deepmind with a strong track record of successful mentorship and influential contributions to the AI safety literature.
The project has already succeeded in producing interesting and novel results. However, it may fail to improve outcomes from transformative AI if (i) the kind of structure we seek to impose on models inhibits their ability to fit data, reducing capabilities beyond a point of usefulness, or if (ii) low-alignment-tax localization could work for real-world problems and models in principle, but requires semantically disentangled data labels that are impractical to procure or estimate.
We have a small compute budget from MATS (just above $5350 for our team of four) that we are currently on track to spend entirely by Sept 8, even without increasing our rate of experimentation.
We have applied to the Long Term Future Fund and FAR AI for 6-month salaries and compute budget to cover our work during the MATS extension phase, but do not expect to hear back from them for another month. Also, even if we are funded, it is likely that our requested compute budget would not be sufficient to cover both accelerated experimentation now and 6 months of regular research activity.
Austin Chen
3 months ago
Approving this as part of our portfolio on technical AI safety research. Thanks to Neel for funding this and Alex for the writeup; I especially appreciated your emphasis on sharing credit in this team effort!
Neel Nanda
3 months ago
I discussed this with Alex Cloud. I'm somewhat pessimistic about whether the technique will both work and not have a crippling alignment tax, but he made a pretty compelling case that it MIGHT, and could be a big deal if it worked, and it's a fairly elegant idea that seems like it has potential for some cool things even if the exact proposal doesn't work.
Either way, this was a fairly cheap grant, a small fraction of the cost of labor going into the project, and it seems valuable to gather more data on whether the technique works and I expect that having more compute will make the quantity and quality of the evidence better, especially if they can go beyond using tiny stories to more realistic settings. There were several experiments Alex and I agreed would be good ideas, and I would be keen to see them happen.
Neel Nanda
3 months ago
@NeelNanda I was also impressed that Alex was able to defend the case for the project in quite a lot of detail, had already thought of several experiments I suggested, and generally seemed to care a lot about baselines and rigour.
I'm also generally pro supporting the work of promising junior researchers regardless of the project to help them build skills and credibility.
Alex Cloud
3 months ago
Thanks so much, Neel! This is a huge for the team. Your feedback helped me hone in on key questions that have already changed my sense of research prioritization.
As an aside: I want to make sure it's common knowledge that Joseph Miller, Evžen Wybitul, Jacob Goldman-Wetzler, and Alex Turner share equal credit here for all the hard work they've been putting in behind the scenes.