Compute for 4 MATS scholars to rapidly scale promising new method pre-ICLR

Project summary

We are a team of four scholars at MATS, led by Alex Turner. We invented a method that grants highly-configurable control over how neural networks learn to represent and process information, whether that be at the level of parameters, activations, or modules. We have promising early results on several toy models, including MNIST autoencoders, small language models (Tinystories 8M, Qwen 0.5B), and a toy gridworld RL agent. We are currently scaling to a real-world unlearning problem.

What are this project's goals? How will you achieve them?

Our goal is to develop the method and demonstrate its efficacy-- as well as its relevance to AI risk-- to the broader research community. To do this, are pursuing three related research directions:

Feature localization and interpretability in LLMs: train LLMs in such a way to localize concepts and computation, exploring novel opportunities for interpretability and control (e.g. steering scalars; compare with https://arxiv.org/abs/2312.06681). These should produce tangible, interpretable subcomponents of networks or features in networks.
Deep unlearning / factored capabilities for LLMs: e.g. apply our methods to LLMs on the WMDP unlearning benchmark (https://arxiv.org/abs/2403.03218) and achieve state-of-the-art resistance to fine-tuning attacks, demonstrating “deep” unlearning without prohibitive costs to retain set performance.
Scalable oversight via localized motivations: reduce goal misgeneralization occurring under sparse behavioral feedback by localizing an agent’s “motivations” to specific, known components that generalize to cases without oversight. We intend to demonstrate this in simple RL environments simulating “partial oversight” (where a policy updates based on rewards that we can’t observe, see, e.g. https://ai-alignment.com/semi-supervised-reinforcement-learning-cf7d5375197f), then move to more complex environments (e.g. as in https://arxiv.org/pdf/2105.14111).

We aim to publish our initial findings in ICLR 2025, submission deadline: Oct 1, 2024.

How will this funding be used?

To rent reliable, always-on, remote-development machines with A100 GPUs with 80GB of VRAM until the ICLR submission deadline of Oct 1, so that we can experiment as rapidly as possible.

With the minimum funding, we will be able to rent a reliable 8x A100 GPU machine until Sept. 8. With the target funding, we will be able to continue to rent our current 4x A6000 machine from Sept 8 to Sept 29, plus a 8x A100 GPU machine from now until Sept 29.

Note: we are exploring discounted compute options that wouldn’t compromise our rate of development and experimentation (potentially offered by Hofvarpnir Studios or FluidStack).

Who is on your team? What's your track record on similar projects?

Jacob Goldman-Wetzler: Published research on mixed precision neural nets with Lu Group at Yale (https://www.sciencedirect.com/science/article/abs/pii/S0045782524003499) at age 17/18; many open-source contributions to Zig Programming Language compiler (https://github.com/ziglang/zig/pulls?q=is%3Apr+author%3Ag-w1+is%3Amerged); I was one of the youngest Recurse Center (writer’s retreat for programmers in NYC) participants at 13.

Evžen Wybitul: Graduated top of class in Bioinformatics at Charles University, Prague, now enrolled in a Data Science Master’s programme at ETH Zurich. Worked professionally as a data scientist, then collaborated with Deepmind’s David Lindner on benchmarking safety-relevant capabilities of VLMs (work submitted to NeurIPS).

Joseph Miller: Paper published at COLM 2024 on circuit finding in mechanistic interpretability (https://x.com/JosephMiller_/status/1811629344453984362). Worked as Research Engineer at FAR AI on their first adversarial Go paper (https://goattack.far.ai/). As a solo developer, created Hypnogram in 2021, a text-to-image AI website that generated over 5.4 million images for 1.2 million users (https://news.ycombinator.com/item?id=29143237). Worked as a software engineer in big data for 1+ years. BS in Math and Computer Science.

Alex Cloud: PhD in Statistics; 4 years of ML research in competitive industry jobs (Riot Games AI, Amazon): led research projects on novel applications to real-world games (e.g. https://www.ijcai.org/proceedings/2023/0009.pdf, https://ieeexplore.ieee.org/document/9619045); started undergrad data science mentorship program (https://www.doranslab.gg/), participated in MLAB 2.0.

Alex Turner: MATS mentor and Research Scientist at Deepmind with a strong track record of successful mentorship and influential contributions to the AI safety literature.

What are the most likely causes and outcomes if this project fails?

The project has already succeeded in producing interesting and novel results. However, it may fail to improve outcomes from transformative AI if (i) the kind of structure we seek to impose on models inhibits their ability to fit data, reducing capabilities beyond a point of usefulness, or if (ii) low-alignment-tax localization could work for real-world problems and models in principle, but requires semantically disentangled data labels that are impractical to procure or estimate.

What other funding are you or your project getting?

We have a small compute budget from MATS (just above $5350 for our team of four) that we are currently on track to spend entirely by Sept 8, even without increasing our rate of experimentation.

We have applied to the Long Term Future Fund and FAR AI for 6-month salaries and compute budget to cover our work during the MATS extension phase, but do not expect to hear back from them for another month. Also, even if we are funded, it is likely that our requested compute budget would not be sufficient to cover both accelerated experimentation now and 6 months of regular research activity.

What progress have you made since your last update?

We put out a paper (https://arxiv.org/abs/2410.04332), LessWrong post (https://www.lesswrong.com/posts/nLRKKCTtwQgvozLTN/gradient-routing-masking-gradients-to-localize-computation), revised the paper during the ICLR review process, and resubmitted to ICML. Manifund's funding was critical for obtaining the results in Table 1 of the paper. The funding also enabled additional experiments that improved our understanding of the challenges and opportunities for applying gradient routing to larger language models. This understanding has shaped our subsequent work for the better.

What are your next steps?

Two of the project members (Alex and Jacob) are writing a research agenda that builds on the original work, as well as advising two projects based on this agenda. We intend to publish the agenda soon.

Is there anything others could help you with?

We have been donating our time to the research agenda and advising. We would be open to receiving funding as compensation for this work (either retroactively or moving forward).