Verifiably-Safe Transformers

Project summary

I want to fine-tune a coding model on the task of converting weights to code. I don't expect the resulting code to be readable by humans, but I do hope it will support static analysis, so that I can train a neural network which provably satisfies a safety property, e.g. that the model is not sandbagging.

What are this project's goals? How will you achieve them?

My goal is to scale my previous demo so that it supports neural networks containing attention layers. That demo uses a modified version of back-propagation so that in addition to minimizing a loss function, we also balance the concern of making sure the model satisfies a safety property.

A key component of that demo is a verification algorithms which examine the weights of the model in order to prove that it satisfies the safety property on all inputs, not just the example inputs from the training data. For technical reasons, verification algorithms yield poor results when we give them a neural network which uses attention layers.

I want to bypass the problem by fine-tuning a coding model on the task of converting weights to code, which is easier to verify than weights. Since the resulting code does not need to be human-readable, I expect to be able to use synthetic training data. However, for technical reasons, we must use a differentiable static analysis algorithm, we cannot use an off-the-shelf code verification tool.

How will this funding be used?

I am currently working part-time on similar projects, and they are advancing slowly. This project would require a significant amount of time and resources, so I am asking for at least enough money to rent GPUs for the fine-tuning (5000$), and I am hoping to get enough money to quit my job (160k$, which is lower than what I currently make as a Principal Software Engineer) so that I can focus on this project full-time. If I manage to complete this project in less than a year, I will spend my time on my other AI Safety projects, e.g. Verifiably-Safe Reinforcement Learning via (Soft) Linear Temporal Logic.

Who is on your team? What's your track record on similar projects?

Only me, Samuel Gélineau, software engineer and independent researcher. Here are some of my past projects:

• My publications "Moebius: Metaprogramming using Contextual Types" (POPL2022) and "Predictable Macros for Hindley-Milner" (TyDe2020) about novel statically-typed type systems demonstrate my knowledge of static analysis.

• The invited talk I gave at Galois in 2022, Can we Prove Facts About Machine Learning Models via Code synthesis, demonstrates that I have been working on using code to formally verify models for years.

• Parity Bot demonstrates how to train a model so that it satisfies some ad-hoc safety property.

• Sandbagging via Static Analysis demonstrates how to verify a more useful safety property, namely that the model is not lying about its capabilities.

What are the most likely causes and outcomes if this project fails?

• The task of converting weights to code might be too difficult even for powerful coding models.

• The resulting code might be unusable for the purpose of static analysis.

How much money have you raised in the last 12 months, and from where?

I am one of a handful of independent researchers supported by the AI Safety Research Fund but I have not receive any money from this fund during the last 12 months.