When can a mimic surprise you?

[see original LessWrong post // by David Johnston]

Thanks to Chris Leong and Nora Belrose for their feedback. This is meant to be part of an entry to the Future Fund AI Worldview Competition, but a later post is intended to address the competition questions head on.

In this post, I explore mimics. Mimics are what you get when you join a simulator with a generator. Examples are language models that learn to predict text sequences (the simulator), and generate samples of text sequences from their predictions (the generator). A number of AI safety researchers have mentioned that mimics seem to be safer than "traditional" AI architectures like reinforcement learners, with the proposed reason for this often being that mimics are less "agentic" or "goal-driven" than traditional architecture. moire's Simulators is a particularly thorough overview that makes a similar point.

In this post, I argue that a key feature of mimics is unrelated to their "agentiness": someone who can forecast a mimic's training data can also forecast a mimic's behaviour. I call this phenomenon synchronisation. Synchronisation is possible even when the operator can only forecast some crude features of the training sequence.

Certain methods for fine-tuning mimics allow mimics to be optimised for certain tasks while staying synchronised with the operator. This enables mimics to be controlled in a manner that maintains synchronisation and consequently remain easy to predict.

However, some kinds of objectives do not facilitate synchronised control of mimics. If an operator fine-tunes a mimic to control some feature of the world over which it wouldn't normally have complete control, then the operator should generally expect the mimic's output to diverge from forecasts based on the training data. In practice, the consequences of this divergence is reminiscent of failures due to Goodhart's law.

The extremely brief summary of this post is:

Idealised mimics do what you expect them to when you're trying to control features of their output
Idealised mimics can surprise you when you're trying to control features of the world

Safety relevance

Suppose you've been reading books all your life, and you have a pretty good estimate of how likely a book is to actually be good (by your lights) given it gets a 4.8 star rating on Amazon - and, being a good Bayesian, you represent this with a conditional probability F(Actually good|4.8stars). One of the key claims of this article is that, in some situations, it is possible to fine-tune a mimic so that it produces 4.8 star books in such a way that its sampling distribution Q(Actually good|4.8stars) approximates your own subjective probability F(Actually good|4.8stars).

This provides a method for dealing with concerns like those in You get what you measure; here is a method for fine-tuning on the easy-to measure thing, and getting the hard-to-measure latent just as much as you expect you would. A number of stars have to align in order for this to happen, but it is not an inordinately large number of stars. Furthermore, it may be possible to say quite a lot about when this might fail and by how much it might fail.

So the first safety relevant point is: perhaps there is a solution to this problem.

A broader question, the one that initially led me down this path, is whether or not safe AI is incentive compatible. If safe AI is incentive compatible, then if you do a good job of building AI that simply does what you want it to, you also do a good job of building safe AI. If safe AI is incentive incompatible, then you have to make trade-offs between building AI that simply does what you want and ensuring safety.

There's a narrow question one can ask in this regard. As I explore in this article, fine-tuning mimics often involves a regularising penalty that ensures the result is close in distribution to the original mimic. Granting, for argument's sake, that this penalty makes a system safer, we can ask: is the size of the penalty limited by performance or safety? I perform a microscopic literature review here and come up with the answer that it seems to be more often limited by performance. While today's AI systems are only weakly relevant to future AI systems, they are still a little relevant, and it might be worthwhile to interrogate this question more comprehensively.

There's also a broader question that I think is relevant: is it easier to solve control problems or hide them? If it is easier to solve control problems, then I think our world looks more incentive compatible; if it is easier to hide them then I think it looks more incentive incompatible. If mimics really do solve an important control problem, then I think we have evidence - albeit inconclusive - that we might be in a solving problems world and not a hiding problems one.

I cannot conclusively answer the question of whether mimics do solve this control problem, but the "maybe" that I offer is still progress with respect to my own understanding.

Epistemic status

I think some of the claims I make here are fairly simple and I have high confidence in them, but they are also not the critical ones. I think the important claim is the one I made in the first paragraph of the previous section: it's possible to fine-tune mimics in a way that approximately matches an operator's conditional probability in important regards, and this is a key feature that enables mimics to address more complex problems than other AI architectures. I'm much less confident in this. I expect that it is almost never true in every last detail, but I give 45% credence to it being roughly true (bearing in mind that I think most theories of this type this should be very unlikely a priori).

There's also a heap I don't understand about the ideas I present here, so this credence is liable to swing wildly at short notice.

Notation reference

Xi (X′i) is a "natural" ("mimicked") random variable taking values in the set X with events X

Zi (Z′i) is a random variable determinstically related to Xi (X′i) taking values in the set Z

Ti (T′i) is a random variable not deterministically related to Xi (X′i), taking values in the set T

P is the mimic's probability distribution

P(Xn|X<n=x<n) is the distribution of Xn learned by the mimic after observing x<n

Q is the sampler argument - Q←P(Xn|X<n=x<n) means that the mimic draws samples according to P(Xn|X<n=x<n)

F is the probability distribution the operator uses to predict both natural and simulated variables. I think of the operator as a skilled but not superhuman forecaster: she has good priors, and updates them sensibly given evidence, but there are many things beyond her ability to forecast

What is a mimic?

A mimic is a simulator joined to a generator. It does two things:

It learns a probability distribution that predicts elements of a sequence of inputs
It can sample this probability distribution to produce outputs of the same type as its inputs

Given a sequence of random variables X1,X2,...,Xn−1=:X<n and an event X<n=x<n, a mimic learns the posterior distribution P(Xn|X<n=x<n). It is also equipped with a sampler, which maps distributions over Xi to random outputs X′i taking values in X. Setting the sampler argument Q←P(Xn|X<n=x<n) produces outputs X′i distributed according to P(Xn|X<n=x<n).

Example

Consider a mimic that takes a sequence of books X<i as input. It can predict the an as-yet unseen book Xn and it can sample a book X′n, using the same probability distribution for both.

Operators can synchronise with mimics

The basic insight of this section is: under some conditions, a person(an "operator") who can do a good job of probabilistically forecasting a natural sequence Xi can also do a good job of forecasting a mimicked sequence X′i if the mimic is trained on the same natural sequence. This happens when the operator's and mimic's posterior distributions converge. Informally: if the mimic is good, then to the operator its outputs look just like its training data.

Such convergence can happen even if the operator only observes some coarse features Zi of the mimic's inputs Xi. I do not address the question of whether or not this convergence happens in practically relevant lengths of time for practically implementable machines.

Equal capabilities

Bayesian reasoners, given the same sequence of data, will under some circumstances "merge" in their opinions of the future (pdf). Specifically, if the operator has a distribution F over the infinite sequence XN and the mimic has a distribution P over the same infinite sequence and for any collection of outcomes C, P(XN∈C)=0 implies F(XN∈C)=0 (that is, F is dominated by P) then the conditionals P(Xn|X<n=x<n) and F(Xn|X<n=x<n) will converge as n→∞ on all inputs except a set Xbad⊂XN with F-probability 0. If P is dominated by F, then this set also has P-probability 0. If F is dominated by P and vice versa, I say they have identical support.

If the mimic's sampler is set to Q←P(Xn|X<n=x<n), the operator can set their forecasting distribution F(X′n|X<n=x<n)=F(Xn|X<n=x<n), and by the above convergence this will approximate the mimic's sampling distribution. When the operator's distribution over the natural sequence approximately matches the mimic's sampling distribution, we say that the operator and the mimic are synchronised.

Note that the assumption of identical support is, in the general case, not very easy to evaluate, and this is especially true when we don't have any easy way to evaluate P or F.

Example

If the mimic learns to predict books (in every last detail) from the sequence X<n and the operator learns to predict books (in every last detail) from the same sequence, and their initial distributions assign measure 0 to the same set of long-run events, then the operator's forecasting distribution over "natural" books and the mimic's sampling distribution will eventually come to agree. I call this convergence synchronisation.

Mimic more capable

If the operator can predict every detail of Xi just as well as the mimic, then one might wonder what use the mimic is - perhaps we could just sample from the operator's distribution instead. However, the operator may not need to predict every detail of Xi; it may be enough for her to predict some coarse features of each book, and still achieve synchronisation with the mimic. The story here is a bit more complicated, though.

Suppose that instead of observing the "base" sequence Xn, the operator observes some features Zn:=g(Xn). Abusing notation slightly, the "objective" sampling distribution of Z′n:=g(X′n) is given by

P(Z′n|X<n=x<n)=P(Zn|X<n=x<n)

By supposition, the operator does not observe x<n and so they cannot make use of F(Z′n|X<n=x<n) to synchronise with the mimic. Thus the naive argument for synchronisation does not apply. However, we can still say two things:

Given similar assumptions of common support, the operator's forecast of the machine's output given Z<n=z<n converges to the distribution of Zn given Z<n=z<n that could in principle be obtained with the mimic's assistance
If we make the additional assumption that the sequence Xi is exchangeable with respect to P, then the operator's forecast may converge to the mimic's sampling distribution as normal

These are explained in more detail after the following example.

Example

Suppose the operator observes two features of many books:

Genre
Whether or not the operator enjoys reading it

We say Zi:=(genre(Xi),enjoyability(Xi)). The operator can estimate the probability that they enjoy a book given its genre from their history of books read and the probability that a random book is of a given genre.

If the operator accepts that these probability estimates converge to the mimic's sampling distribution of Z′n because she shares inputs with the mimic, then even though the operator cannot write books, she can still say (probabilistically) how well they'll like the books the mimic produces, and what genre they'll be.

1. Operator forecast merges with the mimic's limited forecast

The mimic, by supposition, defines a collection of conditionals P(Xn|X<n) for every n. Thus we can (in principle) extract a joint distribution P(X[n]) over sequences of length n from the mimic. Actually doing this would be very impractical.

A joint distribution P(X[n]) induces a joint distribution P(X[n],Z[n]) by pushing it forward with the function h:x[n]↦(x[j],g(x[j]))j∈[i] (actually computing this would, among other things, require knowledge of g). From this, in turn, we can derive a conditional probability P(X<n|Z<n).

If the mimic's model P(X<n) is thought to be a particularly good one, then because Zi is a function of Xi, we might also surmise that P(X<n|Z<n) is a good model for X_{<n} given Z<n. Given a realisation of the sequence Z<n=z<n, the operator can consult the mimic's conditional probability to help them assess what outputs it is likely to produce

F(Z′n|Z<n=z<n)=∑x<n∈Xn−1P(Zn|X<n=x<n)P(X<n=x<n|Z<n=z<n)

because each Zi is a deterministic function of Xn, the right hand side is equal to P(Zn|Z<n=z<n).

But, if F(ZN) is dominated by P(ZN), then merging of opinions implies that

P(Zn|Z<n=z<n)→F(Zn|Z<n=z<n)

in total variation. So, instead of performing the impractically complex query to determineP(Zn|Z<n=z<n), the operator can just substitute their own estimate F(Zn|Z<n=z<n), and for sufficiently large n the result will be approximately the same.

2. Sequence is exchangeable

If the sequence XN is exchangeable with respect to P, then so is the sequence ZN. In this case, it can be shown that Zi is independent of X<n given ΘZ, the empirical distribution of ZN, which is a function of ZN∖{n} or XN∖{n}. Hence we have

P(Zn|XN∖{n}=xN∖{n})=P(Zn|ΘZ∘gN(xN∖{n})=θZ)

=P(Zn|ΘZ(zN∖{n})=θZ)

=P(Zn|ZN∖{n}=zN∖{n})

I suspect it's possible to say something more directly about under what circumstances P(Zn|X<n=x<n)→F(Zn|Z<n=z<n), but at the moment I don't know more than this.

Exchangeable sequences also have the advantage that identical support is easier to evaluate. For exchangeable sequences, identical support of P(ZN) and F(ZN) is equivalent to the priors over the empirical distributions P(ΘZ) and F(ΘZ) having common support.

Convergence rates

The fact that F(Zn|Z<n) converges to P(Zn|Z<n) "for some finite n" isn't especially useful by itself - n being finite does not mean that it is small enough to be practically important. I don't have much idea about the extent to which operators and mimics converge in practical settings.

It's possible that there are different features of human interest - say, Wi and Zi - such that F(Wn|W<n,Z<n) and F(Zn|W<n,Z<n) converge at very different rates to the respective conditionals in P. This difference in rates could be important if Wi is some feature relevant to "performance on the immediate objective" while Zi is some feature relevant to safety - it may then be possible to build a mimic that is very predictable with respect to the immediate objective but whose safety properties are very unpredictable.

Operators can control mimics and maintain synchronisation

Not only can operators predict what mimics will do unconditionally, but for some purposes, they can control mimics such that the mimic's behaviour remains synchronised with their forecasts of the natural sequence.

Example

Suppose the operator once again observes the genre and enjoyableness of many books, and she somehow controls the mimic to only produce books that she enjoys.

The operator's control desynchronises the mimic if it changes the mimic's distribution of book features conditional on enjoyability. For example, if most of the natural books the operator enjoyed were fantasy, but most of the mimicked books she enjoys are operator-flattery, then her control desynchronised the mimic.

The operator's control maintains synchronisation if the distribution of book features conditional on enjoyability doesn't change. If most of the mimicked books that the operator enjoys are also fantasy, then her control maintains synchronisation with respect to genre. Synchronisation is maintained in general if the distribution of "books in every last detail" conditional on enjoyability is unchanged.

A standard method for controlling mimics is fine-tuning them. In particular, given a binary function b:X→{0,1}, we can fine tune a mimic to approximate samples from the conditioned distribution by reinforcement learning using a KL-divergence penalty. We set r(x)={0b(x)=1−∞b(x)=0

and then, letting π0:=P(Xn|X<n=x<n), set

Q←argmaxπθEx∼πθ[r(x)]−DKL(πθ,π0)

This is maximised by P(Xn|b(X′n)=1,X<n=x<n) (see Korbak, Perez and Buckley, appendix).

If F(Zn|Z<n=z<n) approximates P(Zn|X<n=x<n) and consequentlyF(Zn|b(Xn)=1,Z<n=z<n) approximates P(Zn|b(Xn)=1,X<n=x<n,then the operator can adopt

F(Z′n|b(Xn)=1,Z<n=z<n)=F(Zn|b(Xn)=1,Z<n=z<n)

as an approximation of the conditioned mimic's sampling distribution. This requires, of course, that the operator is able to compute this conditional, and they may not be able to.

Setting a softer function r(x) will leave us somewhere between the conditioned distribution and the orignal distribution.

Example

Suppose the operator tracks two features of every book: its machine-rated binary sentiment and the number of times one person is described as helping another in the text; (Sn,Hn):=(sentiment(Xn),help(Xn)). If we use fine tuning to set the mimic's sampling distribution Q←P(Xn|Sn=1,X<n=x<n) and we accept that the appropriate form of synchronisation holds, then the operator can approximate the sampling distribution of mentions-of-helping using

F(H′n|S′n=1,S<n,H<n)=F(Hn|Sn=1,S<n,H<n)

Thus if mentions-of-helping is highly correlated with sentiment in natural books, such mentions will be very common in mimicked books fine-tuned to have positive sentiment. This example was inspired by Jermyn's discussion of the difficulty of predicting the outputs of conditioned mimics.

Fine-tuning with imperfect control is desynchronising

In practise, the operator isn't just interested in controlling functions of the mimic's output Xi. She is usually interested in controlling some feature Ti of "the world at large" which is plausibly influenced by by X′i. Even in our example, we discuss things like whether books are enjoyable. The operator wants enjoyable books because she wants to read a book and enjoy it. Asking the mimic to make her enjoy the book is a lot to ask - the mimic seemingly can't do anything about her stressful job that dampens her enthusiasm for reading on some days.

What if we fine-tune the mimic with the same function, but with a reward that depends stochastically on Xi? That is, we set

Q←argmaxπθEx∼πθ;ρ[R|x]−DKL(πθ,π0)

where the expectation is is some stochastic function ρ:X→Δ(R) "implemented by the real world" that maps mimic outputs x to rewards R, which are once again assumed to take values of −∞ or 0 (not because it's a good idea, but because it helps to make my point).

If there is a nonempty "forcing" set XF⊂X defined by XF:={x|ρ(R=0|x)=1}, the result of this fine tuning will be to set Q to the distribution P(Xn|Xn∈XF,X<n=x<n).

Abusing notation again, let P(XN,RN) to be the result of taking P(XN) and "pushing the Xis through ρ" (alternatively: what the mimic would believe if an oracle told him that the distribution of Ri given Xi was ρ). Unlike the situation discussed previously, fine-tuning with imperfect control will not generally yield samples from P(Xn|Rn=0,X<n=x<n).

If control is "almost perfect" - i.e. P(Xn∈XF|Rn=0,X<n=x<n)>1−ϵ, then we almost get samples from the distribution conditioned on Rn=0. In particular, under the assumption of almost perfect control we have for any A∈X

|P(Xn∈A|Rn=0,X<n=x<n)−P(Xn∈A|Xn∈XF,X<n=x<n)|<2ϵ

However, if control is far from perfect - i.e. P(Xn∈XF|Rn=0,X<n=x<n)<ϵ - then P(Xn|Rn=0,X<n=x<n) can differ very substantially from P(Xn|Xn∈XF,X<n=x<n).

Example

Suppose the operator fine-tunes the mimic on rewards Ri, which take the value −∞ if a random person did not agree with the book's thesis after reading it, and 0 if said random person did agree with the book's thesis. - whether or not the book i is persuades a randomly chosen individual of its main thesis. The base rate for persuasion is low - P(Rn=0|Xn=x,X<n)=0.01, but conditional on persuasion there is substantial variation in the topic Zi - i.e. P(Zi=z|Ri=0,X<i)<ϵ for all z. Fine tuning to produce books X′i with a high rate of persuasion is found to achieve the aim of almost always "persuading" random people of the book's thesis, but all of the books produced argue the thesis that the sky is blue P(Zi=skyisblue|Xn∈XF,X<n)=1.

As before, softer reward functions will wind up somewhere between the unconditioned distribution and P(Xn|Xn∈XF,X<n=x<n). However, it remains difficult for the operator to forecast the result of fine-tuning, because unless they know XF in advance, they don't have any obvious method to condition on Xi∈XF.

As an aside, there is an additional problem in this regard where an operator fine-tunes a mimic to produce outputs with a particular feature, but she doesn't get what she wants in the real world from it because causation ≠ correlation.

Testing this theory

The core of the theory is: the better the mimic, the better someone (or some machine) that has learned to predict or classify the training data will perform on the mimic generated data.

This could be tested in a scheme something like this: have human volunteers label a set of training data and a set of mimic generated data. Subsequently, compare the performance of:

a classifier trained on the training data, tested on the mimic generated data
a classifier trained on the mimic generated data and tested on a held-out set of mimic generated data

The theory I present here predicts that as the mimic gets better, the performance gap between the two should shrink.

Are mimics dangerous?

The above discussion suggests that fine-tuning a mimic on features over which it has imperfect control might lead to unexpected behaviour - and this behaviour might be very unexpected if the objective can be controlled, but only by the mimic adopting a very unusual strategy. "Unusual strategies" that succeed at controlling difficult objectives may well be dangerous. In practice, will people want to stick close to a mimic's original distribution, or push it far from this distribution in search of effective strategies?

The claims I have made above are already somewhat speculative. The question of whether mimics are safe depends on further speculation:

Perhaps mimics may pay a performance penalty if they are not sufficiently regularised - fine tuning might have an adverse impact on their ability to generalise because they depended on the initially learned distribution to be able to do this
Perhaps the desynchronisation from fine-tuning with imperfect control might lead to mimics giving undesirable results long before regularisation becomes weak enough to make them dangerous

If the first hypothesis are true, then mimics are "passively safe" - even if we try to remove the regularisation term during fine-tuning, their ability to generalise fails before they take any dangerous actions. If only the second is true, then mimics safety is incentive compatible. Removing the regularisation term can lead to dangerous actions, but no-one is interested in doing that because it gets undesirable results for other reasons. If neither is true, then mimic safety is incentive incompatible - people want small regularisation terms to get desirable results, but this trades off against safety.

Some empirical findings

Desynchronisation can happen when fine-tuning without regularisation on perfectly controlled features

Fine-tuning language models without a KL penalty has been found to produce "degeneration" of the generated samples. Many articles attest that degeneration involves a reduction in "fluency and diversity" of samples.

Korbak et. al. examined different methods to fine-tune GPT-2 to produce compilable code. Their findings were, briefly:

Unregularised reinforcement learning yielded a much higher rate of compilability at the cost of substantially reduced program length and complexity and substantial divergence from the baseline distribution of texts generated by GPT-2 conditioned on compilability
KL-regularised fine tuning yielded lower rates of compilability but longer programs (though still slightly shorter than baseline) and reduced divergence from the distribution of texts generated by GPT-2 conditioned on compilability

Training without the KL-regularisation leads to divergence from the baseline distribution conditioned on compilability. If the baseline distribution is synchronised with an operator, then this divergence is what I call "desynchronisation". The reduction in program length is one consequence of desynchronisation among many, and illustrates how desynchronised mimics can yield undesirable results that satisfy the training goal on paper.

Earlier work by Paulus, Xiong and Socher reports a broadly similar result: fine-tuning summarisation with unregularised reinforcement yields higher scores on the metric of interest, but

It is possible to game such discrete metrics and increase their score without an actual increase in readability or relevance

they also employ a kind of regularisation to try to improve summarisation while maintaining readability and relevance.

I think these examples provide very weak evidence against passive safety - unregularised reinforcement learning was successful at improving their scores on the metrics in question. I think they provide also very weak evidence in favour of incentive safety - unregularised reinforcement learning was found to produce output that was nevertheless undesirable. I say the evidence is very weak because I would not be surprised if these examples were not representatives of substantially more advanced systems deployed to solve substantially more difficult problems.

It's worth noting that Korbak et. al. were not able to produce perfectly compilable samples from GPT-2 using KL-regularised fine tuning, despite the fact that compilability definitely is perfectly controlled by the sequence generator. My guess is that being unable to learn the compilability predicate looks quite similar to the situation where compilability is not fully controlled by the learner. This leads me to expect that KL-regularised fine tuning in this regime might in some ways be similar to KL-regularised fine tuning in the imperfect control regime. Thus I expect to see some desynchronisation in this context, and I wonder if the slight reduction in program length this team observed is a sign of this.

There are many different pre-training schemes that seem to be effective

Pre-training might not need a large and diverse dataset to be effective. For example:

Krishna et. al. find that self-supervised pretraining on a small task-specific text dataset can yield results nearly as good as (and in some cases better than) pretraining on a large and diverse corpus of text
Other papers behind that link show that self-supervised pretraining on nonsense text or synthetic text can also yield high performance on downstream tasks

If pretraining datasets don't matter very much, then (in my language) P(Xn|X<n) might not need to match F(Xn|X<n) very closely in order to produce a mimic with high performance. If these distributions do not match in every particular then, for example, F putting low weight on dangerous actions does not necessarily imply P puts low weight on the same.

On the other hand, pretraining on large datasets does seem to help performance on average, and despite the results mentioned above it remains plausible to me that extensive pretraining is necessary for mimics that are used to solve particularly difficult problems.

I think these results - especially the pretraining on nonsense text results - also weakly undermine the claim that synchronisation is an important reason why pretrained models are able to perform useful tasks ("because they give us what they expect"), but I think the relevance is very slight and is outweighed by things like the fact that we can ask GPT-3 a question and get a sensible answer in response.

Conclusion

The basic idea here seems obvious: a good mimic is hard to distinguish from the thing it's mimicking. Nevertheless, to my knowledge, Bayesian merging of opinions ("synchronisation") has not previously been proposed as mechanism for how this occurs. My impression is also that applying standard prediction techniques (both formal and informal) to features of the training sequences to predict features of the outputs of mimics has been widely used - for example, in the investigation of prompting - but the reasons for why this is possible have also not been explored very much theoretically.

I wonder whether it is feasible to advance the science of deep learning (and psychology?) to the point where we have strong enough results about synchronisation to actually prove some safety properties for advanced mimics. I am pessimistic about this, but not confident in my pessimism.

In my view, here are some key takeaways of this post:

Controlling advanced AI presents us with a problem of "delegate proxy controllability": under what conditions can I direct a delegate to pursue proxy M and expect good results?
If we take an event M to be a good proxy for desired results under "natural" conditions, then I suggest that if the consequences of a delegate pursuing M match our expectations for what happens when M occurs under natural conditions, then M should be a good proxy for controlling that delegate
Under some (possibly optimistic) assumptions, mimics can achieve the property outlined in the previous bullet
Furthermore, when some of those optimistic assumptions don't hold, we might be able to measure the "Goodhart-proneness" of an objective by estimating the probability of an action lying in the forcing set for that objective conditional on that objective being achieved. Such a measure seems relevant to a number of concerns in the AI safety field.