My PhD Thesis Topic: Transformers Might Approximate Solomonoff Induction
Or maybe not, who knows
One question I've been asked a lot recently - possibly more than any other over the past year or so - is “What’s your PhD thesis topic?”
It’s not one that’s easy to compress into a couple of sentences, though here are some brief attempts I’ve made:
I’m trying to figure out why modern AI works so well by comparing it to a theoretically ideal AI - what you might build if you had infinite computing power.
I’m trying to show how systems like ChatGPT can be broken down into different programs.
I’m trying to see why modern AI works so well by comparing it to an idealised model first conceived in the ‘60s.
Since Solomonoff induction is a theoretically ideal sequence prediction scheme, I’m trying to find connections between it and Transformers.
Since Transformers can emulate any Turing machine, and Solomonoff induction is a prior on every Turing machine, maybe Transformers can be modeled as forming a prior on every emulated Turing machine and that’s why they work so well.
Um… it’s kind of hard to explain…
If those are enough to sate your curiosity, great! If not, this blog post is my attempt at a proper lay explanation.
(Here is my non-lay explanation, which I presented at ICONIP 2024.)
Background
Transformers
Before you scroll further, please respond to the poll below.
If you answered “Yes” to the poll above, you probably know what a Transformer model is and can skip this section.
If not, then please take a moment to be annoyed at OpenAI’s naming abilities with me.
😒
Thank you.
(They’ve gotten slightly better recently, in my opinion, but naming your flagship product after an acronym nobody knows outside of academic computer science is pretty bad.)
“GPT” stands for Generative Pretrained Transformer. “Generative” means it can write text, not just analyse it; “Pretrained” means it was trained; “Transformer” is the kind of neural network that makes ChatGPT’s level of intelligence possible.
You don’t need to know the details of how a Transformer works and how it differs from other neural networks for the purposes of this blog post. All you need to know for now is:
They’re the kind of AI architecture that’s behind pretty much all modern generative AI
They can be trained extremely quickly compared to other neural networks
They seem to just get better and better as they get bigger and as they get trained on more data
They (more or less) work by taking a block of text and predicting what’s most likely to follow it
We know how to set them up and train them, but the way they work after being trained is very difficult to understand; no one really knows how they actually work!
Turing Machines
The next piece of the puzzle is the Turing machine (or TM), named after its creator, Alan Turing.
A Turing machine is sort of the original conception of a computer, but not quite as we know them today - more like a single-purpose program that can take an input, process it somehow, and then give an output. For example, you could have a machine that, given a binary number as an input, will output that number’s square in binary.
Formally speaking, a TM is defined by 5 things:
An infinitely long tape broken up into separate squares, each with a 0 or a 1 (and sometimes also blank spaces)
A head that can read whatever symbol is below it and overwrite it (if it wants)
A set of ‘states’ that the TM can be in - it’s always in one state at a time, no more, no less - including a starting state and sometimes a halting state
A set of instructions (called a transition function) for what the machine should do for each combination of [state the machine is in] and [symbol currently underneath the head], telling the machine which
state to switch to (can be same as current)
symbol to write in the current square (can be same as current)
direction to move the head in
There are different ways of thinking about what the TM’s ‘input’ and ‘output’ are; the simplest is just having the ‘input’ be whatever is on the tape to start with, and the ‘output’ is whatever’s on it at the end.
According to the Church-Turing thesis, this is sufficient to compute anything that can be computed. Any more complex kind of computer might be able to do the same thing faster, but it can’t compute anything that a TM can’t (even if it takes ages). A TM can even simulate a quantum computer, although it’ll do it incredibly slowly.
This also means that any computer/programming language/thing that can simulate a Turing machine can also compute anything; anything that can do so is called Turing-complete. A TM is a reasonably simple thing to put together, so almost all programming languages fall into this category, but so do Microsoft Excel, Minecraft, and Magic: The Gathering. This also includes almost any variation on the basic Turing machine that you can think of; an option to not move the head? Using multiple heads? Using multiple tapes, possibly including separate input/output tapes? Anything they can do, TMs can do slower.
Turing machines aren’t the best of the Turing-complete bunch by any stretch, but they were the first to be described, so they’re the touchstone that all others get compared to.
Universal Turing Machines
Since TMs are just descriptions of what needs to be done to solve a problem (as opposed to actual physical machines), you could easily sit down with a printed-out transition function and a piece of paper and do it yourself; I’ve done so several times in the course of my computer science study. As long as you have enough time, patience, paper, and right TM description in front of you, you can calculate anything by hand.
Here’s the thing: when you execute a TM by hand, you’re performing a computation yourself! You’re acting as a computable function that takes a description of a TM and a problem as input, and executes the TM to solve the problem, and gives the same output that that TM would. Every computable function has a TM that can execute it, and that one (“take a TM and a problem and solve it as the TM would”) is no different!
There exists a class of TMs that can take a description of any other TM and a separate input and execute the TM on the input, called Universal Turing machines, or UTMs.
Solomonoff Induction
(This is the tricky bit. Bear with.)
Here’s an annoying trick question: what number comes next in the following sequence?
1, 2, 4, 8, 16…
You might notice that each number after 1 is double the previous number and think that the next number is 32.
Or you might recognise this as an old trick question to ask budding mathematicians, and remember that it’s the maximum number of pieces a cake can be cut into with the cuts starting and ending at X number of points on the edge of the cake, with the next number being 31, you idiot, why would you think it was 32.
Well, you’d think it was 32 because it’s just kind of obvious! “The numbers double” is a simpler explanation than “maximum pieces a cake can be cut into by cutting along the edges and diagonals of an X-sided shape”. Even if everyone knew the cake thing, it’s a more complicated explanation. By Occam’s razor, we should go with the doubling thing.
But Occam’s razor is just a vague heuristic. Solomonoff induction is a way of formalising it - finding the simplest explanation(s) for any sequence, and using that to predict what probably comes next.
Solomonoff induction is what we call an unbounded algorithm - that is, an algorithm that expresses the optimal way of solving a problem regardless of how much computational power and memory you have. So we imagine it being executed on a computer with literally infinite computing power and memory - any program finishes instantly (even ones that loop forever) and it never runs out of space. Useful!
The technical term for this kind of computer is a hypercomputer, but I prefer to call it a magic computer, just to reinforce that, like… this is not a real thing. We can imagine what it might do, but we can’t build one. It’s a useful idea, though, because if you can’t solve a problem on a magic computer, you’ve got no chance on a real one.
So how do you solve arbitrary sequence prediction on a hypercomputer?
Here’s Ray Solomonoff’s solution:
Receive a sequence of numbers in binary - say, “000111” - that we’ll call S.
Pick a UTM and call it U. Any will do, but it’s easiest to imagine one that simulates TMs that receive an input on an input tape, do any and all calculations on a working tape, and write on a one-way output tape (i.e. they can’t use backspace).
List every possible input that U could receive (i.e. every finite binary sequence: “0”, “1”, “00”, “01”, “10”, “11”, “000”, “001”…)
Assign a probability to each input as follows:
Split half of the probability weight (that is, 50%) among all inputs of length 1 (that is, “0” and “1” - both get 25% of the weight)
Split half of the remaining weight (that is, half of 50%, which is 25%) among the programs of length 2 (“00”, “01”, “10”, “11” - each get 25%/4=6.25% of the weight)
Keep going, splitting half of the remaining weight equally among all inputs of the next length. This results in every program getting assigned at least a little bit of probability, and all of the probabilities adding up to 1.
Run each input on U.
Some of the inputs won’t work, or won’t ever stop and give an output. That’s fine, just take them out of consideration.
Some of the inputs will give an output that doesn’t match S. Also fine, take them out too.
Some of the inputs will give an output that does match S, up to whatever length S is. Leave them in!
Sum up all of the weights of the remaining inputs. This is the probability that U assigns to S - call it P(S).
The probability (according to U) that S is followed by a 0 is equal to the sum of the weights of all inputs that follow S up with a 0, divided by P(S) - in other words, P(S0)/P(S). This also works for any continuation you can think of: the chance of S being followed by “0110101011111101000001” is P(S0110101011111101000001)/P(S).
You can probably think of a few ways to continue our example string, “000111”. One obvious one is “000111000111000111...”, but I think “0001111111111111…” might be simpler on most UTMs. Either way, both are simpler than “0001110110101011111101000001”.
This gives us a concrete way of defending our prediction of “32” above; a program that multiplies a number by 2 over and over is much simpler than one that calculates the number of ways to cut a cake in whatever way it was, so it should have a higher weight.1
(Our imaginary trick-question-giver has a potential rejoinder here: why that UTM, and not a different one that assigns more weight to the cake thingy? After all, there are infinite UTMs, and it’s always possible to assign more weight to any particular program you like. The best response is to ask them if they know what Kolmogorov complexity is. If they don’t, tell them to read about it; if they do, tell them they’re being a dick.)
There’s an equivalent way of thinking of Solomonoff induction which separately considers every TM that U can run and every input that that TM can receive. This allows you to consider each TM to be not so much a program as a probability evaluation method: how often does a given TM follow S up with 0 versus with 1? It might be the case that both are possible, but have different likelihoods. In that case, you don’t drop any program unless it can literally never output S; you just adjust each program’s weight by how likely it is to output S, according to Bayes’ Theorem. In this case, Occam’s razor doesn’t say to use the shortest program, but the one that has the smallest [size + prediction error].2
I was first introduced to Solomonoff induction by this excellent introductory dialogue by Eliezer Yudkowsky.
Solomonoff induction has a few desirable properties that together let us call it the ‘optimal’ solution to the problem of sequence prediction:
It’s fully general; it can predict all possible continuations of all possible strings
It’s Pareto-optimal; you might be able to beat it on a particular string, but not on every string
But it also has an obvious downside: executing an infinite number of programs would require infinite space and infinite time, and neither are (seemingly) physically possible in this universe. Bummer.
The point is that it’s a start; now that we have an optimal way of predicting what follows a sequence on a magic computer, we can start on making a version of it that runs on an real computer.
The Basic Idea
Here’s the basic hypothesis:
Transformers are the state-of-the-art in sequence predicting, but no one quite knows why.
Solomonoff induction is the optimal unbounded algorithm for sequence predicting.
Therefore, maybe the Transformer can be understood as being the best existing approximation of Solomonoff induction?
That’s it, that’s the hypothesis.
Okay, not quite, there’s more to it than that. That’s the basic idea, but here are some key considerations:
Remember how there’s a whole group of computational schemes that are Turing-complete, meaning they can all compute anything (including each other)? Almost all neural networks fall into this category, including Transformers! This means that it might be possible to model Transformers as being made of of smaller components that each model a different kind of pattern, with patterns that match the preceding text being used to predict what comes next, and others being discarded. Modern GPTs are made of trillions of neurons, which can be broken down into an absurd number of subgroups - finding ones that handle particular patterns is an ongoing area of research.
It’s not necessary for large language models to model every possible string they could encounter, only ones that are likely to appear in its training corpus - i.e. ones that humans are likely to use. This means that it doesn’t need to track every possible TM/program, it can focus on a limited set that actually work.
GPTs are, as their poorly-chosen name suggests, trained on a whole lot of data, whereas a Solomonoff inductor starts from scratch. This makes them hard to compare directly, but there are a few ways to make them equivalent:
Think of a GPT’s training as UTM selection - finding a universal machine that has short codes that correspond to the sort of patterns humans often use, rather than ones that are easier to program on a TM
Think of the Solomonoff inductor as observing the entire training corpus before it sees the string we want to predict
Seeing whether Transformers can actually work as Solomonoff inductors has been done! Researchers at DeepMind found that Transformers performed better than other networks when trained on randomly sampled programs - as long as the input was no longer than what it had been trained on. When it was longer, it failed. I think that this is due to a particular architectural choice they made (namely, a fixed positional encoding), and changing this could demonstrate that Transformers make better Solomonoff inductors than other kinds of neural networks (without explaining why).
Motivation
Okay, so maybe I’m right, maybe I’m wrong; who should care?
AI safety researchers: one of the key problems in modern AI safety research is the fact that modern AI is an inscrutable collection of numbers assigned to neurons - more than anyone can read, let alone analyse. If I’m right that Transformers work as well as they do because they’re basically Solomonoff inductors, that would help give researchers a framework that they can use to tell what these systems are doing and how they can be controlled and monitored.
AI developers: it’s possible that constructing Transformers to even more closely resemble Solomonoff induction could make them more capable and more interpretable at the same time.
Computer scientists in general: this wouldn’t just apply to Transformers, but would go a long way towards explaining what deep learning even is and why it works. It’d be a great example of an unbounded algorithm receiving a bounded solution, and it might suggest some ways to approach good bounded solutions to other problems.
You: look, this AI stuff is already changing the world beyond recognition, it’s getting faster, and no one knows how it works. I think I’ve got a piece of the puzzle here.
What I’m working on right now
Unbounded Transformers: Solomonoff induction is the ideal way of solving sequence prediction, yet most people working in AI nowadays haven’t heard of it. But earlier, I said that if you can’t solve a problem on a magic computer, you can’t do it on a real one - what gives? I think that whatever kind of ‘unbounded’ algorithm it is that AI engineers are trying to approach is a bit different in structure from Solomonoff’s, although it’ll achieve the same result. Solomonoff described Bayesian induction over inputs to a UTM, but you can perform it over anything if you can distribute the weights right. The paper I’m currently working on describes a way of performing Solomonoff-like induction over assignments of weights to Transformers instead.
Experiments: next on my list is improving on Grau-Moya et al.’s analysis of Transformers’ potential to emulate a Solomonoff inductor and Mingard et al.’s analysis of how stochastic gradient descent (the way that Transformers are trained) approaches Bayesian selection, the way that Solomonoff induction does.
I’m unfortunately unable to look into the future and tell you how influential this idea could turn out to be, but I can defer you to the next best thing:
If you have any thoughts on this at all - ideas, supporting arguments, counterarguments, anything - I’d love to hear from you! Comment below or email me at the address in my paper.
In fact, its weight should be much higher; in our above example, every additional digit means getting 4 times less probability, and encoding the cake program will (probably) take quite a few more digits than the doubling program.
Prediction error and program size are both measured in bits, so they can be simply added together; in fact, doing anything else is incorrect.