(Cross-posted from Hands and Cities)
A number of people I know are excited about a type of fundamental prior known as the “Universal Distribution” (UD). In particular, they’re excited about using this prior to address questions about anthropics.
I discuss the application to anthropics in a later post. As a foundation for that post, though, I wanted to first lay out my current take on the UD.
I start by explaining how one version of the UD works. Then I give my current favorite pitch for something in the vicinity. I’m not sure, though, that this pitch gets us all the way to the UD as standardly used, and I’d like more clarity about the arguments for going the full distance.
I also briefly discuss a number of possible problems with the UD, notably:
- problems introduced by the arbitrary choice of Universal Turing Machine;
- ruling out uncomputable worlds;
- simulation weirdness;
- issues with relating its implicit ontology to the real world.
My current overall view is that the UD seems plausibly cool, but pretty far from a slam-dunk.
My thanks to Amanda Askell, Paul Christiano, Tom Davidson, Katja Grace, Jacob Hilton, Evan Hubinger, Buck Shlegeris, Carl Shulman, Bastian Stern, Ben Weinstein-Raun, and Mark Xu for discussion.
I. The Universal Distribution
(Quick caveat: I’m not an expert on this stuff, and the discussion here is centrally emerging from reading a few introductions, wikis, and bits of Hutter (2005), and from talking with various people who like UD-type stuff. Presentations of various of the ideas here seem to vary, and mine may need tweaking.)
What is the UD? Basically, the UD is a type of “prior” — that is, a way of assigning credences to hypotheses, before you’ve seen any evidence. Here’s one way of setting it up.
Consider an arbitrary Universal Turing Machine (UTM) — that is, any Turing Machine that is capable of simulating any other Turing Machine’s behavior on any input. Let’s say, in particular, that this UTM has a read-only “input tape” that only moves one direction, a write-only “output tape” that moves in only one direction, and a read/write “work tape” that can move in both directions; that it only uses the symbols “0”, “1”, and “X”; and that all tapes start out filled with (infinite) zeros unless otherwise specified as inputs or re-written by the machine. And let’s say that way this machine works is:
- First, it reads the input tape up through the first “X.”
- Then, it never reads any more of the input tape, and instead proceeds to do whatever else it’s going to do (run, print outputs, etc).
Let’s call the finite string of 1s and 0s that ends in the first “X” the “program” of a given input. Thus, for the input “1011X010X10XX010…”, the program is “1011X.” (If an input has no “X”s, and hence no program, the machine never stops reading the tape, and hence never outputs anything.)
The machine, we learn, is going to get some input. What should our prior be that its output will start with a given string s — say, “01101”? Well, we don’t know anything about the input, so we’re going to assume that the contents of each cell on the infinite input tape are chosen randomly, using a fair three-sided coin (1/3rd on “1”, 1/3rd on “0”, and 1/3rd on “X”). That is, in a sense we assume a uniform prior over all possible infinite input strings. And we use this to calculate the probability that the output starts with “01101”.
How do we do this? Basically: the hard way. We search over all possible programs that the chosen input might start with; we check to see which ones cause the machine to print an output that starts with “01101” (call the set of such programs P); and then we calculate the probability of three-sided-coin-flipping our way to one of those programs. Thus, for a program of length L, the probability of three-sided-coin-flipping an input that starts with that program is ⅓L. And the probability that your output will start with “01101” is just the sum, for each program p in P, of 1/3Length(p).
(A few technical notes, OK to skip:
- I think there’s some trickiness here related to normalization, which I’m ignoring.
- The shortest programs are going to be pulling most of the weight: so much so that the length of the shortest program that outputs a string starting with s — a quantity closely related to to the “Kolmogorov Complexity” or “K” — can be used to approximate the overall prior on seeing a given s. See here for a bit more discussion of how good an approximation this is.
- Because all programs have exactly one “X”, P is a “prefix-free set” — that is, no member of the set is a prefix of any other. After all, if some program p1 was also a prefix of p2, then p2 would have two “X”s instead of one. This property is important for keeping our probabilities well-behaved.)
How do we actually find all the programs (or even, the shortest such program) that output a string starting with “01101”? In brief: you can’t actually do this. Suppose, for example, that we start by checking all the one bit programs (e.g., “X”), then all the two bit programs (“1X”, “0X,”), and so on. “X”, we find, outputs something starting with “111110001…” — so we junk it and move on. “1X”, it turns out, is also a dud. When we test “0X”, though, let’s say we hit a problem: after reading “0X”, the machine just keeps running and running, without printing anything or halting. We wait around for a hundred years; a thousand; a million. Man, testing this particular program is taking forever. Is there some way that we can speed up the process — e.g., find some set of steps we can execute to answer the question of whether our UTM is, for example, going to halt on this input? Uh oh. That sounds like the type of thing we famously can’t do. But what are we supposed to do instead, just sit around forever? Could we maybe just junk “0X” after a few million years and call it a day? But what if it turns out that a few million years after that, “0X” settles down and prints out 01101? Then it would be a hit — one of the precious members of set P. And given the failures of “X” and “1X,” it would also be the shortest member, and hence an extremely important contributor to the overall prior on “01101.” So we can’t just junk it without knowing what it ultimately does.
Thus, one of the big issues for the UD: you can’t actually compute it. And not just in a “number of atoms in the universe” sense; rather, in a “whatever is going on with the halting problem” sense. I’m told you can approximate it (see here for a bit more detail), but I don’t have a sense of how this gets around the “what if ‘0X’ is our first big hit?” problem (people sometimes talk about using a “speed prior” instead — e.g., a prior which treats programs as less likely, the longer they take to output a given set of observations — but this, I gather, has its own problems).
My impression, though, is that proponents of the UD are kind of OK with this uncomputable situation; they think that the UD serves as a useful ideal model nonetheless. I haven’t thought about this much, but I won’t focus a lot on “but it’s not computable!”-style objections here.
The UD plays a key role in an idealized model of inductive reasoning known as Solomonoff Induction (after its inventor, Ray Solomonoff; see here and here for introductions). As I understand it, Solomonoff Induction basically uses Bayesian updating on the UD to predict what observations you will make next, given what you’ve seen so far. Thus, using the set up above: if you’ve already seen 01101, you’ve narrowed down your possible inputs to the ones that start with programs in P. Before you saw anything, these programs each had prior probabilities of ⅓Length; now that you’ve seen 01101, you’ve renormalized these probabilities, but the ratios are still the same. Now, to figure out what probability to assign to a “0”, “1”, or “X” next, you check which of those programs give outputs that start with 011010 vs. 011011 vs. 01101X, weigh them by their probability, and make your prediction.
(In some presentations, e.g. here, the programs output probability distributions over bit-strings, rather than bit-strings themselves. I’m not sure how much of a difference this makes, or which version is more official/canonical.)
How would we apply something like Solomonoff Induction to our concrete lives, even in theory? Our observations, after all, don’t present themselves as a string of ones or zeros; nor do we generally think of them as generated by a specific “Great Universal Turing Machine in the Sky,” being fed some randomly generated input tape. Still, though, to run Solomonoff Induction, you just pick any old UTM, and you just pick some way of encoding your observations into the format of this UTM’s output tape (I get to the problem of deciding what counts as your “observations” in section X below), and then you treat this UTM like it’s being fed a coin-flipped input tape, and then, boom: you’ve got your prior over observations (e.g., the UD, relative to this UTM and this encoding), you’ve got your actual observations, and it’s all basic (well, not that basic) Bayesianism from there.
II. My current favorite pitch for the UD
OK, but: why would one do such a thing? What’s supposed to be good about this?
People say various things, here. For example:
- The UD is a formalization of Occam’s Razor.
- I was really confused about how to think about priors and this thing gives me answers!
- The UD is a way of being uniform over the set of all infinite inputs to an arbitrary UTM, and in this sense is maximally “agnostic.”
- Given certain assumptions, the UD + Solomonoff Induction is only finitely worse than any computable alternative (more on this below).
- The UD is formal and precise and I can prove things about it and I really like that.
- It helps with anthropics and infinities and stuff (more on this in my next post).
- It looks like the physics of our universe could be written down in a relatively short python program. Maybe this is evidence that the UD is a good guide, if we hand-wave about the arbitrary UTM thing?
- What else are you going to do? Got any better ideas? (See e.g. Hutter (2005), section 2.3.4.)
Indeed, we can prove various “optimality” properties of the UD and of Solomonoff Induction. I haven’t gone through these in detail, but my impression is that they tend to have the flavor of (4) above. And there are presumably various other arguments as well (readers who like the UD, I’d be curious to hear yours). Hutter et al (2007), for example, claim that the UD has “many remarkable properties,” and that it meets criteria “usually taken in Mathematics to mean that this object is objectively important” (random aside: does math come with an objective status hierarchy? In what sense is e “more important” than 37?).
I’ll note that (1), which features prominently in various introductions to the topic, seems to me kind of underwhelming — at least on its own. I like Occam’s Razor just fine, but there’s still a question of why Occam’s Razor is a good heuristic (see here and here for some stories) — more importantly, why should we think that talking a lot about the lengths of different inputs to an arbitrary UTM is a good way of capturing what makes Occam’s Razor useful. Indeed, at least on cursory engagement with the literature on the UD (and on algorithmic probability in general), it can feel like one is being encouraged to move from “who knows why, but I guess we think simple theories are more likely?” to “isn’t a string of a zillion ones in some sense ‘simpler’ than some particular randomly generated zillion-length string?” to “let’s build our entire epistemology on the lengths of different programs input to this UTM I just made up.” And one can be left wondering why such moves are worth making.
Indeed, as written above, (1)-(8) doesn’t seem a particularly compelling case overall, especially given the “what does this have to do with anything?” nature of the thing being argued for. Can we do better?
I’m not sure. Here’s my current favorite pitch, though, for something like the UD. I’m not especially satisfied with it, but it provides at least the beginnings of an answer to “why would I care about simplicity in something like the sense at stake in the UD?” (My thanks to Mark Xu and Ben Weinstein-Raun, in particular, for discussion.)
Suppose that you’re a person living in what you assume is a good old fashioned physical world (not, e.g., some sort of Great-UTM-In-The-Sky situation), and you’re interested in predicting what you’ll see next. And let’s say, further, that we can encode your present and future observations as a sequence of 1s and 0s (more on whether this makes sense to imagine in section X).
We’re going to make the following substantive assumption: namely, that the process generating your observations is computable. That is, there is in principle some finite set of steps that we could execute that would answer the question “what will I see next?”: if you were about to see “1,” these steps would eventually tell us “1”; and if “0”, they would tell us “0” instead. We can imagine, for example, that these steps involve simulating the entire universe at some sufficient level of detail (see the Church-Turing-Deutsch principle for more on whether this is possible in our universe; there are questions here about what it means to be simulable at this level of detail, what to say about randomness, and so on, but I’m going to pass over those for now).
Ok, then: sounds like our task is easy. All we have to do is find the steps that will answer our question, do them, and we’re done. Hmm, though: how can we find these steps?
Well, I want to tell you about a certain very special type of machine, known as a Universal Turing Machine. This machine, strange as it may sound, can do anything that can be done by executing a finite series of steps — you just need to feed it the right finite program. In fact, any version of this machine has this property: it doesn’t matter which one you pick. So, pick one. That random machine will be capable of simulating the process giving rise to your observations, if we can only find the right program to give it. With the right key in the lock, we can make this bundle of gears and tapes and pens move like the universe (and if we want, we can make it move like any other Universal Turing Machine, too).
Ok, how do we find the right program for predicting our observations? Well, we know that in the space of all possible finite programs, the one we want is out there somewhere. So what we’ll do is just: not rule any of them out. That is, we’re going to pick an arbitrary prior F over programs that gives some non-zero probability to every finite program. This guarantees that the program we want is in our hypothesis space. We’re not just deciding, willy-nilly, that some programs “can’t be the one.” It’s a very weak constraint.
(I think that to get some nice “equivalent up to a finite constant” properties, we also need to make this prior computable; but it’s not actually clear to me that the pitch I’m making, in particular, requires this constraint.)
Notice: this prior doesn’t need to be the specific one described in the previous section, which assumed that programs were chosen using a fair (three-sided) coin, and which assigned a prior on programs of 1/3^Length. That’s one such prior F, but there are many others. You could use a coin weighted a million to one towards “0”. You could use a different weighted coin for each cell of the tape. You can do whatever you like, as long as every finite program gets at least a teensy-weensy, real-numbered bit of probability.
Do these probabilities have to be real numbers? Could we, for example, use infinitesimal probabilities instead? I dunno. I gather that using infinitesimal probabilities isn’t exactly a walk in the “definitely just normal Bayesianism” park (thanks to Amanda Askell for discussion), which seems pretty unsurprising, but I haven’t looked into it. For now, I’ll just assume that you want a good-old-fashioned real-numbered prior.
Ok, but perhaps you wonder: if we’re not flipping coins, and are instead just choosing any arbitrary real-numbered prior F that doesn’t literally rule out programs for no reason, are we still going to put higher probability on shorter programs? Yes. No matter what we do, no matter how hard we try, there is a binding sense in which, if we’re assigning well-behaved probabilities to the set of all finite programs, we have to think that shorter programs are more likely. Why? Because there are too many programs to do anything else.
Here’s an informal proof (I owe this formulation from Ben Weinstein-Raun; see also related discussion here). Our claim is that:
- For any (real-numbered, well-behaved) probability e, there must be some length L such that for every program with length greater than L, the probability of that program is less than e.
Why is this true? Because for any probability e, there can only be finitely many programs with probability e or greater (specifically, you can only have 1/e programs that are at least e probable) — otherwise, your probabilities would add up to more than one. Thus, there must be some set of longest programs with probability e or greater. Their length is L, and thus, (1) is true. (If there are no programs with at least e probability, then any length will qualify as L).
Because (1) is true, as we range over all the possible programs, starting from the shortest, there is a strict sense in which our probability assignment must be bounded by some diminishing curve. That is, there will be some finite set of highest-probability programs; after you hit the last one of those, there will be some finite set of next-highest-probability programs after that; and so on — and the overall trend must always be downward. At some point, even the most determined-to-prefer-longer-programs amongst us will be driven to treat longer programs as arbitrarily unlikely: an infinite line of programs is just too long for anything else.
This is currently my favorite justification for treating shorter programs as more likely: namely, you have to, if you’re going to assign well-behaved probabilities to the set of all finite programs. Indeed, the argument applies even if you go in for ruling out programs arbitrarily: assigning 0 probability to some programs doesn’t allow you to assign e probability to more than 1/e programs, either, so you’ll either have to assign diminishing probability to the infinitely long tail of programs, or zero out the tail entirely — and zeroing it out would just be an especially extreme version of the UD-ish Occam’s Razor (e.g., “sufficiently complex hypotheses are impossible”). The tail gets you regardless.
Note, though, that this argument also allows for treating shorter programs as less likely, within any finite range. Thus, for example, if you start with the shortest programs, you can treat the first Graham’s number of programs as progressively more likely, the longer they are. It’s true that at some point, you have to hit the last program with your highest probability, and that it’s all bounded by a downhill curve from there — but you can go uphill arbitrarily long if you’d like to.
And note, as well, that your preference for shorter programs will behave in strange ways. Suppose you go through the arduous labor of assigning probabilities to each finite string. You sit back, exhausted, and notice that — surprise surprise — your prior displays a trend that treats shorter strings as more likely. But then a gust of wind picks up and blows your probability assignment every which-way — your probability on “00110X” goes to “10101000101010X,” your probability on “0X” goes to a program starting with a Graham’s Number of 0s, and so on. You wonder: “will I still expect shorter programs, when the wind dies down?” Yes, you will. It doesn’t matter how the wind blows. There’s no precious structure to maintain. When the probability-dust settles, you’ll still be an Occam-ite — at least of this extremely general and not-obviously-relevant-to-anything kind.
Let’s grant, then, that we’ve picked some prior F, which gives non-zero, real-numbered probability to every finite program. Thus, it gives non-zero probability to the “true program” (e.g., the one such that, when input into our arbitrary UTM, the UTM would accurately simulate the process producing our observations). Indeed, there are actually many “true programs” (consider, for example, the program that makes our UTM simulate some new UTM, and then codes for that new UTM’s “true program”) — and we’ve got non-zero probability on each one.
Now, the hope is, we can just go out and learn stuff, junking programs that conflict with our observations along the way. The more observations we make, the more we will junk programs that output our observations up to some point, but not after. And eventually, one hopes, the “true programs” come to dominate our probability distribution (more on whether this actually works below).
III. Is everything equivalent up to a finite fudge factor?
That’s my current best pitch for something kind of like the UD, and for the necessity of using something kind of like a simplicity prior. Basically, if you assume you live in a computable world, then you can (maybe) guarantee eventual success at predicting your observations, if you give non-zero credence to all finite programs, feed them through an arbitrary UTM, see what comes out the other end, and keep updating. And having well-behaved credences over all finite programs forces, in the limit, some kind of bias towards shorter programs.
Hmm, though. When we write it out like that, it sounds a bit like saying “Bayesianism eventually works if you don’t rule out hypotheses for no reason,” and “here’s an extremely loose constraint on how we have to assign real-numbered probabilities to a countably infinite set of hypotheses.” Both seem plausible, but not obviously anything to write home about — and in particular, not obviously enough to justify the particular way people tend to use the UD/Solomonoff Induction in practice.
What is that way? Well, in my experience, people don’t actually pick any old arbitrary “don’t rule things out for no reason” prior over finite programs. Rather, they specifically assume a prior like the one I described in section I — e.g., one that treats programs as though they were chosen via flipping a fair coin, and which thus penalizes each additional bit of length via something like 1/(number of possible characters) ^ (length of program). And what’s more, people don’t actually pick any old UTM. Rather, they specifically assume some familiar and intuitively “simple” or “natural” type of UTM — for example, something akin to a python or C++ interpreter.
These additional assumptions allow us to get a somewhat better initial grip on what the UD/Solomonoff Induction might imply about a given issue. If we’re thinking about which laws of physics are simpler, we can think about which are easier to write in python; if we’re comparing two hypotheses, the second of which requires one extra bit of information to specify, we can specifically penalize the second one by a factor of two relative to the first. But it’s not clear why people who chose other UTMs, or other, non-coin-flippy priors over programs, would have reason to accept reasoning of this kind. And we haven’t, yet, said anything about why you should choose python, or coin-flips, in particular.
One response, here, is that these choices don’t matter, because once we’re dealing with UTMs, tons of stuff becomes “only a finite constant away” from tons of other stuff, assuming the stuff in question is computable. My understanding of this sort of move is clearest with respect to the choice of UTM. Thus, if I chose UTM1, and you chose UTM2, there will be some finite prefix we can feed to my UTM, such that it then starts simulating yours in response to everything it reads after that prefix. And if I gave non-zero probability to all finite programs, I also gave non-zero probability to all the programs that start with the prefix in question; and thus, on those programs, my process of Solomonoff Induction, with respect to the programs as a whole, will mimic yours in response to the set of the post-prefix parts of those programs (which is itself the set of all possible finite programs), up to whatever difference the prefix in question makes to the prior (if I’m using a coin-flippy prior, we can quantify this difference using the length of the prefix; if I’m using any old “don’t rule things out arbitrarily” prior, it’s harder). And the same holds for you, if you gave non-zero probability to all programs.
My understanding is that this is the type of thing that gives rise to the “only finitely worse than any computable alternative”-type properties I gestured at in section III. That is, if you use Solomonoff Induction, and someone else comes to you and says: “hey dumbo, I’ve got a better UTM than you,” you can just respond with: “well, I’ve also got some non-zero probability on turning my UTM into yours, so if your thing is so cool, I’ll converge on it after a finite number of observations.” My sense is that people excited about Solomonoff Induction love this type of comeback. I’ll note, though, that I haven’t actually worked through exactly how this convergence works, what properties it has, and what specific assumptions it requires; possibly there’s additional complexity/subtlety here.
Does this sort of response work for disagreements about (computable) priors over programs, too? That is, if someone comes to you and says “hey dumbo, not only do you have a stupid UTM, but you’ve got a stupid (computable) prior over all finite programs, too — you should use my (computable) prior instead,” can you respond by saying “well, not only do I have non-zero probability on your UTM, but I’ve got some non-zero probability on your way of assigning priors to programs, too; so again, if you’re thing is so cool, my thing will start behaving just like yours after some finite number of observations”? I’m told that you can in fact say this, since there will be some way of encoding and reproducing this other person’s prior over programs, by feeding your UTM some finite prefix; that for this reason, a coin-flippy prior is equivalent (up to a constant) to any other computable prior that doesn’t rule out programs for no reason; and indeed, that all such computable priors are therefore equivalent to each other up to a constant, and are in that sense “universal.” However, when I tried to actually work through how to make one prior equivalent to another just by attaching a finite prefix, it wasn’t immediately clear to me how to do it using the set-up above (e.g., one where programs output actual bit-strings, rather than probabilities over bit-strings), and a brief conversation with someone about the topic suggested that the details might get kind of gnarly (something about resampling with the help of an infinite source of randomness). For now, I’m just going to leave the question unresolved.
Even if we can’t use an “only a finite constant away” argument for a coin-flippy prior, though, perhaps we can use some other argument. In particular, if we think of programs as the beginnings of infinite strings, as we did in section I, then coin-flipping allows for a genuinely uniform distribution over such strings: shorter programs are still more likely, but only to the extent that a larger share of infinite strings start with them. And such a uniform distribution seems more intuitively “agnostic” than e.g., assigning steadily increasing probability to the first Graham’s number of programs, then decreasing probability after that.
Overall, I’m currently a bit unsure about whether people tend to assume coin-flippy priors because of some sort of “uniformity”/”agnosticism” argument, because of some sort of “it’s all the same up to a finite constant” argument, because it privileges shorter strings and this is sort of like Occam’s razor, or because of something else and/or some combination. Indeed, I’m actually a bit unclear whether “The Universal Distribution,” as a term, refers to coin-flippy distributions in particular (as suggested e.g. Hutter et al (2007)), or whether it refers to the broader class of computable “don’t rule stuff out arbitrarily” distributions, understood as equivalent up to constant (as was suggested in a few conversations I had). For simplicity, I’ll use it, in what follows, to refer to coin-flippy distributions in particular. (There can also be some further ambiguity about whether the UD refers to your prior over programs, as suggested by e.g. Hutter et al, or your prior over outputs given a coin-flippy prior over programs, as suggested by one reading of e.g. Finney. I’m generally assuming the former.)
Various writers (e.g., Schmidhuber (1999)) are pretty clear, though, that “all the same up to a finite fudge factor” is supposed to be why choosing an arbitrary UTM is fine and dandy. So let’s assume some coin-flippy distribution, and look at these finite fudge factors, along with other possible objections/problems for the UD, in more detail.
IV. Can you ignore being only finitely wrong?
In many real-world contexts, praise like “X is only finitely bad” and “X does only finitely worse than Y” isn’t especially comforting. Consider a student who failed to turn in her homework, protesting that “I would’ve finished given some finite amount of time!”. Faced with such an excuse, it seems reasonable for the teacher to respond with: “yes, but what matters is whether you did in fact finish, given the amount of time you in fact had.” Or consider an employee giving a timeline for a project that’s only off by some finite factor — say, only a few billion (instead of taking 3 months, the project took 10 billion months). One wonders about the performance review. In general, “finite error” isn’t enough: the actual number matters, because different finite errors mean very different things. (Thanks to Katja Grace for suggesting these analogies.)
We might wonder, then, whether “the UD converges on the truth after a finite number of observations” or “my arbitrary UTM is only finitely worse than any other arbitrary UTM” is any better of an excuse than the student’s above. In particular, if our basic strategy here is something like “just make sure that you’ve got non-zero credence on the true hypothesis, then update on as many observations as you can,” we might wonder whether we will have enough observations to avoid being hopelessly wrong. And we might wonder this, too, with respect to the number of observations required to bridge the gap between one UTM and another.
One route, here, is just to hope/argue that in practice, we do have enough such observations. We might do this, for example, by estimating (1) the amount of information we have typically seen over our lifetimes, and comparing it to (2) the number of bits we expect to be required to converge on something approaching the truth, given some arbitrary starting point; and/or to (3) the number of bits we expect to be required to get one arbitrary UTM to simulate another one. If (1) is bigger than (2) and (3), then things are looking good for “the finite fudge factors will wash out once we actually bump up against the evidence.”
Even granted some large estimate for (1), though, I feel unclear how to estimate (2) and (3). Indeed, I have some hazy background sense of “for most finite numbers, most finite numbers are larger than that,” such that granted some pinned-down estimate for (1) (e.g., ten megabytes per second for thirty years), and some “pull an arbitrary finite number out of the air” estimates for (2) and (3), I hazily expect the arbitrary finite numbers to be bigger. And if they are bigger, this means that, as a Solomonoff Inductor, your beliefs given your actual evidence will end up heavily influenced by your arbitrary choice of UTM.
That said, I think it possible that I’m missing intuitions about (2) and (3) in particular that others have in more force. For example, if it is somehow true that “most” UTMs code for each other using some small number of bits — a number much smaller than whatever our estimate is for (1) — that looks helpful for the “does the arbitrary choice of UTM matter” question. Unless we restrict the class of UTMs under consideration, though (see sections V and VI), it’s not currently clear to me why one would expect this: UTMs, after all, can be arbitrarily complicated and gnarly (consider, for example, UTMs that execute some number n of additional junk instructions on the work tape before printing output), even if the ones humans tend to work with seem intuitively “simple” and “natural” to us (see here for discussion of some especially small ones).
Of course, it’s not like opponents of the UD have some obvious alternative prior, which is somehow guaranteed to lead to truth, or to resolve disagreements, given the amount of evidence we in fact get (my understanding is that some bayesians with an “Epistemic Normative Realism” bent want to appeal to some “One True Prior That Everyone Should Have”, here — but even granted the existence of such a mysterious object, it remains entirely unclear how we would identify it). Indeed, in some sense, Solomonoff Inductors are in a boat similar to the one that less computer-science-y Bayesians were in all along: you’ll plausibly converge on the truth, and resolve disagreements, eventually; but a priori, for arbitrary agents in arbitrary situations, it’s hard to say when. My main point here is that the Solomonoff Induction boat doesn’t seem obviously better.
V. To rig, or not to rig, your UTM
There is, though, another way of trying to avoid the “if my arbitrary UTM starts me off too far from the truth, I’m screwed” problem — namely, by rigging your UTM to be such that the types of worlds you currently think are likely are such as to be specifiable via a fairly short program. Thus, for example, you might choose something python-ish, because it looks like the physics of our universe is in fact specifiable in a relatively short python program.
At a glance, this looks like cheating. Indeed, if we’re allowed to actively rig our UTMs, then we can make Solomonoff Induction say whatever crazy thing we want. The recipe is fairly simple: pick the observations you want to expect (e.g., “my desk is about to turn into an elephant”), rig your chosen UTM to print those observations given the shortest strings (e.g., “0X”, “1X”, etc), and boom: elephant-desk is your highest probability prediction.
At least one person I spoke to, though, seemed to think that rigging your UTM to get the results you want was the way to go. The reasoning, as I understood it, was something like: “well, if you find yourself in a universe that looks like it was likely according to a certain distribution (for example, according to a Python-ish coin-flippy simplicity prior), you should update towards using that distribution as your prior.” Naively, though, this strikes me as a strange way to think about the type of fundamental prior thought to be at stake in the UD. After all, this prior is supposed to tell you what to expect before you see evidence like “how easy is it to write the laws of physics in python”; and in general, we tend to think that when you see evidence, the thing to do is to update your posterior, rather than to “change your prior” (whatever that means). That said, maybe the line between changing your prior (e.g., re-starting with a python-ish UTM) vs. updating your posterior (e.g., starting with some non-python-ish UTM, then updating towards programs that start with prefixes that turn your UTM python-ish) gets blurry (thanks to Evan Hubinger for discussion here).
VI. Simplicity realism
Another way around the “does your arbitrary UTM screw you over” problem is to restrict the class of admissible UTMs to some objectively privileged set. Thus, for example, we can imagine requiring that the UTM in question be intuitively “simple,” “natural,” “reasonable,” or “non-janky.” If all such UTMs simulate each other given fairly compact prefixes, this would limit the damage from finite fudge factors significantly.
Indeed, we could even loosen the requirement, and require only that the chosen UTM be such as to simulate some intuitively simple/natural/non-janky etc UTM, given some not-ridiculously-long prefix (see e.g. Yudkowsky here on not starting with ludicrously low probabilities on the hypothesis that some UTM buildable using a small number of wheels and gears, in our physical universe, would be better than whatever one you chose). To the extent that we trust such simple/natural/etc UTMs more, this builds in a requirement that convergence on such a better-trusted UTM is fairly fast.
Of course, appealing to the “simplicity” of a UTM prompts worries about circularity. In particular, if we are using “simplicity” in the sense at stake in the machinery of Solomonoff Induction, then the notion will only be defined relative to some further UTM (e.g., “UTM1 is simple iffdef it is simulable by UTM2 using a relatively short prefix”) — and thus, we will need some further way of evaluating whether that further UTM counts as simple, and so on.
Perhaps, though, we are invoking some as-yet-undefined notion of “simplicity” (despite the fact that formalizing the notion of simplicity was supposed to be one of algorithmic probability’s great victories). Indeed, my sense is that some undefined, intuitive notion of “simplicity,” “naturalness,” “non-janky-ness”, and so forth is often operating in the background of discussions of this kind; and it crops up in many other parts of philosophy as well. Perhaps, then, it’s OK to admit that it’s playing a role here, too — even if we aren’t particularly satisfied with our best current account of it.
Indeed, even absent a satisfying account, my sense is that some people seem to hope or assume that simplicity is in some deep sense an objective thing — something that is not, ultimately, relative to some particular encoding or UTM or language or practice or whatever, but which is still in some deep sense out there, such that all (most?) sufficiently intelligent beings, across all (most?) possible (actual?) worlds, could agree on it. When combined with some fundamental allegiance to Occam’s Razor, this view makes “simplicity” a kind of magical we-know-not-what, governing which things are how likely to exist, which UTMs and encodings and theories are reasonable to use, and maybe much else. On such a view, there is, in fact, a One True Prior over hypotheses, which reflects the One True Simplicity. Perhaps one day we will see it clearly. For now, we can merely grope, or intuit.
Let’s call this type of view “simplicity realism.” To my ear, it feels like a bit much — and in particular, like it risks mistaking something about how our particular, naturally-selected brains process information in this particular world with something deeper, more objective, and more applicable across all possible worlds/agents. But who knows: maybe simplicity crops up sufficiently often, or seems sufficiently continuous with other things we treat as objective, that this sort of picture ends up justified (thanks to Bastian Stern and Amanda Askell for discussion). At the least, though, if we are invoking simplicity realism (for example, in limiting the choice of UTM, or in justifying Occam’s Razor or the UD more broadly), we should be explicit about it.
VII. Simplicity as love
Where simplicity realists respond to “arbitrary UTM”-type problems by attempting to invoke some objective standard, other people I spoke to took a different and opposing tack: they embraced the arbitrariness of the choice of UTM, but attempted to reinterpret that choice in ethical rather than epistemic terms. On this view (call it, “simplicity as love”), to call something simple (e.g., in the context of the UD, to choose a UTM such that a given hypothesis gets coded for using a shorter program) is, roughly, to say that you care about it more — not because it’s more likely to exist, but as a matter of brute subjective preference.
I’m skeptical of this sort of move, especially when suggested as some sort of casual afterthought rather than in a mode of “I’m hereby getting rid of/radically revising the concept of epistemology altogether, let’s really work through what this implies in general,” but I’m not going to get into the issue here. I thought I’d flag it, though, as an additional option.
VIII. Uncomputable worlds
I’ll also flag a few other issues with the UD.
The first, noted above, is that the UD assumes that the process generating your observations is computable: e.g., there is in principle some finite set of steps we could execute, which would allow us to predict your observations. (Or perhaps, to predict them within any arbitrary error? I feel a bit confused about exactly how this goes, and about how randomness fits in with various definitions, but I’m going to leave the issue to the side.)
Indeed, if we allowed the universe to include uncomputable processes, we could allow the universe to include a copy of your process of Solomonoff Induction itself, in which case the universe could function as a Solomonoff “anti-Inductor”: e.g., a process that simulates perfectly what you will predict next, then feeds you the opposite every time (thanks to Mark Xu for discussion). My understanding is that this possibility would throw a wrench in the works of Solomonoff Induction’s “always converges on the truth” properties — and that in this sense, such properties require that Solomonoff Induction take place “outside the world.” Or put another way: Solomonoff Induction is great, as long as no one can actually run it. Not only is the process uncomputable, but it requires the impossibility of uncomputable things (and especially: itself).
(Also, to the extent we can run approximations of Solomonoff Induction, can’t we run approximations of Solomonoff Anti-Inductors, too? And won’t they create analogous problems for converging on the truth?)
Setting aside “Solomonoff Induction requires that Solomonoff Induction is impossible” type worries, though, can we just go ahead and blithely assume that the process generating our observations is computable? It would be convenient if so: and in particular, if the process generating our observations was a nice deterministic discrete cellular-automata type deal. But it’s not currently clear to me what would justify such an assumption, especially if we’re in a “let’s keep non-zero probability on all the hypotheses” type mood. After all, the question here — and especially, the question from the perspective of the UD, the fundamental prior we are supposed to assign before we learn anything about the world — is not whether, e.g. the practice of physics happens to work in our world, or whether sufficiently information-rich things run into problems with black holes, or whatever. The question is whether, in the space of all possible hypotheses we should be leaving on the table before we learn anything, there are some involving worlds whose behavior can’t be predicted by running a Turing Machine for some finite amount of time. Given that there is at least some ongoing debate about whether you can do this with our actual best-guess world; given the availability of multiple models of hypercomputation that seem mathematically coherent (and note that being in an uncomputable world doesn’t obviously require being able to use that world to do your own uncomputable things); and given the sheer vastness of the agnosticism that “non-zero probability on all hypotheses” naively implies, it seems strange to rule out uncomputable worlds, point blank, from the get-go.
One response I’ve heard here is: “well, if our universe isn’t computable, then the whole ‘predict your observations using your finite brain’ game is out the window, so whatever, let’s just give up on those worlds and focus on the computable ones.” I’d need more convincing, though. For one thing, “feasible for our finite brain” wasn’t exactly Solomonoff Induction’s strong suit, either. For another, “uncomputable” doesn’t obviously entail “can’t say anything useful about its output” — after all, Solomonoff Induction is uncomputable, but presumably we’re supposed to be able to say useful stuff about it outputs, what with all this talk about approximations and so on. And indeed, depending on how we feel about e.g. infinite ethics, uncomputable worlds could be where a lot of the action is. Giving up on them might give up on a lot of important stuff.
A different tack, though, is just to say that Solomonoff Induction is the thing you should do with the portion of your credence devoted to computable worlds — for the rest, you’re in the same sorry boat you were in before you learned all this fun stuff about UTMs, including re: whether the world is computable or uncomputable. This strikes me as a better response than “just give up on the uncomputable worlds” (though one might’ve hoped for something more unified).
IX. Simulation weirdness
Another issue for the UD is simulations. Of course, in some sense simulations are an issue for everyone. But plausibly, the UD makes the issue more acute and weirder.
The standard simulation worry is that, if there are lots of simulations of people like you, then you’re probably living in one (assuming that the population of “unsimulated people like you” is comparatively small). On the UD, though, there’s an additional worry: namely, that worlds in which you’re simulated might be coded for by much shorter programs — e.g., they might be “simpler,” relative to your arbitrary UTM — than worlds in which you’re not (though note that this worry holds even if we use some intuitive notion of “simple”, rather than some UTM-ish one; e.g., all adherents to Occam’s Razor, however computer-science-y, might end up thinking simulation worlds are more likely a priori).
Note, for example, that as a first pass, the UD doesn’t care about how long it takes for something to happen in a world; all it cares about is the length of the program required to specify that world (and then to extract some output from it — see part 2). Just as it might be easier to specify the Mandelbrot Set than a particular patch of the Mandelbrot Set, then, it might be easier to specify a world with much simpler-relative-to-our-UTM physics than our own, which then goes on (perhaps after an extremely long time) to produce simulations of our physics, than it is to specify our physics itself. Conway’s Game of Life, for example, could in principle simulate our world (it can run UTMs). If it would eventually do so, and if it’s easier to code up on your UTM than whatever should ultimately replace the Standard Model etc, then on priors, maybe you’re really a Game of Lifer, and your bones are made of bits (are they anyway?). And it’s not like your observations are any help.
Of course, it’s not clear that this is an objection, per se. But it gets into some weird stuff — in particular, stuff about agents in easy-to-code-for universes realizing that their world is “likely” according to the UD (does this assume some sort of simplicity realism?), and running simulations (and maybe doing other things) to manipulate what the UD says. Or something. People have written various things about this (see e.g. here and here), which I haven’t tried to engage with in depth. But I thought I’d flag it as an additional thing that the UD might need to deal with.
Simulation stuff is also a reminder of what “Solomonoff Induction converges on the truth” actually means. It’s not that Solomonoff Induction converges on a program that causes your UTM to simulate your world in particular, such that we can read off of the UTM’s internal behavior what sort of world you ultimately live in. Rather, Solomonoff Induction converges on programs that output your observations: accurate representation of your world isn’t necessary. Thus, if you are actually a flesh and blood human, in a normal-physics basement universe, but your random UTM happens to be disposed, on comparatively short strings, to (a) run Game of Life for zillions of years until it simulates your physics, and then (b) to extract and print out what you observe in those simulations, then looking at the internals of your UTM’s activity for clues about the nature of your ultimate reality is going to mislead: you’ll end up thinking that just beyond the walls of your physics, somewhere you can never go, there is a fleet of gliders soaring across the pixelated, two-dimensional sky: when actually, there is only [insert Actual Ultimate Reality and/or fundamental metaphysical confusion and/or something something brute fact nothingness etc].
In this sense, expecting Solomonoff Induction to actually get you to an accurate description of the world beyond your experience, as opposed to good predictions of that experience, requires a deeper faith than I attempted to justify in section II. You can’t throw all the hypotheses in the mix, then update the heck off of what you learn: that might eventually get you to good predictions, but it won’t get you to the difference between “Game of Life simulating my physics” vs. “my physics is basement.” That difference requires confidence that your random UTM’s random preferences over strings are somehow a good guide to what’s actually going on — and absent some sort of simplicity realism (or even granted such realism), the question of “wait why would we think that again?” remains salient.
Of course, as above, it’s not like anyone else has especially satisfying ideas for how to deal with problems with a flavor of “how do I tell the difference between x and y hypothesis, given that they predict that I will make the same observations?” And the UD at least gives a specific procedure. But procedures (“favor hypotheses that remind you more of your mother’s apple pie”) are cheap: it’s reliable procedures that we’re after.
X. What sort of robot are you
A final set of questions about the UD have to do with relating its implicit ontology to your real world situation. In particular: it can seem like the UD imagines you as a kind of fixed, Cartesian eye, staring — unblinking, Clockwork-Orange-style — down at an output tape, watching some well-defined set of 1s and 0s scroll past, trying to predict what comes next. But we might wonder: is that really your situation?
This question takes various forms. One version has to do with the “1s and 0s” phenomenology. Naively, we don’t see 1s and 0s: we experience the blooming richness of sight, sound, touch, proprioception, and so on. What’s more, we don’t experience these as a set of “raw pixels,” whatever those would be: rather, they come pre-processed into high-level concepts (milk, fire-truck, scary), affordances, valences, associations, and so on. How do we get all this into the format the UD requires?
One response is: “Look, let’s make this simpler. Imagine that you’re a robot with cameras for eyes, where all the images you’ve seen over your life get saved into a hard-drive in your chest. Now we’re going to predict the RGB values for your upcoming images (or maybe, the values output by some image classifier in your head).” And such a move does indeed simplify the story. One wonders, though, whether anything important is lost.
One thing that seems like it probably will get lost (whether important or no) is any key role of your “qualia” — those famously unproblematic objects, which everyone definitely agrees should be kept around. Qualia, on certain ontologies, are the central mediators between your Cartesian self and that undiscovered land, the world — and in this sense, I expect that many people intuitively imagine them as the thing that your unblinking eye is watching scroll past. But our little robot story leaves them out. Perhaps that’s all for the best (and ~everyone agrees that there are at least physical correlates of qualia anyway, so presumably pro-qualia folks can just use those), but it also leaves us with a less intuitive picture of what these “RGB values” are supposed to be. Chemical reactions in my retina? Complex patterns of neural firing? Presumably something in this vicinity — but this version makes the connection with “I wonder if I’ll see that big red fire truck today” feel less direct, relative to e.g. little red qualia patches. That said, this type of disconnection generally falls out of physicalist philosophies of mind more broadly.
And it’s not like things are neat and pretty if we go in for qualia, either: trying to encode your qualia as a string of 1s and 0s seems far from straightforward, even in principle. After all, it’s not actually a digital Cartesian theatre, with discrete little pixels or patches you can make a nice little mapping with. Rather, it’s something weirder, mushier, wobblier.
Qualia stuff aside, though (one often hopes to set it aside), even general worries in the vicinity of "how exactly do I go from my observations to a string of bits?" are the type of thing a certain “c’mon man just imagine you’re a simple robot” aesthetic really doesn’t worry about. And I’m generally up for such an aesthetic as a first pass: these issues aren’t at the top of my list of worries about the UD. But they do lurk in the background a bit, as a reminder of the gap between what my actual epistemic life feels like on the inside, on the one hand, and the Cartesian, bit-string-y eye of the UD, on the other.
There’s also a broader class of questions about the Cartesian eye thing, related to the sense in which Solomonoff Induction can seem to assume or require that you are in some sense outside the world, receiving inputs from a well-defined channel (we got a bit of a taste of this above, re: Solomonoff Induction’s certainty that uncomputable processes like itself cannot be run in the world), rather than fuzzy cloud of atoms and stuff embedded within it. I haven’t thought a lot about this, but I’ll flag it as an additional bucket of stuff.
XI. In summary
Overall, then, my current attitude towards the UD is something like: “I dunno, seems interesting, not exactly a slam dunk, but what is?” I like that it’s a formal thing we can reason and prove stuff about, and that it reflects a certain kind of agnosticism and inclusiveness with respect to (computable) stuff. Various of its optimality properties seem like they might ultimately amount to fancy and more rigorous ways of saying “Bayesianism eventually works if you didn’t rule out the true hypothesis”, but maybe that’s OK: increased clarity about stuff like this is indeed pretty great. I do agree that if the universe is computable, you can in principle model it using any arbitrary UTM; and I agree, as well, that you can’t give uniform, real-numbered probabilities to an infinite set of hypotheses, so unless I go in for infinitesimals or whatever, there’s some sense in which I need to penalize hypotheses that are later in some ordering (all orderings?). It’s much less clear to me, though, whether this has anything to do with ‘simplicity’ as typically and intuitively understood.
Indeed, probably my biggest complaint about the UD is that people seem to jump quickly from hazy intuitions about Occam’s Razor to “obviously simplicity is the fundamental key to the a priori likelihood/reality-ness/loved-by-me-ness of everything” to “obviously weighting strings input to arbitrary UTMs by ½^length is the right way to operationalize simplicity” (I exaggerate somewhat, but see e.g. the degree of non-argument here and here). Maybe this is right somehow, but I’d like to see the argument spelled out in a lot more detail. And absent such argument, I worry that some sort of undefended but metaphysically-intensive simplicity realism will be lurking in the background — a bailey to the motte of “it’s all the same up to some finite fudge factor.” And I worry, too, that “some finite fudge factor” is too little comfort. (The rest of the worries I’ve noted — about uncomputability, simulations, embedded agents, and so on — feel more secondary at the moment, though I can imagine them starting to feel more pressing.)
That said, these are all just a few initial thoughts from reading about the topic a bit, and talking with some people who like this kind of thing. I haven’t engaged much with the formal literature on the UD, and possibly there are arguments out there — both formal and informal — that would prompt in me some deeper excitement about the UD than I currently feel. Indeed, I can imagine such excitement stemming from some intuitive resonance/satisfaction that the UD provides, which isn’t strictly captured by a single linear argument or even some set, but which stems from a variety of sources and directions at once (see e.g. the list of “advice” here).
What’s more, as I’ve tried to emphasize in the post, I think that a lot of the UD’s problems (bad priors maybe dooming you, ignoring certain troublesome cases, simulations, etc) are everyone’s problems. Yudkowsky claims that “Solomonoff induction is the best formalized epistemology we have right now”– and I think he could well be right. The set of remotely satisfying fundamental epistemologies on offer seems sparse indeed, and we should be wary, in objecting to the UD, of imagining, wrongly, that we have some clear-cut and compelling alternative in mind, as opposed to whatever unprincipled muddling we’ve been doing all along. Still, though: unprincipled muddling can encode wisdom that “maybe it seems brittle and irrelevant, but at least it has an aura of technical rigor” lacks.
Regardless, my higher-level aim isn’t actually to evaluate UD per se. Rather, it’s to evaluate whether the UD can help with anthropics. I turn to that issue in my next post.
I also doubt there's a way around the “what if ‘0X’ is our first big hit?” problem without making further assumptions (like a speed prior).
Here's another way to approximate the length of the shortest program that prints out a given string from above: interleave the programs and run them alternatively, increasing the number of different programs being run. Enumerate all of the programs, P1, P2, P3, by length in non-decreasing order.
Then, for each of the steps below, run exactly one step of each program in the step's list:
2. P1, P2
3. P1, P2, P3
4. P1, P2, P3, P4
n. P1, P2, ..., Pn
i.e. at step n, run one more step of programs P1, P2, ..., Pn.
Do this while keeping track of the shortest program so far that printed your string (and you can stop running programs that halted or programs that are as long as or longer than your shortest so far). Eventually, you would find the shortest program that prints your string and record it as the shortest so far, and in some cases, you would know that it's the shortest, if all the shorter programs halt and you wait long enough for them to do so. In many cases, your upper and lower bounds (the length of the shortest program that hasn't halted yet) will never match, since your lower bound program doesn't halt.
If you have an infinite enumerable list of hypotheses, you can do the same trick and interleave those as well, and keep track of shortest programs for each hypothesis. With a finite list of hypotheses, you don't need to interleave, and you can just start them all at the same time.
Of course, this is probably slow, but if you want to make it orders of magnitude faster, I'd expect you to need more assumptions (or maybe use randomness?).
I thought that this post was neat. I was already familiar with Solomonoff induction, but the post still taught me a few things.
Not necessarily true! See Scott Aaronson on this (but iirc, he makes some assumptions I disagreed with)
One thing to think about: in order to reason about "observations" using mathematical theory, we need to (and do) convert then into mathematical things. Probability theory can only address the mathematical things we get in the end.
Most schemes for doing this ignore a lot of important stuff. E.g. "measure my height in cm, write down the answer" is a procedure that produces a real number, but also one that is indifferent to almost every "observation" I might care about in the future.
(The quotes around observation are to indicate that I don't know if it's exactly the right word).
One thing we could try to to is to propose a scheme for mathematising every observation we care about. One way we could try to do this is to try to come up with a sequence of questions "are my observations like X or like not X?". Then the mathematical object our observations become will be a binary sequence. In practice, this will never solve the problem of distinguishing any two observations we care to distinguish, but maybe imagining something like this that goes on forever is not a bad idealization in the sense that we might care less and less about the remaining undistinguished observations.
Can this story capture something like the tale of the universal prior? The problem here is that what I've described looks a bit like a Turing machine -- it outputs a sequence of binary digits -- but it isn't a Turing machine because it has no well-defined domain. In fact, the problem of getting from a vague domain to something mathematical is what it was meant to solve to begin with.
One way we can conceptualize of inputs to this process is to postulate "more powerful observers". For example, if I turn an observation into n binary questions, a more powerful observer is one that asks the same n questions and also asks one more. Then our "observation process" is a Turing machine that takes the output of the more powerful observer and drops the last digit.
However, if we consider the n->infinity limit of this, it seems consistent to me that the more powerful observer could be an anti-inductor or a randomiser vs us every step of the way.
So it seems that this story at least requires an assumption like "we can eventually predict the more powerful observer perfectly".
There are lots of other ways to make more powerful observers, they just need to be capable of distinguishing everything our observation process distinguishes.