Hide table of contents

crossposted on LessWrong

I'm interested in questions of the form, "I have a bit of metadata/structure to the question, but I know very little about the content of the question (or alternatively, I'm too worried about biases/hacks to how I think about the problem or what pieces of information to pay attention to). In those situations, what prior should I start with?"

I'm not sure if there is a more technical term than "low-information prior."

Some examples of what I found useful recently:

1. Laplace's Rule of Succession, for when the underlying mechanism is unknown.

2. Percentage of binary questions that resolves as "yes" on Metaculus. It turns out that of all binary (Yes-No) questions asked on the prediction platform Metaculus, ~29% of them resolved yes. This means that even if you know nothing about the content of a Metaculus question, a reasonable starting point for answering a randomly selected binary Metaculus question is 29%.

In both cases, obviously there are reasons to override the prior in both practice and theory (for example, you can arbitrarily add a "not" to all questions on Metaculus such that your prior is now 71%). However (I claim), having a decent prior is nonetheless useful in practice, even if it's theoretically unprincipled.

I'd be interested in seeing something like 5-10 examples of low-information priors as useful as the rule of succession or the Metaculus binary prior.

New Answer
New Comment

9 Answers sorted by

Douglas Hubbard mentions a few rules of the sort you seem to be describing in his book How to measure anything. For example, his "Rule of Five" states that «There is a 93.75% chance that the median of a population is between the smallest and largest values in any random sample of five from that population.» One of the central themes in that book is in fact that you "have more data that you think", and that these simple rules can take you surprisingly far.

The principle of maximum entropy may be a useful meta answer. E.g.,

  • the normal distribution is the maximum entropy distribution over all real numbers with fixed mean and variance;
  • the exponential distribution is the maximum entropy distribution over non-negative numbers with fixed mean.

Many other commonly used distributions have similar maximum entropy properties.

So e.g. suppose that for structural reasons you know that some quantity is non-negative, and that it has a certain mean, but nothing else. The principle of maximum entropy then suggests to use an exponential prior.

The maximum entropy principle can give implausible results sometimes though. If you have a bag containing 100 balls which you know can only be coloured red or blue, and you adopt a maximum entropy prior over the possible ball colourings, then if you randomly drew 99 balls from the bag and they were all red, you'd conclude that the next ball is red with probability 50/50. This is because in the maximum entropy prior, the ball colourings are independent. But this feels wrong in this context. I'd want to put the probability on the 100th ball being red much higher.

The maximum entropy principle does give implausible results if applied carelessly but the above reasoning seems very strange to me. The normal way to model this kind of scenario with the maximum entropy prior would be via Laplace's Rule of Succession, as in Max's comment below. We start with a prior for the probability that a randomly drawn ball is red and can then update on 99 red balls. This gives a 100/101 chance that the final ball is red (about 99%!). Or am I missing your point here? Somewhat more formally, we're looking at a Bernoulli trial - for each ball, there's a probability p that it's red. We start with the maximum entropy prior for p, which is the uniform distribution on the interval [0,1] (= beta(1,1)). We update on 99 red balls, which gives a posterior for p of beta(100,1), which has mean 100/101 (this is a standard result, see e.g. conjugate priors - the beta distribution is a conjugate prior for a Bernoulli likelihood). The more common objection to the maximum entropy principle comes when we try to reparametrise. A nice but simple example is van Fraassen's cube factory (edit: new link): a factory manufactures cubes up to 2x2x2 feet, what's the probability that a randomly selected cube has side length less than 1 foot? If we apply the maximum entropy principle (MEP), we say 1/2 because each cube has length between 0 and 2 and MEP implies that each length is equally likely. But we could have equivalently asked: what's the probability that a randomly selected cube has face area less than 1 foot squared? Face area ranges from 0 to 4, so MEP implies a probability of 1/4. All and only those cubes with side length less than 1 have face area less than 1, so these are precisely the same events but MEP gave us different answers for their probabilities! We could do the same in terms of volume and get a different answer again. This inconsistency is the kind of implausible result most commonly pointed to.
I think I disagree that that is the right maximum entropy prior in my ball example. You know that you are drawing balls without replacement from a bag containing 100 balls, which can only be coloured blue or red. The maximum entropy prior given this information is that every one of the 2^100 possible colourings {Ball 1, Ball 2, Ball 3, ...} -> {Red, Blue} is equally likely (i.e. from the start the probability that all balls are red is 1 over 2^100). I think the model you describe is only the correct approach if you make an additional assumption that all balls were coloured using an identical procedure, and were assigned to red or blue with some unknown, but constant, probability p. But that is an additional assumption. The assumption that the unknown p is the same for each ball is actually a very strong assumption. If you want to adopt the maximum entropy prior consistent with the information I gave in the set-up of the problem, you'd adopt a prior where each of the 2^100 possible colourings are equally likely. I think this is the right way to think about it anyway. The re-paremetrisation example is very nice though, I wasn't aware of that before.
Thanks for the clarification - I see your concern more clearly now. You're right, my model does assume that all balls were coloured using the same procedure, in some sense - I'm assuming they're independently and identically distributed. Your case is another reasonable way to apply the maximum entropy principle and I think it's points to another problem with the maximum entropy principle but I think I'd frame it slightly differently. I don't think that the maximum entropy principle is actually directly problematic in the case you describe. If we assume that all balls are coloured by completely different procedures (i.e. so that the colour of one ball doesn't tell us anything about the colours of the other balls), then seeing 99 red balls doesn't tell us anything about the final ball. In that case, I think it's reasonable (even required!) to have a 50% credence that it's red and unreasonable to have a 99% credence, if your prior was 50%. If you find that result counterintuitive, then I think that's more of a challenge to the assumption that the balls are all coloured in such a way that learning the colour of some doesn't tell you anything about the colour of the others rather than a challenge to the maximum entropy principle. (I appreciate you want to assume nothing about the colouring processes rather than making the assumption that the balls are all coloured in such a way that learning the colour of some doesn't tell you anything about the colour of the others, but in setting up your model this way, I think you're assuming that implicitly.) Perhaps another way to see this: if you don't follow the maximum entropy principle and instead have a prior of 30% that the final ball is red and then draw 99 red balls, in your scenario, you should maintain 30% credence (if you don't, then you've assumed something about the colouring process that makes the balls not independent). If you find that counterintuitive, then the issue is with the assumption that the balls are all c
I think I disagree with your claim that I'm implicitly assuming independence of the ball colourings. I start by looking for the maximum entropy distribution within all possible probability distributions over the 2^100 possible colourings. Most of these probability distributions do not have the property that balls are coloured independently. For example, if the distribution was a 50% probability of all balls being red, and 50% probability of all balls being blue, then learning the colour of a single ball would immediately tell you the colour of all of the others. But it just so happens that for the probability distribution which maximises the entropy, the ball colourings do turn out to be independent. If you adopt the maximum entropy distribution as your prior, then learning the colour of one tells you nothing about the others. This is an output of the calculation, rather than an assumption. I think I agree with your last paragraph, although there are some real problems here that I don't know how to solve. Why should we expect any of our existing knowledge to be a good guide to what we will observe in future? It has been a good guide in the past, but so what? 99 red balls apparently doesn't tell us that the 100th will likely be red, for certain seemingly reasonable choices of prior. I guess what I was trying to say in my first comment is that the maximum entropy principle is not a solution to the problem of induction, or even an approximate solution. Ultimately, I don't think anyone knows how to choose priors in a properly principled way. But I'd very much like to be corrected on this.
As a side-note, the maximum entropy principle would tell you to choose the maximum entropy prior given the information you have, and so if you intuit the information that the balls are likely to be produced by the same process, you'll get a different prior that if you don't have that information. I.e., your disagreement might stem from the fact that the maximum entropy principle gives different answers conditional on different information. I.e., you actually have information to differentiate between drawing n balls and flipping a fair coin n times.

With 50% probability, things will last twice as long as they already have.

In 1969, just after graduating from Harvard, Gott was traveling in Europe. While touring Berlin, he wondered how long the Berlin Wall would remain there. He realized that there was nothing special about his being at the Wall at that time. Thus if the time from the construction of the Wall until its removal were divided into four equal parts, there was a 50% chance that he was in one of the middle two parts. If his visit was at the beginning of this middle 50%, then the Wall would be there three times as long as it had so far; if his visit was at the end of the middle 50%, then the Wall would last 1/3 as long as it had so far. Since the Wall was 8 years old when he visited, Gott estimated that there was a 50% chance that it would last between 2.67 and 24 years. As it turned out, it was 20 more years until the Wall came down in 1989. This success of this prediction spurred Gott to write up his method for publication. (It appeared in the journal Nature in 1993.)

Source; see also Gott.

I have used this method with great success to estimate, among other things, the probability that friends will break up with their romantic partners.

I also carried out some experiments a while ago to find out what the prior probability was for me "being really sure about something", or the probability associated to "I would be highly surprised to learn if this were false." That is, for the feeling of being highly sure, how does that pan out?

On another direction, superforecasters have some meta-priors, such as "things will take longer than expected, and longer for larger organizations", or "things will stay mostly as they have."

I agree this is useful and I often use it when forecasting. It's important to emphasize that this is a useful prior, though, since Gott appears to treat it as an all-things-considered posterior.

I have used this method with great success to estimate, among other things, the probability that friends will break up with their romantic partners.

William Poundstone uses this example, too, to illustrate the "Copernican principle" in his popular book on the doomsday argument.

With 50% probability, things will last twice as long as they already have.

In my head, I map this very similarly to the German tank problem. I agree that it's a very useful prior!

Yes, I think that this corresponds to the German tank problem after you see the first tank.

Here is a Wikipedia reference:

The Lindy effect is a theory that the future life expectancy of some non-perishable things like a technology or an idea is proportional to their current age, so that every additional period of survival implies a longer remaining life expectancy. Where the Lindy effect applies, mortality rate decreases with time.

With 50% probability, things will last twice as long as they already have. By this you mean that if something has lasted x amount of time so far, with 50% probability the total amount of time it will have lasted is at least 2x (i.e., it will continue to last at least another x years)?

Yep, exactly right. 

I found the answers to this question on stats.stackexchange useful for thinking about and getting a rough overview of "uninformative" priors, though it's mainly a bit too technical to be able to easily apply in practice. It's aimed at formal Bayesian inference rather than more general forecasting.

In information theory, entropy is a measure of (lack of) information - high entropy distributions have low information. That's why the principle of maximum entropy, as Max suggested, can be useful.

Another meta answer is to use Jeffreys prior. This has the property that it is invariant under a change of coordinates. This isn't the case for maximum entropy priors in general and is a source of inconsistency (see e.g. the partition problem for the principle of indifference, which is just a special case of the principle of maximum entropy). Jeffrey's priors are often unwieldy, but one important exception is for the interval (e.g. for a probability), for which the Jeffrey's prior is the distribution. See the red line in the graph at the top of the beta distribution Wikipedia page - the density is spread to the edges close to 0 and 1.

This relates to Max's comment about Laplace's Rule of Succession: taking N_v = 2, M_v = 1 corresponds to the uniform distribution on (which is just beta(1,1)). This is the maximum entropy entropy distribution on . But as Max mentioned, we can vary N_v and M_v. Using Jeffrey's prior would be like setting N_v = 1 and M_v = 1/2, which doesn't have as nice an interpretation (1/2 a success?) but has nice theoretical features. Especially useful if you want to put the density around 0 and 1 but still have mean 1/2.

There's a bit more discussion of Laplace's Rule of Sucession and Jeffrey's prior in an EA context in Toby Ord's comment in response to Will MacAskill's Are we living at the most influential time in history?

Finally, a bit of a cop-out, but I think worth mentioning, is the suggestion of imprecise credences in one of the answers to the stats.stackexchange question linked above. Select a range of priors and seeing how much they converge, you might find prior choice doesn't matter that much and when it does matter, I expect this could be useful for determining your largest uncertainties.

I'm confused about the partition problem you linked to. Both examples in that post seem to be instances where in one partition available information is discarded.

Suppose you have a jar of blue, white, and black marbles, of unknown proportions. One is picked at random, and if it is blue, the light is turned on. If it is black or white, the light stays off (or is turned off). What is the probability the light is on?
There isn’t one single answer. In fact, there are several possible answers.
[1.] You might decide to assign a 1/2 probability to t
... (read more)
yeah, these aren't great examples because there's a choice of partition which is better than the others - thanks for pointing this out. The problem is more salient if instead, you suppose that you have no information about how many different coloured marbles there are and ask what the probability of picking a blue marble is. There are different ways of partitioning the possibilities but no obviously privileged partition. This is how Hilary Greaves frames it here. Another good example is van Fraassen's cube factory, e.g. described here.
Thanks a lot for the pointers! Greaves' example seems to suffer the same problem, though, doesn't it? We have information about the set and distribution of colors, and assigning 50% credence to the color red does not use that information. The cube factory problem does suffer less from this, cool! I wonder if one should simply model this hierarchically, assigning equal credence to the idea that the relevant measure in cube production is side length or volume. For example, we might have information about cube bottle customers that want to fill their cubes with water. Because the customers vary in how much water they want to fit in their cube bottles, it seems to me that we should put more credence into partitioning it according to volume. Or if we'd have some information that people often want to glue the cubes under their shoes to appear taller, the relevant measure would be the side length. Currently, we have no information like this, so we should assign equal credence to both measures.
I don't think Greaves' example suffers the same problem actually - if we truly don't know anything about what the possible colours are (just that each book has one colour), then there's no reason to prefer {red, yellow, blue, other} over {red, yellow, blue, green, other}. In the case of truly having no information, I think it makes sense to use Jeffreys prior in the box factory case because that's invariant to reparametrisation, so it doesn't matter whether the problem is framed in terms of length, area, volume, or some other parameterisation. I'm not sure what that actually looks like in this case though
Hm, but if we don't know anything about the possible colours, the natural prior to assume seems to me to give all colors the same likelihood. It seems arbitrary to decide to group a subsection of colors under the label "other", and pretend like it should be treated like a hypothesis on equal footing with the others in your given set, which are single colors. Yeah, Jeffreys prior seems to make sense here.

Some Bayesian statisticians put together prior choice recommendations. I guess what they call a "weakly informative prior" is similar to your "low-information prior".

Another useful meta answer is stability: There are only a few distributions with the property that a linear combination of independent distributions is a distribution of the same type. This means they're "attractors" for processes that can be described as consecutive sums of independent quantities and multiplications by nonrandom factors.

So if for structural reasons you know that what you're looking at is the output of such a process, then you may want to use an alpha-stable distribution as prior. The value of alpha essentially controls how heavy the tails are. (And the normal distribution is a special case for alpha = 2.)

This is essentially a generalization of the central limit theorem to distributions with higher variance / heavier tails.

In a context where multiple forecasts have already been made (by you or other people), use the geometric mean of the odds as a blind aggregate:

If you want to get fancy, use an extremized version of this pooling method, by scaling the log odds by a factor :

Satopaa et al have found that in practice gives the best results.

The latex isn't displaying well (for me at least!) which makes this really hard to read. You just need to press 'ctrl'/'cmd' and '4' for inline latex and 'ctrl'/'cmd' and 'M' for block :)

Bono et al. (2017), based on reviewing the abstracts of all articles published between 2010 and 2015 and listed on Web of Science, found (N = 262 abstracts reporting non-normal distributions):

In terms of their frequency of appearance, the most-common non-normal distributions can be ranked in descending order as follows: gamma, negative binomial, multinomial, binomial, lognormal, and exponential.

[I read only the abstract and can't comment on the review's quality.]

Regarding the proportion of Metaculus binary questions resolving “yes”, it is worth noting that this has recently changed. Here is what Anthony Aguirre wrote on the Metaculus discord channel on 10th of July:

I'd like to note a meta trend that things are happening: since March 1, by my rough count 54% of binary questions on Metaculus and 62% of binary questions on Pandemic have resolved positively; this is marketly different from this historical average of ~30%.

Agreed, though I think most of the positive resolutions were closely related to covid-19?

Michał Dubrawski
You are probably right but it would be great to have access to the data and check if it is true. I have already spoken with Tamay about the idea od Metaculus data dumps, but if Metaculus Team decides to implement this in some form, it will take time. I will try to gather this data using some script.
I'd be really excited if you were to do this.
Sorted by Click to highlight new comments since:

[I learned the following from Tom Davidson, as well as that this perspective goes back to at least Carnap. Note that all of this essentially is just an explanation of beta distributions.]

Laplace's Rule of Succession really is a special case of a more general family of rules. One useful way of describing the general family is as follows:

Recall that Laplace's Rule of Succession essentially describes a prior for Bernoulli experiments; i.e. a series of independent trials with a binary outcome of success or failure. E.g. every day we observe whether the sun rises ('success') or not ('failure') [and, perhaps wrongly, we assume that whether the sun rises on one day is independent from whether it rose on any other day].

The family of priors is as follows: We pretend that prior to any actual trials we've seen N_v "virtual trials", among which were M_v successes. Then at any point after having seen N_a actual trials with M_a successes, we adopt the maximum likelihood estimate for the success probability p of a single trial based on both virtual and actual observations. I.e.,

p = (M_v + M_a) / (N_v + N_a).

Laplace's Rule of Succession simply is the special case for N_v = 2 and M_v = 1. In particular, this means that before the first actual trial we expect it to succeed with probability 1/2. But Laplace's Rule isn't the only prior with that property! We'd also expect the first trial to succeed with probability 1/2 if we took, e.g., N_v = 42 and M_v = 21. The difference compared to Laplace's Rule would be that our estimate for p will move much slower in response to actual observations - intuitively we'll need 42 actual observations until they get the same weight as virtual observations, whereas for Laplace's Rule this happens after 2 actual observations.

And of course, we don't have to "start" with p = 1/2 either - by varying N_v and M_v we can set this to any value.

More from Linch
Curated and popular this week
Relevant opportunities