[ Question ]

What are some low-information priors that you find practically useful for thinking about the world?

by Linch1 min read7th Aug 202025 comments


RationalityStatistical Methods

crossposted on LessWrong

I'm interested in questions of the form, "I have a bit of metadata/structure to the question, but I know very little about the content of the question (or alternatively, I'm too worried about biases/hacks to how I think about the problem or what pieces of information to pay attention to). In those situations, what prior should I start with?"

I'm not sure if there is a more technical term than "low-information prior."

Some examples of what I found useful recently:

1. Laplace's Rule of Succession, for when the underlying mechanism is unknown.

2. Percentage of binary questions that resolves as "yes" on Metaculus. It turns out that of all binary (Yes-No) questions asked on the prediction platform Metaculus, ~29% of them resolved yes. This means that even if you know nothing about the content of a Metaculus question, a reasonable starting point for answering a randomly selected binary Metaculus question is 29%.

In both cases, obviously there are reasons to override the prior in both practice and theory (for example, you can arbitrarily add a "not" to all questions on Metaculus such that your prior is now 71%). However (I claim), having a decent prior is nonetheless useful in practice, even if it's theoretically unprincipled.

I'd be interested in seeing something like 5-10 examples of low-information priors as useful as the rule of succession or the Metaculus binary prior.

New Answer
Ask Related Question
New Comment

9 Answers

Douglas Hubbard mentions a few rules of the sort you seem to be describing in his book How to measure anything. For example, his "Rule of Five" states that «There is a 93.75% chance that the median of a population is between the smallest and largest values in any random sample of five from that population.» One of the central themes in that book is in fact that you "have more data that you think", and that these simple rules can take you surprisingly far.

With 50% probability, things will last twice as long as they already have.

In 1969, just after graduating from Harvard, Gott was traveling in Europe. While touring Berlin, he wondered how long the Berlin Wall would remain there. He realized that there was nothing special about his being at the Wall at that time. Thus if the time from the construction of the Wall until its removal were divided into four equal parts, there was a 50% chance that he was in one of the middle two parts. If his visit was at the beginning of this middle 50%, then the Wall would be there three times as long as it had so far; if his visit was at the end of the middle 50%, then the Wall would last 1/3 as long as it had so far. Since the Wall was 8 years old when he visited, Gott estimated that there was a 50% chance that it would last between 2.67 and 24 years. As it turned out, it was 20 more years until the Wall came down in 1989. This success of this prediction spurred Gott to write up his method for publication. (It appeared in the journal Nature in 1993.)

Source; see also Gott.

I have used this method with great success to estimate, among other things, the probability that friends will break up with their romantic partners.

I also carried out some experiments a while ago to find out what the prior probability was for me "being really sure about something", or the probability associated to "I would be highly surprised to learn if this were false." That is, for the feeling of being highly sure, how does that pan out?

On another direction, superforecasters have some meta-priors, such as "things will take longer than expected, and longer for larger organizations", or "things will stay mostly as they have."

I found the answers to this question on stats.stackexchange useful for thinking about and getting a rough overview of "uninformative" priors, though it's mainly a bit too technical to be able to easily apply in practice. It's aimed at formal Bayesian inference rather than more general forecasting.

In information theory, entropy is a measure of (lack of) information - high entropy distributions have low information. That's why the principle of maximum entropy, as Max suggested, can be useful.

Another meta answer is to use Jeffrey's prior. This has the property that it is invariant under a change of coordinates. This isn't the case for maximum entropy priors in general and is a source of inconsistency (see e.g. the partition problem for the principle of indifference, which is just a special case of the principle of maximum entropy). Jeffrey's priors are often unwieldy, but one important exception is for the interval (e.g. for a probability), for which the Jeffrey's prior is the distribution. See the red line in the graph at the top of the beta distribution Wikipedia page - the density is spread to the edges close to 0 and 1.

This relates to Max's comment about Laplace's Rule of Succession: taking N_v = 2, M_v = 1 corresponds to the uniform distribution on (which is just beta(1,1)). This is the maximum entropy entropy distribution on . But as Max mentioned, we can vary N_v and M_v. Using Jeffrey's prior would be like setting N_v = 1 and M_v = 1/2, which doesn't have as nice an interpretation (1/2 a success?) but has nice theoretical features. Especially useful if you want to put the density around 0 and 1 but still have mean 1/2.

There's a bit more discussion of Laplace's Rule of Sucession and Jeffrey's prior in an EA context in Toby Ord's comment in response to Will MacAskill's Are we living at the most influential time in history?

Finally, a bit of a cop-out, but I think worth mentioning, is the suggestion of imprecise credences in one of the answers to the stats.stackexchange question linked above. Select a range of priors and seeing how much they converge, you might find prior choice doesn't matter that much and when it does matter, I expect this could be useful for determining your largest uncertainties.

The principle of maximum entropy may be a useful meta answer. E.g.,

  • the normal distribution is the maximum entropy distribution over all real numbers with fixed mean and variance;
  • the exponential distribution is the maximum entropy distribution over non-negative numbers with fixed mean.

Many other commonly used distributions have similar maximum entropy properties.

So e.g. suppose that for structural reasons you know that some quantity is non-negative, and that it has a certain mean, but nothing else. The principle of maximum entropy then suggests to use an exponential prior.

Some Bayesian statisticians put together prior choice recommendations. I guess what they call a "weakly informative prior" is similar to your "low-information prior".

Another useful meta answer is stability: There are only a few distributions with the property that a linear combination of independent distributions is a distribution of the same type. This means they're "attractors" for processes that can be described as consecutive sums of independent quantities and multiplications by nonrandom factors.

So if for structural reasons you know that what you're looking at is the output of such a process, then you may want to use an alpha-stable distribution as prior. The value of alpha essentially controls how heavy the tails are. (And the normal distribution is a special case for alpha = 2.)

This is essentially a generalization of the central limit theorem to distributions with higher variance / heavier tails.

In a context where multiple forecasts have already been made (by you or other people), use the geometric mean of the odds as a blind aggregate:

If you want to get fancy, use an extremized version of this pooling method, by scaling the log odds by a factor :

Satopaa et al have found that in practice gives the best results.

Bono et al. (2017), based on reviewing the abstracts of all articles published between 2010 and 2015 and listed on Web of Science, found (N = 262 abstracts reporting non-normal distributions):

In terms of their frequency of appearance, the most-common non-normal distributions can be ranked in descending order as follows: gamma, negative binomial, multinomial, binomial, lognormal, and exponential.

[I read only the abstract and can't comment on the review's quality.]

Regarding the proportion of Metaculus binary questions resolving “yes”, it is worth noting that this has recently changed. Here is what Anthony Aguirre wrote on the Metaculus discord channel on 10th of July:

I'd like to note a meta trend that things are happening: since March 1, by my rough count 54% of binary questions on Metaculus and 62% of binary questions on Pandemic have resolved positively; this is marketly different from this historical average of ~30%.