351 karmaJoined Oct 2014


Seems like a question where the answer has to be "it depends".

There are some questions which have a decomposition that helps with estimating them (e.g. Fermi questions like estimating the mass of the Earth), and there are some decompositions that don't help (for one thing, decompositions always stop somewhere, with components that aren't further decomposed).

Research could help add texture to "it depends", sketching out some generalizations about which sorts of decompositions are helpful, but it wouldn't show that decomposition is just generally good or just generally bad or useless.

However, an absolute reduction of cumulative risk by 10-8 requires (by definition) driving cumulative risk at least below 1-10-8. Again, you say, that must be easy. Not so. Driving cumulative risk this low requires driving per-century risk to about 1.6*10-6, barely one in a million.

I'm unclear on what this means. I currently think that humanity has better than a 10-8 chance of surviving the next billion years, so can I just say that "driving cumulative risk at least below 1-10-8" is already done? Is the 1.6*10-6 per-century risk some sort of average of 10 million different per-century numbers (such that my views on the cumulative risk imply that this risk is similarly already below that number), or is this trying to force our thinking into an implausible-to-me model where the per-century risk is the same in every century, or is this talking about the first future century in which risk drops below that level?

On the whole, this decay-rate framing of the problem feels more confusing to me than something like a two-stage framing where there is some short-term risk of extinction (over the next 100 or 1000 years or similar) and then some probability of long-term survival conditional on surviving the first stage.

e.g., Suppose that someone thinks that humanity has a 10^-2 chance (1%) of surviving the next thousand years, and a 10^-4 chance (.01%) of surviving the next billion years conditional on surviving the next thousand years, and that our current actions can only affect the first of those two probabilities. Then increasing humanity's chances of surviving a billion years by 10^-8 (in absolute terms) requires adding 10^-4 to our 10^-2 chance of surviving the next thousand years (an absolute .01% increase), or, equivalently, multiplying our chances of surviving the next thousand years by x1.01 (a 1% relative increase).

Constant per-century risk is implausible because these are conditional probabilities, conditional on surviving up to that century, which means that they're non-independent.

For example, the probability of surviving the 80th century from now is conditioned on having survived the next 79 centuries. And the worlds where human civilization survives the next 79 centuries are mostly not worlds where we face a 10% chance of extinction risk each century and keep managing to stumble along. Rather, they’re worlds where the per-century probabilities of extinction over the next 79 centuries are generally lower than that, for whatever reason. And worlds where the next 79 per-century extinction probabilities are generally lower than 10% are mostly worlds where the 80th extinction probability is also lower than that. So, structurally we should expect extinction probabilities to go down over time, as cumulative extinction risk means filtering for less extinction-prone worlds.

Dice roll style models which assume independence can be helpful for sketching out some of the broad contours of the problem, but in many cases they don't capture the structure of the situation well enough to use them for quantitative forecasts. This is an issue for modeling many domains, like for projecting how many games a sports team will win over the upcoming season. If your central estimate is that a team will win 30% of its games, worlds where the team winds up winning more than half of its games are mostly not worlds where the team kept getting lucky game after game. Rather, they're worlds where the team was better than expected, so they generally had more than a 30% chance of winning each game.

A model that doesn't account for this non-independence across games (like just using a binomial distribution based on your central estimate of what fraction of games a team will win) has built in the assumption that the only way to win more games is to have a bunch of 30% events go their way, and implicitly rules out the possibility of the team being better than your central estimate, so it will give inaccurate distributions. For example 3 of the 30 NBA teams had less than a 1/1000 chance of winning as many games as they did this year according to the distributions you'd get from a simple binomial model using the numbers here.

Similarly, Sam Wang’s 2016 election forecast gave Trump less than a 1% chance of winning the US presidency because it failed to correctly account for correlated uncertainty. By not sufficiently tracking non-independence, the model accidentally made extremely strong assumptions which led to a very overconfident prediction.

GiveWell has a 2021 post Why malnutrition treatment is one of our top research priorities, which includes a rough estimate of "a cost of about $2,000 to $18,000 per death averted" through treating "otherwise untreated episodes of malnutrition in sub-Saharan Africa." You can click through to the footnotes and the spreadsheets for more details on how they calculated that.

Is this just showing that the predictions were inaccurate before updating?

I think it's saying that predictions over the lifetime of the market are less accurate for questions where early forecasters disagreed a lot with later forecasters, compared to questions where early forecasters mostly agreed with later forecasters. Which sounds unsurprising.

That improvement of the Metaculus community prediction seems to be approximately logarithmic, meaning that doubling the number of forecasters seems to lead to a roughly constant (albeit probably diminishing) relative improvement in performance in terms of Brier Score: Going from 100 to 200 would give you a relative improvement in Brier score almost as large as when going from 10 to 20 (e.g. an improvement by x percent).

In some of the graphs it looks like the improvement diminishes more quickly than the logarithm, such that (e.g.) going from 100 to 200 gives a smaller improvement than going from 10 to 20. It seems like maybe you agree, given your "albeit probably diminishing" parenthetical. If so, could you rewrite this summary to better match that conclusion?

Maybe there's some math that you could do that would provide a more precise mathematical description? e.g., With your bootstrapping analysis, is there a limit for the Brier score as the number of hypothetical users increases?

I think the correct adjustment would involve multiplying the effect size by something like 1.1 or 1.2. But figuring out the best way to deal with it should involve some combination of looking into this issue in more depth and/or consulting with someone with more expertise on this sort of statistical issue.

This sort of adjustment wouldn't change your bottom-line conclusions that this point estimate for deworming is smaller than the point estimate for StrongMinds, and that this estimate for deworming is not statistically significant, but it would shift some of the distributions & probabilities that you discuss (including the probability that StrongMinds has a larger well-being effect than deworming).

A low reliability outcome measure attenuates the measured effect size. So if researchers measure the effect of one intervention on a high-quality outcome measure, and they measure the effect of another intervention on a lower-quality outcome measure, the use of different measures will inflate the apparent relative impact of the intervention that got higher-quality measurement. Converting different scales into number of SDs puts them all on the same scale, but doesn't adjust for this measurement issue.

For example, if you have a continuous outcome measure and you dichotomize it by taking a median split (so half get a score of zero and half get a score of one), that will shrink your effect size (number of SDs) to about 80% of what it would've been on the continuous measure. So if you would've gotten an effect size of 0.08 SDs on the continuous measures, you'll find an effect size of .064 SDs on this binary measure.

I think that using a three point scale to measure happiness should produce at least as much attenuation as taking a continuous measure and then carving it up into three groups. Here are some sample calculations to estimate how much that attenuates the effect size. I believe the best case scenario is if the responses are trichotomized into three equally sized groups, which would shrink the effect size to about 89% of what it would've been on the continuous measure, e.g. from .08 to .071. At a glance I don't see descriptive statistics for how many people selected each option on the happy123 measure in this study, so I can't do a calculation that directly corresponds to this study. (I also don't know how you did the measurement for the study of StrongMinds, which would be necessary for comparing them head-to-head.)

I don't see why you used a linear regression over time. It seems implausible that the trend over time would be (non-flat) linear, and the three data points have enough noise to make the estimate of the trend extremely noisy. 

Intelligence 1: Individual cognitive abilities.

Intelligence 2: The ability to achieve a wide range of goals.

Eliezer Yudkowsky means Intelligence 2 when he talks about general intelligence. e.g., He proposed "efficient cross-domain optimization" as the definition in his post by that name. See the LW tag page for General Intelligence for more links & discussion.

Load more