All of Dan_Keys's Comments + Replies

I've raised related points here, and also here with followup, about how exponential decay with a fixed decay rate is not a good model to use for estimating long-term survival probability.

Does your model without log(GNI per capita) basically just include a proxy for log(GNI per capita), by including other predictor variables that, in combination, are highly predictive of log(GNI per capita)?

With a pool of 1058 potential predictor variables, many of which have some relationship to economic development or material standards of living, it wouldn't be surprising if you could build a model to predict log(GNI per capita) with a very good fit. If that is possible with this pool of variables, and if log(GNI per capita) is linearly predictive of lif... (read more)

1
Alexander Loewi
2mo
I think the first thing to emphasize is that, even when you do include log(g:dp/ni), the measured effect still isn't all that big. It says that you'll get an increase of 1.5 points satisfaction ... if you almost triple gdp! (I.e. multiply it by 2.7, because that's just mechanically what it means when you transform a linear predictor by the natural log.) Since that's either ludicrous or impossible for many countries, there are plenty of cases where it doesn't even make sense to consider. My largest problem has nothing to do with the non-linearities of the log -- if it fit better, great! But 1) it just, simply, numerically, doesn't and 2) the fact that you have to interpret the log in a fundamentally different way than all the non-transformed variables makes it extraordinarily misleading. You get a bigger bump on the graph -- but it's a bigger bump that means something fundamentally different than all of the other bumps (a multiplicative effect, not an additive one). Then when you include it in charts as if it doesn't mean something different, you're floating towards very nasty territory. It is certainly the case that many of the variables are highly collinear, but there are clearly no obvious close proxies in the list. If I removed log(gdp) but introduced log(trading volume) or something, that would be suspicious -- but you can see all 14 of the variables that are actually in the model. I would have to be approximating log(gdp) with -- water and preschool? The 1,058 variables are searched over, yes -- but then 1,044 of them are rejected, and simple don't enter. I'm sorry though, I just don't understand your last paragraph. If the true effect needs a log, then the log should account for that effect. And if the effect is properly transformed, I don't understand how a different variable would do a better job of accounting for the variance than the true variable. Happy to discuss if you can clarify though.

It looks like the 3 articles are in the appendix of the dissertation, on pages 65 (fear, Study A), 72 (hope, Study B), and 73 (mixed, Study C).

The effect of health insurance on health, such as the old RAND study, the Oregon Medicaid expansion, the India study from a couple years ago, or whatever else is out there.

Robin Hanson likes to cite these studies as showing that more medicine doesn't improve health, but I'm skeptical of the inference from 'not statistically significant' to 'no effect' (I'm in the comments there as "Unnamed"). I would like to see them re-analyzed based on effect size (e.g. a probability distribution or confidence interval for DALY per $).

I'd guess that this is because an x-risk intervention might have on the order of a 1/100,000 chance of averting extinction. So if you run 150k simulations, you might get 0 or 1 or 2 or 3 simulations in which the intervention does anything. Then there's another part of the model for estimating the value of averting extinction, but you're only taking 0 or 1 or 2 or 3 draws that matter from that part of the model because in the vast majority of the 150k simulations that part of the model is just multiplied by zero.

And if the intervention sometimes increases e... (read more)

2
Vasco Grilo
5mo
That makes sense to me, Dan!

I believe the paper you're referring to is "Water Treatment And Child Mortality: A Meta-Analysis And Cost-effectiveness Analysis" by Kremer, Luby, Maertens, Tan, & Więcek (2023).

The abstract of this version of the paper (which I found online) says:

We estimated a mean cross-study reduction in the odds of all-cause under-5 mortality of about 30% (Peto odds ratio, OR, 0.72; 95% CI 0.55 to 0.92; Bayes OR 0.70; 95% CrI 0.49 to 0.93). The results were qualitatively similar under alternative modeling and data inclusion choices. Taking into account heterogenei

... (read more)
3
NickLaing
6mo
Hey yes I somehow failed to reference the most important paper I was referring to my bad! Thanks so much for the in depth look here. I agree with all of your points. I was debating writing a list of these issues with the study, but decided not to for simplicity and instead just wrote "Kremer looks retrospectively at data not gathered for-purpose, which is in epidemiological speak a little dodgy." And yeah, potential p hacking and noisiness are aspects of that dodginess A couple of small notes I think even the 8 percent mortality reduction lower bound wouldn't completely wipe out the question. Clean water reduces diarrhoea by 30 to 50 percent, leaving a highest plausible mortality reduction of about 5 percent (I think Kremer listed it as 4 in the study?), so even at the lower bound of mortality reduction and higher bound of diarrhea reduction, there is still a discrepancy. On publication bias, the kind of big studies they are looking at are likely to get published even with negative results, and their funnel plot looking for the bias looked pretty good. In general I think a huge RCT (potentially even multi county) is still needed which can look at mortality, and can also explore potential reasons for the large overall mortality reduction.

Two thoughts on this paper:

  1. Does it make sense to pool the effect of chlorine interventions with filtration interventions, when these are two different types of interventions? I don't think it does and notably the Cochrane review on this topic that looks at diorrhoea rather than mortality doesn't pool these effects - it doesn't even pool cholirnation products and flocculation sachets together, or different types of filtration together -  https://www.cochrane.org/CD004794/INFECTN_interventions-improve-water-quality-and-prevent-diarrhoea - it's hard not
... (read more)

More from Existential Risk Observatory (@XRobservatory) on Twitter:

It was a landmark speech by @RishiSunak: the first real recognition of existential risk by a world leader. But even better are the press questions at the end:

@itvnews: "If the risks are as big as you say, shouldn't we at the very least slow down AI development, at least long enough to understand and control the risks."

@SkyNews: "Is it fair to say we know enough already to call for a moratorium on artificial general intelligence? Would you back a moratorium on AGI?" 

Sky again: "Given th

... (read more)

One way to build risk decay into a model is to assume that the risk is unknown within some range, and to update on survival.

A very simple version of this is to assume an unknown constant per-century extinction risk, and to start with a uniform distribution on the size of that risk. Then the probability of going extinct in the first century is 1/2 (by symmetry), and the probability of going extinct in the second century conditional on surviving the first is smaller than that (since the higher-risk worlds have disproportionately already gone extinct) - with ... (read more)

Why are these expected values finite even in the limit?

It looks like this model is assuming that there is some floor risk level that the risk never drops below, which creates an upper bound for survival probability through n time periods based on exponential decay at that floor risk level. With the time of perils model, there is a large jolt of extinction risk during the time of perils, and then exponential decay of survival probability from there at the rate given by this risk floor.

The Jupyter notebook has this value as r_low=0.0001 per time period. If a... (read more)

1
MichaelStJules
6mo
Another way to get infinite EV in the time of perils model would be to have a nonzero lower bound on the per period risk rate across a rate sequence, but allow that lower bound to vary randomly and get arbitrarily close to 0 across rate sequences. You can basically get a St Petersburg game, with the right kind of distribution over the long-run lower bound per period risk rate. The outcome would have finite value with probability 1, but still infinite EV. EDIT: To illustrate, if f(r), the expected value of the future conditional on a per period risk rate r in the limit, goes to infinity as r goes to 0, then the expected value of f(r) will be infinite over at least some distributions for r in an interval (0, b], which excludes 0. Furthermore, if you assign any positive credence to subdistributions over the rates together that give infinite conditional EV, then the unconditional expected value will be infinite (or undefined). So, I think you need to be extremely confident (imo, overconfident) to avoid infinite or undefined expected values under risk neutral expectational total utilitarianism.

(Commenting on mobile, so excuse the link formatting.)

See also this comment and thread by Carl Shulman: https://forum.effectivealtruism.org/posts/zLZMsthcqfmv5J6Ev/the-discount-rate-is-not-zero?commentId=Nr35E6sTfn9cPxrwQ

Including his estimate (guess?) of 1 in a million risk per century in the long run:

https://forum.effectivealtruism.org/posts/zLZMsthcqfmv5J6Ev/the-discount-rate-is-not-zero?commentId=GzhapzRs7no3GAGF3

In general, even assigning a low but non-tiny probability to low long run risks can allow huge expected values.

See also Tarsney's The Epistem... (read more)

Thank you very much Dan for your comments and for looking into the ins and outs of the work and highlighting various threads that could improve it.

There are two quite separate issues that you brought up here. First about infinite value, which can be recovered with new scenarios and, second, the specific parameter defaults used. The parameters the report used could be reasonable but also might seem over-optimistic or over-pessimistic, depending on your background views.

I totally agree that we should not anchor on any particular set of parameters, including ... (read more)

4
Linch
6mo
(speaking for myself) The conditional risk point seems like a very interesting crux between people; I've talked both to people who think the point is so obviously true that it's close to trivial and to people who think it's insane (I'm more in the "close to trivial" position myself).

One way to build risk decay into a model is to assume that the risk is unknown within some range, and to update on survival.

A very simple version of this is to assume an unknown constant per-century extinction risk, and to start with a uniform distribution on the size of that risk. Then the probability of going extinct in the first century is 1/2 (by symmetry), and the probability of going extinct in the second century conditional on surviving the first is smaller than that (since the higher-risk worlds have disproportionately already gone extinct) - with ... (read more)

You can get a sense for these sorts of numbers just by looking at a binomial distribution.

e.g., Suppose that there are n events which each independently have a 45% chance of happening, and a noisy/biased/inaccurate forecaster assigns 55% to each of them.

Then the noisy forecaster will look more accurate than an accurate forecaster (who always says 45%) if >50% of the events happen, and you can use the binomial distribution to see how likely that is to happen for different values of n. For example, according to this binomial calculator, with n=51 there is... (read more)

1
nikos
6mo
Good comment, thank you!

Seems like a question where the answer has to be "it depends".

There are some questions which have a decomposition that helps with estimating them (e.g. Fermi questions like estimating the mass of the Earth), and there are some decompositions that don't help (for one thing, decompositions always stop somewhere, with components that aren't further decomposed).

Research could help add texture to "it depends", sketching out some generalizations about which sorts of decompositions are helpful, but it wouldn't show that decomposition is just generally good or just generally bad or useless.

However, an absolute reduction of cumulative risk by 10-8 requires (by definition) driving cumulative risk at least below 1-10-8. Again, you say, that must be easy. Not so. Driving cumulative risk this low requires driving per-century risk to about 1.6*10-6, barely one in a million.

I'm unclear on what this means. I currently think that humanity has better than a 10-8 chance of surviving the next billion years, so can I just say that "driving cumulative risk at least below 1-10-8" is already done? Is the 1.6*10-6 per-century risk some sort of average of 10 ... (read more)

8
David Thorstad
10mo
Thanks Dan! As mentioned, to think that cumulative risk is below 1-(10^-8) is to make a fairly strong claim about per-century risk. If you think we're already there, that's great! Bostrom was actually considering something slightly stronger: the prospect of reducing cumulative risk by a further 10^(-8) from wherever it is at currently. That's going to be hard even if you think that cumulative risk is already lower than I do. So for example, you can ask what changes you'd have to make to per-century risk to drop cumulative risk from N to r-(10^-8) for any r in [0,1). Honestly, that's a more general and interesting way to do the math here. The only reason I didn't do this is that (a) it's slightly harder, and (b) most academic readers will already find per-century risk of ~one-in-a-million relatively implausible, and  (c) my general aim was to illustrate the importance of carefully distinguishing between per-century risk and cumulative risk. It might be a good idea, in rough terms, to think of a constant hazard rate as an average across all centuries. I suspect that if the variance of risk across centuries is low-ish, this is a good idea, whereas if the variance of risk across centuries is high-ish, it's a bad idea. In particular, on a time of perils view, focusing on average (mean) risk rather than explicit distributions of risk across centuries will strongly over-value the future, since a future in which much of the risk is faced early on is lower-value than a future in which risk is spread out. Strong declining trends in hazard rates induce a time-of-perils like structure, except that on some models they might make a bit weaker assumptions about risk than leading time of perils models do. At least one leading time of perils model (Aschenbrenner) has a declining hazard structure. In general, the question will be how to justify a declining hazard rate, given a standard story on which (a) technology drives risk, and (b) technology is increasing rapidly. I think tha

Constant per-century risk is implausible because these are conditional probabilities, conditional on surviving up to that century, which means that they're non-independent.

For example, the probability of surviving the 80th century from now is conditioned on having survived the next 79 centuries. And the worlds where human civilization survives the next 79 centuries are mostly not worlds where we face a 10% chance of extinction risk each century and keep managing to stumble along. Rather, they’re worlds where the per-century probabilities of extinction over... (read more)

GiveWell has a 2021 post Why malnutrition treatment is one of our top research priorities, which includes a rough estimate of "a cost of about $2,000 to $18,000 per death averted" through treating "otherwise untreated episodes of malnutrition in sub-Saharan Africa." You can click through to the footnotes and the spreadsheets for more details on how they calculated that.

Is this just showing that the predictions were inaccurate before updating?

I think it's saying that predictions over the lifetime of the market are less accurate for questions where early forecasters disagreed a lot with later forecasters, compared to questions where early forecasters mostly agreed with later forecasters. Which sounds unsurprising.

6
Vasco Grilo
1y
Hi Dan, I think that can be part of it. Just a note, I calculated the belief movement only for the 2nd half of the question lifetime to minimise the effect of inaccurate earlier predictions. Another plausible explanation to me is that questions with greater updating are harder.

That improvement of the Metaculus community prediction seems to be approximately logarithmic, meaning that doubling the number of forecasters seems to lead to a roughly constant (albeit probably diminishing) relative improvement in performance in terms of Brier Score: Going from 100 to 200 would give you a relative improvement in Brier score almost as large as when going from 10 to 20 (e.g. an improvement by x percent).

In some of the graphs it looks like the improvement diminishes more quickly than the logarithm, such that (e.g.) going from 100 to 200 give... (read more)

I think the correct adjustment would involve multiplying the effect size by something like 1.1 or 1.2. But figuring out the best way to deal with it should involve some combination of looking into this issue in more depth and/or consulting with someone with more expertise on this sort of statistical issue.

This sort of adjustment wouldn't change your bottom-line conclusions that this point estimate for deworming is smaller than the point estimate for StrongMinds, and that this estimate for deworming is not statistically significant, but it would shift some of the distributions & probabilities that you discuss (including the probability that StrongMinds has a larger well-being effect than deworming).

A low reliability outcome measure attenuates the measured effect size. So if researchers measure the effect of one intervention on a high-quality outcome measure, and they measure the effect of another intervention on a lower-quality outcome measure, the use of different measures will inflate the apparent relative impact of the intervention that got higher-quality measurement. Converting different scales into number of SDs puts them all on the same scale, but doesn't adjust for this measurement issue.

For example, if you have a continuous outcome measure an... (read more)

9
JoelMcGuire
1y
Hi Dan, This is an interesting topic, but we’d need more time to look into it. We would like to look into this more when we have more time.  We agree that the 3-point measure is not optimal. However, we think our general conclusion still holds when we examine the effect using other measures of subjective wellbeing in the data (including a 1-10 scale, some 1-6 frequency scales). None of the other measures are significant, and we get a similar result (see Appendix A3.1).  Are you suggesting that this (1-.89 = .11) 11% shrinkage would justify increasing the cost-effectiveness of deworming by 11%? If so, even such an adjustment applied to our ‘optimistic’ model (see Appendix A1) would not change our conclusion that deworming is not more cost-effective than StrongMinds (and even if it did, it wouldn’t change the larger problem that the evidence here is still very weak and noisy). The StrongMinds analysis is based on a meta-analysis of psychotherapy in LMICs combined with some studies relevant to the StrongMinds method. This includes a lot of different types of measures with varying scale lengths. 

I don't see why you used a linear regression over time. It seems implausible that the trend over time would be (non-flat) linear, and the three data points have enough noise to make the estimate of the trend extremely noisy. 

2
Samuel Dupret
1y
Hi Dan, Our main conclusion is that these data don’t demonstrate there is an effect of deworming, as all the point estimates are all non-significant (see further discussion in Section 2.3). We conducted the cost-benefit analysis as an exercise to see what the effects look like. We took the trend in the data at face value because the existing literature is so mixed and doesn’t provide a strong prior.

Intelligence 1: Individual cognitive abilities.

Intelligence 2: The ability to achieve a wide range of goals.

Eliezer Yudkowsky means Intelligence 2 when he talks about general intelligence. e.g., He proposed "efficient cross-domain optimization" as the definition in his post by that name. See the LW tag page for General Intelligence for more links & discussion.

3
Magnus Vinding
1y
I do not claim otherwise in the post :) My claim is rather that proponents of Model 1 tend to see a much smaller distance between these respective definitions of intelligence, almost seeing Intelligence 1 as equivalent to Intelligence 2. In contrast, proponents of Model 2 see Intelligence 1 as an important yet still, in the bigger picture, relatively modest subset of Intelligence 2, alongside a vast set of other tools.
6
David Johnston
1y
Eliezer’s threat model is “a single superintelligent algorithm with at least a little bit of ability to influence the world”. In this sentence, the word “superintelligent” cannot mean intelligence in the sense of definition 2, or else it is nonsense - definition 2 precludes “small or no ability to influence the world”. Furthermore, in recent writing Eliezer has emphasised threat models that mostly leverage cognitive abilities (“intelligence 1”), such as a superintelligence that manipulates someone into building a nano factory using existing technology. Such scenarios illustrate that intelligence 2 is not necessary for AI to be risky, and I think Eliezer deliberately chose these scenarios to make just that point. One slightly awkward way to square this with the second definition you link is to say that Yudkowsky uses definition 2 to measure intelligence, but is also very confident that high cognitive abilities are sufficient for high intelligence and therefore doesn’t always see a need to draw a clear distinction between the two.

The model assumes gradually diminishing returns to spending within the next year, but the intuitions behind your third voice think that much higher spending would involve marginal returns that are a lot smaller OR ~zero OR negative?

2
kokotajlod
1y
Huh, now that you mention it, I think the third voice thinks that much higher spending would be negative, not just a lot smaller or zero. So maybe that's what's going on: The third voice intuits that there  are backfire risks along the lines of "EA gets a reputation for being ridiculously profligate" that the model doesn't model? Maybe another thing that's going on is that maybe we literally are funding all the opportunities that seem all-things-net-positive to us. The model assumes an infinite supply of opportunities, of diminishing quality, but in fact maybe there are literally only finitely many and we've exhausted them all.

Could you post something closer to the raw survey data, in addition to the analysis spreadsheet linked in the summary section? I'd like to see something that:

  • Has data organized by respondent  (a row of data for each respondent)
  • Shows the number given by the respondent, before researcher adjustments (e.g., answers of 0 are shown as "0" and not as ".01") (it's fine for it to show the numbers that you get after data cleaning which turns "50%" and "50" into "0.5")
  • Includes each person's 6 component estimates, along with a few other variables like their dire
... (read more)
1
Froolow
2y
Yes I will do, although some respondents asked to remain anonymous / not have their data publicly accessible and so I need to make some slight alterations before I share. I'd guess a couple of weeks for this

The numbers that you get from this sort of exercise will depend heavily on which people you get estimates from. My guess is that which people you include matters more than what you do with the numbers that they give you.

If the people who you survey are more like the general public, rather than people around our subcultural niche where misaligned AI is a prominent concern, then I expect you'll get smaller numbers.

Whereas, in Rob Bensinger's 2021 survey of "people working on long-term AI risk", every one of the 44 people who answered the survey gave an estim... (read more)

7
Froolow
2y
I completely agree that the survey demographic will make a big difference to the headline results figure. Since I surveyed people interested in existential risk (Astral Codex Ten, LessWrong, EA Forum) I would expect the results to bias upwards though. (Almost) every participant in my survey agreed the headline risk was greater than the 1.6% figure from this essay, and generally my results line up with the Bensinger survey.  However, this is structurally similar to the state of Fermi Paradox estimates prior to SDO 'dissolving' this - that is, almost everyone working on the Drake Equation put the probable number of alien civilisations in this universe very high, because they missed the extremely subtle statistical point about uncertainty analysis SDO spotted, and which I have replicated in this essay. In my opinion, Section 4.3 indicates that as long as you have any order-of-magnitude uncertainty you will likely get asymmetric distribution of risk, and so in that sense I disagree that the mechanism depends on who you ask. The mechanism is the key part of the essay, the headline number is just one particular way to view that mechanism.

If the estimates for the different components were independent, then wouldn't the distribution of synthetic estimates be the same as the distribution of individual people's estimates?

Multiplying Alice's p1 x Bob's p2 x Carol's p3 x ... would draw from the same distribution as multiplying Alice's p1 x Alice's p2 x Alice's p3 ... , if estimates to the different questions are unrelated.

So you could see how much non-independence affects the bottom-line results just by comparing the synthetic distribution with the distribution of individual estimates (treating ... (read more)

3
Froolow
2y
In practice these numbers wouldn't perfectly match even if there was no correlation because there is some missing survey data that the SDO method ignores (because naturally you can't sample data that doesn't exist). In principle I don't see why we shouldn't use this as a good rule-of-thumb check for unacceptable correlation. The synth distribution gives a geomean of 1.6%, a simple mean of around 9.6%, as per the essay The distribution of all survey responses multiplied together (as per Alice p1 x Alice p2 x Alice p3) gives a geomean of approx 2.3% and a simple mean of approx 17.3%. I'd suggest that this implies the SDO method's weakness to correlated results is potentially depressing the actual result by about 50%, give or take. I don't think that's either obviously small enough not to matter or obviously large enough to invalidate the whole approach, although my instinct is that when talking about order-of-magnitude uncertainty, 50% point error would not be a showstopper.

Does the table in section 3.2 take the geometric mean for each of the 6 components?

From footnote 7 it looks like it does, but if it does then I don't see how this gives such a different bottom line probability from the synthetic method geomean in section 4 (18.7% vs. 1.65% for all respondents). Unless some probabilities are very close to 1, and those have a big influence on the numbers in the section 3.2 table? Or my intuitions about these methods are just off.

1
Froolow
2y
That's correct - the table gives the geometric mean of odds for each individual line, but then the final line is a simple product of the preceding lines rather than the geometric mean of each individual final estimate. This is a tiny bit naughty of me, because it means I've changed my method of calculation halfway through the table - the reason I do this is because it is implicitly what everyone else has been doing up until now (e.g. it is what is done in Carlsmith 2021) , and I want to highlight the discrepancy this leads to.

Have you looked at how sensitive this analysis is to outliers, or to (say) the most extreme 10% of responses on each component?

The recent Samotsvety nuclear risk estimate removed the largest and smallest forecast (out of 7) for each component before aggregating (the remaining 5 forecasts) with the geometric mean. Would a similar adjustment here change the bottom line much (for the single probability and/or the distribution over "worlds")?

The prima facie case for worrying about outliers actually seems significantly stronger for this survey than for an org l... (read more)

I had not thought to do that, and it seems quite sensible (I agree with your point about prima facie worry about low outliers). The results are below.

To my eye, the general mechanism I wanted to defend about is preserved (there is an asymmetric probability of finding yourself in a low-risk world), but the probability of finding yourself in an ultra-low-risk world has significantly lowered, with that probability mass roughly redistributing itself around the geometric mean (which itself has gone up to 7%-ish)

In some sense this isn't totally surprising - remo... (read more)

A passage from Superforecasting:

Flash back to early 2012. How likely is the Assad regime to fall? Arguments against a fall include (1) the regime has well-armed core supporters; (2) it has powerful regional allies.  Arguments  in  favor  of  a  fall  include  (1)  the  Syrian  army  is  suffering  massive defections;  (2)  the  rebels  have  some  momentum,  with  fighting  reaching  the  capital. Suppose you weight the strength of t

... (read more)

Two empirical reasons not to take the extreme scope neglect in studies like the 2,000 vs 200,000 birds one as directly reflecting people's values.

First, the results of studies like this depend on how you ask the question. A simple variation which generally leads to more scope sensitivity is to present the two options side by side, so that the same people would be asked both about 2,000 birds and about the 200,000 birds (some call this "joint evaluation" in contrast to "separate evaluation"). Other variations also generally produce more scope sensitive resu... (read more)

2
Dan_Keys
2y
A passage from Superforecasting: Note: in the other examples studied by Mellers & colleagues (2015), regular forecasters were less sensitive to scope than they should've been, but they were not completely insensitive to scope, so the Assad example here (40% vs. 41%) is unusually extreme.

It would be interesting whether the forecasters with outlier numbers stand by those forecasts on reflection, and to hear their reasoning if so. In cases where outlier forecasts reflect insight, how do we capture that insight rather than brushing them aside with the noise? Checking in with those forecasters after their forecasts have been flagged as suspicious-to-others is a start.

The p(month|year) number is especially relevant, since that is not just an input into the bottom line estimate, but also has direct implications for individual planning. The plan ... (read more)

These numbers seem pretty all-over-the-place. On nearly every question, the odds given by the 7 forecasters span at least 2 orders of magnitude, and often substantially more. And the majority of forecasters (4/7) gave multiple answers which seem implausible (details below) in ways that suggest that their numbers aren't coming from a coherent picture of the situation.

I have collected the numbers in a spreadsheet and highlighted (in red) the ones that seem implausible to me.

Odds span at least 2 orders of magnitude:

Another commenter noted that the answers to ... (read more)

3
Misha_Yagudin
2y
Hey Dan, thanks for sanity-checking! I think you and feruell are correct to be suspicious of these estimates, we laid out reasoning and probabilities for people to adjust to their taste/confidence. * I agree outliers are concerning (and find some of them implausible), but I likewise have an experience of being at 10..20% when a crowd was at ~0% (for a national election resulting in a tie) and at 20..30% when a crowd was at ~0% (for a SCOTUS case) [likewise for me being ~1% while the crowd was much higher; I also on occasion was wrong updating x20 as a result, not sure if peers foresaw Biden-Putin summit but I was particularly wrong there]. * I think the risk is front-loaded, and low month-to-year ratios are suspicious, but I don't find them that implausible (e.g., one might expect everyone to get on a negotiation table/emergency calls after nukes are used and for the battlefield to be "frozen/shocked" – so while there would be more uncertainty early on, there would be more effort and reasons not to escalate/use more nukes at least for a short while — these two might roughly offset each other). * Yeah, it was my prediction that conjunction vs. direct wouldn't match for people (really hard to have a good "sense" of such low probabilities if you are not doing a decomposition). I think we should have checked these beforehand and discussed them with folks.
5
NunoSempere
2y
Hey, thanks for the analysis, we might do something like that next time to improve consistency of our estimates, either as a team or as individuals. Note that some of the issues you point out are the cost of speed, of working a bit in the style of an emergency response team, rather than delaying a forecast for longer. Still, I think that I'm more chill and less worried than you about these issues, because as you say the aggregation method was picked this up, and it doesn't take the geometric mean of the forecasts that you colored in red, given that it excludes the minimum and maximum. I also appreciated the individual comparison between chained probabilities and directly elicited ones, and it makes me even more pessimistic about using the directly elicited ones, particularly for <1% probabilities

The Less Wrong posts Politics as Charity from 2010 and Voting is like donating thousands of dollars to charity from November 2012 have similar analyses to the 2020 80k article.

Agreed that there are some contexts where there's more value in getting distributions, like with the Fermi paradox.

Or, before the grants are given out, you could ask people to give an ex ante distribution for "what will be your ex post point estimate of the value of this grant?" That feeds directly into VOI calculations, and it is clearly defined what the distribution represents. But note that it requires focusing on point estimates ex post.

9
NunoSempere
2y
> Or, before the grants are given out, you could ask people to give an ex ante distribution for "what will be your ex post point estimate of the value of this grant?" That feeds directly into VOI calculations, and it is clearly defined what the distribution represents. But note that it requires focusing on point estimates ex post. Aha, but you can also do this when the final answer is also a distribution. In particular, you can look at the KL-divergence between the initial distribution and the answer, and this is also a proper scoring rule.

I think it would've been better to just elicit point estimates of the grants' expected value, rather than distributions. Using distributions adds complexity, for not much benefit, and it's somewhat unclear what the distributions even represent.

Added complexity: for researchers giving their elicitations, for the data analysis, for readers trying to interpret the results. This can make the process slower, lead to errors, and lead to different people interpreting things differently. e.g., For including both positive & negative numbers in the distributions... (read more)

2
NunoSempere
2y
Yeah, you can use a mixture distribution if you are thinking about the distribution of impact, like so, or you can take the mean of that mixture if you want to estimate the expected value, like so. Depends of what you are after.
2
NunoSempere
2y
My intuitions point the other way with regards to point estimates vs distributions. Distributions seem like the correct format here, and they could allow for value of information calculations, sensitivity, to highlight disagreements which people wouldn't notice with point estimates, to better combine. The bottom line could also change when using estimates, e.g., as in here.  That said, they do have a learning curve and I agree with you that they add additional complexity/upfront cost.

In the table with post-discussion distributions, how is the lower bound of the aggregate distribution for the Open Phil AI Fellowship -73, when the lowest lower bound for an individual researcher is -2.4? Also in that row, Researcher 3's distribution is given as "250 to 320", which doesn't include their median (35) and is too large for a scale that's normalized to 100.

2
NunoSempere
2y
Hey, thanks Should have been -250, updated. This also explains the -73.

I haven't seen a rigorous analysis of this, but I like looking at the slope, and I expect that it's best to include each resolved prediction as a separate data point. So there would be 743 data points, each with a y value of either 0 or 1.

There are several different sorts of systematic errors that you could look for in this kind of data, although checking for them requires including more features of each prediction than the ones that are here.

For example, to check for optimism bias you'd want to code whether each prediction is of the form "good thing will happen", "bad thing will happen", or neither. Then you can check if probabilities were too high for "good thing will happen" predictions and too low for "bad thing will happen" predictions. (Most of the example predictions were "good thing... (read more)

2
Javier Prieto
2y
We do track whether predictions have a positive ("good thing will happen") or negative ("bad thing will happen") framing, so testing for optimism/pessimism bias is definitely possible. However, only 2% of predictions have a negative framing, so our sample size is too low to say anything conclusive about this yet. Enriching our database with base rates and categories would be fantastic, but my hunch is that given the nature and phrasing of our questions this would be impossible to do at scale. I'm much more bullish on per-predictor analyses and that's more or less what we're doing with the individual dashboards.

Pardon my negativity, but I get the impression that you haven't thought through your impact model very carefully.

In particular, the structure where

Every week, an anonymous team of grantmakers rank all participants, and whoever accomplished the least morally impactful work that week will be kicked off the island. 

is selecting for mediocrity.

Given fat tails, I expect more impact to come from the single highest impact week than from 36 weeks of not-last-place impact.

Perhaps for the season finale you could bring back the contestant who had the highest imp... (read more)

3
Yonatan Cale
2y
Seems like it's more important to encourage discussions-about-impact than it is to encourage impact directly But I'm not sure, I don't even own a TV

How much overlap is there between this book & Singer's forthcoming What We Owe The Past?

I got 13/13.

q11 (endangered species) was basically a guess. I thought that an extreme answer was more likely given how the quiz was set up to be counterintuitive/surprising. Also relevant: my sense is that we've done pretty well at protecting charismatic megafauna; the fact that I've heard about a particular species being at risk doesn't provide much information either way about whether things have gotten worse for it (me hearing about it is related to things being bad for it, and it's also related to successful efforts to protect it).

On q6 (age distributi... (read more)

For example: If there are diminishing returns to campaign spending, then taking equal amounts of money away from both campaigns would help the side which has more money.

If humanity goes extinct this century, that drastically reduces the likelihood that there are humans in our solar system 1000 years from now. So at least in some cases, looking at the effects 1000+ years in the future is pretty straightforward (conditional on the effects over the coming decades).

In order to act for the benefit of the far future (1000+ years away), you don't need to be able to track the far future effects of every possible action. You just need to find at least one course of action whose far future effects are sufficiently predictable to guide you (and good in expectation).

2
Michael_Wiebe
4y
The initial claim is that for any action, we can assess its normative status by looking at its long-run effects. This is a much stronger claim than yours.

The initial post by Eliezer on security mindset explicitly cites Bruce Schneier as the source of the term, and quotes extensively from this piece by Schneier.

In most of his piece, by “aiming to be mediocre”, Schwitzgebel means that people’s behavior regresses to the actual moral middle of a reference class, even though they believe the moral middle is even lower.

This skirts close to a tautology. People's average moral behavior equals people's average moral behavior. The output that people's moral processes actually produce is the observed distribution of moral behavior.

The "aiming" part of Schwitzgebel's hypothesis that people aim for moral mediocrity gives it empirical content. It gets harder to pick out the empirical content when interpreting aim in the objective sense.

1
JacobS
4y
This is fair. I was trying to salvage his argument without running into the problems mentioned in the above comment, but if he means "aim" objectively, then its tautologically true that people aim to be morally average, and if he means "aim" subjectively, then it contradicts the claim that most people subjectively aim to be slightly above average (which is what he seems to say in the B+ section). The options are: (1) his central claim is uninteresting (2) his central claim is wrong (3) I'm misunderstanding his central claim. And I normally would feel like I should play it safe and default to (3), but it's probably (2).

Unless a study is done with participants who are selected heavily for numeracy and fluency in probabilities, I would not interpret stated probabilities literally as a numerical representation of their beliefs, especially near the extremes of the scale. People are giving an answer that vaguely feels like it matches the degree of unlikeliness that they feel, but they don't have that clear a sense of what (e.g.) a probability of 1/100 means. That's why studies can get such drastically different answers depending on the response format, and why (I predict) eff... (read more)

That can be tested on these data, just by looking at the first of the 3 questions that each participant got, since the post says that "Participants were asked about the likelihood of humans going extinct in 50, 100, and 500 years (presented in a random order)."

I expect that there was a fair amount of scope insensitivity. e.g., That people who got the "probability of extinction within 50 years" question first gave larger answers to the other questions than people who got the "probability of extinction within 500 years" question first.

1
RandomEA
6y
Do you know if this platform allows participants to go back? (I assumed it did, which is why I thought a separate study would be necessary.)

I agree that asking about 2016 donations in early 2017 is an improvement for this. If future surveys are just going to ask about one year of donations then that's pretty much all you can do with the timing of the survey.

In the meantime, it is pretty easy to filter the data accordingly -- if you look only at donations made by EAs who stated that they joined on 2014 or before, the median donation is $1280.20 for 2015 and $1500 for 2016.

This seems like a better way to do the analyses. I think that the post would be more informative & easier to inter... (read more)

2
Peter Wildeford
7y
The median 2016 reported donation total of people who joined on 2015 or before was $655. We'll talk amongst the team about if we want to update the post or not. Thanks!

It is also worth noting that the survey was asking people who identify as EA in 2017 how much they donated in 2015 and 2016. These people weren't necessarily EAs in 2015 or 2016.

Looking at the raw data of when respondents said that they first became involved in EA, I'm getting that:

7% became EAs in 2017
28% became EAs in 2016
24% became EAs in 2015
41% became EAs in 2014 or earlier

(assuming that everyone who took the "Donations Only" survey became an EA before 2015, and leaving out everyone else who didn't answer the question about when they bec... (read more)

2
Peter Wildeford
7y
You're right there's a long lag time between asking about donations and the time of the donations... for the most part this is unavoidable, though we're hoping to time the survey much better in the future (asking only about one year of donations and asking just a month or two after the year is over). This will come with better organization in our team. In the meantime, it is pretty easy to filter the data accordingly -- if you look only at donations made by EAs who stated that they joined on 2014 or before, the median donation is $1280.20 for 2015 and $1500 for 2016.

This year, a “Donations Only” version of the survey was created for respondents who had filled out the survey in prior years. This version was shorter and could be linked to responses from prior years if the respondent provided the same email address each year.

Are these data from prior surveys included in the raw data file, for people who did the Donations Only version this year? At the bottom of the raw data file I see a bunch of entries which appear not to have any data besides income & donations - my guess is that those are either all the people who took the Donations Only version, or maybe just the ones who didn't provide an email address that could link their responses.

1
Peter Wildeford
7y
All the raw data for all the surveys for all the years are here: https://github.com/peterhurford/ea-data/tree/master/data. We have not yet published a longitudinal linked dataset for 2017, but will publish that soon. Those are the people who took the Donations Only version.

https://delib.zendesk.com/hc/en-us/articles/205061169-Creating-footnotes-HTML-anchors

It might be possible to fix in a not-too-tedious way, by using find-replace in the source code to edit all of the broken links (and anchors?) at once.

It appears that this analysis did not account for when people became EAs. It looked at donations in 2014, among people who in November 2015 were nonstudent EAs on an earning to give path. But less than half of those people were nonstudent EAs on an earning to give path at the start of 2014.

In fact, less than half of the people who took the Nov 2015 survey were EAs at the start of 2014. I've taken a look at the dataset, and among the 1171 EAs who answered the question about 2014 donations:
40% first got involved in EA in 2013 or earlier
21% first got involved... (read more)

0
Ajeya
7y
Thanks Dan! I didn't know this, I'll look more closely at the data when I get the chance.
Load more