Vasco Grilo

Working (0-5 years experience)
1720Lisbon, PortugalJoined Jul 2020

Participation
3

  • Completed the Precipice Reading Group
  • Completed the In-Depth EA Virtual Program
  • Attended more than three meetings with a local EA group

Comments
427

Hi Wil,

Thanks for sharing your thoughts. I am slightly confused your comment is overall downvoted (-8 total karma now). I upvoted it, but disagreed.

Nice post, Stan!

Point estimates are fine for multiplication, lossy for division

I think one caveat here is that, if we want to obtain an expected value as output, the input point estimates should refer to the mean instead of the median. They are the same or similar for non-heavy-tailed distributions (like uniform or normal), but could differ a lot for heavy-tailed ones (like exponential or lognormal). When setting a lognormal to a point estimate, I think people often use the geometric mean between 2 percentiles (e.g. 5th and 95th percentiles), which corresponds to the median, not mean. Using the median in this case will underestimate the expected value, because it equals (see here):

  • E(X) = Median(X)*e^(sigma^2/2), where sigma^2 is the variance of log(X).

Here you mention that this "lognormal mean" can lead to extreme results, but I think that is a feature as long as we think the lognormal is modelling the right tail correctly. If we do not think so, we can still use the mean of:

  • Truncated lognormal distribution.
  • Minimum between a lognormal distribution and a maximum value (after which we think the lognormal no longer models the right tail well).

Interval estimates are prone to personal bias. It’s easy to create an interval estimate intuitively. When objectivity is important and the evidence base is sparse, point estimates are easier to form and are more transparent.

In my mind:

  • Being objective is about faithfully representing the information we have about reality, even if that means being more uncertain.
  • The evidence base being sparse suggests we are uncertain about what reality actually looks like, which means a faithful representation of it will more easily be achieved by intervals, not point estimates. For example, I think using interval estimates in the Drake equation in much more important that in the cost-effectiveness analyses of GiveWell's top charities.
  • One compromise to achieve transparency while mainting the benefits of interval estimates is using pessimistic, realistic and optimistic point estimates. One the one hand, this may result in wider intervals because the product between 2 5th percentiles is rarer than a 5th percentile, so the pessimistic final estimate will be more pessimistic than its inputs. On the other hand, we can think as the wider intervals as accounting for structural uncertainty of the model.

Greater remark, Sanjay! Great piece, Stan!

Related to which type of distribution is better, in this episode of The 80,000 Hours Podcast (search for "So you mentioned, kind of, the fat tail-ness of the distribution."), David Roodman suggests using the generalised Pareto distribution (GPD) to model the right tail (which often drives the expected value). David mentions the right tails of normal, lognormal and power law distributions are particular cases of the GDP:

Robert Wiblin: This kind of log-normal or normal curve, or power law, are they all special cases of this generalized family [GPD]?

David Roodman: Their tails are.

So, fitting the right-tail empirical data to a GPD is arguably better than assuming (or fitting the data to) one particular type of distribution:

[David Roodman:]

So what you can do is you can take a data set like, all geomagnetic disturbances since 1957, and then look at the [inaudible 00:59:09] say, 300 biggest ones. What’s the right tail of the distribution? And then ask which member of the generalized Pareto family fits that data the best? And then once you’ve got a curve that you know … you know for theoretical reasons is a good choice, you can extrapolate it farther to the right and say, “What’s a million year storm look like?”

And one also has to be careful about out of sample extrapolations. But I think it’s more grounded in theory, this is, to use the generalized Pareto family, because it is analogous to using the normal family when constructing usual standard errors. Than, to, for example, assume that geomagnetic storms follow a power law, which was done in one of the papers that reached the popular press. So there was a Washington Post story some years ago that said the chance of a Carrington-size storm was like 12% per decade. But that was assuming a power law, which has a very fat tail. When I looked at the data, I just felt that that … and allowed the data to choose within a larger and theoretically motivated family. It did not, the model fit did not gravitate towards the power law.

Thanks, JP Addison!

Note that, among all questions, Metaculus' predictions have a Brier score evaluated at all times of 0.122, which is only slightly lower than the 0.127 of Metaculus' community predictions. So overall there is not much difference, although there can be significant differences for certain categories (like for that of "artificial intelligence").

I have now done a follow-up analysis here where I aggregate results across all levels of ASRS severity taking their likelihood into account (e.g. 150 Tg is much less likely than 30 Tg). The top 3 for Latin America, and respective increase in future potential due to nationally and fully mitigating the food shocks caused by ASRSs are:

  • For Brazil, 4.08 bp.
  • For Mexico, 3.62 bp.
  • For Argentina, 2.62 bp.

The full results are here (see tab "TOC").

Thanks for the great summary, Stan!

I'd suggest that the conclusion is out-of-sync with how most people feel about saving lives in poor, undemocratic countries. We typically don't hesitate to tackle neglected tropical diseases on the basis that doing so boosts the populations of dictatorships.

I agree. At the same time, I do not think we can take the status quo for granted, because reality is often quite complex. For example, most people do not hesitate to eat factory-farmed animals, but the scale of their (negative) welfare may well outweight that of humans (see here).

This is not to say I am confident socioeconomic indices weighted by real GDP are a better proxy for longterm value than population:

Is it possible that nationally mitigating food shocks robustly increases real GDP in the nearterm, but accidentally leads to a worse future if it decreases global socioeconomic indices weighted by influence? Is this even a real trade-off? Would it be better to use socioeconomic indices multiplied, instead of weighted, by real GDP as a proxy for future potential, such that greater real GDP would always be good? I am uncertain about the answers.

However, do you think halving the population of the United States while doubling that of China would be good in the longterm (assuming constant socioeconomic indices)? I think it would be bad in expectation (although with high uncertainty) because the world would then have a major undemocratic superpower with 4 times as much GDP as the wealthiest democratic country.

I think saving lives would be more important in terms of longterm value if the population loss was higher, because then it would be reducing the chance of extinction. I think it is quite hard for a nuclear war to lead to extinction, so I preferred using socioeconomic indices to estimate future value.

We should also consider saving lives in low-income countries can affect their socioeconomic indices, which seems to be a neglected topic. From Kono 2009 (emphasis mine):

Although many people have argued that foreign aid props up dictators, few have claimed that it props up democrats, and no one has systematically examined whether either assertion is empirically true. We argue, and find, that aid has both effects. Over the long run, sustained aid flows promote autocratic survival because autocrats can stockpile this aid for use in times of crisis. Each disbursement of aid, however, has a larger impact on democratic survival because democrats have fewer alternative resources to fall back on.

In my model, mitigating the food shock of any given country counterfactually increases its real GDP per capita, and therefore socioeconomic indices.

Great work, Marie!

X-risk from nuclear war this century: ~0.18%. (based on the numbers below and sanity checked against Toby Ord’s estimate in The Precipice (p. 287) of ~0.1%)

FWIW, I got an estimate twice as large here:

I estimate globally mitigating food shocks caused by abrupt sunlight reduction scenarios (ASRSs) between 2024 and 2100 leads to an increase in future potential of 37.8 bp (5th to 95th percentile, -73.7 to 201).

Regarding:

Overall, the downsides of this project seem relatively limited to me.

I estimated here:

nationally mitigating food shocks is harmful not only in pessimistic cases, but also in expectation in 40.7 % (= 59/145) of the countries I analysed. The reasons are (see discussion here):

  • I suppose (see here) real gross domestic product (real GDP) is a good proxy for (good or bad) influence in the future.
  • Mitigating the food shocks of a country counterfactually increases its real GDP in the worst year of the ASRS.
  • Consequently, mitigating the food shocks of a country with low[10] socioeconomic indices[11] leads to lower global socioeconomic indices weighted by influence. I assume this is bad, although with significant uncertainty (see 2nd paragraph of the previous section).

A positive Shapley value means that all players decide to contribute (if basing their decisions off Shapley values as advocated in this post), and you then end up with N=3

Since I was calculating the Shapley value relative to doing nothing, it being positive only means taking the action is better than doing nothing. In reality, there will be other options available, so I think agents will want to maximise their Shapley cost-effectiveness. For the previous situation, it would be:

For the previous values, this would be 7/6. Apparently not very high, considering donating 1 $ to GWWC leads to 6 $ of counterfactual effective donations as a lower bound (see here). However, the Shapley cost-effectiveness of GWWC would be lower than their counterfactual cost-effectiveness... In general, since there are barely any impact assessments using Shapley values, it is a little hard to tell whether a given value is good or bad.

Thanks for the post, Quintin!

However, time and again, we've found that deep learning systems improve more through scaling, of either the data or the model.

Jaime Sevilla from Epoch mentioned here that scaling of compute and algorithms are both responsible for half of the progress:

roughly historically it has turned out that the two main drivers of progress have been the scaling of compute and these algorithmic improvements. And I will say that they are like 50

Jaime also mentions that data has not been a bottleneck.

Hi Luisa,

I see the expected harm scale is logarithmic (as the overall score is the some of the component scores), but what is the ratio between the badness of 2 consecutive scores (e.g. "badness of score 12"/"badness of score 11")?

Load more