Rory Stewart discusses GiveDirectly on The Rest is Politics

Thank you - I will update accordingly.

Rory Stewart discusses GiveDirectly on The Rest is Politics

I didn't get the impression from this transcript that Rory Stewart has just heard of cash transfers - is there any part which implied that? It felt to me more like bringing-the-listener-with-him kind of speak to convey a weird but exciting idea.

Reading the transcript cold, maybe it doesn't give that impression. If you're willing to listen to the episodes (there's two of them and the topic comes up a few times intersperced throughout) I'd be interested if your view changes with his joke. (He certainly gives off a tone of surprise). I also think this:

... (read more)I

11mo

Thanks! Yes this was just my impression from reading, not listening. I'll
hopefully get round to listening later and see if that updates my impression.

Rory Stewart discusses GiveDirectly on The Rest is Politics

I feel like I'm going to upvote both? There seem to be some significant (specific) errors, but the message is broadly correct.

Rory Stewart discusses GiveDirectly on The Rest is Politics

To be clear - I think that this is on net a good thing. This podcast will probably introduce both GiveDirectly and EA ideas to a wider audience. Having written up this transcript, I am also less disappointed about how this came across than I was when I first heard this at 2x-speed. That said, I still find two things fairly depressing:

- Someone who has worked in international development for 30 years and headed DfID(!) is only just now finding out about cash transfers, and thinks it's the most effective intervention you can do. (Although perhaps with his cave

61mo

Thanks v much for posting this transcript! I agree this is on net good and think
I took a more positive impression from Rory Stewart's points :)
I didn't get the impression from this transcript that Rory Stewart has just
heard of cash transfers - is there any part which implied that? It felt to me
more like bringing-the-listener-with-him kind of speak to convey a weird but
exciting idea.
I would argue his point that 'giving people cash is probably the most effective
single intervention that you can do for a very poor family'is pretty accurate
and I think it implies he understands it maybe isn't as effective as larger
scale interventions (larger than 'a single intervention for one family'). But
agree with you that the joke at the end "We should have kept DfID, but we should
have spent the money on cash transfers" is wrong!
Anecdotally, from my experience in DfID in 2019-20, people working on
cross-cutting development prioritisation often mentioned cash transfers in a way
implying familiarity. The main question wasn't whether this weird idea works,
but how it compares to bigger interventions like conflict-prevention or
aid-for-trade.
So I come out even more cheerful about this interview!

Leveraging finance to increase resilience to GCRs

Capital market investors would be attracted to these financial products because they are not correlated with developed world asset prices. As mentioned before, these investments can also hedge against climate risks and GCRs.

Lots of products aren't correlated to financial markets. (Betting on sports for example). That doesn't mean investors want to put money in.

Another point is that if they hedge against climate risk, and you think climate risk will materially affect the world, then you should expect these products to be correlated to the market. (But at least then they might have some excess return).

Leveraging finance to increase resilience to GCRs

Capital market investors would be attracted to these finance products due to high returns and a lack of correlation with developed world asset prices. As mentioned before, these investments can also hedge against climate risks and GCRs.

Why should we expect high returns? ILS / "Cat Bonds" don't seem to have especially high returns, and I'm not sure what the economic justification for them having high returns would be?

13mo

Good catch. Investors care about return vs risk. It also allows them to
diversify and have lower risk / higher returns for their portfolio.

Forecasting Newsletter: Looking back at 2021.

My general take on this space is:

- There is (generally) a disconnect between decision makers and forecasting platforms
- Spot forecasts are not especially useful on their own
- There are some good examples of decision makers at least looking at markets

Re 1: the disconnect between decision makers and forecasting platforms. I think the problem comes in two directions.

- Decision makers don't value the forecasts as much as they would cost to create (even if the value they would provide would be huge)
- The incentives to make the forecasts are usually orthogonal to the peop

Bottlenecks to more impactful crowd forecasting

You might be interested in both: "Most Likes" and "h-Index" metrics on MetaculusExtras which does have a visible upvote score. (Although I agree it would be nice to have it on Metaculus proper)

A Quick Overview of Forecasting and Prediction Markets: Why they’re useful, why they aren’t, and what’s next

Some nitpicks:

Forecasts have been more accurate than random 94% of the time since 2015

This is a terrible metric since most people looking at most questions on Metaculus wouldn't think they are all 50/50 coin flips.

Augur’s solution to this issue is to provide predictors with a set of outcomes on which predictors stake their earnings on the true outcome. Presumably, the most staked-on outcome is what actually happened (such as Biden winning the popular vote being the true outcome). In turn, predictors are rewarded for staking on true outcomes.

This doesn't ac... (read more)

35mo

thank you for the feedback- it's very helpful! I'll make the edits/clarify my
thinking and get back to you.

When pooling forecasts, use the geometric mean of odds

Looking at the rolling performance of your method (optimize on last 100 and use that to predict), median and geo mean odds, I find they have been ~indistinguishable over the last ~200 questions. If I look at the exact numbers, extremized_last_100 does win marginally, but looking at that chart I'd have a hard time saying "there's a 70% chance it wins over the next 100 questions". If you're interested in betting at 70% odds I'd be interested.

... (read more)There seems to be a long tradition of extremizing in the academic literature (see the reference in the post above). Th

When pooling forecasts, use the geometric mean of odds

This has restored my faith on extremization

I think this is the wrong way to look at this.

Metaculus was way underconfident originally. (Prior to 2020, 22% using their metric). Recently it has been much better calibrated - (2020- now, 4% using their metric).

Of course if they are underconfident then extremizing will improve the forecast, but the question is what is most predictive going forward. Given that before 2020 they were 22% underconfident, more recently 4% underconfident, it seems foolhardy to expect them to be underconfident going forward.

I would NOT... (read more)

17mo

I get what you are saying, and I also harbor doubts about whether extremization
is just pure hindsight bias or if there is something else to it.
Overall I still think its probably justified in cases like Metaculus to
extremize based on the extremization factor that would optimize the last 100
resolved questions, and I would expect the extremized geo mean with such a
factor to outperform the unextremized geo mean in the next 100 binary questions
to resolve (if pressed to put a number on it maybe ~70% confidence without
thinking too much).
My reasoning here is something like:
* There seems to be a long tradition of extremizing in the academic literature
(see the reference in the post above). Though on the other hand empirical
studies have been sparse, and eg Satopaa et al are cheating by choosing the
extremization factor with the benefit of hindsight.
* In this case I didn't try too hard to find an extremization factor that would
work, just two attempts. I didn't need to mine for a factor that would work.
But obviously we cannot generalize from just one example.
* Extremizing has an intuitive meaning as accounting for the different pieces
of information across experts that gives it weight (pun not intended). On the
other hand, every extra parameter in the aggregation is a chance to shoot off
our own foot.
* Intuitively it seems like the overall confidence of a community should be
roughly continuous over time? So the level of underconfidence in recent
questions should be a good indicator of its confidence for the next few
questions.
So overall I am not super convinced, and a big part of my argument is an appeal
to authority.
Also, it seems to be the case that extremization by 1.5 also works when looking
at the last 330 questions.
I'd be curious about your thoughts here. Do you think that a 1.5-extremized geo
mean will outperform the unextremized geo mean in the next 100 questions? What
if we choose a finetuned extremization

My current best guess on how to aggregate forecasts

It's not clear to me that "fitting a Beta distribution and using one of it's statistics" is different from just taking the mean of the probabilities.

I fitting a beta distribution to Metaculus forecasts and looked at:

- Median forecast
- Mean forecast
- Mean log-odds / Geometric mean of odds
- Fitted beta median
- Fitted beta mean

Scattering these 5 values against each other I get:

We can see fitted values are closely aligned with the mean and mean-log-odds, but not with the median. (Unsurprising when you consider the ~parametric formula for the mean / median).

The performan... (read more)

How does forecast quantity impact forecast quality on Metaculus?

I investigated this, and it doesn’t look like there is much evidence for herding among Metaculus users to any noticeable extent, or if there is herding, it doesn’t seem to increase as the number of predictors rises.

1. People REALLY like predicting multiples of 5

2. People still like predicting the median after accounting for this (eg looking at questions where the median isn't a multiple of 5)

(Another way to see how much forecasters love those multiples of 5)

How does forecast quantity impact forecast quality on Metaculus?

If one had access to the individual predictions, one could also try to take 1000 random bootstrap samples of size 1 of all the predictions, then 1000 random bootstrap samples of size 2, and so on and measure how accuracy changes with larger random samples. This might also be possible with data from other prediction sites.

I discussed this with Charles. It's not possible to do *exactly* this with the API, but we can approximate this by looking at the final predictions just before close.

We can see that:

- Questio

When pooling forecasts, use the geometric mean of odds

I created a question series on Metaculus to see how big an effect this is and how the community might forecast this going forward.

When pooling forecasts, use the geometric mean of odds

If I was to summarise your post in another way, it would be this:

The biggest problem with pooling is that a point estimate isn't the end goal. In most applications you care about some transform of the estimate. In general, you're better off keeping all of the information (ie your new prior) rather than just a point estimate of said prior.

I disagree with you that the most natural prior is "mixture distribution over experts". (Although I wonder how much that actually ends up mattering in the real world).

I also think something "interesting" is being sai... (read more)

When pooling forecasts, use the geometric mean of odds

```
import requests, json
import numpy as np
import pandas as pd
def fetch_results_data():
response = {"next":"https://www.metaculus.com/api2/questions/?limit=100&status=resolved"}
results = []
while response["next"] is not None:
print(response["next"])
response = json.loads(requests.get(response["next"]).text)
results.append(response["results"])
return sum(results,[])
all_results = fetch_results_data()
binary_qns = [q for q in all_results if q['possibilities']['type'] == 'binary' and q['resolution'] in [0,1]]
bi
```

... (read more)When pooling forecasts, use the geometric mean of odds

"more questions resolve positively than users expect"

Users expect 50 to resolve positively, but actually 60 resolve positive.

"users expect more questions to resolve positive than actually resolve positive"

Users expect 50 to resolve positive, but actually 40 resolve positive.

I have now editted the original comment to be clearer?

28mo

Cheers

When pooling forecasts, use the geometric mean of odds

but also the average predictor improving their ability also fixed that underconfidence

What do mean by this?

I mean in the past people were underconfident (so extremizing would make their predictions better). Since then they've stopped being underconfident. My assumption is that this is because the average predictor is now more skilled or because more predictors improves the quality of the average.

Doesn't that mean that it should be

lessaccurate, given the bias towards questions resolving positively?

The bias isn't that more questions resolve pos... (read more)

38mo

I don't get what the difference between these is.

18mo

Gotcha!
Oh I see!

When pooling forecasts, use the geometric mean of odds

I find it very interesting that the extremized version was consistently below by a narrow margin. I wonder if this means that there is a subset of questions where it works well, and another where it underperforms.

I think it's actually that historically the Metaculus community was underconfident (see track record here before 2020 vs after 2020).

Extremizing fixes that underconfidence, but also the average predictor improving their ability also fixed that underconfidence.

One question / nitpick: what do you mean by geometric mean of the probabilities?

~~Met~~... (read more)

18mo

What do mean by this?
Oh I see!
It is very cool that this works.
One thing that confuses me - when you take the geometric mean of probabilities
you end up withppooled+(1−p)pooled<1. So the pooled probability gets slighly
nudged towards 0 in comparison to what you would get with the geometric mean of
odds. Doesn't that mean that it should be less accurate, given the bias towards
questions resolving positively?
What am I missing?

When pooling forecasts, use the geometric mean of odds

Yes - copy and paste fail - now corrected

When pooling forecasts, use the geometric mean of odds

brier | -log | |
---|---|---|

metaculus_prediction | 0.110 | 0.360 |

geo_mean_weighted | 0.115 | 0.369 |

extr_geo_mean_odds_2.5_weighted | 0.116 | 0.387 |

geo_mean_odds_weighted | 0.117 | 0.371 |

median_weighted | 0.121 | 0.381 |

mean_weighted | 0.122 | 0.393 |

geo_mean_unweighted | 0.128 | 0.409 |

geo_mean_odds_unweighted | 0.130 | 0.410 |

extr_geo_mean_odds_2.5_unweighted | 0.131 | 0.431 |

median_unweighted | 0.134 | 0.417 |

mean_unweighted | 0.138 | 0.439 |

28mo

(I note these scores are very different than in the first table; I assume these
were meant to be the Brier scores instead?)

48mo

Thank you for the superb analysis!
This increases my confidence in the geo mean of the odds, and decreases my
confidence in the extremization bit.
I find it very interesting that the extremized version was consistently below by
a narrow margin. I wonder if this means that there is a subset of questions
where it works well, and another where it underperforms.
One question / nitpick: what do you mean by geometric mean of the probabilities?
If you just take the geometric mean of probabilities then you do not get a valid
probability - the sum of the pooled ps and (1-p)s does not equal 1. You need to
rescale them, at which point you end with the geometric mean of odds.
Unexpected values explains this better than me here
[https://www.lesswrong.com/posts/mpDGNJFYzyKkg7zc2/aggregating-forecasts?commentId=s3PDE6se6pYt2k7Cp]
.

When pooling forecasts, use the geometric mean of odds

tl;dr The conclusions of this article hold up in an empirical test with Metaculus data

Looking at resolved binary Metaculus questions and using 5 different methods to pool the community estimate.

- Geometric mean of probabilities
- Geometric mean of odds / Arithmetic mean of log-odds
- Median of odds (current Metaculus forecast)
- Arithmetic mean of odds
- Proprietary Metaculus forecast

Also looking at two different scoring rules (Brier and Log) I find rankings as (smaller is better in my table):

- Metaculus prediction is currently the best[2]
- Geometric mean of probabili

17mo

I was curious about why the extremized geo mean of odds didn't seem to beat
other methods. Eric Neyman suggested trying a smaller extremization factor, so I
did that.
I tried an extremizing factor of 1.5, and reused your script to score the
performance on recent binary questions. The result is that the extremized
prediction comes on top.
This has restored my faith on extremization. On hindsight, recommending a fixed
extremization factor was silly, since the correct extremization factor is going
to depend on the predictors being aggregated and the topics they are talking
about.
Going forward I would recommend people who want to apply extremization to study
what extremization factors would have made sense in past questions from the same
community.
I talk more about this in my new post
[https://forum.effectivealtruism.org/posts/acREnv2Z5h4Fr5NWz/my-current-best-guess-on-how-to-aggregate-forecasts]
.

48mo

META: Do you think you could edit this comment to include...
1. The number of questions, and aggregated predictions per question?
2. The information on extremized geometric mean you computed below (I think it
is not receiving as much attention due to being buried in the replies)?
3. Possibly a code snippet to reproduce the results?
Thanks in advance!

38mo

Cool, that’s really useful to know. Can you also check how extremizing the odds
with different parameters performs?

You might be interested in my empirical look at this for Metaculus