All of Simon_M's Comments + Replies

Rory Stewart discusses GiveDirectly on The Rest is Politics

I didn't get the impression from this transcript that Rory Stewart has just heard of cash transfers - is there any part which implied that? It felt to me more like bringing-the-listener-with-him kind of speak to convey a weird but exciting idea.

Reading the transcript cold, maybe it doesn't give that impression. If you're willing to listen to the episodes (there's two of them and the topic comes up a few times intersperced throughout) I'd be interested if your view changes with his joke. (He certainly gives off a tone of surprise). I also think this:

 I

... (read more)
1PatrickL1mo
Thanks! Yes this was just my impression from reading, not listening. I'll hopefully get round to listening later and see if that updates my impression.
Rory Stewart discusses GiveDirectly on The Rest is Politics

I feel like I'm going to upvote both? There seem to be some significant (specific) errors, but the message is broadly correct.

Rory Stewart discusses GiveDirectly on The Rest is Politics

To be clear - I think that this is on net a good thing. This podcast will probably introduce both GiveDirectly and EA ideas to a wider audience. Having written up this transcript, I am also less disappointed about how this came across than I was when I first heard this at 2x-speed. That said, I still find two things fairly depressing:

  1. Someone who has worked in international development for 30 years and headed DfID(!) is only just now finding out about cash transfers, and thinks it's the most effective intervention you can do. (Although perhaps with his cave
... (read more)
6PatrickL1mo
Thanks v much for posting this transcript! I agree this is on net good and think I took a more positive impression from Rory Stewart's points :) I didn't get the impression from this transcript that Rory Stewart has just heard of cash transfers - is there any part which implied that? It felt to me more like bringing-the-listener-with-him kind of speak to convey a weird but exciting idea. I would argue his point that 'giving people cash is probably the most effective single intervention that you can do for a very poor family'is pretty accurate and I think it implies he understands it maybe isn't as effective as larger scale interventions (larger than 'a single intervention for one family'). But agree with you that the joke at the end "We should have kept DfID, but we should have spent the money on cash transfers" is wrong! Anecdotally, from my experience in DfID in 2019-20, people working on cross-cutting development prioritisation often mentioned cash transfers in a way implying familiarity. The main question wasn't whether this weird idea works, but how it compares to bigger interventions like conflict-prevention or aid-for-trade. So I come out even more cheerful about this interview!
Leveraging finance to increase resilience to GCRs

Capital market investors would be attracted to these financial products because they are not correlated with developed world asset prices. As mentioned before, these investments can also hedge against climate risks and GCRs.

Lots of products aren't correlated to financial markets. (Betting on sports for example). That doesn't mean investors want to put money in. 

Another point is that if they hedge against climate risk, and you think climate risk will materially affect the world, then you should expect these products to be correlated to the market. (But at least then they might have some excess return).

Leveraging finance to increase resilience to GCRs

Capital market investors would be attracted to these finance products due to high returns and a lack of correlation with developed world asset prices. As mentioned before, these investments can also hedge against climate risks and GCRs.

 

Why should we expect high returns? ILS / "Cat Bonds" don't seem to have especially high returns, and I'm not sure what the economic justification for them having high returns would be?

1PhilC3mo
Good catch. Investors care about return vs risk. It also allows them to diversify and have lower risk / higher returns for their portfolio.
Forecasting Newsletter: Looking back at 2021.

My general take on this space is:

  1. There is (generally) a disconnect between decision makers and forecasting platforms
  2. Spot forecasts are not especially useful on their own
  3. There are some good examples of decision makers at least looking at markets

Re 1: the disconnect between decision makers and forecasting platforms. I think the problem comes in two directions.

  • Decision makers don't value the forecasts as much as they would cost to create (even if the value they would provide would be huge)
  • The incentives to make the forecasts are usually orthogonal to the peop
... (read more)
Bottlenecks to more impactful crowd forecasting

You might be interested in both: "Most Likes" and "h-Index" metrics on MetaculusExtras which does have a visible upvote score. (Although I agree it would be nice to have it on Metaculus proper)

A Quick Overview of Forecasting and Prediction Markets: Why they’re useful, why they aren’t, and what’s next

Some nitpicks:

Forecasts have been more accurate than random 94% of the time since 2015

This is a terrible metric since most people looking at most questions on Metaculus wouldn't think they are all 50/50 coin flips.

Augur’s solution to this issue is to provide predictors with a set of outcomes on which predictors stake their earnings on the true outcome. Presumably, the most staked-on outcome is what actually happened (such as Biden winning the popular vote being the true outcome). In turn, predictors are rewarded for staking on true outcomes.

This doesn't ac... (read more)

3henry5mo
thank you for the feedback- it's very helpful! I'll make the edits/clarify my thinking and get back to you.
When pooling forecasts, use the geometric mean of odds

Looking at the rolling performance of your method (optimize on last 100 and use that to predict), median and geo mean odds, I find they have been ~indistinguishable over the last ~200 questions. If I look at the exact numbers, extremized_last_100 does win marginally, but looking at that chart I'd have a hard time saying "there's a 70% chance it wins over the next 100 questions". If you're interested in betting at 70% odds I'd be interested.

There seems to be a long tradition of extremizing in the academic literature (see the reference in the post above). Th

... (read more)
When pooling forecasts, use the geometric mean of odds

This has restored my faith on extremization

I think this is the wrong way to look at this.

Metaculus was way underconfident originally. (Prior to 2020, 22% using their metric). Recently it has been much better calibrated - (2020- now, 4% using their metric).

Of course if they are underconfident then extremizing will improve the forecast, but the question is what is most predictive going forward. Given that before 2020 they were 22% underconfident, more recently 4% underconfident, it seems foolhardy to expect them to be underconfident going forward.

I would NOT... (read more)

1Jaime Sevilla7mo
I get what you are saying, and I also harbor doubts about whether extremization is just pure hindsight bias or if there is something else to it. Overall I still think its probably justified in cases like Metaculus to extremize based on the extremization factor that would optimize the last 100 resolved questions, and I would expect the extremized geo mean with such a factor to outperform the unextremized geo mean in the next 100 binary questions to resolve (if pressed to put a number on it maybe ~70% confidence without thinking too much). My reasoning here is something like: * There seems to be a long tradition of extremizing in the academic literature (see the reference in the post above). Though on the other hand empirical studies have been sparse, and eg Satopaa et al are cheating by choosing the extremization factor with the benefit of hindsight. * In this case I didn't try too hard to find an extremization factor that would work, just two attempts. I didn't need to mine for a factor that would work. But obviously we cannot generalize from just one example. * Extremizing has an intuitive meaning as accounting for the different pieces of information across experts that gives it weight (pun not intended). On the other hand, every extra parameter in the aggregation is a chance to shoot off our own foot. * Intuitively it seems like the overall confidence of a community should be roughly continuous over time? So the level of underconfidence in recent questions should be a good indicator of its confidence for the next few questions. So overall I am not super convinced, and a big part of my argument is an appeal to authority. Also, it seems to be the case that extremization by 1.5 also works when looking at the last 330 questions. I'd be curious about your thoughts here. Do you think that a 1.5-extremized geo mean will outperform the unextremized geo mean in the next 100 questions? What if we choose a finetuned extremization
My current best guess on how to aggregate forecasts

It's not clear to me that "fitting a Beta distribution and using one of it's statistics" is different from just taking the mean of the probabilities.

I fitting a beta distribution to Metaculus forecasts and looked at:

  • Median forecast
  • Mean forecast
  • Mean log-odds / Geometric mean of odds
  • Fitted beta median
  • Fitted beta mean

Scattering these 5 values against each other I get:

We can see fitted values are closely aligned with the mean and mean-log-odds, but not with the median. (Unsurprising when you consider the ~parametric formula for the mean / median).

The performan... (read more)

How does forecast quantity impact forecast quality on Metaculus?

I investigated this, and it doesn’t look like there is much evidence for herding among Metaculus users to any noticeable extent, or if there is herding, it doesn’t seem to increase as the number of predictors rises.

 


1. People REALLY like predicting multiples of 5

2. People still like predicting the median after accounting for this (eg looking at questions where the median isn't a multiple of 5)

(Another way to see how much forecasters love those multiples of 5)

How does forecast quantity impact forecast quality on Metaculus?

If one had access to the individual predictions, one could also try to take 1000 random bootstrap samples of size 1 of all the predictions, then 1000 random bootstrap samples of size 2, and so on and measure how accuracy changes with larger random samples. This might also be possible with data from other prediction sites.

 

I discussed this with Charles. It's not possible to do exactly  this with the API, but we can approximate this by looking at the final predictions just before close.

Brier score from bootstrapped predictors

We can see that:

  1. Questio
... (read more)
When pooling forecasts, use the geometric mean of odds

I created a question series on Metaculus to see how big an effect this is and how the community might forecast this going forward.

When pooling forecasts, use the geometric mean of odds

If I was to summarise your post in another way, it would be this:

The biggest problem with pooling is that a point estimate isn't the end goal. In most applications you care about some transform of the estimate. In general, you're better off keeping all of the information (ie your new prior) rather than just a point estimate of said prior.

 

I disagree with you that the most natural prior is "mixture distribution over experts". (Although I wonder how much that actually ends up mattering in the real world).


I also think something "interesting" is being sai... (read more)

When pooling forecasts, use the geometric mean of odds
import requests, json
import numpy as np
import pandas as pd

def fetch_results_data():
    response = {"next":"https://www.metaculus.com/api2/questions/?limit=100&status=resolved"}

    results = []
    while response["next"] is not None:
        print(response["next"])
        response = json.loads(requests.get(response["next"]).text)
        results.append(response["results"])
    return sum(results,[])


all_results = fetch_results_data()
binary_qns = [q for q in all_results if q['possibilities']['type'] == 'binary' and q['resolution'] in [0,1]]
bi
... (read more)
When pooling forecasts, use the geometric mean of odds

"more questions resolve positively than users expect"

Users expect 50 to resolve positively, but actually 60 resolve positive.

"users expect more questions to resolve positive than actually resolve positive"

Users expect 50 to resolve positive, but actually 40 resolve positive.

I have now editted the original comment to be clearer?

2NunoSempere8mo
Cheers
When pooling forecasts, use the geometric mean of odds

but also the average predictor improving their ability also fixed that underconfidence

What do mean by this? 

I mean in the past people were underconfident (so extremizing would make their predictions better). Since then they've stopped being underconfident.  My assumption is that this is because the average predictor is now more skilled or because more predictors improves the quality of the average.

Doesn't that mean that it should be less accurate, given the bias towards questions resolving positively?

The bias isn't that more questions resolve pos... (read more)

3NunoSempere8mo
I don't get what the difference between these is.
1Jaime Sevilla8mo
Gotcha! Oh I see!
When pooling forecasts, use the geometric mean of odds

I find it very interesting that the extremized version was consistently below by a narrow margin. I wonder if this means that there is a subset of questions where it works well, and another where it underperforms.

I think it's actually that historically the Metaculus community was underconfident (see track record here before 2020 vs after 2020).

Extremizing fixes that underconfidence, but also the average predictor improving their ability also fixed that underconfidence.

One question / nitpick: what do you mean by geometric mean of the probabilities? 

Met... (read more)

1Jaime Sevilla8mo
What do mean by this? Oh I see! It is very cool that this works. One thing that confuses me - when you take the geometric mean of probabilities you end up withppooled+(1−p)pooled<1. So the pooled probability gets slighly nudged towards 0 in comparison to what you would get with the geometric mean of odds. Doesn't that mean that it should be less accurate, given the bias towards questions resolving positively? What am I missing?
When pooling forecasts, use the geometric mean of odds
 brier-log
metaculus_prediction0.1100.360
geo_mean_weighted0.1150.369
extr_geo_mean_odds_2.5_weighted0.1160.387
geo_mean_odds_weighted0.1170.371
median_weighted0.1210.381
mean_weighted0.1220.393
geo_mean_unweighted0.1280.409
geo_mean_odds_unweighted0.1300.410
extr_geo_mean_odds_2.5_unweighted0.1310.431
median_unweighted0.1340.417
mean_unweighted0.1380.439
2Jaime Sevilla8mo
(I note these scores are very different than in the first table; I assume these were meant to be the Brier scores instead?)
4Jaime Sevilla8mo
Thank you for the superb analysis! This increases my confidence in the geo mean of the odds, and decreases my confidence in the extremization bit. I find it very interesting that the extremized version was consistently below by a narrow margin. I wonder if this means that there is a subset of questions where it works well, and another where it underperforms. One question / nitpick: what do you mean by geometric mean of the probabilities? If you just take the geometric mean of probabilities then you do not get a valid probability - the sum of the pooled ps and (1-p)s does not equal 1. You need to rescale them, at which point you end with the geometric mean of odds. Unexpected values explains this better than me here [https://www.lesswrong.com/posts/mpDGNJFYzyKkg7zc2/aggregating-forecasts?commentId=s3PDE6se6pYt2k7Cp] .
When pooling forecasts, use the geometric mean of odds

tl;dr The conclusions of this article hold up in an empirical test with Metaculus data

 

Looking at resolved binary Metaculus questions and using 5 different methods to pool the community estimate.

  • Geometric mean of probabilities
  • Geometric mean of odds / Arithmetic mean of log-odds
  • Median of odds (current Metaculus forecast)
  • Arithmetic mean of odds
  • Proprietary Metaculus forecast

Also looking at two different scoring rules (Brier and Log) I find rankings as (smaller is better in my table):

  1. Metaculus prediction is currently the best[2]
  2. Geometric mean of probabili
... (read more)
1Jaime Sevilla7mo
I was curious about why the extremized geo mean of odds didn't seem to beat other methods. Eric Neyman suggested trying a smaller extremization factor, so I did that. I tried an extremizing factor of 1.5, and reused your script to score the performance on recent binary questions. The result is that the extremized prediction comes on top. This has restored my faith on extremization. On hindsight, recommending a fixed extremization factor was silly, since the correct extremization factor is going to depend on the predictors being aggregated and the topics they are talking about. Going forward I would recommend people who want to apply extremization to study what extremization factors would have made sense in past questions from the same community. I talk more about this in my new post [https://forum.effectivealtruism.org/posts/acREnv2Z5h4Fr5NWz/my-current-best-guess-on-how-to-aggregate-forecasts] .
4Jaime Sevilla8mo
META: Do you think you could edit this comment to include... 1. The number of questions, and aggregated predictions per question? 2. The information on extremized geometric mean you computed below (I think it is not receiving as much attention due to being buried in the replies)? 3. Possibly a code snippet to reproduce the results? Thanks in advance!
3MaxRa8mo
Cool, that’s really useful to know. Can you also check how extremizing the odds with different parameters performs?