Takeaways from the Metaculus AI Progress Tournament

Javier Prieto🔸

Comments 6

Sorted by

New & upvoted

Disclaimer: I work for Metaculus.

Thanks for carefully looking into this @Javier Prieto, this looks very interesting! I'm particularly intrigued by identifying different biases for different categories and wondered how much weight you'd put on this being a statistical artefact vs a real, persistent bias that you would continue to worry about. Concretely, if we waited until a comparable number of AI benchmark progress questions, say, resolved, what would your P(Metaculus is underconfident on AI benchmark progress again) be? (Looking only at the new questions.)

Some minor comments:

About 70% of the predictions at question close had a positive log score, i.e. they were better than predicting a maximally uncertain uniform distribution over the relevant range (chance level).

I think the author knows what's going on here, but it may invite misunderstanding. This notion of "being better than predicting a […] uniform distribution" implies that a perfect forecast on the sum of two independent dice is "better than predicting a uniform distribution" only 2 out of 3 times, i.e. less than 70% of the time! (The probabilities for D_1+D_2 = 2,3,4,10,11, or 12 are all smaller than 1/#{possible outcomes}.)

The average log score at question close was 0.701 (Median: 0.868, IQR: [-0.165, 1.502][7]) compared to an average of 2.17 for all resolved continuous questions on Metaculus.

Given that quite a lot of these AI questions closed over a year before resolution, which is rather atypical for Metaculus, comparing log scores at question close seems a bit unfair. I think time-averaged scores would be more informative. (I reckon they'd produce a quantitatively different, albeit qualitatively similar picture.)

This also goes back to "Metaculus narrowly beats chance": We tried to argue why we believe that this isn't as narrow as others made it out to be (for reasonable definitions of "narrow") here.

Javier Prieto🔸

Thanks, Peter!

To your questions:

I'm fairly confident (let's say 80%) that Metaculus has underestimated progress on benchmarks so far. This doesn't mean it will keep doing so in the future because (i) forecasters may have learned from this experience to be more bullish and/or (ii) AI progress might slow down. I wouldn't bet on (ii), but I expect (i) has already happened to some extent -- it has certainly happened to me!
The other categories have fewer questions and some have special circumstances that make the evidence of bias much weaker in my view. Specifically, the biggest misses in "compute" came from GPU price spikes that can probably be explained by post-COVID supply chain disruptions and increased demand from crypto miners. Both of these factors were transient.
I like your example with the two independent dice. My takeaway is that, if you have access to a prior that's more informative than a uniform distribution (in this case, "both dice are unbiased so their sum must be a triangular distribution"), then you should compare your performance against that. My assumption when writing this was that a (log-)uniform prior over the relevant range was the best we could do for these questions. This is in line with the fact that Metaculus's log score on continuous questions is normalized using a (log-)uniform distribution.
That's a good point re: different time horizons. I didn't bother to check the average time between close and resolution for all questions on the platform, but, assuming it's <<1 year as you suggest, I agree it's an important caveat. If you know that number off the top of your head, I'll add it to the post.

Peter Mühlbacher

Disclaimer: I work for Metaculus.

You can now forecast on how much AI benchmark progress will continue to be underestimated by the Metaculus Community Prediction (CP) on this Metaculus question! Thanks @Javier Prieto for prompting us to think more about this and inspiring this question!

Predict a distribution with a mean of

≈0.5, if you expect the CP to be decently calibrated or just aren't sure about the direction of bias,
>0.5, if you think the CP will continue to underestimate AI benchmark progress,
<0.5, if you think the CP will overestimate AI benchmark progress, e.g. by overreacting to this post.

Here is a Colab Notebook to get you started with some simulations.

And don't forget to update your forecasts about the AI benchmark progress questions in question, if the CP on this one has a mean far away from 0.5!

Lukas_Gloor

For anyone else who didn't know whether a higher log score is good or bad, I think I may have figured it out by reading between the lines. It looks like higher log score = better. But please correct me if I got this wrong!

Javier Prieto🔸

That's right. When defined using a base 2 logarithm, the score can be interpreted as "bits of information over the maximally uncertain (uniform) distribution". Forecasts assigning less probability mass to the true outcome than the uniform distribution result in a negative score.

Muireall

That's right. (But lower is better for some other common scoring rules, including the Brier score.)

Comments

More from the author

162

How accurate are Open Phil's predictions?

Javier Prieto🔸, Coefficient Giving·4y ago·15m read

Curated and popular this week

Hard-to-reverse decisions destroy option value

Stefan_Schubert·9y ago·Curated 12h ago·14m read

This post is co-authored with Ben Garfinkel. It is cross-posted from the CEA blog. A PDF version can be found here. Summary: Some strategic decisions available to the effective altruism m...

If you're agentic, work in biosecurity

sharmaayushmaan🔸·5d ago·7m read

Disclaimer: Although I work on the Groups Team at CEA, I’m writing this in a personal capacity, and this post does not constitute an endorsement by CEA. Agency - the realisation that you really can just do things. TL;DR Biosecurity needs people (of any background) who are agentic and have a high execution velocity and track record....

Introducing Impact List: a ranking of philanthropists by expected lives saved

Elliot Olds·1d ago·6m read

TL;DR: I'm releasing a website that ranks philanthropists according to EA principles and research, and allows users to re-rank the list using their own assumptions. I'd like feedback and help making it better. I'd especially like ideas for how to make the results more trustworthy. Funding may be available. I recently built Impact List (impactlist.xyz), a site which ranks people by their positive impact via donations. The goal is t...

Peter Mühlbacher

Disclaimer: I work for Metaculus.

Some minor comments:

About 70% of the predictions at question close had a positive log score, i.e. they were better than predicting a maximally uncertain uniform distribution over the relevant range (chance level).

The average log score at question close was 0.701 (Median: 0.868, IQR: [-0.165, 1.502][7]) compared to an average of 2.17 for all resolved continuous questions on Metaculus.

This also goes back to "Metaculus narrowly beats chance": We tried to argue why we believe that this isn't as narrow as others made it out to be (for reasonable definitions of "narrow") here.

^{^}

The questions were run in parallel on Hypermind, but this analysis will focus exclusively on the Metaculus forecasts.

^{^}

I started writing this analysis in April 2023. June 2023 was the last time I updated it.

^{^}

I reverse-coded the questions where lower numbers meant faster progress (e.g. the benchmarks measuring perplexity) so that a higher/lower CDF could be interpreted consistently as pessimism/optimism.

^{^}

A Kolmogorov-Smirnov test couldn’t reject the null hypothesis “the data were sampled from a uniform distribution on ” at $α = 0.05$ .

^{^}

There were essentially four question categories in this tournament:

1. Economic indicators, e.g. market cap of certain tech companies, weight of IT in the S&P 500.

2. Bibliometric indicators, all of them of the form "How many papers of <ML subfield> will be published on the arXiv before <date>?"

3. Compute, e.g. top GPU performance in FLOP/$ or total FLOPs available in TOP500 computers.

4. State-of-the-art performance on several ML benchmarks.

^{^}

I tested this in two ways: (i) the t-distributed 95% confidence interval for CDF(true value) of the benchmark category doesn’t overlap with bibliometrics or compute, and the overlap with economics is rather small (2 percentage points); and (ii) a categorical OLS regression with the benchmarks category as baseline returns negative coefficients for the other three, with all p-values <0.01.

^{^}

This is the observed interquartile range of the data, not a confidence interval on the mean or median.

^{^}

As indicated in their track record page as of Jun 5, 2023.

^{^}

I fit the models `log_score ~ horizon` and `log_score ~ horizon + C(category)` using Python’s statsmodels OLS method. The 95% interval for the coefficient of `horizon` is [-0.002, 0.002] in both.

^{^}

See how the curves in figures 1 and 3 here tap out at <30 – although the metric of success is not a log score, so I’m not sure how much this applies to our case. Another line of evidence comes from this claim by Manifold Markets; they say their calibration page only includes markets with at least 15 traders, “which is where we tend to find an increased number of traders doesn't significantly impact calibration”.

^{^}

I fitted the logistic regression `outcome ~ Bernoulli(inverse_logit(log odds))`. The inverse of the slope in this model can be interpreted as a measure of overconfidence – if it’s >1, it means the forecasts are extremized with respect to the true probability. I found this number was 0.483 (95% bootstrap CI: [0.145, 1.001]), consistent with good calibration but suggestive of underconfidence since most of the interval is <1.

Takeaways from the Metaculus AI Progress Tournament

Related work

Key takeaways

Results

On bias

On accuracy

Some narrative speculations

Appendix: Comparison with previous tournament