Hide table of contents

Can GPT-3 predict real world events? To answer this question I had GPT-3 predict the likelihood for every binary question ever resolved on Metaculus.

Predicting whether an event is likely or unlikely to occur, often boils down to using common sense. It doesn't take a genius to figure out that "Will the sun explode tomorrow?" should get a low probability. Not all questions are that easy, but for many questions common sense can bring us surprisingly far.

Experimental setup

Through their API I downloaded every binary question posed on Metaculus.
I then filtered them down to only the non-ambiguously resolved questions, resulting in this list of 788 questions.

For these questions the community's Mean Squared Error was 0.19, a good deal better than random!

Prompt engineering

GPT's performance is notoriously dependent on the prompt it is given.

  • I primarily measured the quality of prompts, on the percentage of legible predictions made.
  • Predictions were made using the most powerful DaVinci engine.

The best performing prompt was optimized for brevity and did not include the question's full description.

A very knowledgable and epistemically modest analyst gives the following events a likelihood of occuring:

Event: Will the cost of sequencing a human genome fall below $500 by mid 2016? 
Likelihood: 43%

Event: Will Russia invade Ukrainian territory in 2022?
Likelihood: 64%

Event: Will the US rejoin the Iran Nuclear Deal before 2023?
Likelihood: 55%

Event: <Question to be predicted>
Likelihood: <GPT-3 insertion>

I tried many variations, different introductions, different questions, different probabilities, including/excluding question descriptions, etc.

Of the 786 questions, the best performing prompt made legible predictions for 770. For the remaining 16 questions GPT mostly just wrote "\n".

If you want to try your own prompt or reproduce the results, the code to do so can be found in this Github repository.

Results

GPT-3's MSE was 0.33, which is about what you'd expect if you were to guess completely at random. This was surprising to me! GPT Why isn't GPT better?

Going into this, I was confident GPT would do better than random. After all many of the questions it was asked to predict, resolved before GPT-3 was even trained. There's probably some of the questions it knows the answer to and still somehow gets wrong!

It seems to me that GPT-3 is struggling to translate beliefs into probabilities. Even if it understands that the sun exploding tomorrow is unlikely, it doesn't know how to formulate that using numeric probabilities. I'm unsure if this is an inherent limitation of GPT-3 or whether its just the prompt that is confusing it.

I wonder if predicting using expressions such as "Likely" | "Uncertain" | "Unlikely", and interpreting these as 75% | 50% | 25% respectively could produce results better than random, as GPT wouldn't have to struggle with translating its beliefs into numeric probabilities. Unfortunately running GPT-3's best engine on 800 questions would be yet another hour and $20 I'm reluctant to spend, so for now that will remain a mystery.

 

It may be that even oracle AI's will be dangerous, fortunately GPT-3 is far from an oracle!

 

(Crossposted from lesswrong: https://www.lesswrong.com/posts/c3cQgBN3v2Cxpe2kc/getting-gpt-3-to-predict-metaculus-questions)

Comments6


Sorted by Click to highlight new comments since:

Suggested variation, which I'd expect to lead to better results: use raw "completion probabilities" for different answers.

E.g. with prompt "Will Russia invade Ukrainian territory in 2022?" extract completion likelihoods of the next few tokes "Yes" and "No". Normalize

man you just blew my mind, will give it a try next time I feel an urge to play around with GPT!

I don't know much about AI or machine learning, but as you say, I think some of the reason for your results is that the language model of GPT-3 doesn't have a great "connection" between the "real world" "latent information" in your questions, and the probabilities you want. This deficiency is sort of what you're suggesting in your post, and I think you're right.

 

I think another major reason is sort of the prompt design or "mindset of use" of GPT-3. 

I guess I would sort of say it's useful to see it as a "paid actor". This is a case where it's useful to see GPT-3 as "acting" or trying to generate text that rationalizes a certain framing. 

It sort of tries to figure out from your prompt if it was writing a blog post, writing a joke, or having starker, more dramatic framing.

Once you get it into this "mindset", you actually get meaningful completions. 

 

Examples:

Completion 1:

This is not a joke, the above is an actual completion.

Completion 2:

I am not joking, again the above is an actual completion. I am not sure how it is so accurate, since GPT-3's training info is cut off at 2019. 

See the parameters below:

import os import openai openai.api_key = os.getenv("OPENAI_API_KEY") response = openai.Completion.create( engine="text-davinci-002", prompt="Event: Will the cost of sequencing a human genome fall below $500 by mid 2016? \nLikelihood: 43%\n\nEvent: Will Russia invade Ukrainian territory in 2022?\nLikelihood: 64%\n\nEvent: Will the US rejoin the Iran Nuclear Deal before 2023?\nLikelihood: 55%\n\nEvent: Will Charles He get banned from the EA forum in the next 30 days?\nLikelihood:", temperature=0.7, max_tokens=256, top_p=1, frequency_penalty=0, presence_penalty=0 )

Also, there's several comments on your prompt, that you might have thought of before:

  • As you noted, your first "long" prompt is long and in this case, this impedes GPT-3's performance. In my own words, I would say that makes it harder for GPT to "construct" the framing involved. https://github.com/MperorM/gpt3-metaculus/blob/main/gpt_prompt.py
  • For your shorter prompt you ended up using, I think you might get different results by changing the questions to be similar to the domain,  or to "cover more domain space", or "loosen up the probability space".
    • Your questions are sort of "dry" and 2 of the 3 questions cover geopolitical issues. If you expanded this to be a little more "dramatic", or had the prompt "express skill", I think you would see different results.

More tips from prompt design are from Andrew Mayne. https://andrewmayneblog.wordpress.com/

 

Temperature and other parameters matter a lot too. Related to this, I think you sort of have an "N of 1"? I need to think about this, but that might not give much information about GPT-3's performance.

What do you think would occur if you added in the 1st or 2nd most upvoted, recent comments in the GPT-3 description, following the question? 

I think it might make the difference on some questions with high forecaster volume, but might detract from the accuracy on questions with  lower forecaster volume. 

If the comments include a prediction my guess is that GPT would often make the same prediction and thus become much more accurate. Not because it learned to predict things but because there's probably a strong correlation between the community prediction and the most upvoted comments prediction.

If the goal is to give GPT more context than just the title of the question, then you could include the descriptions for each question as well, but when I tried this I got worse results (fewer legible predictions).

Curated and popular this week
Ben_West🔸
 ·  · 1m read
 · 
> Summary: We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks. > > The length of tasks (measured by how long they take human professionals) that generalist frontier model agents can complete autonomously with 50% reliability has been doubling approximately every 7 months for the last 6 years. The shaded region represents 95% CI calculated by hierarchical bootstrap over task families, tasks, and task attempts. > > Full paper | Github repo Blogpost; tweet thread. 
 ·  · 2m read
 · 
For immediate release: April 1, 2025 OXFORD, UK — The Centre for Effective Altruism (CEA) announced today that it will no longer identify as an "Effective Altruism" organization.  "After careful consideration, we've determined that the most effective way to have a positive impact is to deny any association with Effective Altruism," said a CEA spokesperson. "Our mission remains unchanged: to use reason and evidence to do the most good. Which coincidentally was the definition of EA." The announcement mirrors a pattern of other organizations that have grown with EA support and frameworks and eventually distanced themselves from EA. CEA's statement clarified that it will continue to use the same methodologies, maintain the same team, and pursue identical goals. "We've found that not being associated with the movement we have spent years building gives us more flexibility to do exactly what we were already doing, just with better PR," the spokesperson explained. "It's like keeping all the benefits of a community while refusing to contribute to its future development or taking responsibility for its challenges. Win-win!" In a related announcement, CEA revealed plans to rename its annual EA Global conference to "Coincidental Gathering of Like-Minded Individuals Who Mysteriously All Know Each Other But Definitely Aren't Part of Any Specific Movement Conference 2025." When asked about concerns that this trend might be pulling up the ladder for future projects that also might benefit from the infrastructure of the effective altruist community, the spokesperson adjusted their "I Heart Consequentialism" tie and replied, "Future projects? I'm sorry, but focusing on long-term movement building would be very EA of us, and as we've clearly established, we're not that anymore." Industry analysts predict that by 2026, the only entities still identifying as "EA" will be three post-rationalist bloggers, a Discord server full of undergraduate philosophy majors, and one person at
Thomas Kwa
 ·  · 2m read
 · 
Epistemic status: highly certain, or something The Spending What We Must 💸11% pledge  In short: Members pledge to spend at least 11% of their income on effectively increasing their own productivity. This pledge is likely higher-impact for most people than the Giving What We Can 🔸10% Pledge, and we also think the name accurately reflects the non-supererogatory moral beliefs of many in the EA community. Example Charlie is a software engineer for the Centre for Effective Future Research. Since Charlie has taken the SWWM 💸11% pledge, rather than splurge on a vacation, they decide to buy an expensive noise-canceling headset before their next EAG, allowing them to get slightly more sleep and have 104 one-on-one meetings instead of just 101. In one of the extra three meetings, they chat with Diana, who is starting an AI-for-worrying-about-AI company, and decide to become a cofounder. The company becomes wildly successful, and Charlie's equity share allows them to further increase their productivity to the point of diminishing marginal returns, then donate $50 billion to SWWM. The 💸💸💸 Badge If you've taken the SWWM 💸11% Pledge, we'd appreciate if you could add three 💸💸💸 "stacks of money with wings" emoji to your social media profiles. We chose three emoji because we think the 💸11% Pledge will be about 3x more effective than the 🔸10% pledge (see FAQ), and EAs should be scope sensitive.  FAQ Is the pledge legally binding? We highly recommend signing the legal contract, as it will allow you to sue yourself in case of delinquency. What do you mean by effectively increasing productivity? Some interventions are especially good at transforming self-donations into productivity, and have a strong evidence base. In particular:  * Offloading non-work duties like dates and calling your mother to personal assistants * Running many emulated copies of oneself (likely available soon) * Amphetamines I'm an AI system. Can I take the 💸11% pledge? We encourage A
Recent opportunities in Forecasting
20
Eva
· · 1m read