Can GPT-3 predict real world events? To answer this question I had GPT-3 predict the likelihood for every binary question ever resolved on Metaculus.
Predicting whether an event is likely or unlikely to occur, often boils down to using common sense. It doesn't take a genius to figure out that "Will the sun explode tomorrow?" should get a low probability. Not all questions are that easy, but for many questions common sense can bring us surprisingly far.
Experimental setup
Through their API I downloaded every binary question posed on Metaculus.
I then filtered them down to only the non-ambiguously resolved questions, resulting in this list of 788 questions.
For these questions the community's Mean Squared Error was 0.19, a good deal better than random!
Prompt engineering
GPT's performance is notoriously dependent on the prompt it is given.
- I primarily measured the quality of prompts, on the percentage of legible predictions made.
- Predictions were made using the most powerful DaVinci engine.
The best performing prompt was optimized for brevity and did not include the question's full description.
A very knowledgable and epistemically modest analyst gives the following events a likelihood of occuring:
Event: Will the cost of sequencing a human genome fall below $500 by mid 2016?
Likelihood: 43%Event: Will Russia invade Ukrainian territory in 2022?
Likelihood: 64%Event: Will the US rejoin the Iran Nuclear Deal before 2023?
Likelihood: 55%Event: <Question to be predicted>
Likelihood: <GPT-3 insertion>
I tried many variations, different introductions, different questions, different probabilities, including/excluding question descriptions, etc.
Of the 786 questions, the best performing prompt made legible predictions for 770. For the remaining 16 questions GPT mostly just wrote "\n".
If you want to try your own prompt or reproduce the results, the code to do so can be found in this Github repository.
Results
GPT-3's MSE was 0.33, which is about what you'd expect if you were to guess completely at random. This was surprising to me! GPT Why isn't GPT better?
Going into this, I was confident GPT would do better than random. After all many of the questions it was asked to predict, resolved before GPT-3 was even trained. There's probably some of the questions it knows the answer to and still somehow gets wrong!
It seems to me that GPT-3 is struggling to translate beliefs into probabilities. Even if it understands that the sun exploding tomorrow is unlikely, it doesn't know how to formulate that using numeric probabilities. I'm unsure if this is an inherent limitation of GPT-3 or whether its just the prompt that is confusing it.
I wonder if predicting using expressions such as "Likely" | "Uncertain" | "Unlikely", and interpreting these as 75% | 50% | 25% respectively could produce results better than random, as GPT wouldn't have to struggle with translating its beliefs into numeric probabilities. Unfortunately running GPT-3's best engine on 800 questions would be yet another hour and $20 I'm reluctant to spend, so for now that will remain a mystery.
It may be that even oracle AI's will be dangerous, fortunately GPT-3 is far from an oracle!
(Crossposted from lesswrong: https://www.lesswrong.com/posts/c3cQgBN3v2Cxpe2kc/getting-gpt-3-to-predict-metaculus-questions)
What do you think would occur if you added in the 1st or 2nd most upvoted, recent comments in the GPT-3 description, following the question?
I think it might make the difference on some questions with high forecaster volume, but might detract from the accuracy on questions with lower forecaster volume.
If the comments include a prediction my guess is that GPT would often make the same prediction and thus become much more accurate. Not because it learned to predict things but because there's probably a strong correlation between the community prediction and the most upvoted comments prediction.
If the goal is to give GPT more context than just the title of the question, then you could include the descriptions for each question as well, but when I tried this I got worse results (fewer legible predictions).