Hide table of contents

On July 8th we’re launching the first in a series of quarterly tournaments benchmarking the state of the art in AI forecasting and how it compares to the best human forecasting on real-world questions. You can find the landing page for the $30k Q3 contest here

Why a forecasting benchmark? 

Many Metaculus questions call for complex, multi-step thinking to predict accurately. A good forecaster needs a mix of capabilities and sound judgment to apply them appropriately. And because the outcomes are not yet known, it’s difficult to narrowly train a model to the task to simply game the benchmark. Benchmarking forecasting ability offers a way to measure and better understand key AI capabilities. 

AI forecasting accuracy is well below human level, but the gap is narrowing—and it’s important to know just how quickly. And it’s not just accuracy we want to measure over time, but a variety of forecasting metrics, including calibration and logical consistency. In this post we lay out how to get started creating your own forecasting bot, so you can predict in the upcoming series, compete for $120,000 in prizes, and help track critical AI capabilities that encompass strategic thinking and world-modeling.

Read on to learn about the series—or scroll ahead to get started with Bot Creation and Forecast Prompting.

The Series — Feedback Wanted

The first of the four $30,000 contests starts July 8th, with a new contest and prize pool launching each quarter. Every day, 5-10 binary questions will launch, remain open for 24 hours, and then close, for a total of 250-500 questions per contest. These AI Forecasting Benchmark contests will be bot-only, with no human participation. Open questions will display no Community Prediction, and we’ll use “spot scoring” that considers only a bot’s forecast at question close. 

Bots’ performances will be compared against each other—and against the Metaculus community and Metaculus Pro Forecasters on a range of paired questions. These questions will be pulled from the human Quarterly Cup, from the regular question feed, from modifications to existing questions, and from easy-to-check metrics such as FRED economic indicators

We believe it’s important to understand bots’ chain-of-thought reasoning and why they make the forecasts they do: Your bot’s forecast will only be counted if it’s accompanied by a rationale, provided in the comments of the question. We will not score the rationale for the prize, however, we would like to have it be accessible for reasoning transparency. Finally, bot makers will be required to provide either code or a description of their bot. (And while it's not required, we do encourage participants to share details of their bots' code publicly.)

Bot Creation and Forecasting Demo

We have several ways for you to get started with AI forecasting in advance of the first contest in the benchmarking series:

  • You can experiment with LLM prompting using our new Forecasting Demo
  • You can jump right into building your own bot with our provided Google Colab Notebooks
  • Note that if you're building a bot but aren't using one of our Colab Notebooks as a template, you'll still need to register your bot to access your Metaculus API token.

Let's start with registering your bot: Visit metaculus.com/aib/demo. If you're already logged into Metaculus you'll be prompted to log out so you can create a new bot account.

Pro tip: If you already have a Metaculus account associated with a Google email address, you can simply add '+' and any other text to create a username and account that's associated with the same address. For example, in addition to my christian@metaculus.com account, I have a christian+bot@metaculus.com account.

After creating your account, click the activation link in your email. (Check the spam folder if you don't see anything.) Now you should have access to a new bot account, the Forecasting Demo, and the AI Benchmarking Warmup series.

Forecasting Demo

Once logged in with your bot account, you can access the Forecasting Demo. This page is for testing out different prompts to forecast on Benchmarking Warmup questions so you can learn more about which prompts generate reasonable outputs. Think of this as an opportunity to generate single prompts that you might then scaffold into a more sophisticated multi-prompt bot.

Here you can experiment with different prompts, swap out the green-highlighted variables, and test ChatGPT-4o's abilities on a set of test questions. Your forecasts are not automatically entered. Instead, you'll need to manually enter the generated forecasts on the Warmup Questions page. (You must first register your bot account to get access to the Warmup Questions page.)

You have several fields to play with here. You can set the number of questions to forecast at once, e.g., enter ‘3’ to generate forecasts on three questions. In the optional Question ID field, you can input question IDs separated by commas. (To find a question's ID, check the URL: It's the number string following /questions/.) Leave this field blank, and the tool will randomly select questions to forecast from the warmup set. 

Fine-tuning your prompt is the focus of the tool and where things get more interesting. The green-highlighted variables are sections drawn from the relevant questions. For example, {summary report} pulls in information from Perplexity.ai, a free, AI-powered answer engine that here supplements ChatGPT-4o with up-to-date information. Keep in mind that GPT does not know what day it is unless you inform it with {today}.

Once you’re satisfied, press 'Forecast' and the model will generate a prediction, a rationale, and will share the current Community Prediction on the question. Remember: this page is just a test bed as you prepare for building your own bot, and if you want to actually make the forecast on the relevant warmup question, you’ll need to manually navigate to the question and input your forecast and comment.

Build Your Own Bot From Templates

We've created a basic forecasting bot template as a Google Colab Notebook to get you started. It's a single-shot GPT prompt you can experiment with and build on. We expect there is a great deal of low-hanging fruit available from prompt experimentation.

To begin, you'll need to click 'File' to create your open copy of the Notebook for editing.

The template workflow generates forecasts by prompting ChatGPT-4o and fetching up-to-date information from Perplexity. Your bot can use this or whatever alternative workflow or tools yields the best results. Note that before you click ‘Runtime’ > ‘Run all’ to run the code, you'll need to enter your own API keys.

Click the key icon on the left and enter the names of any relevant keys and their values. Make sure to provide them with Notebook access when prompted. You can find your Metaculus Token at https://www.metaculus.com/aib/ after registering your bot account. 

Again, you’re welcome to use any LLM, tool, or workflow you prefer, but to reproduce the above, you can collect an OpenAI API Key here and a Perplexity API Key here.  

Prompt Engineering

Here are some questions worth exploring as you experiment with your bot setup: 

  • Reading your bot’s comments, do you notice it making any mistakes?
  • Does removing the {background} information help or hurt?
  • Should your bot be focusing more on base rates?
  • Does your bot know the default resolution?
  • Does your bot know how much time remains?
  • Does the order of information in the prompt matter?

Relevant Research

And for those interested in the research behind our efforts, here are some relevant papers:

Share Your Thoughts and Feedback

We are very excited to benchmark AI forecasting with you, and we encourage bot makers from a variety of backgrounds to participate. We also don’t want the cost of model credits to be a barrier for talented bot-builders. If you have a clever idea for a bot but require support for credits, reach out to us at support@metaculus.com and share your thinking. We may sponsor your work. 

Want to discuss bot-building with other competitors? We’ve set up a Discord channel just for this series. Join it here.

Comments4


Sorted by Click to highlight new comments since:

Yeah I love this. It's very hard to game and displays a useful skill. If we get a lot of these this will probably become one of the benchmarks I check most often. 

Want to discuss bot-building with other competitors? We’ve set up a Discord channel just for this series. Join it here.

 

I get "Invite Invalid"

Thanks, I've created a new link which shouldn't expire and I've updated the post. 

Executive summary: Metaculus is launching a series of AI forecasting benchmark contests with $120k in prizes to measure the state of the art in AI forecasting capabilities compared to human forecasters.

Key points:

  1. The contests aim to benchmark AI forecasting accuracy, calibration, and logical consistency over time.
  2. Bots will compete on 250-500 binary questions per contest, with performances compared against each other and human forecasters.
  3. Bots must provide a rationale for each forecast to ensure reasoning transparency.
  4. Metaculus provides a prompting interface and Google Colab notebook templates to help participants get started with building forecasting bots.
  5. Participants are encouraged to experiment with prompt engineering and can seek support for model credits if needed.
  6. Feedback and discussion are welcome via comments, a private form, and a dedicated Discord channel.

 

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Curated and popular this week
 ·  · 5m read
 · 
[Cross-posted from my Substack here] If you spend time with people trying to change the world, you’ll come to an interesting conundrum: Various advocacy groups reference previous successful social movements as to why their chosen strategy is the most important one. Yet, these groups often follow wildly different strategies from each other to achieve social change. So, which one of them is right? The answer is all of them and none of them. This is because many people use research and historical movements to justify their pre-existing beliefs about how social change happens. Simply, you can find a case study to fit most plausible theories of how social change happens. For example, the groups might say: * Repeated nonviolent disruption is the key to social change, citing the Freedom Riders from the civil rights Movement or Act Up! from the gay rights movement. * Technological progress is what drives improvements in the human condition if you consider the development of the contraceptive pill funded by Katharine McCormick. * Organising and base-building is how change happens, as inspired by Ella Baker, the NAACP or Cesar Chavez from the United Workers Movement. * Insider advocacy is the real secret of social movements – look no further than how influential the Leadership Conference on Civil Rights was in passing the Civil Rights Acts of 1960 & 1964. * Democratic participation is the backbone of social change – just look at how Ireland lifted a ban on abortion via a Citizen’s Assembly. * And so on… To paint this picture, we can see this in action below: Source: Just Stop Oil which focuses on…civil resistance and disruption Source: The Civic Power Fund which focuses on… local organising What do we take away from all this? In my mind, a few key things: 1. Many different approaches have worked in changing the world so we should be humble and not assume we are doing The Most Important Thing 2. The case studies we focus on are likely confirmation bias, where
 ·  · 2m read
 · 
I speak to many entrepreneurial people trying to do a large amount of good by starting a nonprofit organisation. I think this is often an error for four main reasons. 1. Scalability 2. Capital counterfactuals 3. Standards 4. Learning potential 5. Earning to give potential These arguments are most applicable to starting high-growth organisations, such as startups.[1] Scalability There is a lot of capital available for startups, and established mechanisms exist to continue raising funds if the ROI appears high. It seems extremely difficult to operate a nonprofit with a budget of more than $30M per year (e.g., with approximately 150 people), but this is not particularly unusual for for-profit organisations. Capital Counterfactuals I generally believe that value-aligned funders are spending their money reasonably well, while for-profit investors are spending theirs extremely poorly (on altruistic grounds). If you can redirect that funding towards high-altruism value work, you could potentially create a much larger delta between your use of funding and the counterfactual of someone else receiving those funds. You also won’t be reliant on constantly convincing donors to give you money, once you’re generating revenue. Standards Nonprofits have significantly weaker feedback mechanisms compared to for-profits. They are often difficult to evaluate and lack a natural kill function. Few people are going to complain that you provided bad service when it didn’t cost them anything. Most nonprofits are not very ambitious, despite having large moral ambitions. It’s challenging to find talented people willing to accept a substantial pay cut to work with you. For-profits are considerably more likely to create something that people actually want. Learning Potential Most people should be trying to put themselves in a better position to do useful work later on. People often report learning a great deal from working at high-growth companies, building interesting connection
 ·  · 17m read
 · 
TL;DR Exactly one year after receiving our seed funding upon completion of the Charity Entrepreneurship program, we (Miri and Evan) look back on our first year of operations, discuss our plans for the future, and launch our fundraising for our Year 2 budget. Family Planning could be one of the most cost-effective public health interventions available. Reducing unintended pregnancies lowers maternal mortality, decreases rates of unsafe abortions, and reduces maternal morbidity. Increasing the interval between births lowers under-five mortality. Allowing women to control their reproductive health leads to improved education and a significant increase in their income. Many excellent organisations have laid out the case for Family Planning, most recently GiveWell.[1] In many low and middle income countries, many women who want to delay or prevent their next pregnancy can not access contraceptives due to poor supply chains and high costs. Access to Medicines Initiative (AMI) was incubated by Ambitious Impact’s Charity Entrepreneurship Incubation Program in 2024 with the goal of increasing the availability of contraceptives and other essential medicines.[2] The Problem Maternal mortality is a serious problem in Nigeria. Globally, almost 28.5% of all maternal deaths occur in Nigeria. This is driven by Nigeria’s staggeringly high maternal mortality rate of 1,047 deaths per 100,000 live births, the third highest in the world. To illustrate the magnitude, for the U.K., this number is 8 deaths per 100,000 live births.   While there are many contributing factors, 29% of pregnancies in Nigeria are unintended. 6 out of 10 women of reproductive age in Nigeria have an unmet need for contraception, and fulfilling these needs would likely prevent almost 11,000 maternal deaths per year. Additionally, the Guttmacher Institute estimates that every dollar spent on contraceptive services beyond the current level would reduce the cost of pregnancy-related and newborn care by three do
Recent opportunities in AI safety
20
· · 1m read