Hide table of contents

Summary

  1. I have an intuition that crowd forecasting could be a useful tool for important decisions like cause prioritization and AI strategy.
  2. I’m not aware of many example success stories of crowd forecasts impacting important decisions, so I define a simple framework for how crowd forecasts could be impactful:
    1. Forecasting questions get written such that their forecasts will affect important decisions.
    2. The forecasts are accurate and trustworthy enough.
    3. Organizations and individuals (stakeholders) making important decisions are willing to use crowd forecasts to inform decision making.
  3. I describe illustrations of issues with achieving (1) and (2) above.
  4. I discuss 3 bottlenecks to success stories and possible solutions:
    1. Creating the important questions
    2. Incentivizing time spent on important questions
    3. Incentivizing forecasters to collaborate

Background

I’ve been forecasting at least semi-actively on Metaculus since Mar 2020, GJO and Foretell since mid-2020. I have an intuition that crowd forecasting could be a useful tool for important decisions like cause prioritization and AI strategy, but feel like I haven’t seen many success stories of successful use on these important questions. When I say crowd forecasting, I’m referring mostly to platforms like Metaculus, Foretell and Good Judgment but prediction markets like Kalshi and PredictIt share many similar features.

I feel very uncertain about many high-level questions like AI strategy and find it hard to understand how others seem comparatively confident. I’ve been reflecting recently on what would need to happen to bridge the gap in my mind between “crowd forecasting seems like it could be useful” and “crowd forecasting is useful for informing important decisions”. This post represents my best guesses on some ways to bridge that gap.

Success story

Framework

I’ll define the primary impact of forecasting based on how actionable the forecasts are: forecasts are impactful to the extent that they directly affect important decisions. This is the pathway discussed as part of the improving institutional decision making cause area. Note that there are other impact pathways, but I believe this is the one with the most expected impact.

When I say “important decisions”, I’m mostly thinking of important decisions from an EA perspective. Some examples:

  1. What causes should 80,000 Hours recommend as priority paths for people to work in?
  2. How should Open Philanthropy allocate grants within global health & development?
  3. Should AI alignment researchers be preparing more for a world with fast or slow takeoff?
  4. What can the US government do to minimize risk of a catastrophic pandemic?

Crowd forecasting’s success story can then be broken down into:

  1. Forecasting questions get written such that their forecasts will affect important decisions.
    1. Bottleneck here: Creating important questions.
  2. The forecasts are accurate and trustworthy enough.
    1. Bottlenecks here: Incentivizing time spent on important questions and Incentivizing forecasters to collaborate.
  3. Institutions and individuals (stakeholders) making important decisions are willing to use crowd forecasts to inform decision making.
    1. I’m skeptical that this is presently the primary bottleneck in the EA community because I’ve observed cases where stakeholders seemed to be interested in using crowd forecasting and useful forecasts didn’t seem to be produced, though many reasonable people seem to disagree. See the illustration of issues for more.
    2. Either way, I’ll focus on bottlenecks in (1) as it is where I have the most experience.

Examples of successes

Potential positive examples of the success story described above thus far include (but aren’t limited to):

  1. Metaculus El Paso COVID predictions being useful for El Paso government (according to this Forbes article).
  2. Metaculus tournament supporting the decision-making of the Virginia Department of Health (h/t Charles Dillon). See video presentation.
  3. Metaculus + GJO COVID predictions being sent to CDC, see e.g. Forecast: the Impacts of Vaccines and Variants on the US COVID Trajectory.
    1. Though I’m not sure to what extent this actually affected the CDC’s decisions.

Illustration of issues

My impression is that even when organizations are enthusiastic to use crowd forecasting to inform import decisions, it’s difficult for them to get high-quality action-relevant forecasts. Speaking from the perspective of a forecaster, I personally wouldn't have trusted the forecasts produced from the following questions as a substantial input into important decisions.

A few examples:

[Disclaimer: These are my personal impressions. Creating impactful questions and incentivizing forecaster effort is really hard and I respect OP//RP/Metaculus a lot for giving it a shot, and would love to be proven wrong about the impact of current initiatives like these]

  1. The Open Philanthropy/Metaculus Forecasting AI Progress Tournament is the most well-funded initiative I know of, potentially besides those contracting Good Judgment superforecasters, but my best guess is that the forecasts resulting from it will not be impactful. An example is the "deep learning" longest time horizon round, where despite Metaculus' best efforts most questions have no-few comments and at least to me it felt like the bulk of the forecasting skill was forming a continuous distribution from trend extrapolation. See also this question where the community failed to fully update on record-breaking scores appropriately. Also note that each question attracted only 25-35 forecasters.
  2. Rethink Priorities’ animal welfare questions authored by Neil Dullaghan seem to have the majority of comments on them by Neil himself. I feel intuitively skeptical that most of the 25-45 forecasters per question are doing more than skimming and making minor adjustments to the current community forecast, and this feels like an area where getting up to speed on domain knowledge is important to accurate forecasts.

Bottlenecks

Note: Each bottleneck could have posts for itself, I’ll aim to give my rough thoughts on each one here.

Creating the important questions

In general, I think it’s just really hard to create questions on which crowd forecasting will be impactful

Heuristics for impactful questions

Some heuristics that in my view correlate with more impactful questions:

  1. Targets an area of disagreement (“crux”) that, if resolved, would have a clear path to affecting actions.
  2. Targets an important decision from an EA lens.
    1. i.e. in an EA cause area, or cause prioritization.
  3. All else equal, near-term is better than long-term.
    1. It’s much harder to incentivize good predictions on long-term questions.
    2. Quicker resolution can help us determine more quickly which world models are correct (e.g. faster vs. slower AI timelines).
      1. Similarly, near-term questions provide quicker feedback to forecasters on how to improve and to stakeholders on which forecasters are more accurate.
  4. In a domain where generalists can do well relatively quickly.
    1. If a question requires someone to do anywhere close to the work of a report from scratch to make a coherent initial forecast, it’s unrealistic to expect this to happen.
  5. Requiring open-ended/creative reasoning rather than formulaic.
    1. We already have useful automatic forecasts for questions for which we have lots of applicable past data. Crowd-based judgmental forecasts are likely more useful in cases requiring more qualitative reasoning, e.g. when we have little past data, good reason to expect trends to break, or the right reference class is unclear.

Failure modes

The following are some failure modes I’ve observed with some existing questions that haven’t felt impactful to me.

  1. Too much trend extrapolation resulting in questions that are hard to imagine being actionable.
    1. See e.g. this complaint regarding Forecasting AI Progress.
  2. Not targeting near-term areas of disagreement.
    1. Helpful to target near-term areas of disagreement to determine which of various “camp”s’ world models are more accurate.
    2. When questions aren’t framed around a disagreement between people’s models, it’s often hard to know how to update on results.
  3. Too long-term / abstract to feel like you can trust without associated reasoning (e.g. AI timelines without associated reasoning).
  4. Requires too much domain expertise / time.
    1. I mostly conceptualize crowd forecasting as an aggregator of opinions given existing research for important questions, since ideally independent of crowd forecasting there is rigorous research going on in the area.
      1. Example: People can read about Ajeya and Tom’s AI forecasts and aggregate these + other considerations into an AI timeline prediction.
      2. But if a question requires someone to do anywhere close to the work of a report from scratch, it’s unrealistic to expect this to happen.
    2. My sense is that for a lot of questions on Metaculus, forecasters don’t really have enough expertise to know what they’re doing.

See this related article: An estimate of the value of Metaculus questions.

Solution ideas

  1. Experiment with best practices for creating impactful questions. Some ideas:
    1. Idea for question creation process: double crux creation.
      1. E.g. I get someone whose TAI median is 2030, and someone whose TAI median is >2100, and try to find double cruxes as near-term as possible for question creation.
      2. Get signal on whose world model is more correct, + get crowd’s opinion on as concrete a thing as possible.
      3. Related: decomposing questions into multiple smaller questions to find cruxes.
        1. Foretell has done this with stakeholders to predict the future of the DoD-Silicon Valley relationship (see report).
    2. Early warning signs for e.g. pandemics, nukes.
    3. Orgs decompose decision making then release either all questions or some sub-questions to crowd forecasting.
      1. For big questions, some sub-questions will be better fit for crowd forecasting than others; good for orgs to have an explicit plan on how forecasts will be incorporated into decision-making
        1. This may already be happening to some extent but transparency seems good + motivating for forecasters
      2. Metaculus Causes like Feeding Humanity in collaboration with GFI is a great step in the right direction in terms of explicitly aiding orgs’ decision-making
        1. Ideally there’d be more transparency about which questions will affect GFI’s decision-making in which ways
      3. Someone should translate the most important cruxes discovered by MTAIR into forecasting questions.
    4. Include text in the question explaining why question matters, what disagreement exists as in Countervalue Detonations by the US by 2050?
    5. More research into conditional questions like the Possible Worlds Series.
  2. Write up best practices for creating impactful questions.
    1. https://www.metaculus.com/question-writing/ is a good start but I’d love to see a longer version with more focus on the Explain why your question is interesting or important section, specifically Explain what decisions will be affected differently by your question.
  3. Incentivize impactful question creation e.g. via leaderboards or money.

Incentivizing time spent on important questions

Failure modes

Some questions are orders of magnitude more important than others but usually don’t get nearly orders of magnitude more effort on crowd forecasting platforms. Effort doesn’t track impact well since the incentives aren’t aligned.

Some more details about the incentive issues, including my experience with them:

  1. Metaculus incentive issue: incentive to spend a little time on lots of questions. But really contributing to the conversation on important issues may require hours if not days of research.
    1. I spent several months predicting and updating on every Metaculus question resolving within the next 1.5 years to move up the sweet Metaculus leaderboard.
      1. On reflection, I realized I should do deep dives into fewer impactful questions, rather than speed forecasting on many questions.
    2. Important questions like Neil’s (action relevant for Rethink Priorities) often don’t get extra attention deserved.
  2. Incentive issue on most platforms: performance on all questions is weighted approximately equally, despite some questions seeming much more impactful than others.

Alignment Problems With Current Forecasting Platforms contains relevant content.

Solution ideas

Some unordered ideas to help with the incentives here:

  1. Give higher weight on leaderboards or cash prizes to more important questions.
    1. Note that there could be disagreement over which questions are important, which could be assuaged by (h/t Michael Aird for this point):
      1. Being transparent about importance decision process.
      2. Separate leaderboards for specific tournaments and/or cause areas.
      3. Within tournaments, could give higher weight to some questions.
    2. Important questions could be in part determined by people rating questions based on how cruxy they seem.
  2. Metaculus is already moving in the right direction with the Forecasting Causes (e.g. Feeding Humanity) with prizes, the AI progress tournament, and an EA question author.
    1. I’d like to see things continue to move in this direction and similar efforts on other platforms.
  3. Foretell itself has a narrower focus and goal than Metaculus and GJOpen, focusing the whole site on an important area.
  4. Good Judgment model: organizations pay for forecasts on particular questions
    1. Has strengths and weaknesses, willingness-to-pay is a proxy for impact but ideally there can also be focus on questions for which good forecasts is a public good
  5. Note that we could also just increase total forecaster time. I’m a little skeptical because :
    1. It feels more tractable on the margin to shift forecaster weight and feels like a bottleneck that should be improved before scaling up.
    2. The pool of very good forecasters might not be that large.
    3. (but we should do some of both)

Ideally, we want to incentivize people to do detailed forecasts on important questions like read Ajeya’s draft timelines report and update their AI timelines predictions based on that, writing up their reasoning for others to see, which brings me to…

Incentivizing forecasters to collaborate

Failure modes

Due to the competitive nature of forecasting platforms, forecasters are incentivized to not share reasoning that could help other forecasters do better. Some notes on this:

  1. My personal experience:
    1. I noticed that when I shared my reasoning community would trend toward my prediction, so it was better to be silent.
    2. I like to think of myself as altruistic but am also competitive, and similar to how I got sucked in by the Metaculus leaderboard to predict shallowly on many questions, I got sucked in to not sharing my reasoning.
  2. An illustrative example: Metaculus AI Progress tournament:
    1. At first few forecasters left any comments despite the importance of the topic, then when they required 3 comments forecasters generally left very brief, uninformative comments. See the questions for the final round here.
    2. See further discussion here: an important point is that this problem gets worse as monetary incentives are added, which may be needed as part of a solution to incentivizing time spent on important questions.
  3. Another example: In the Hypermind Arising Intelligence tournament, there was very little discussion and transparency of forecasters’ reasoning on the question pages.
  4. Note that incentivizing sharing of reasoning is important for both improving collaboration and thus forecast quality and getting stakeholders to trust forecasts, so solutions to this seem especially valuable.

See Alignment Problems With Current Forecasting Platforms for more on why collaboration isn’t incentivized currently.

Solution ideas

Some undeveloped solution ideas, again unordered:

  1. Collaborative scoring rules such as Shapley values
  2. Prizes for insightful comments
  3. More metrics taking into account comment upvotes
  4. Fortified essay prizes (which Metaculus is trying out)
  5. Prizes for comments which affect others’ forecasts
  6. Encourage forecasters to work in teams which compete against each other. Foretell has a teams feature.
  7. Reducing visibility of comments (h/t Michael Aird). Note these have downsides of less legible expertise for forecasters, less quick collaboration.
    1. Only visible to non-forecasting stakeholders (question authors, tournament runners, etc.)
    2. Comments become visible X time after they’re made, e.g. 2 weeks later.

Other impact pathways

In my current mental model, crowd forecasting will have something like a power law distribution of impact, where it has the most impact from affecting a relatively small amount of important decisions. Identifying which questions have a large expected impact and producing such questions seems tractable. An alternative impact model is that crowd forecasting will “raise the sanity waterline” and its impact will be more diffuse and hard to measure, making many relatively low-impact decisions better.

Some reasons I’m skeptical of this alternative model:

  1. Some decisions seem so much more important than others.
  2. There’s a fairly large fixed cost to operationalizing a question well and getting a substantial amount of reasonable forecasters to forecast on it, that I’m not very optimistic about reducing, which makes me skeptical about the diffuse, tons-of-questions model of impact.
  3. I have trouble thinking of plausible ways most Metaculus questions will impact decisions that are even low impact.

A few other impact models seem relevant but not the primary path to impact:

  1. Improving forecasters’ reasoning
  2. Vetting people for EA jobs/grants via forecasting performance

See also this forecasting impact pathways diagram.

Conclusion

I’m excited for further research to be done on using crowd forecasting to impact important decisions.

I’d love to see more work on both:

  1. Estimating the value of crowd forecasting.
  2. Testing approaches for improving the value of crowd forecasting. For example, it would be great if there were a field for studying the creation of impactful forecasting questions.

I’m especially excited about work testing out ways of integrating crowd forecasting with EA-aligned research/modeling, as Michael Aird has been doing with the Nuclear Risk Tournament.

Acknowledgements

This post is revised from an earlier draft outline which I linked to on my shortform. Thanks to Michael Aird, Misha Yagudin, Lizka, Jungwon Byun, Ozzie Gooen, Charles Dillon, Nuño Sempere, and Devin Kim for providing feedback.

Comments2


Sorted by Click to highlight new comments since:

Thanks for writing this up, I think those are a bunch of good points. 

Regarding „Incentivizing forecasters to collaborate“, I wonder how cheap of a win it could be to make the comment upvoting on Metaculus more like our karma system. On Metaculus I don’t get notified about upvotes and I don’t have a visible overall upvote score.

As a side note, I personally already get more out of people upvoting my comments on Metaculus than out of my ranking, feels much more like I directly contributed to the wisdom of the crowd with my hot take analysis.

You might be interested in both: "Most Likes" and "h-Index" metrics on MetaculusExtras which does have a visible upvote score. (Although I agree it would be nice to have it on Metaculus proper)

Curated and popular this week
Paul Present
 ·  · 28m read
 · 
Note: I am not a malaria expert. This is my best-faith attempt at answering a question that was bothering me, but this field is a large and complex field, and I’ve almost certainly misunderstood something somewhere along the way. Summary While the world made incredible progress in reducing malaria cases from 2000 to 2015, the past 10 years have seen malaria cases stop declining and start rising. I investigated potential reasons behind this increase through reading the existing literature and looking at publicly available data, and I identified three key factors explaining the rise: 1. Population Growth: Africa's population has increased by approximately 75% since 2000. This alone explains most of the increase in absolute case numbers, while cases per capita have remained relatively flat since 2015. 2. Stagnant Funding: After rapid growth starting in 2000, funding for malaria prevention plateaued around 2010. 3. Insecticide Resistance: Mosquitoes have become increasingly resistant to the insecticides used in bednets over the past 20 years. This has made older models of bednets less effective, although they still have some effect. Newer models of bednets developed in response to insecticide resistance are more effective but still not widely deployed.  I very crudely estimate that without any of these factors, there would be 55% fewer malaria cases in the world than what we see today. I think all three of these factors are roughly equally important in explaining the difference.  Alternative explanations like removal of PFAS, climate change, or invasive mosquito species don't appear to be major contributors.  Overall this investigation made me more convinced that bednets are an effective global health intervention.  Introduction In 2015, malaria rates were down, and EAs were celebrating. Giving What We Can posted this incredible gif showing the decrease in malaria cases across Africa since 2000: Giving What We Can said that > The reduction in malaria has be
LintzA
 ·  · 15m read
 · 
Cross-posted to Lesswrong Introduction Several developments over the past few months should cause you to re-evaluate what you are doing. These include: 1. Updates toward short timelines 2. The Trump presidency 3. The o1 (inference-time compute scaling) paradigm 4. Deepseek 5. Stargate/AI datacenter spending 6. Increased internal deployment 7. Absence of AI x-risk/safety considerations in mainstream AI discourse Taken together, these are enough to render many existing AI governance strategies obsolete (and probably some technical safety strategies too). There's a good chance we're entering crunch time and that should absolutely affect your theory of change and what you plan to work on. In this piece I try to give a quick summary of these developments and think through the broader implications these have for AI safety. At the end of the piece I give some quick initial thoughts on how these developments affect what safety-concerned folks should be prioritizing. These are early days and I expect many of my takes will shift, look forward to discussing in the comments!  Implications of recent developments Updates toward short timelines There’s general agreement that timelines are likely to be far shorter than most expected. Both Sam Altman and Dario Amodei have recently said they expect AGI within the next 3 years. Anecdotally, nearly everyone I know or have heard of who was expecting longer timelines has updated significantly toward short timelines (<5 years). E.g. Ajeya’s median estimate is that 99% of fully-remote jobs will be automatable in roughly 6-8 years, 5+ years earlier than her 2023 estimate. On a quick look, prediction markets seem to have shifted to short timelines (e.g. Metaculus[1] & Manifold appear to have roughly 2030 median timelines to AGI, though haven’t moved dramatically in recent months). We’ve consistently seen performance on benchmarks far exceed what most predicted. Most recently, Epoch was surprised to see OpenAI’s o3 model achi
Rory Fenton
 ·  · 6m read
 · 
Cross-posted from my blog. Contrary to my carefully crafted brand as a weak nerd, I go to a local CrossFit gym a few times a week. Every year, the gym raises funds for a scholarship for teens from lower-income families to attend their summer camp program. I don’t know how many Crossfit-interested low-income teens there are in my small town, but I’ll guess there are perhaps 2 of them who would benefit from the scholarship. After all, CrossFit is pretty niche, and the town is small. Helping youngsters get swole in the Pacific Northwest is not exactly as cost-effective as preventing malaria in Malawi. But I notice I feel drawn to supporting the scholarship anyway. Every time it pops in my head I think, “My money could fully solve this problem”. The camp only costs a few hundred dollars per kid and if there are just 2 kids who need support, I could give $500 and there would no longer be teenagers in my town who want to go to a CrossFit summer camp but can’t. Thanks to me, the hero, this problem would be entirely solved. 100%. That is not how most nonprofit work feels to me. You are only ever making small dents in important problems I want to work on big problems. Global poverty. Malaria. Everyone not suddenly dying. But if I’m honest, what I really want is to solve those problems. Me, personally, solve them. This is a continued source of frustration and sadness because I absolutely cannot solve those problems. Consider what else my $500 CrossFit scholarship might do: * I want to save lives, and USAID suddenly stops giving $7 billion a year to PEPFAR. So I give $500 to the Rapid Response Fund. My donation solves 0.000001% of the problem and I feel like I have failed. * I want to solve climate change, and getting to net zero will require stopping or removing emissions of 1,500 billion tons of carbon dioxide. I give $500 to a policy nonprofit that reduces emissions, in expectation, by 50 tons. My donation solves 0.000000003% of the problem and I feel like I have f
Recent opportunities in Forecasting
20
Eva
· · 1m read
32
Ozzie Gooen
· · 2m read