Hide table of contents

Summary

In this post I describe a hypothetical single site that aggregates a given forecaster’s performance across multiple public platforms into a single place. I detail what I expect the technical hurdles to be in implementing such a site. I justify these costs by arguing that this would help foster an efficient culture of learning within the forecasting community that would accelerate the rate of improvement of forecasters and their techniques. I also detail the current performance metrics displayed on four popular platforms and find them to be vastly different.

What I’m Proposing

Imagine a minimum viable "Linkedin" for forecasting. At a glance, a profile there could answer questions like…

  • What has an individual or team forecasted?
  • How accurate have their forecasts been? How about relative to the crowd?
  • What platforms do they forecast on?
  • What topics do they tend to forecast?
  • Have they won any prizes or ranked highly in any leaderboards or tournaments?

There are currently multiple popular forecasting platforms, and they all surface answers to some subset of the above questions in their own way. A user may not even have the same username across multiple platforms, so I believe the current state of the art is for a prolific forecaster to either link to all of their independent profiles from their own personal site/blog or to summarize their results themselves and manually update them.

A single site could allow a user to link their various accounts and concisely display their performance across platforms such that it can be continually updated, either via API access or web scraping.

Technical Challenges

I expect there to be some barriers to implementing this that I’m not qualified to fully assess:

  • Account verification. How do we know that the accounts to be aggregated actually belong to the same user? Can they post a particular verification code in their profile on those platforms, or link back to this specific overview profile, and this then gets scraped? Is easier verification possible, for example on sites that support Google credentials?
  • How feasible is the API access / web scraping? I believe Metaculus and Manifold have APIs, but do they surface all of the profile metrics? I don’t believe good Judgement Open or INFER have API’s, so how fragile would a web scraping application be? IE, if either platform changes the way they display metrics it could break the link to this new site. Could partnerships be made with these sites to mitigate these risks?

Why Do This?

Obviously such a site would cost resources to create and maintain. These metrics are already being collected, so why should we care about aggregating them?

I view rigorous judgmental forecasting as a recently developed art. We know, first from academic research and now from these online platforms, that usefully impressive human performance is possible. We even know some predictors of it and some common attributes of accurate forecasts and forecasters.

But how much better can forecasters get? What are the limits of human judgmental forecasting performance and what techniques get us there? Which domains or question types are most amenable to what techniques? These questions can be answered by academia rigorously, but at what feels to me like a glacial pace. Athletes and coaches don’t wait for double blind studies to confirm which strategies or equipment they should use. Instead, they live in a culture of continual experimentation where they’re surrounded by evidence to evaluate and infer causes from. Critically, they can evaluate which teams and players are top performing thanks to clear scoring and statistics.

I believe that lowering the barrier to accessing a forecasters performance metrics (along with other interventions that I will continue to describe on this blog) can help foster a similar culture within the forecasting world, where individuals and teams can better learn from each other. When someone shares a resource on a topic or advice on how to structure a forecast, being able to evaluate their track record at a glance minimizes the friction to weighing the value of that information as you consume it. Making performance more public also increases the incentive to perform well, and explore the limits of current techniques.

With the advent of open, online forecasting platforms, I strongly believe that the most powerful lever on advancing the art of forecasting is to foster communities of open experimentation and collaboration, and making performance as clear and accessible as possible seems fundamental to this. By aggregating these metrics in one place, you also open the door to have a new API that makes surfacing the most salient information in other locations (like Discord servers or on forums) much easier.

The Current State of Platform Metrics

All of this depends on what metrics are being surfaced by open forecasting platforms in the first place. In writing this post I was surprised that the current state is massively varied. I expect this to be a subject of continual improvement for these platforms, but I wanted to capture the current state here for posterity. Even now, I believe the kind of aggregation would have strong benefits, and the diversity between the scoring metrics may even make this case stronger.

I only list the metrics/information relevant to forecasting directly, not question writing. I’ve also omitted visual badges/achievements as they represent things already captured in the other scores, but these are typical across the platforms. This list is accurate as of April 1st, 2023.

Metaculus

  • Level (I believe this is just a function of accumulated points)
  • Number of predictions, across how many questions, and how many of those are resolved
  • Number of comments across how many questions
  • List of tournaments and projects
  • Users also appear on an overall leaderboard of points (more forecasts and more correct forecasts than the crowd = more points), and individual tournament leaderboards
  • Notably, I don’t see anything like a brier score or objective scoring anywhere.

Manifold Markets

  • Trading profits, balance, and portfolio value in Mana (Manifold’s play currency)
  • Calibration plot with grade ("C+") and "score" (numerical, but not brier)

Good Judgement Open

  • Overall brier score
  • Number of questions forecasted, and how many of those have been scored
  • Upvotes
  • Calendar of forecasting activity

INFER

  • Relative brier score
  • This can also be displayed over time as a plot
  • This can also be filtered by question, topics, or year
  • Number of questions forecasted, and how many have been scored
  • Number of forecasts
  • Number of upvotes
Comments2


Sorted by Click to highlight new comments since:

I like this idea :-)

I think that there are some tricky questions about comparing across different forecasters and their predictions. If you simply take Brier score, this can be Goodharted: people can choose the "easiest" questions and get way better scores than the ones taking on difficult questions.

I can think of some attempts to go at this:

  • Ranking forecasters:
    • For two forecasters, they get ranked according to their Brier scores on questions they have both forecasted on. I fear that this will lead to cyclical rankings, which could be dealt with using the Smith set or Hodge decomposition.
    • Forecasters are ranked according to their performance relative to all other forecasters on each question. (Making easier questions less impactful on a forecasters score).
  • I'd like to look into credibility theory to see whether it has some insights into ranking with different sample sizes since IMDb uses it for ranking movies.

I agree with your concerns on using a pure Brier score with open platforms. I expect that currently it makes the most sense within "tournaments" where participants are answering every question. Technically, I think some sort of objective, proper scoring rule is a prerequisite to a more advanced scoring system that conveys more useful information in open contexts.

I've seen some sort of a "relative Brier score" referenced frequently in associated research (definitely in the good judgement project papers, at a minimum) that scored forecasters based on the difficulty of each question, as determined by the performance of others who forecasted it. This seems promising, and I expect there are a lot of options in that direction.

Curated and popular this week
 ·  · 38m read
 · 
In recent months, the CEOs of leading AI companies have grown increasingly confident about rapid progress: * OpenAI's Sam Altman: Shifted from saying in November "the rate of progress continues" to declaring in January "we are now confident we know how to build AGI" * Anthropic's Dario Amodei: Stated in January "I'm more confident than I've ever been that we're close to powerful capabilities... in the next 2-3 years" * Google DeepMind's Demis Hassabis: Changed from "as soon as 10 years" in autumn to "probably three to five years away" by January. What explains the shift? Is it just hype? Or could we really have Artificial General Intelligence (AGI)[1] by 2028? In this article, I look at what's driven recent progress, estimate how far those drivers can continue, and explain why they're likely to continue for at least four more years. In particular, while in 2024 progress in LLM chatbots seemed to slow, a new approach started to work: teaching the models to reason using reinforcement learning. In just a year, this let them surpass human PhDs at answering difficult scientific reasoning questions, and achieve expert-level performance on one-hour coding tasks. We don't know how capable AGI will become, but extrapolating the recent rate of progress suggests that, by 2028, we could reach AI models with beyond-human reasoning abilities, expert-level knowledge in every domain, and that can autonomously complete multi-week projects, and progress would likely continue from there.  On this set of software engineering & computer use tasks, in 2020 AI was only able to do tasks that would typically take a human expert a couple of seconds. By 2024, that had risen to almost an hour. If the trend continues, by 2028 it'll reach several weeks.  No longer mere chatbots, these 'agent' models might soon satisfy many people's definitions of AGI — roughly, AI systems that match human performance at most knowledge work (see definition in footnote). This means that, while the compa
 ·  · 4m read
 · 
SUMMARY:  ALLFED is launching an emergency appeal on the EA Forum due to a serious funding shortfall. Without new support, ALLFED will be forced to cut half our budget in the coming months, drastically reducing our capacity to help build global food system resilience for catastrophic scenarios like nuclear winter, a severe pandemic, or infrastructure breakdown. ALLFED is seeking $800,000 over the course of 2025 to sustain its team, continue policy-relevant research, and move forward with pilot projects that could save lives in a catastrophe. As funding priorities shift toward AI safety, we believe resilient food solutions remain a highly cost-effective way to protect the future. If you’re able to support or share this appeal, please visit allfed.info/donate. Donate to ALLFED FULL ARTICLE: I (David Denkenberger) am writing alongside two of my team-mates, as ALLFED’s co-founder, to ask for your support. This is the first time in Alliance to Feed the Earth in Disaster’s (ALLFED’s) 8 year existence that we have reached out on the EA Forum with a direct funding appeal outside of Marginal Funding Week/our annual updates. I am doing so because ALLFED’s funding situation is serious, and because so much of ALLFED’s progress to date has been made possible through the support, feedback, and collaboration of the EA community.  Read our funding appeal At ALLFED, we are deeply grateful to all our supporters, including the Survival and Flourishing Fund, which has provided the majority of our funding for years. At the end of 2024, we learned we would be receiving far less support than expected due to a shift in SFF’s strategic priorities toward AI safety. Without additional funding, ALLFED will need to shrink. I believe the marginal cost effectiveness for improving the future and saving lives of resilience is competitive with AI Safety, even if timelines are short, because of potential AI-induced catastrophes. That is why we are asking people to donate to this emergency appeal
 ·  · 23m read
 · 
Or on the types of prioritization, their strengths, pitfalls, and how EA should balance them   The cause prioritization landscape in EA is changing. Prominent groups have shut down, others have been founded, and everyone is trying to figure out how to prepare for AI. This is the first in a series of posts examining the state of cause prioritization and proposing strategies for moving forward.   Executive Summary * Performing prioritization work has been one of the main tasks, and arguably achievements, of EA. * We highlight three types of prioritization: Cause Prioritization, Within-Cause (Intervention) Prioritization, and Cross-Cause (Intervention) Prioritization. * We ask how much of EA prioritization work falls in each of these categories: * Our estimates suggest that, for the organizations we investigated, the current split is 89% within-cause work, 2% cross-cause, and 9% cause prioritization. * We then explore strengths and potential pitfalls of each level: * Cause prioritization offers a big-picture view for identifying pressing problems but can fail to capture the practical nuances that often determine real-world success. * Within-cause prioritization focuses on a narrower set of interventions with deeper more specialised analysis but risks missing higher-impact alternatives elsewhere. * Cross-cause prioritization broadens the scope to find synergies and the potential for greater impact, yet demands complex assumptions and compromises on measurement. * See the Summary Table below to view the considerations. * We encourage reflection and future work on what the best ways of prioritizing are and how EA should allocate resources between the three types. * With this in mind, we outline eight cruxes that sketch what factors could favor some types over others. * We also suggest some potential next steps aimed at refining our approach to prioritization by exploring variance, value of information, tractability, and the
Recent opportunities in Forecasting
20
Eva
· · 1m read