Hide table of contents

tl;dr^2: We could continuously forecast the value of small projects, choose the most valuable and informative ones, carry them out, and then make better forecasts and choose better projects with each round. I have a small experiment which does that. If this sounds like something you might want to participate in, sign up here

tl;dr: I conducted an experiment to forecast the impact of small altruistic projects. Forecasters attempted to predict upvotes, because I initially thought that they would correlate more robustly with impact, and because upvotes are appealingly simple as a metric, and thus easy to forecast. Forecasters proved somewhat able to produce predictions which discriminated between projects with more upvotes, and projects with less upvotes, but were generally too optimistic. And, in hindsight, upvotes aren't that great a metric for impact, so, going forward, I'd probably have forecasters predict a scoring rubric, i.e., an aggregate of different metrics, graded by more than one judge. Nonetheless, I think that the forecasting pipeline in this experiment is interesting, and might help the EA community more systematically identify valuable small projects to carry out, in contrast with today, where individuals or EA groups carry out projects more idiosyncratically. Because the experiment was very underpowered, I'm looking for volunteer participants to expand it. 

Index

  • Introduction
  • Experiment design
  • Projects carried out, predictions and their results
  • Observations about participants
  • Upvotes maybe not a good measure of impact
  • Going forward

Introduction

I report the outcomes of a forecasting experiment in which 10 forecasters predicted the value of 20 potential altruistic projects. I then carried out 8, of which I made 5 public–an additional project developed into my Summer Research Fellowship at FHI.

Overall, the experiment was underpowered (n=10 forecasters, n=5 published projects, n=135 predictions (246 including bots, more on that below)). Nonetheless, I think that the idea is promising, and I elaborate on some interesting properties of the prediction setup, the problems in the EA community it may solve, and why I'm personally excited about it. I'm considering scaling it up somewhat; thoughts and ideas are welcome.

If you'd be interested in participating in a scaled-up version of this experiment or in later forecasting experiments, fill out this form.

Previous literature on the EA forum:

Thanks to Andis Draguns, Jacob Lagerros, Ray, Ozzie Gooen, Misha Yagudin, Cadillion, Gavin Leech, Datscilly and Holomanga for taking part in this project as forecasters; congratulations to Gavin Leech for overall being the most accurate forecaster. Thanks to Ozzie Gooen et. al. for creating foretold, to Jaime Sevilla for providing feedback on ideas, and to Jacob Lagerros for encouragement and funding. Thanks to various people on the EA Editing and Review Facebook group for feedback. 

Experiment design.

The setup of the experiment was as follows:

  • I produced a list of potential altruistic projects, by distilling other lists of projects, lists of lists, research agendas, and by coming up with projects myself. This original list had 75 potential projects. Since then, I've kept accumulating projects and project lists.
  • From these original 75, I selected 26 on the basis of being a good personal fit, particularly promising, short in duration, and in general suitable for this experiment.
  • A more senior friend gave feedback on the original list of ideas, after which I pruned the list to 20.
  • Using the infrastructure built by foretold.io, I created a community to predict the value of each project, operationalized as how many upvotes it would gather on the Effective Altruism Forum or on LessWrong (more on that below).
  • Monetary rewards were scaled to how much forecasters beat a prior I created, individually and as a team.
  • After the value of the projects was predicted, I intended to choose the top two, two chosen randomly, and two chosen however I wished, and implement them.

This setup can be understood as having two steps, similar to those in Babble and Prune (a sequence on LW):

  • Babble. Use a weak and local filter to (perhaps even randomly) generate a lot of possibilities.
  • Prune. Use a strong and global filter to test for the best, or at least a satisfactory, choice.

The forecasts for this experiment were made in foretold, a prediction platform geared towards performing experiments, which made this experiment easier. In particular, it was more convenient than sharing thoughts using Google Docs, or Google Sheets. Besides reducing the hassle in carrying out the experiment, it is possible that foretold may have acted as a tool for thought, such that it existing may make it easier to come up with experiments like this, in the same way that Guesstimate makes the thought "I'll conduct a Monte Carlo simulation to quantify my uncertainty" easier to think.

Scoring rule

(Some familiarity with scoring rules is assumed in this section. See: Brier score and Proper scoring rule)

Forecasters labored under a collaborative scoring rule I created, inspired by Shapley Values, such that I first produced a prior by instantiating different perspectives (see below), and each forecaster was rewarded in proportion to:

1/2 * (The information they added over that prior, excluding other forecasters) + 1/2 * (The information other forecasters added over that prior).

Note that this scoring rule is both roughly proper and solves all incentive problems in Incentive Problems in Current Forecasting Tournaments.

In this particular case, "information added" is defined in terms of the Kullback–Leibler divergence between a prediction and the resolution. However, the spirit might be easier to understand (and compute) in terms of the Brier score. That is, the scoring rule could have been

1/2 * (The forecaster's Brier score - the Brier score of the prior) + 1/2 * (The aggregate without the forecaster's Brier score - the Brier score of the prior).

In practice, when converting the score into a monetary reward, a constant would have to be added so that the reward is never less than $0, but I didn't think about that beforehand.

I also had a payout for insightful comments, such that the nth most upvoted comment would get $36*(2/3)^(n-1). Total payouts were $259 (or an average payout of $25.9 per forecaster), of which around $100 were given out for comments, which was a larger proportion than I expected. One forecaster didn’t receive a payout because he didn’t make a prediction on any project which was then carried out. 

Projects carried out, predictions and their results

Note that the projects were not chosen so as to maximize impact, but rather as to maximize information about whether their value could be predicted.

We observe that:

  1. The experiment is underpowered; there are only five projects, hardly enough to be confident in conclusions
  2. Predictions do show some ability to discriminate between more valuable and not so valuable projects.
  3. If one looks at the distributions, they broadly have a similar shape. This can be explained by the use of bots to generate a prior (more of which in the next section), which made the aggregate sticky.
  4. Predictions turned out to be over-optimistic. This can also partially be explained by the choice of a sticky prior, which used an out of distribution historical prior (before the start of the experiment I’d only written up posts which were chosen because I thought they were likely to be particularly valuable, as opposed to random posts for this experiment.) Subjectively, however, forecasters didn’t correct enough for that.

Project 1: Write up further thoughts on Shapley values.

Output: Here are the further thoughts. I provide some further thoughts after my first post on Shapley values, including a procedure for allowing two philanthropic funders with slightly different values to share the burden of funding interventions they value differently, and an impossibility theorem for value attribution.

Outcome: Nothing much happened, but I do keep the idea of Shapley values in my head and occasionally use it (e.g., when thinking about designing better incentives for forecasters.)

Predictions:

  • Expected value: 56
  • Median: 51
  • Highest likelihood value: 38
  • Centered 50% confidence interval: 33 to 74
  • Centered 95% confidence interval: 11 to 111

Actual upvotes after one month: 31

Project 2: Identify previous examples of technological projects with clear long-term goals, and then produce estimates of the time required to achieve those goals to varying degrees.

Output: Here is a LW post about this. This then gave me intuitions about technological progress, but led me to realize that what I actually wanted was something more systematic, so I ended up writing a post on A prior for technological discontinuities, which I think was significantly more valuable.

Outcome: Better personal intuitions about technological progress, a rough prior for technological discontinuities.

Predictions:

  • Expected upvotes: 51
  • Median: 46
  • Highest likelihood value: 32
  • Centered 50% confidence interval: 28 to 68
  • Centered 95% confidence interval: 9 to 106

Actual upvotes after one month: 20 for the first post, 49 for the second one. I'm taking 49 as the resolution.

Project 3: Investigate international supply chain accountability as cause X.

Output: Here is an EA forum post.

Outcome: Unfortunately I posted this on April Fools' day together with other "New Top EA Cause Area" posts, and it didn't get taken too seriously. But I do still think that this cause could potentially use many millions of dollars per year from the EA community.

Predictions:

  • Expected upvotes: 40
  • Median: 34
  • Highest likelihood value: 5
  • Centered 50% confidence interval: 15 to 57
  • Centered 95% confidence interval: 4 to 91

Actual upvotes after one month: 22

Project 4: Look into EA literature.

Output: I looked into some past examples of literature which might have influenced the world in some way, available as a comment here. Originally this was part of a longer piece, which I ended up not posting.

Outcome: I became marginally more enlightened about the role of literature in history.

Predictions:

  • Expected upvotes: 42
  • Median: 30
  • Highest likelihood value: 15
  • Centered 50% confidence interval: 17 to 60
  • Centered 95% confidence interval: 6 to 102

Actual upvotes after one month: 20

Project 5: Review two books on survey making.

Output: Here is the review.

Outcome: It spread the knowledge among some people that I was available to ask questions about survey-making. In particular, some people later reached out to me for help with surveys for their projects, and my help might have been valuable.

Predictions:

  • Expected upvotes: 42
  • Median: 35, Highest likelihood value: 17
  • Centered 50% confidence interval: 19 to 59
  • Centered 95% confidence interval: 7 to 92

Actual upvotes after one month: 27

Other projects

There were three further projects which I carried out but which for various reasons didn't end up posting publicly: Some historical research was too sensitive, and I made two suggestions privately by email rather than by writing a post publicly. 

A further idea was my proposal for my Summer Research Fellowship at FHI, though it changed upon execution. The predictions for that project were:

  • Expected upvotes: 50
  • Median: 45, Highest likelihood value: 21
  • Centered 50% confidence interval: 24 to 70
  • Centered 95% confidence interval: 10 to 103

I take these data-points as further evidence that this setup is interesting or worth it; arguably a major take-away for this project is “a fairly simple forecasting system is able to produce a project which gets accepted to the FHI summer fellowship.” Because the program got ~300 applications, but only 27 participants were accepted, this puts this forecasting setup on the top 9% of applicants in terms of some fuzzy “optimization power” (though this is a simplification, because the project proposal was probably one of many factors.)

Observations about participants

The value which the forecasters provided was distributed like a power-law, with the top few forecasters providing most of the value. For example, the most upvoted forecaster received 43% of upvotes. Comments by forecasters were fairly valuable, for example, some pointed to previous similar efforts, to possible research directions, or gave caveats and warnings.

I also recruited some participants from the LessWrong Slack group. One of them, who is a regular there and had made valuable comments in the past, turned out to behave like a troll and made somewhat unpleasant or unproductive comments (e.g., writing “I was looking at a tree outside and this is what it said” as a comment rationale), which in the end I decided not to censor. They also blatantly attempted to manipulate the market by inputting high predictions for the projects they personally wanted to be carried out, rather than making honest predictions.

As well as giving my all-things-considered forecast, I also forecasted using different perspectives, using foretold's bot functionality, and this defined a prior against which forecasters were compared. I created bots for the following perspectives:

  • The Historical Extrapolator. Just taking the upvote distribution of my previous posts. In hindsight, this was too high, and I would have done better considering the distribution of all posts, or of all posts with authors with more than three posts, etc.
  • The Unrepentant Insider. A sometimes optimistic inside view.
  • The Bent Cynic. Is very cynical; channels depression.
  • The Unimpressed Augur. Foresees middling success.
  • The Equalizer. Uniform distribution between 1 and 100.
  • The All-Father. An aggregate which weights the above equally.

It was disappointing, but not surprising, to see that the Bent Cynic, the part of me which I associate with being depressed, had the best score not only among all perspectives, but also among all participants. My interpretation is that this perspective is able to see through social fictions and sympathetic lies, which improves its accuracy. But other explanations are possible, such as forecasters giving too much weight to an out of distribution historical base rate.

Additionally, the presence of a "historical extrapolator", "unrepentant insider" and "uniform distribution" bots made the aggregate predictions be overly optimistic, and the presence of many bots made the aggregate slightly sticky; i.e., each individual prediction couldn’t change the aggregate all that much. 

Upvotes maybe not a good measure of impact

Initially, I thought that forecasting popularity on the Forum or LW would be a good enough proxy for the projects’ value, if perhaps far from perfect. I'd ideally want something like an efficient market in impact certificates populated by trustworthy altruists instead.

Here are some factors that might be reducing the correlation between upvotes and value, based on my own judgement and some light data analysis. Due to those limitations, in addition to forecasts of popularity, I also ended up paying attention to the comments under the forecasts, to whether a project could cause harm, and to personal taste when deciding which projects to carry out.

However, despite those limitations, the number of upvotes does discriminate between more and less valuable projects, to a certain extent. For example, I took 25 randomly selected EA forum posts posted during July, and rated them on a 1-10 scale according to how valuable I thought they were, and the correlation between that and upvotes had an R^2 = 0.2715 (subjectively, a small/medium correlation).

From where I'm standing now, one could have forecasted a rubric of measures, possibly decided by a group of trusted judges after the project is completed, in a way similar to what Charity Entrepreneurship does, or what 80k used to do. Alternatively, one could have tried to compare the value of each project to e.g., the value of a QALY, or to a set of previously completed projects (and forecast said value beforehand). Further, in hindsight it would have been more informative to forecast the value of a project per unit of time, rather than the total value.

Going forward.

I think that a setup like the above could develop into something more widely useful, though this first proof of concept was very under-powered and maybe not that informative. One reason I'm excited about this is that successive prediction → implementation cycles could each bring their own improvement:

  • Lessons in forecasting gained from forecasting the value of projects can be used to better forecast the value of projects. This would be both by detecting better forecasters based on their track record, and by giving forecasters more data with which to work. For example, the right historical base-rate could be determined. This would correspond to better "pruning" of babble.
  • Lessons in how valuable projects are (after completing them and seeing how they turn out) may lead one to suggest better projects. This would correspond to better babble.
  • So overall, you can create a loop: Project suggestions → Forecasts → Projects are implemented → Feedback on the EA forum → Better forecasts & better project suggestions.

Note that the loop is recursive, but it doesn't necessarily have to keep increasing forever. For example, there might be a ceiling to how good forecasts can be.

Further:

  • Such a process could be left continuously running; there could be a community in foretold always taking suggestions for new projects, always forecasting their value, and always implementing them.
  • One might scale this pretty arbitrarily (though gradually). This experiment starts out pretty small, but I could imagine it generalizing it to one local group, and then to many EA local groups.
  • On the ambitious side, I could imagine this just becoming a mainstream way to allocate projects to idle altruists.
  • Eventually, with enough data in one place, one might more systematically learn what kinds of small projects are valuable, and have that knowledge in one place so that it can be made actionable, rather than distributed across the EA community as it is now. One might even be able to pose the problem as a multi-armed bandit problem, where different types of projects are the different arms of the bandit.

Overall, if it is the case that EA is vetting constrained, a prediction pipeline like the one in this experiment could be created to solve this problem by identifying both projects with high expected value and individuals who can carry them out. Note that many kinds of people could be used, not just forecasters:

  • People to suggest projects.
  • People to come up with a rubric which captures the factors which make a project valuable.
  • People to check that projects do no harm.
  • People to share intuitions about whether projects would be valuable or not, even if they can't put them into probabilities.
  • People to forecast the value of projects.
  • Finding people to carry out projects.
  • People to carry out projects.
  • People to evaluate the value of projects after they have been completed.
  • ...

In fact, the most time-consuming part of this project was not designing the experiment or doing the forecasting, but actually carrying the projects out. Nonetheless, it is not clear whether forecasting is currently cheap enough to be scalable.

If you would be interested in participating in a scaled-up version of this experiment (as a project-implementer, forecaster, etc.), or in later forecasting experiments, fill out this form. I'm unsure about whether to scale this experiment, but if I do, I expect both recruitment to be a bottleneck and this first proposal to be read by more people than any subsequent announcement, hence the link now.

 


 

Conflict of interest: I've worked in the past as a paid contractor for foretold/Ozzie Gooen. Jacob Lagerros provided 2/3rds of the payout funding.

Comments3
Sorted by Click to highlight new comments since: Today at 2:25 PM

This was interesting! 

I do think that the forecasters seem to have been starting from a bad base rate (rather than using "sympathetic lies"). Relatively few posts about original topics (rather than e.g. org updates) hit 50 karma, and comments do so even less often. But the posts and comments we are most likely to read are unusually likely to be the high-karma ones; I wouldn't be surprised if people tend to overestimate their frequency.

As one extra data point: When I look at the EA Forum profile of users whose Forum posts are relatively well-known, I'm often quite surprised:

  • That the typical karma on their posts seems notably lower than what I'd have guessed
  • That they wrote a bunch of posts I didn't know about (which tend to have notably less karma than the posts of theirs which I do know about)

This seems pretty consistent with the explanation Aaron provides.

(Also, I too found this an interesting post!)

Then, there is Buck.