elifland's Shortform

by elifland16th Jun 20208 comments
8 comments, sorted by Highlighting new comments since Today at 6:44 AM
New Comment

I wrote a draft outline on bottlenecks to more impactful crowd forecasting that I decided to share in its current form rather than clean up into a post.

Link

Summary:

  1. I have some intuition that crowd forecasting could be a useful tool for important decisions like cause prioritization but feel uncertain
  2. I’m not aware of many example success stories of crowd forecasts impacting important decisions, so I define a simple framework for how crowd forecasts could be impactful:
    1. Organizations and individuals (stakeholders) making important decisions are willing to use crowd forecasting to help inform decision making
    2. Forecasting questions are written such that their forecasts will affect the important decisions of stakeholders
    3. The forecasts are good + well-reasoned enough that they are actually useful and trustworthy for stakeholders
  3. I discuss 3 bottlenecks to success stories and possible solutions:
    1. Creating the important questions
    2. Incentivizing time spent on important questions
    3. Incentivizing forecasters to collaborate

I really enjoyed your outline, thank you! I have a few questions/notes: 

  1. [Bottlenecks] You suggest "Organizations and individuals (stakeholders) making important decisions are willing to use crowd forecasting to help inform decision making" as a crucial step in the "story" of crowd forecasting’s success (the "pathway to impact"?) --- this seems very true to me. But then you write "I doubt this is the main bottleneck right now but it may be in the future" (and don't really return to this). 
    1.  Could you explain your reasoning here? My intuition was that important decision-makers' willingness (and institutional ability) to use forecasting info would be a major bottleneck. (You listed Rethink Priorities and Open Phil as examples of institutions that" seem excited about using crowd forecasts to inform important decisions," but my understanding was that their behavior was the exception, not the rule. )
    2. If, say, the CDC (or important people there, etc.) were interested in using Metaculus to inform their decision-making, do you think they would be unable to do so due to a lack of interest (among forecasters) and/or a lack of relevant forecasting questions? (But then, could they not tell suggest questions they felt were relevant to their decisions?) Or do you think that the quality of answers they would get (or the amount of faith they would be able to put into those answers) wouldn't be sufficient? 
  2. [Separate, minor confusion] You say: "Forecasts are impactful to the extent that they affect important decisions," and then you suggest examples a-d ("from an EA perspective") that range from career decisions or what seem like personal donation choices to widely applicable questions like "Should AI alignment researchers be preparing more for a world with shorter or longer timelines?" and "What actions should we recommend the US government take to minimize pandemic risk?" This makes me confused about the space (or range) of decisions and decision-makers that you are considering here. 
    1. Are you viewing group forecasting initiatives as a solution to personal life choices? (Or is the "I" in a/b a very generalized "I" somehow?) (Or even 
    2. I'd guess that an EA perspective on the possible impact of crowd forecasting should focus on decision-makers with large impacts whether or not they are EA-aligned (e.g. governmental institutions), but I may be very wrong. 
  3. [Side note] I loved the section "Idea for question creation process: double crux creation," and in general the number of possible solutions that you list, and really hope that people try these out or study them more. (I also think you identify  other really important bottlenecks). 

Please note that I have no real relevant background (and am neither a forecast stakeholder nor a proper forecaster).

Hi Lizka, thanks for your feedback and think it touched on some of the sections that I'm most unsure about / could most use some revision which is great!

  1. [Bottlenecks] You suggest "Organizations and individuals (stakeholders) making important decisions are willing to use crowd forecasting to help inform decision making" as a crucial step in the "story" of crowd forecasting’s success (the "pathway to impact"?) --- this seems very true to me. But then you write "I doubt this is the main bottleneck right now but it may be in the future" (and don't really return to this).

I'll say up front it's possible I'm just wrong about the importance of the bottleneck here, and I think it also interacts with the other bottlenecks in a tricky way. E.g. if there were a clearer pipeline for creating important questions which get very high quality crowd forecasts which then affect decisions, more organizations would be interested. 

That being said, my intuition that this is not the bottleneck comes from some personal experiences I've had with forecasts solicited by orgs that already are interested in using crowd forecasts to inform decision making. Speaking from the perspective of a forecaster, I personally wouldn't have trusted the forecasts produced as an input into important decisions. 

Some examples: [Disclaimer: These are my personal impressions. Creating impactful questions and incentivizing forecaster effort is really hard and I respect OP//RP/Metaculus a lot for giving it a shot, and would love to be proven wrong about the impact of current initiatives like these]

  1. The Open Philanthropy/Metaculus Forecasting AI Progress Tournament is the most well-funded initiative I know of [ETA: potentially besides those contracting Good Judgment superforecasters], but my best guess is that the forecasts resulting from it will not be impactful. An example is the "deep learning" longest time horizon round, where despite Metaculus' best efforts most questions have no-few comments and at least to me it felt like the bulk of the forecasting skill was forming a continuous distribution from trend extrapolation. See also this question where the community failed to fully update on record-breaking scores appropriately. Also note that each question attracted only 25-35 forecasters.
  2. I feel less sure about this, but the RP's animal welfare questions authored by Neil Dullaghan seem to have the majority of comments on them by Neil himself. I feel intuitively skeptical that most of the 25-45 forecasters per question are doing more than skimming and making minor adjustments to the current community forecast, and this feels like an area where getting up to speed on domain knowledge is important to accurate forecasts.

So my argument is: given that AFAIK we haven't had consistent success using crowd forecasts to help institutions making important decisions, the main bottleneck seems to be helping the interested institutions rather than getting more institutions interested. 

If, say, the CDC (or important people there, etc.) were interested in using Metaculus to inform their decision-making, do you think they would be unable to do so due to a lack of interest (among forecasters) and/or a lack of relevant forecasting questions? (But then, could they not tell suggest questions they felt were relevant to their decisions?) Or do you think that the quality of answers they would get (or the amount of faith they would be able to put into those answers) wouldn't be sufficient?

[Caveat: I don't feel too qualified too opine on this point since I'm not a stakeholder nor have I interviewed ones, but I'll give my best guess.]

I think for the CDC example:

  1. Creating impactful questions seems relatively easier here than in e.g. the AI safety domain, though it still may be non-trivial to identify and operationalize cruxes for which predictions would actually lead to different decisions.
  2. I'd on average expect the forecasts to be a bit better than CDC models / domain experts. Perhaps substantially better on tail risks. Don't think we have a lot of evidence here, we have some from Metaculus tournaments with a small sample size.
    1. I think with better incentives to allocate more forecaster effort to this project, it's possible the forecasts could be much better.

Overall, I'd expect slightly decent forecasts on good but not great questions and I think that this isn't really enough to move the needle, so to speak. I also think there would need to be reasoning given behind the forecasts for stakeholder to understand and trust in crowd forecasts would need to be built up over time.

Part of the reason it seems tricky to have impactful forecasts is that often there are competing people/"camps" with different world models, and a person which the crowd forecast disagrees with may be reluctant to change their mind unless (a) the question is well targeted at cruxes of the disagreement and (b) they have built up trust of the forecasters and their reasoning process. To the extent this is true within the CDC, the harder it seems for forecasting questions to be impactful.

2. [Separate, minor confusion] You say: "Forecasts are impactful to the extent that they affect important decisions," and then you suggest examples a-d ("from an EA perspective") that range from career decisions or what seem like personal donation choices to widely applicable questions like "Should AI alignment researchers be preparing more for a world with shorter or longer timelines?" and "What actions should we recommend the US government take to minimize pandemic risk?" This makes me confused about the space (or range) of decisions and decision-makers that you are considering here. 

Yeah I think this is basically right, I will edit the draft.

  1. [Side note] I loved the section "Idea for question creation process: double crux creation," and in general the number of possible solutions that you list, and really hope that people try these out or study them more. (I also think you identify  other really important bottlenecks).

I hope so too, appreciate it!

Speaking from the perspective of a forecaster, I personally wouldn't have trusted the forecasts produced as an input into important decisions. 

Fwiw, I expect to very often see forecasts as an input into important decisions, but also usually seem them as a somewhat/very crappy input. I just also think that, for many questions that are key to my decisions or to the decisions of stakeholders I seek to influence, most or all of the available inputs are (by themselves) somewhat/very crappy, and so often the best I can do is: 

  1. try to gather up a bunch of disparate crappy inputs with different weaknesses
  2. try to figure out how much weight to give each
  3. see how much that converges on a single coherent picture and if so what picture

(See also consilience.)

(I really appreciated your draft outline and left a bunch of comments there. Just jumping in here with one small point.)

I liked this document quite a bit, and I think it would be a reasonable Forum post even without further cleanup — you could basically copy over this Shortform, minus the bit about not cleaning it up. This lets the post be tagged, be visible to more people, etc. (Though I understand if you'd rather leave it in a less-trafficked area.)

Appreciate the compliment. I am interested in making it a Forum post, but might want to do some more editing/cleanup or writing over next few weeks/months (it got more interest than I was expecting so seems more likely to be worth it now). Might also post as is, will think about it more soon.

The efforts by https://1daysooner.org/ to use human challenge trials to speed up vaccine development make me think about the potential of advocacy for "human challenge" type experiments in other domains where consequentialists might conclude there hasn't been enough "ethically questionable" randomized experimentation on humans. 2 examples come to mind:

My impression of the nutrition field is that it's very hard to get causal evidence because people won't change their diet at random for an experiment.

Why We Sleep has been a very influential book, but the sleep science research it draws upon is usually observational and/or relies on short time-spans. Alexey Guzey's critique and self-experiment have both cast doubt on its conclusions to some extent.

Getting 1,000 people to sign up and randomly contracting 500 of them to do X for a year, where X is something like being vegan or sleeping for 6.5 hours per day, could be valuable.

Challenge trials face resistance for very valid historical reasons - this podcast has a good summary. https://80000hours.org/podcast/episodes/marc-lipsitch-winning-or-losing-against-covid19-and-epidemiology/