elifland

Research Engineer at Ought. Interested in all things EA but especially cause prioritization, forecasting, and AI safety. More at https://www.elilifland.com/. You can give me anonymous feedback here.

Unless stated otherwise, all views are my own rather than my employer's.

Wiki Contributions

Comments

elifland's Shortform

Really appreciate hearing your perspective!

On causal evidence of RCTs vs. observational data: I'm intuitively skeptical of this but the sources you linked seem interesting and worthwhile to think about more before setting an org up for this. (Edited to add:) Hearing your view  already substantially updates mine, but I'd be really curious to hear more perspectives from others with lots of experience working on this type of stuff, to see if they'd agree, then I'd update more. If you have impressions of how much consensus there is on this question that would be valuable too.

On nudging scientific incentives to focus on important questions rather than working on them ourselves: this seems pretty reasonable to me. I think building an app to do this still seems plausibly very valuable and I'm not sure how much I trust others to do it, but maybe we combine the ideas and build an app then nudge other scientists to use this app to do important studies.

elifland's Shortform

This all makes sense to me overall. I'm still excited about this idea (slightly less so than before) but I think/agree there should be careful considerations on which interventions make the most sense to test.

I think it's really telling that Google and Amazon don't have internal testing teams to study productivity/management techniques in isolation. In practice, I just don't think you learn that much, for the cost of it.

What these companies do do, is to allow different managers to try things out, survey them, and promote the seemingly best practices throughout. This happens very quickly. I'm sure we could make tools to make this process go much faster. (Better elicitation, better data collection of what already happens, lots of small estimates of impact to see what to focus more on, etc). 

A few things come to mind here:

  1. The point on the amount of evidence Google/Amazon not doing it provides feels related to the discussion around our corporate prediction market analysis. Note that I was the author who probably took the evidence that most corporations discontinued their prediction markets as the most weak (see my conclusion), though I still think it's fairly substantial.
  2. I also agree with the point in your reply that setting up prediction markets and learning from them has positive externalities, and a similar thing should apply here.
  3. I agree that more data collection tools for what already happens and other innovations in that vein seem good as well!
elifland's Shortform

A variant I'd also be excited about (could imagine even moreso, could go either way after more reflection) that could be contained within the same org or separate: the same thing but for companies (particularly, startups) edit to clarify: test policies/strategies across companies, not on people within companies

elifland's Shortform

I think the obvious answer is that doing controlled trials in these areas is a whole lot of work/expense for the benefit.

Some things like health effects can take a long time to play out; maybe 10-50 years. And I wouldn't expect the difference to be particularly amazing. (I'd be surprised if the average person could increase their productivity by more than ~20% with any of those)

 

I think our main disagreement is around the likely effect sizes; e.g. I think blocking out focused work could easily have an effect size of >50% (but am pretty uncertain which is why I want the trial!). I agree about long-term effects being a concern, particularly depending on one's TAI timelines.

 

On "challenge trials"; I imagine the big question is how difficult it would be to convince people to accept a very different lifestyle for a long time. I'm not sure if it's called "challenge trial" in this case. 

Yeah, I'm most excited about challenges that last more like a few months to a year, though this isn't ideal in all domains (e.g. veganism), so maybe this wasn't best as the top example. I have no strong views on terminology.

elifland's Shortform

Votes/considerations on why this is a good or bad idea are also appreciated!

elifland's Shortform

Reflecting a little on my shortform from a few years ago, I think I wasn't ambitious enough in trying to actually move this forward.

I want there to be an org that does "human challenge"-style RCTs across lots of important questions that are extremely hard to get at otherwise, e.g. (top 2 are repeated from previous shortform. edited to clarify: these are some quick examples off the top of my head, should be more consideration into which are the best for this org):

  1. Health effects of veganism
  2. Health effects of restricting sleep
  3. Productivity of remote vs. in-person work
  4. Productivity effects of blocking out focused/deep work

Edited to add: I no longer think "human challenge" is really the best way to refer to this idea (see comment that convinced me); I mean to say something like "large scale RCTs of important things on volunteers who sign up on an app to randomly try or not try an intervention." I'm open to suggestions on succinct ways to refer to this.

I'd be very excited about such an org existing. I think it could even grow to become an effective megaproject, pending further analysis on how much it could increase wisdom relative to power. But, I don't think it's a good personal fit for me to found given my current interests and skills. 

However, I think I could plausibly provide some useful advice/help to anyone who is interested in founding a many-domain human-challenge org. If you are interested in founding such an org or know someone who might be and want my advice, let me know. (I will also be linking this shortform to some people who might be able to help set this up.)

--

Some further inspiration I'm drawing on to be excited about this org:

  1. Freakonomics' RCT on measuring the effects of big life changes like quitting your job or breaking up with your partner. This makes me optimistic about the feasibility of getting lots of people to sign up.
  2. Holden's note on doing these type of experiments with digital people. He mentions some difficulties with running these types of RCTs today, but I think an org specializing in them could help. (edited to add: in particular, a mobile/web app for matching experiments to volunteers and tracking effects seems like it should be created)
Prediction Markets in The Corporate Setting

I really appreciate that you break down explanatory factors in the way you do.

 

I'm happy that this was useful for you!

I have a hard time making a mental model of their relative importance compared to each other. Do you think that such an exercise is feasible, and if so, do any of you have a conception of the relative explanatory strength of any factor when considered against the others?

Good question. We also had some trouble with this, as it's difficult to observe the reasons many corporate prediction markets have failed to catch on. That being said, my best guess is that it varies substantially based on the corporation:

  1. For an average company, the most important factor might be some combination of (2) and (4): many employees wouldn't be that interested in predicting and thus the cost of getting enough predictions might be high, and there is also just isn't that much appetite to change things up.
  2. For an average EA org, the most important factors might be a combination of (1) and (2): the tech is too immature and writing + acting on good questions takes too much time such that it's hard to find the sweet spot where the benefit is worth the cost. In particular, many EA orgs are quite small so fixed costs of setting up and maintaining the market as well as writing impactful questions can be significant.

This Twitter poll by Ozzie and the discussion under it is also interesting data here; my read is that the mapping between Ozzie's options and our requirements are:

  1. They're undervalued: None of our requirements are substantial enough issues.
  2. They're mediocre: Some  combination of our requirements (1), (2), and (3) make prediction markets not worth the cost.
  3. Politically disruptive: Our requirement (4).
  4. Other

(3) won the poll by quite a bit, but note it was retweeted by Hanson which could skew the voting pool  (h/t Ozzie for mentioning this somewhere else) .

Also, do you think that it is likely that the true explanation has nothing to do with any of these? In that case, how likely? 

The most likely possibility I can think of is the one Ozzie included in his poll: prediction markets are undervalued for a reason other than political fears, and all/most of the companies made a mistake by discontinuing them. I'd say 15% for this, given that the evidence is fairly strong but there could be correlated reasons companies are missing out on the benefits. In particular, they could be underestimating some of the positive effects Ozzie mentioned in his comment above.

As for an unlisted explanation being the main one, it feels like we covered most of the ground here and the main explanation is at least related to something we mentioned, but unknown unknowns are always a thing; I'd say 10% here .

So that gives me a quick gut estimate of 25%; would be curious to get others' takes.

elifland's Shortform

Appreciate the compliment. I am interested in making it a Forum post, but might want to do some more editing/cleanup or writing over next few weeks/months (it got more interest than I was expecting so seems more likely to be worth it now). Might also post as is, will think about it more soon.

elifland's Shortform

Hi Lizka, thanks for your feedback and think it touched on some of the sections that I'm most unsure about / could most use some revision which is great!

  1. [Bottlenecks] You suggest "Organizations and individuals (stakeholders) making important decisions are willing to use crowd forecasting to help inform decision making" as a crucial step in the "story" of crowd forecasting’s success (the "pathway to impact"?) --- this seems very true to me. But then you write "I doubt this is the main bottleneck right now but it may be in the future" (and don't really return to this).

I'll say up front it's possible I'm just wrong about the importance of the bottleneck here, and I think it also interacts with the other bottlenecks in a tricky way. E.g. if there were a clearer pipeline for creating important questions which get very high quality crowd forecasts which then affect decisions, more organizations would be interested. 

That being said, my intuition that this is not the bottleneck comes from some personal experiences I've had with forecasts solicited by orgs that already are interested in using crowd forecasts to inform decision making. Speaking from the perspective of a forecaster, I personally wouldn't have trusted the forecasts produced as an input into important decisions. 

Some examples: [Disclaimer: These are my personal impressions. Creating impactful questions and incentivizing forecaster effort is really hard and I respect OP//RP/Metaculus a lot for giving it a shot, and would love to be proven wrong about the impact of current initiatives like these]

  1. The Open Philanthropy/Metaculus Forecasting AI Progress Tournament is the most well-funded initiative I know of [ETA: potentially besides those contracting Good Judgment superforecasters], but my best guess is that the forecasts resulting from it will not be impactful. An example is the "deep learning" longest time horizon round, where despite Metaculus' best efforts most questions have no-few comments and at least to me it felt like the bulk of the forecasting skill was forming a continuous distribution from trend extrapolation. See also this question where the community failed to fully update on record-breaking scores appropriately. Also note that each question attracted only 25-35 forecasters.
  2. I feel less sure about this, but the RP's animal welfare questions authored by Neil Dullaghan seem to have the majority of comments on them by Neil himself. I feel intuitively skeptical that most of the 25-45 forecasters per question are doing more than skimming and making minor adjustments to the current community forecast, and this feels like an area where getting up to speed on domain knowledge is important to accurate forecasts.

So my argument is: given that AFAIK we haven't had consistent success using crowd forecasts to help institutions making important decisions, the main bottleneck seems to be helping the interested institutions rather than getting more institutions interested. 

If, say, the CDC (or important people there, etc.) were interested in using Metaculus to inform their decision-making, do you think they would be unable to do so due to a lack of interest (among forecasters) and/or a lack of relevant forecasting questions? (But then, could they not tell suggest questions they felt were relevant to their decisions?) Or do you think that the quality of answers they would get (or the amount of faith they would be able to put into those answers) wouldn't be sufficient?

[Caveat: I don't feel too qualified too opine on this point since I'm not a stakeholder nor have I interviewed ones, but I'll give my best guess.]

I think for the CDC example:

  1. Creating impactful questions seems relatively easier here than in e.g. the AI safety domain, though it still may be non-trivial to identify and operationalize cruxes for which predictions would actually lead to different decisions.
  2. I'd on average expect the forecasts to be a bit better than CDC models / domain experts. Perhaps substantially better on tail risks. Don't think we have a lot of evidence here, we have some from Metaculus tournaments with a small sample size.
    1. I think with better incentives to allocate more forecaster effort to this project, it's possible the forecasts could be much better.

Overall, I'd expect slightly decent forecasts on good but not great questions and I think that this isn't really enough to move the needle, so to speak. I also think there would need to be reasoning given behind the forecasts for stakeholder to understand and trust in crowd forecasts would need to be built up over time.

Part of the reason it seems tricky to have impactful forecasts is that often there are competing people/"camps" with different world models, and a person which the crowd forecast disagrees with may be reluctant to change their mind unless (a) the question is well targeted at cruxes of the disagreement and (b) they have built up trust of the forecasters and their reasoning process. To the extent this is true within the CDC, the harder it seems for forecasting questions to be impactful.

2. [Separate, minor confusion] You say: "Forecasts are impactful to the extent that they affect important decisions," and then you suggest examples a-d ("from an EA perspective") that range from career decisions or what seem like personal donation choices to widely applicable questions like "Should AI alignment researchers be preparing more for a world with shorter or longer timelines?" and "What actions should we recommend the US government take to minimize pandemic risk?" This makes me confused about the space (or range) of decisions and decision-makers that you are considering here. 

Yeah I think this is basically right, I will edit the draft.

  1. [Side note] I loved the section "Idea for question creation process: double crux creation," and in general the number of possible solutions that you list, and really hope that people try these out or study them more. (I also think you identify  other really important bottlenecks).

I hope so too, appreciate it!

elifland's Shortform

I wrote a draft outline on bottlenecks to more impactful crowd forecasting that I decided to share in its current form rather than clean up into a post [edited to add: I ended up revising into a post here].

Link

Summary:

  1. I have some intuition that crowd forecasting could be a useful tool for important decisions like cause prioritization but feel uncertain
  2. I’m not aware of many example success stories of crowd forecasts impacting important decisions, so I define a simple framework for how crowd forecasts could be impactful:
    1. Organizations and individuals (stakeholders) making important decisions are willing to use crowd forecasting to help inform decision making
    2. Forecasting questions are written such that their forecasts will affect the important decisions of stakeholders
    3. The forecasts are good + well-reasoned enough that they are actually useful and trustworthy for stakeholders
  3. I discuss 3 bottlenecks to success stories and possible solutions:
    1. Creating the important questions
    2. Incentivizing time spent on important questions
    3. Incentivizing forecasters to collaborate
Load More