Dan_Keys

271Joined Oct 2014

Comments
39

Intelligence 1: Individual cognitive abilities.

Intelligence 2: The ability to achieve a wide range of goals.

Eliezer Yudkowsky means Intelligence 2 when he talks about general intelligence. e.g., He proposed "efficient cross-domain optimization" as the definition in his post by that name. See the LW tag page for General Intelligence for more links & discussion.

The model assumes gradually diminishing returns to spending within the next year, but the intuitions behind your third voice think that much higher spending would involve marginal returns that are a lot smaller OR ~zero OR negative?

Could you post something closer to the raw survey data, in addition to the analysis spreadsheet linked in the summary section? I'd like to see something that:

  • Has data organized by respondent  (a row of data for each respondent)
  • Shows the number given by the respondent, before researcher adjustments (e.g., answers of 0 are shown as "0" and not as ".01") (it's fine for it to show the numbers that you get after data cleaning which turns "50%" and "50" into "0.5")
  • Includes each person's 6 component estimates, along with a few other variables like their directly elicited p(catastrophe), whether they identified as an expert, and (if you have the data) whether they came to the survey via ACX, LW, or the EA Forum
  • Has the exact text of every question

The numbers that you get from this sort of exercise will depend heavily on which people you get estimates from. My guess is that which people you include matters more than what you do with the numbers that they give you.

If the people who you survey are more like the general public, rather than people around our subcultural niche where misaligned AI is a prominent concern, then I expect you'll get smaller numbers.

Whereas, in Rob Bensinger's 2021 survey of "people working on long-term AI risk", every one of the 44 people who answered the survey gave an estimate larger than the 1.6% headline figure here. The smallest answer was 1.9%, and the central tendency was somewhere between 20% and 40% (depending on whether you look at the median, arithmetic mean, or geometric mean of the odds, and which of the two questions from that survey you look at).

If the estimates for the different components were independent, then wouldn't the distribution of synthetic estimates be the same as the distribution of individual people's estimates?

Multiplying Alice's p1 x Bob's p2 x Carol's p3 x ... would draw from the same distribution as multiplying Alice's p1 x Alice's p2 x Alice's p3 ... , if estimates to the different questions are unrelated.

So you could see how much non-independence affects the bottom-line results just by comparing the synthetic distribution with the distribution of individual estimates (treating each individual as one data point and multiplying their 6 component probabilities together to get their p(existential catastrophe)).

Insofar as the 6 components are not independent,  the question of whether to use synthetic estimates or just look at the distribution of individuals' estimates comes down to 1) how much value is there in increasing the effective sample size by using synthetic estimates and 2) is the non-independence that exists something that you want to erase by scrambling together different people's component estimates (because it mainly reflects reasoning errors) or is it something that you want to maintain by looking at individual estimates (because it reflects the structure of the situation).

Does the table in section 3.2 take the geometric mean for each of the 6 components?

From footnote 7 it looks like it does, but if it does then I don't see how this gives such a different bottom line probability from the synthetic method geomean in section 4 (18.7% vs. 1.65% for all respondents). Unless some probabilities are very close to 1, and those have a big influence on the numbers in the section 3.2 table? Or my intuitions about these methods are just off.

Have you looked at how sensitive this analysis is to outliers, or to (say) the most extreme 10% of responses on each component?

The recent Samotsvety nuclear risk estimate removed the largest and smallest forecast (out of 7) for each component before aggregating (the remaining 5 forecasts) with the geometric mean. Would a similar adjustment here change the bottom line much (for the single probability and/or the distribution over "worlds")?

The prima facie case for worrying about outliers actually seems significantly stronger for this survey than for an org like Samotsvety, which relies on skilled forecasters who treat each forecast professionally. This AI survey could have included people who haven't thought in much depth about AI existential risk, or who aren't comfortable with the particular decomposition you used, or who aren't good at giving probabilities, or who didn't put much time/effort/thought into answering these survey questions. 

And it seems like the synthetic point estimate method used here might magnify the impact of outlier respondents rather than attenuating it. An extreme response can move the geometric mean a lot, and a person who gives extreme answers on 3 of the components can have their extreme estimates show up in 3/n of the synthetic estimates, not just 1/n.

A passage from Superforecasting:

Flash back to early 2012. How likely is the Assad regime to fall? Arguments against a fall include (1) the regime has well-armed core supporters; (2) it has powerful regional allies.  Arguments  in  favor  of  a  fall  include  (1)  the  Syrian  army  is  suffering  massive defections;  (2)  the  rebels  have  some  momentum,  with  fighting  reaching  the  capital. Suppose you weight the strength of these arguments, they feel roughly equal, and you settle on a probability of roughly 50%.

But notice what’s missing? The time frame. It obviously matters. To use an extreme illustration, the probability of the regime falling in the next twenty-four hours must be less—likely a lot less—than the probability that it will fall in the next twenty-four months. To put this in Kahneman’s terms, the time frame is the “scope” of the forecast.

So we asked one randomly selected group of superforecasters, “How likely is it that the Assad regime will fall in the next three months?” Another group was asked how likely it was in the next six months. We did the same experiment with regular forecasters.

Kahneman predicted widespread “scope insensitivity.” Unconsciously, they would do a bait and switch, ducking the hard question that requires calibrating the probability to the time frame and tackling the easier question about the relative weight of the arguments for and against the regime’s downfall. The time frame would make no difference to the final answers, just as it made no difference whether 2,000, 20,000, or 200,000 migratory birds died. Mellers ran several studies and found that, exactly as Kahneman expected, the vast  majority  of forecasters were scope insensitive. Regular forecasters said there was a 40% chance Assad’s regime would fall over three months and a 41% chance it would fall over six months.

But the superforecasters did much better: They put the probability of Assad’s fall at 15% over three months and 24% over six months. That’s not perfect scope sensitivity (a tricky thing to define), but it was good enough to surprise Kahneman. If we bear in mind that no one  was  asked  both  the  three-  and  six-month  version  of  the  question,  that’s  quite  an accomplishment. It suggests that the superforecasters not only paid attention to the time frame in the question but also thought about other possible time frames—and thereby shook off a hard-to-shake bias.

Note: in the other examples studied by Mellers & colleagues (2015), regular forecasters were less sensitive to scope than they should've been, but they were not completely insensitive to scope, so the Assad example here (40% vs. 41%) is unusually extreme.

Two empirical reasons not to take the extreme scope neglect in studies like the 2,000 vs 200,000 birds one as directly reflecting people's values.

First, the results of studies like this depend on how you ask the question. A simple variation which generally leads to more scope sensitivity is to present the two options side by side, so that the same people would be asked both about 2,000 birds and about the 200,000 birds (some call this "joint evaluation" in contrast to "separate evaluation"). Other variations also generally produce more scope sensitive results (this Wikipedia article seems uneven in quality but gives a flavor for some of those variations.) The fact that this variation exists means that just take people's answers at face value does not work as a straightforward approach to understanding people's values, and I think the studies which find more scope sensitivity often have a strong case for being better designed.

Second, there are variants of scope insensitivity which involve things other than people's values. Christopher Hsee has done a number of studies in the context of consumer choice, where the quantity is something like the amount of ice cream that you get or the number of entries in a dictionary, which find scope insensitivity under separate evaluation (but not under joint evaluation), and there is good reason to think that people do prefer more ice cream and more comprehensive dictionaries. Daniel Kahneman has argued that several different kinds of extension neglect all reflect similar cognitive processes, including scope neglect in the bird study, base rate neglect in the Tom W problem, and duration neglect in studies of colonoscopies. And superforecasting researchers have found that ordinary forecasters neglect scope in questions like (in 2012) "How likely is it that the Assad regime will fall in the next three months?" vs. "How likely is it that the Assad regime will fall in the next six months?"; superforecasters' forecasts are more sensitive to the 3 month vs. 6 month quantity (there's a passage in Superforecasting about this which I'll leave as a reply, and a paper by Mellers & colleagues with more examples). These results suggest that people's answers to questions about values-at-scale has a lot to do with how people think about quantities, that "how people think about quantities" is a fairly messy empirical matter, and that it's fairly common for people's thinking about quantities to involve errors/biases/limitations which make their answers less sensitive to the size of the quantity.

This does not imply that the extreme scope sensitivity common in effective altruism matches people's values; I think that claim requires more of a philosophical argument rather than an empirical one. Just that the extreme scope insensitivity found in some studies probably doesn't match people's values.

It would be interesting whether the forecasters with outlier numbers stand by those forecasts on reflection, and to hear their reasoning if so. In cases where outlier forecasts reflect insight, how do we capture that insight rather than brushing them aside with the noise? Checking in with those forecasters after their forecasts have been flagged as suspicious-to-others is a start.

The p(month|year) number is especially relevant, since that is not just an input into the bottom line estimate, but also has direct implications for individual planning. The plan 'if Russia uses a nuclear weapon in Ukraine then I will leave my home to go someplace safer' looks pretty different depending on whether the period of heightened risk when you will be away from home is more like 2 weeks or 6 months.

Load More