Ben Garfinkel

Ben Garfinkel - Researcher at Future of Humanity Institute

Wiki Contributions


Why AI alignment could be hard with modern deep learning

FWIW, I haven't had this impression.

Single data point: In the most recent survey on community opinion on AI risk, I was in at least the 75th percentile for pessimism (for roughly the same reasons Lukas suggests below). But I'm also seemingly unusually optimistic about alignment risk.

I haven't found that this is a really unusual combo: I think I know at least a few other people who are unusually pessimistic about 'AI going well,' but also at least moderately optimistic about alignment.

(Caveat that my apparently higher level of pessimism could also be explained by me having a more inclusive conception of "existential risk" than other survey participants.)

All Possible Views About Humanity's Future Are Wild

Thanks for the clarification! I still feel a bit fuzzy on this line of thought, but hopefully understand a bit better now.

At least on my read, the post seems to discuss a couple different forms of wildness: let’s call them “temporal wildness” (we currently live at an unusually notable time) and “structural wildness” (the world is intuitively wild; the human trajectory is intuitively wild).[1]

I think I still don’t see the relevance of “structural wildness,” for evaluating fishiness arguments. As a silly example: Quantum mechanics is pretty intuitively wild, but the fact that we live in a world where QM is true doesn’t seem to substantially undermine fishiness arguments.

I think I do see, though, how claims about temporal wildness might be relevant. I wonder if this kind of argument feels approximately right to you (or to Holden):

Step 1: A priori, it’s unlikely that we would live even within 10000 years of the most consequential century in human history. However, despite this low prior, we have obviously strong reasons to think it’s at least plausible that we live this close to the HoH. Therefore, let’s say, a reasonable person should assign at least a 20% credence to the (wild) hypothesis: “The HoH will happen within the next 10000 years.”

Step 2: If we suppose that the HoH will happen with the next 10000 years, then a reasonable conditional credence that this century is the HoH should probably be something like 1/100. Therefore, it seems, our ‘new prior’ that this century is the HoH should be at least .2*.01 = .002. This is substantially higher than (e.g.) the more non-informative prior that Will's paper starts with.

Fishiness arguments can obviously still be applied to the hypothesis presented in Step 1, in the usual way. But maybe the difference, here, is that the standard arguments/evidence that lend credibility to the more conservative hypothesis “The HoH will happen within the next 10000” are just pretty obviously robust — which makes it easier to overcome a low prior. Then, once we’ve established the plausibility of the more conservative hypothesis, we can sort of back-chain and use it to bump up our prior in the Strong HoH Hypothesis.

  1. I suppose it also evokes an epistemic notion of wildness, when it describes certain confidence levels as “wild,” but I take it that “wild” here is mostly just a way of saying “irrational”? ↩︎

All Possible Views About Humanity's Future Are Wild

To say a bit more here, on the epistemic relevance of wildness:

I take it that one of the main purposes of this post is to push back against “fishiness arguments,” like the argument that Will makes in “Are We Living at the Hinge of History?

The basic idea, of course, is that it’s a priori very unlikely that any given person would find themselves living at the hinge of history (and correctly recognise this). Due to the fallibility of human reasoning and due to various possible sources of bias, however, it’s not as unlikely that a given person would mistakenly conclude that they live at the HoH. Therefore, if someone comes to believe that they probably live at the HoH, we should think there’s a sizeable chance they’ve simply made a mistake.

As this line of argument is expressed in the post:

I know what you're thinking: "The odds that we could live in such a significant time seem infinitesimal; the odds that Holden is having delusions of grandeur (on behalf of all of Earth, but still) seem far higher."

The three critical probabilities here are:

  • Pr(Someone makes an epistemic mistake when thinking about their place in history)
  • Pr(Someone believes they live at the HoH|They haven’t made an epistemic mistake)
  • Pr(Someone believes they live at the HoH|They’ve made an epistemic mistake)

The first describes the robustness of our reasoning. The second describes the prior probability that we would live at the HoH (and be able to recognise this fact if reasoning well). The third describes the level of bias in our reasoning, toward the HoH hypothesis, when we make mistakes.

I agree that all possible futures are “wild,” in some sense, but I don’t think this point necessarily bears much on the magnitudes of any of these probabilities.

For example, it would be sort of “wild” if long-distance space travel turns out to be impossible and our solar system turns out to be the only solar system to ever harbour life. It would also be “wild” if long-distance space travel starts to happen 100,000 years from now. But — at least at a glance — I don’t see how this wildness should inform our estimates for the three key probabilities.

One possible argument here, focusing on the bias factor, is something like: “We shouldn’t expect intellectuals to be significantly biased toward the conclusion that they live at the HoH, because the HoH Hypothesis isn’t substantially more appealing, salient, etc., than other beliefs they could have about the future.”

But I don’t think this argument would be right. For example: I think the hypothesis “the HoH will happen within my lifetime” and the hypothesis “the HoH will happen between 100,000 and 200,000 years from now” are pretty psychologically different.

To sum up: At least on a first pass, I don't see why the point "all possible futures are wild" undermines the fishiness argument raised at the top of the post.

All Possible Views About Humanity's Future Are Wild

Some possible futures do feel relatively more "wild” to me, too, even if all of them are wild to a significant degree. If we suppose that wildness is actually pretty epistemically relevant (I’m not sure it is), then it could still matter a lot if some future is 10x wilder than another.

For example, take a prediction like this:

Humanity will build self-replicating robots and shoot them out into space at close to the speed of light; as they expand outward, they will construct giant spherical structures around all of the galaxy’s stars to extract tremendous volumes of energy; this energy will be used to power octillions of digital minds with unfathomable experiences; this process will start in the next thirty years, by which point we’ll already have transcended our bodies to reside on computers as brain emulation software.

A prediction like “none of the above happens; humanity hangs around and then dies out sometime in the next million years” definitely also feels wild in its own way. So does the prediction “all of the above happens, starting a few hundred years from now.” But both of these predictions still feel much less wild than the first one.

I suppose whether they actually are much less “wild” depends on one’s metric of wildness. I’m not sure how to think about that metric, though. If wildness is epistemically relevant, then presumably some forms of wildness are more epistemically relevant than others.

Taboo "Outside View"

I suspect you are more broadly underestimating the extent to which people used "insect-level intelligence" as a generic stand-in for "pretty dumb," though I haven't looked at the discussion in Mind Children and Moravec may be making a stronger claim.

I think that's good push-back and a fair suggestion: I'm not sure how seriously the statement in Nick's paper was meant to be taken. I hadn't considered that it might be almost entirely a quip. (I may ask him about this.)

Moravec's discussion in Mind Children is similarly brief: He presents a graph of the computing power of different animal's brains and states that "lab computers are roughly equal in power to the nervous systems of insects."He also characterizes current AI behaviors as "insectlike" and writes: "I believe that robots with human intelligence will be common within fifty years. By comparison, the best of today's machines have minds more like those of insects than humans. Yet this performance itself represents a giant leap forward in just a few decades." I don't think he's just being quippy, but there's also no suggestion that he means anything very rigorous/specific by his suggestion.

Rodney Brooks, I think, did mean for his comparisons to insect intelligence to be taken very seriously. The idea of his "nouvelle AI program" was to create AI systems that match insect intelligence, then use that as a jumping-off point for trying to produce human-like intelligence. I think walking and obstacle navigation, with several legs, was used as the main dimension of comparison. The Brooks case is a little different, though, since (IIRC) he only claimed that his robots exhibited important aspects of insect intelligence or fell just short insect intelligence, rather than directly claiming that they actually matched insect intelligence. On the other hand, he apparently felt he had gotten close enough to transition to the stage of the project that was meant to go from insect-level stuff to human-level stuff.

A plausible reaction to these cases, then, might be:

OK, Rodney Brooks did make a similar comparison, and was a major figure at the time, but his stuff was pretty transparently flawed. Moravec's and Bostrom's comments were at best fairly off-hand, suggesting casual impressions more than they suggest outcomes of rigorous analysis. The more recent "insect-level intelligence" claim is pretty different, since it's built on top of much more detailed analysis than anything Moravec/Bostrom did, and it's less obviously flawed than Brooks' analysis. The likelihood that it reflects an erroneous impression is, therefore, a lot lower. The previous cases shouldn't actually do much to raise our suspicion levels.

I think there's something to this reaction, particularly if there's now more rigorous work being done to operationalize and test the "insect-level intelligence" claim. I hadn't yet seen the recent post you linked to, which, at first glance, seems like a good and clear piece of work. The more rigorous work is done to flesh out the argument, the less I'm inclined to treat the Bostrom/Moravec/Brooks cases as part of an epistemically relevant reference class.

My impression a few years ago was that the claim wasn't yet backed by any really clear/careful analysis. At least, the version that filtered down to me seemed to be substantially based on fuzzy analogies between RL agent behavior and insect behavior, without anyone yet knowing much about insect behavior. (Although maybe this was a misimpression.) So I probably do stand by the reference class being relevant back then.

Overall, to sum up, my position here is something like: "The Bostrom/Moravec/Brooks cases do suggest that it might be easy to see roughly insect-level intelligence, if that's what you expect to see and you're relying on fuzzy impressions, paying special attention to stuff AI systems can already do, or not really operationalizing your claims. This should make us more suspicious of modern claims that we've recently achieved 'insect-level intelligence,' unless they're accompanied by transparent and pretty obviously robust reasoning. Insofar as this work is being done, though, the Bostrom/Moravec/Brooks cases become weaker grounds for suspicion."

Taboo "Outside View"

As a last thought here (no need to respond), I thought it might useful to give one example of a concrete case where: (a) Tetlock’s work seems relevant, and I find the terms “inside view” and “outside view” natural to use, even though the case is relatively different from the ones Tetlock has studied; and (b) I think many people in the community have tended to underweight an “outside view.”

A few years ago, I pretty frequently encountered the claim that recently developed AI systems exhibited roughly “insect-level intelligence.” This claim was typically used to support an argument for short timelines, since the claim was also made that we now had roughly insect-level compute. If insect-level intelligence has arrived around the same time as insect-level compute, then, it seems to follow, we shouldn’t be at all surprised if we get ‘human-level intelligence’ at roughly the point where we get human-level compute. And human-level compute might be achieved pretty soon.

For a couple of reasons, I think some people updated their timelines too strongly in response to this argument. First, it seemed like there are probably a lot of opportunities to make mistakes when constructing the argument: it’s not clear how “insect-level intelligence” or “human-level intelligence” should be conceptualised, it’s not clear how best to map AI behaviour onto insect behaviour, etc. The argument also hadn't yet been vetted closely or expressed very precisely, which seemed to increase the possibility of not-yet-appreciated issues.

Second, we know that there are previous of examples of smart people looking at AI behaviour and forming the impression that it suggests “insect-level intelligence.” For example, in Nick Bostrom’s paper “How Long Before Superintelligence?” (1998) he suggested that “approximately insect-level intelligence” was achieved sometime in the 70s, as a result of insect-level computing power being achieved in the 70s. In Moravec’s book Mind Children (1990), he also suggested that both insect-level intelligence and insect-level compute had both recently been achieved. Rodney Brooks also had this whole research program, in the 90s, that was based around going from “insect-level intelligence” to “human-level intelligence.”

I think many people didn’t give enough weight to the reference class “instances of smart people looking at AI systems and forming the impression that they exhibit insect-level intelligence” and gave too much weight to the more deductive/model-y argument that had been constructed.

This case is obviously pretty different than the sorts of cases that Tetlock’s studies focused on, but I do still feel like the studies have some relevance. I think Tetlock’s work should, in a pretty broad way, make people more suspicious of their own ability to perform to linear/model-heavy reasoning about complex phenomena, without getting tripped up or fooling themselves. It should also make people somewhat more inclined to take reference classes seriously, even when the reference classes are fairly different from the sorts of reference classes good forecasters used in Tetlock’s studies. I do also think that the terms “inside view” and “outside view” apply relatively neatly, in this case, and are nice bits of shorthand — although, admittedly, it’s far from necessary to use them.

This is the sort of case I have in the back of my mind.

(There are also, of course, cases that point in the opposite direction, where many people seemingly gave too much weight to something they classified as an "outside view." Early under-reaction to COVID is arguably one example.)

Taboo "Outside View"

Thank you (and sorry for my delayed response)!

I shudder at the prospect of having a discussion about "Outside view vs inside view: which is better? Which is overrated and which is underrated?" (and I've worried that this thread may be tending in that direction) but I would really look forward to having a discussion about "let's look at Daniel's list of techniques and talk about which ones are overrated and underrated and in what circumstances each is appropriate."

I also shudder a bit at that prospect.

I am sometimes happy making pretty broad and sloppy statements. For example: "People making political predictions typically don't make enough use of 'outside view' perspectives" feels fine to me, as a claim, despite some ambiguity around the edges. (Which perspectives should they use? How exactly should they use them? Etc.)

But if you want to dig in deep, for example when evaluating the rationality of a particular prediction, you should definitely shift toward making more specific and precise statements. For example, if someone has based their own AI timelines on Katja's expert survey, and they wanted to defend their view by simply evoking the principle "outside views are better than inside views," I think this would probably a horrible conversation. A good conversation would focus specifically on the conditions under which it makes sense to defer heavily to experts, whether those conditions apply in this particular case, etc. Some general Tetlock stuff might come into the conversation, like: "Tetlock's work suggests it's easy to trip yourself up if you try to use your own detailed/causal model of the world to make predictions, so you shouldn't be so confident that your own 'inside view' prediction will be very good either." But mostly you should be more specific.

Now I'll try to say what I think your position is:

  1. If people were using "outside view" without explaining more specifically what they mean, that would be bad and it should be tabood, but you don't see that in your experience
  2. If the things in the first Big List were indeed super diverse and disconnected from the evidence in Tetlock's studies etc., then there would indeed be no good reason to bundle them together under one term. But in fact this isn't the case; most of the things on the list are special cases of reference-class / statistical reasoning, which is what Tetlock's studies are about. So rather than taboo "outside view" we should continue to use the term but mildly prune the list.
  3. There may be a general bias in this community towards using the things on the first Big List, but (a) in your opinion the opposite seems more true, and (b) at any rate even if this is true the right response is to argue for that directly rather than advocating the tabooing of the term.

How does that sound?

I'd say that sounds basically right!

The only thing is that I don't necessarily agree with 3a.

I think some parts of the community lean too much on things in the bag (the example you give at the top of the post is an extreme example). I also think that some parts of the community lean too little on things in the bag, in part because (in my view) they're overconfident in their own abilities to reason causally/deductively in certain domains. I'm not sure which is overall more problematic, at the moment, in part because I'm not sure how people actually should be integrating different considerations in domains like AI forecasting.

There also seem to be biases that cut in both directions. I think the 'baseline bias' is pretty strongly toward causal/deductive reasoning, since it's more impressive-seeming, can suggest that you have something uniquely valuable to bring to the table (if you can draw on lots of specific knowledge or ideas that it's rare to possess), is probably typically more interesting and emotionally satisfying, and doesn't as strongly force you to confront or admit the limits of your predictive powers. The EA community has definitely introduced an (unusual?) bias in the opposite direction, by giving a lot of social credit to people who show certain signs of 'epistemic virtue.' I guess the pro-causal/deductive bias often feels more salient to me, but I don't really want to make any confident claim here that it actually is more powerful.

Ben Garfinkel's Shortform

I'm not sure if you think this is an interesting point to notice that's useful for building a world-model, and/or a reason to be skeptical of technical alignment work. I'd agree with the former but disagree with the latter.

Mostly the former!

I think the point may have implications for how much we should prioritize alignment research, relative to other kinds of work, but this depends on what the previous version of someone's world model was.

For example, if someone has assumed that solving the 'alignment problem' is close to sufficient to ensure that humanity has "control" of its future, then absorbing this point (if it's correct) might cause them to update downward on the expected impact of technical alignment research. Research focused on coordination-related issues (e.g. cooperative AI stuff) might increase in value, at least in relative terms.

Taboo "Outside View"

It’s definitely entirely plausible that I’ve misunderstood your views.

My interpretation of the post was something like this:

There is a bag of things that people in the EA community tend to describe as “outside views.” Many of the things in this bag are over-rated or mis-used by members of the EA community, leading to bad beliefs.

One reason for this over-use or mis-use is that the the term “outside view” has developed an extremely positive connotation within the community. People are applauded for saying that they’re relying on “outside views” — “outside view” has become “an applause light” — and so will rely on items in the bag to an extent that is epistemically unjustified.

The things in the bag are also pretty different from each other — and not everyone who uses the term “outside view” agrees about exactly what belongs in the bag. This conflation/ambiguity can lead to miscommunication.

More importantly, when it comes to the usefulness of the different items in the bag, some have more evidential support than others. Using the term “outside view” to refer to everything in the bag might therefore lead people to overrated certain items that actually have weak evidential support.

To sum up, tabooing the term “outside view” might solve two problems. First, it might reduce miscommunication. Second, more importantly, it might cause people to stop overrating some of the reasoning processes that they currently characterize as involving “outside views.” The mechanisms by which tabooing the term can help to solve the second problem are: (a) it takes away an “applause light,” whose existence incentivizes excessive use of these reasoning processes, and (b) it allows people to more easily recognize that some of these reasoning processes don't actually have much empirical support.

I’m curious if this feels roughly right, or feels pretty off.

Part of the reason I interpreted your post this way: The quote you kicked the post off suggested to me that your primary preoccupation was over-use or mis-use of the tools people called “outside views,” including more conventional reference-class forecasting. It seemed like the quote is giving an example of someone who’s refusing to engage in causal reasoning, evaluate object-level arguments, etc., based on the idea that outside views are just strictly dominant in the context of AI forecasting. It seemed like this would have been an issue even if the person was doing totally orthodox reference-class forecasting and there was no ambiguity about what they were doing.[1]

I don’t think that you’re generally opposed to the items in the “outside view” bag or anything like that. I also don’t assume that you disagree with most of the points I listed in my last comment, for why I think intellectuals probably on average underrated the items in the bag. I just listed all of them because you asked for an explanation for my view, I suppose with some implication that you might disagree with it.

You've also given two rough definitions of the term, which seem quite different to me, and also quite fuzzy. (e.g. if by "reference class forecasting" you mean the stuff Tetlock's studies are about, then it really shouldn't include the anti-weirdness heuristic, but it seems like you are saying it does?)

I think it’s probably not worth digging deeper on the definitions I gave, since I definitely don’t think they're close to perfect. But just a clarification here, on the anti-weirdness heuristic: I’m thinking of the reference class as “weird-sounding claims.”

Suppose someone approaches you not the street and hands you a flyer claiming: “The US government has figured out a way to use entangled particles to help treat cancer, but political elites are hoarding the particles.” You quickly form a belief that the flyer’s claim is almost certainly false, by thinking to yourself: “This is a really weird-sounding claim, and I figure that virtually all really weird-sounding claims that appear in random flyers are wrong.”

In this case, you’re not doing any deductive reasoning about the claim itself or relying on any causal models that directly bear on the claim. (Although you could.) For example, you’re not thinking to yourself: “Well, I know about quantum mechanics, and I know entangled particles couldn’t be useful for treating cancer for reason X.” Or: “I understand economic incentives, or understand social dynamics around secret-keeping, so I know it’s unlikely this information would be kept secret.” You’re just picking a reference class — weird-sounding claims made on random flyers — and justifying your belief that way.

I think it’s possible that Tetlock’s studies don’t bear very strongly on the usefulness of this reference class, since I imagine participants in his studies almost never used it. (“The claim ‘there will be a coup in Venezuela in the next five years’ sounds really weird to me, and most claims that sound weird to me aren't true, so it’s probably not true!”) But I think the anti-weirdness heuristic does fit with the definitions I gave, as well as the definition you give that characterizes the term's "original meaning." I also do think that Tetlock's studies remain at least somewhat relevant when judging the potential usefulness of the heuristic.

  1. I initially engaged on the miscommunication, point, though, since this is the concern that would mostly strongly make me want to taboo the term. I’d rather address the applause light problem, if it is a problem, but trying get people in the EA community stop applauding, and the evidence problem, if it is a problem, by trying to just directly make people in the EA community more aware of the limits of evidence. ↩︎

Taboo "Outside View"

On the contrary; tabooing the term is more helpful, I think. I've tried to explain why in the post. I'm not against the things "outside view" has come to mean; I'm just against them being conflated with / associated with each other, which is what the term does. If my point was simply that the first Big List was overrated and the second Big List was underrated, I would have written a very different post!

My initial comment was focused on your point about conflation, because I think this point bears on the linguistic question more strongly than the other points do. I haven’t personally found conflation to be a large issue. (Recognizing, again, that our experiences may differ.) If I agreed with the point about conflation, though, then I would think it might be worth tabooing the term "outside view."

By what definition of "outside view?"

By “taking an outside view on X” I basically mean “engaging in statistical or reference-class-based reasoning.” I think it might also be best defined negatively: "reasoning that doesn’t substantially involve logical deduction or causal models of the phenomenon in question. "[1]

I think most of the examples in your list fit these definitions.

Epistemic deference is a kind of statistical/reference-class-based reasoning, for example, which doesn't involve applying any sort of causal model of the phenomenon in question. The logic is “Ah, I should update downward on this claim, since experts in domain X disagree with it and I think that experts in domain X will typically be right.”

Same for anti-weirdness: The idea is that weird claims are typically wrong.

I’d say that trend extrapolation also fits: You’re not doing logical reasoning or relying on a causal model of the relevant phenomenon. You’re just extrapolating a trend forward, largely based on the assumption that long-running trends don’t typically end abruptly.

“Foxy aggregation,” admittedly, does seem like a different thing to me: It arguably fits the negative definition, depending on how you generate your weights, but doesn’t seem to fit statistical/reference-class one. It also feels like more of a meta-level thing. So I wouldn’t personally use the term “outside view” to talk about foxy aggregation. (I also don’t think I’ve personally heard people use the term “outside view” to talk about foxy aggegration, although I obviously believe you have.)

There is some evidence that in some circumstances people don't take reference class forecasting seriously enough; that's what the original term "outside view" meant. What evidence is there that the things on the Big List O' Things People Describe as Outside View are systematically underrated by the average intellectual?

A condensed answer is: (a) I think most public intellectuals barely use any of the items on this list (with the exception of the anti-weirdness heuristic); (b) I think some of the things on this list are often useful; (c) I think that intellectuals and people more generally are very often bad at reasoning causally/logically about complex social phenomena; (d) I expect intellectuals to often have a bias against outside-view-style reasoning, since it often feels somewhat unsatisfying/unnatural and doesn't allow them to display impressive-seeming domain-knowledge, interesting models of the world, or logical reasoning skills; and (e) I do still think Tetlock’s evidence is at least somewhat relevant to most things on the list, in part because I think they actually are somewhat related to each other, although questions of external validity obviously grow more serious the further you move from the precise sorts of questions asked in his tournaments and the precise styles of outside-view reasoning displayed by participants. [2]

There’s also, of course, a bit of symmetry here. One could also ask: “What evidence is there that the things on the Big List O' Things People Describe as Outside View are systematically overrated by the average intellectual?” :)

  1. These definitions of course aren’t perfect, and other people sometimes use the term more broadly than I do, but, again, some amount of fuzziness seems OK to me. Most concepts have fuzzy boundaries and are hard to define precisely. ↩︎

  2. On the Tetlock evidence: I think one thing his studies suggest, which I expect to generalize pretty well to many different contexts, is that people who are trying to make predictions about complex phenemona (especially complex social phenemona) often do very poorly when they don't incorporate outside views into their reasoning processes. (You can correct me if this seems wrong, since you've thought about Tetlock's work far more than I have.) So, on my understanding, Tetlock's work suggests that outside-view-heavy reasoning processes would often substitute for reasoning processes that lead to poor predictions anyways. At least for most people, then, outside-view-heavy reasoning processes don't actually need to be very reliable to constitute improvements -- and they need to be pretty bad to, on average, lead to worse predictions.

    Another small comment here: I think Tetlock's work also counts, in a somewhat broad way, against the "reference class tennis" objection to reference-class-based forecasting. On its face, the objection also applies to the use of reference classes in standard forecasting tournaments. There are always a ton of different reference classes someone could use to forecast any given political event. Forecasters need to rely on some sort of intuition, or some sort of fuzzy reasoning, to decide on which reference classes to take seriously; it's a priori plausible that people would be just consistently very bad at this, given the number of degrees of freedom here and the absence of clear principles for making one's selections. But this issue doesn't actually seem to be that huge in the context of the sorts of questions Tetlock asked his participants. (You can again correct me if I'm wrong.) The degrees-of-freedom problem might be far larger in other contexts, but the fact that the issue is manageable in Tetlockian contexts presumably counts as at least a little bit of positive evidence. ↩︎

Load More