Here's a dynamic that I've seen pop up more than once.
Person A says that an outcome they judge to be bad will occur with high probability, while making a claim of the form "but I don't want (e.g.) alignment to be doomed — it would be a huge relief if I'm wrong!"
It seems uncontroversial that Person A would like to be shown that they're wrong in a way that vindicates their initial forecast as ex ante reasonable.
It seems more controversial whether Person A would like to be shown that their prediction was wrong, in a way that also shows their initial prediction to have been ex ante unreasonable.
In my experience, it's much easier to acknowledge that you were wrong about some specific belief (or the probability of some outcome), than it is to step back and acknowledge that the reasoning process which led you to your initial statement was misfiring. Even pessimistic beliefs can be (in Ozzie’s language) "convenient beliefs" to hold.
If we identify ourselves with our ability to think carefully, coming to believe that there are errors in our reasoning process can hit us much more personally than updates about errors in our conclusions. Optimistic updates might be an update towards me thinking that my projects have been less worthwhile than I thought, that my local community is less effective than I thought, or that my background framework or worldview was in error. I think these updates can be especially painful for people who are more liable to identify with their ability to reason well, or identify with the unusual merits of their chosen community.
To clarify: I'm not claiming that people with more pessimistic conclusions are, in general, more likely to be making reasoning errors. Obviously there are plenty of incentives towards believing rosier conclusions. I'm simply claiming that: if someone arrives at a pessimistic conclusion based on faulty reasoning, then you shouldn't necessarily expect optimistic pushback to be uniformly welcomed— for all of the standard reasons that updates of the form "I could've done better on a task I care about" can be hard to accept.
Also, sometimes the people saying this have their identity, reputation, or their career tied to the pessimistic belief. They might consciously prefer the optimistic outcome, but I reckon emotionally/subconsciously the bias would clearly be in the other direction.
Some brief, off-the-cuff sociological reflections on the Bostrom email:
Each consideration is discussed below.
I think basically everyone here agrees that white people shouldn’t be using (or mentioning) racial slurs. I think you should also avoid generics. I think that you very rarely gain anything, even from a purely epistemic point of view, from saying (for example): “men are more stupid than women”.
EA skews young, and does a lot of outreach on university campuses. I also think that EA will continue to be attractive to people who like to engage in the world via a certain kind of communication, and I think many people interested in EA are likely to be drawn to controversial topics. I think this is unavoidable. Given that it’s unavoidable, it’s worth being conscious of this, and directly tackling what’s to be gained (and lost) from certain provocative modes of communication, in what contexts.
Lead poisoning affects IQ. The Flint water crisis affected an area that was majority African American, and we have strong evidence that lead poisoning affects IQ. Here’s one way of ‘provocatively communicating’ certain facts.
The US water system is poisoning black people.
I don’t like this statement and think it is true.
Deliberately provocative communication probably does have its uses, but it’s a mode of communication that can be in tension with nuanced epistemics, as well as kindness. If I’m to get back in touch with my own old edgelord streak for a moment, I’d say that one (though obviously not the major) benefit of EA is the way it can transform ‘edgelord energy’ into something that can actually make the world better.
I think, as with SBF, there’s a cognitive cluster which draws people towards both EA, and towards certain sorts of actions most of us wish to reflectively disavow. I think it’s reasonable to say: “EA messaging will appeal (though obviously will not only appeal) disproportionately to a certain kind of person. We recognize the downsides in this, and here’s what we’re doing in light of that.”
There are fewer women than men in computer science. This doesn’t give you any grounds — once a woman says “I’m interested in computer science”, to be like “oh, well, but maybe she’s lying, or maybe it’s a joke, or … ”
Why? You have evidence that means you don’t need to rely on such coarse data! You don’t need to rely on population-level averages! To the extent that you do, or continue to treat such population-averages as highly salient in the face of more relevant evidence, I think you should be criticized. I think you should be criticized because it’s a sign that your epistemics have been infected by pernicious stereotypes, which makes you worse at understanding the world, in addition to being more likely to cause harm when interacting in that world.
You should be careful to believe true things, even when they’re inconvenient.
On the epistemic level, I actually think that we’re not in an inconvenient possible world wrt the ‘genetic influence on IQ’, partially because I think certain conceptual discussions of ‘heritability’ are confused, and partially because I think that it’s obviously reasonable to look at the (historically quite recent!) effects of slavery, and conclude “yeaah, I’m not sure I’d expect the data we have to look all that different, conditioned on the effects of racism causing basically all of the effects we currently see”.
But, fine, suppose I’m in an inconvenient possible world. I could be faced with data that I’d hate to see, and I’d want to maintain epistemic integrity.
One reason I personally found Bostrom’s email sad was that I sensed a missing mood. To support this, here’s an intuition pump that might be helpful: suppose you’re back in the early days of EA, working for $15k in the basement of an estate agent. You’ve sacrificed a lot to do something weird, sometimes you feel a bit on the defensive, and you worry that people aren’t treating you with the seriousness you deserve. Then, someone comes along, says they’ve run some numbers, and told you that EA was more racist than other cosmopolitan groups, and — despite EA’s intention to do good — is actually far more harmful to the world than other comparable groups. Suppose further that we also ran surveys and IQ tests, and found that EA is also more stupid and unattractive than other groups. I wouldn’t say:
EA is harmful, racist, ugly, and stupid. I like this sentence and think it’s true.
Instead, I’d communicate the information, if I thought it was important, in a careful and nuanced way. If I saw someone make the unqualified statement quoted above, I wouldn’t personally wish to entrust that person with promoting my best interests, or with leading an institute directed towards the future of humanity.
I raise this example not because I wish to opine on contemporary Bostrom, based on his email twenty-six years ago. I bring this example up because, while (like 𝕮𝖎𝖓𝖊𝖗𝖆) I’m glad that Bostrom didn’t distort his epistemics in the face of social pressure, I think it’s reasonable to think (like Habiba, apologies if this is an unfair phrase) that Bostrom didn’t take ownership for his previously missing mood, and communicate why his subsequent development leads him to now repudiate what he said.
I don’t want to be unnecessarily punitive towards people who do shitty things. That’s not kindness. But I also want to be part of a community that promotes genuinely altruistic standards, including a fair sense of penance. With that in mind, I think it’s healthy for people to say: “we accept that you don’t endorse your earlier remark (Bostrom originally apologized within 24 hours, after all), but we still think your apology misses something important, and we’re a community that wants people who are currently involved to meet certain standards.”
A working attempt to sketch a simple three-premise argument for the claim: ‘TAI will result in human extinction’, and offer objections. Made mostly for my own benefit while working on another project, but I thought it might be useful to post here.
The structure of my preferred argument is similar to an earlier framing suggested by Katja Grace.
I’ll offer some rough probabilities, but the probabilities I’m offering shouldn’t be taken seriously. I don’t think probabilities are the best way to adjudicate disputes of this kind, but I thought offering a more quantitative sense of my uncertainty (based on my immediate impressions) might be helpful in this case. For the (respective) premises, I might go for 98%, 7%, 83%, resulting in a ~6% chance of human extinction given TAI.
Some more specific objections:
I also think the story by Katja Grace below is plausible, in which superhuman AI systems are “goal-directed”, but don’t lead to human extinction.
AI systems proliferate, and have various goals. Some AI systems try to make money in the stock market. Some make movies. Some try to direct traffic optimally. Some try to make the Democratic party win an election. Some try to make Walmart maximally profitable. These systems have no perceptible desire to optimize the universe for forwarding these goals because they aren’t maximizing a general utility function, they are more ‘behaving like someone who is trying to make Walmart profitable’. They make strategic plans and think about their comparative advantage and forecast business dynamics, but they don’t build nanotechnology to manipulate everybody’s brains, because that’s not the kind of behavior pattern they were designed to follow. The world looks kind of like the current world, in that it is fairly non-obvious what any entity’s ‘utility function’ is. It often looks like AI systems are ‘trying’ to do things, but there’s no reason to think that they are enacting a rational and consistent plan, and they rarely do anything shocking or galaxy-brained.
Perhaps the story above is unlikely because the AI systems in Grace’s story would (in the absence of strong preventative efforts) be dangerous maximizers. I think that this is most plausible on something like Eliezer’s model of agency, and if my views change my best bet is that I’ll have updated towards his view.
Finally, I sometimes feel confused by the concept of ‘capabilities’ as it’s used in discussions about AGI. From Jenner and Treutlein’s response to Grace’s counterarguments:
Assuming it is feasible, the question becomes: why will there be incentives to build increasingly capable AI systems? We think there is a straightforward argument that is essentially correct: some of the things we care about are very difficult to achieve, and we will want to build AI systems that can achieve them. At some point, the objectives we want AI systems to achieve will be more difficult than disempowering humanity, which is why we will build AI systems that are sufficiently capable to be dangerous if unaligned.”
Maybe one thing I’m thinking here is that “more difficult” is hard to parse. The AI systems might be able to achieve some narrower outcome that we desire, without being “capable” of destroying humanity. I think this is compatible with having systems which are superhumanly capable of pursuing some broadly-scoped goals, without being capable of pursuing all broadly-scoped goals.
(Also, I’m no doubt missing a bunch of relevant information here. But this is probably true for most people, and I think it’s good for people to share objections even if they’re missing important details)
Nice analysis! I think a main point of disagreement is that I don't think systems need to be "dangerous maximizers" in the sense you described in order to predictably disempower humanity and then kill everyone. Humans aren't dangerous maximizers yet we've killed many species of animals, neanderthals, and various other human groups (genocide, wars, oppression of populations by governments, etc.) Katja's scenario sounds plausible for me except for the part where somehow it all turns out OK in the end for humans. :)Another, related point of disagreement:
“look, LLMs distill human cognition, much of this cognition implicitly contains plans, human-like value judgements, etc.” I start from a place where I currently believe “future systems have human-like inductive biases” will be a better predictive abstraction than “randomly sample from the space of simplicity-weighted plans”. And … I just don’t currently see the argument for rejecting my current view?
I actually agree that current and future systems will have human-like concepts, human-like inductive biases, etc. -- relative to the space of all possible minds at least. But their values will be sufficiently alien that humanity will be in deep trouble. (Analogy: Suppose we bred some octopi to be smarter and smarter, in an environment where they were e.g. trained with pavlovian conditioning + artificial selection to be really good at reading internext text and predicting it, and then eventually writing it also.... They would indeed end up a lot more human-like than regular wild octopi. But boy would it be scary if they started getting generally smarter than humans and being integrated deeply into lots of important systems and humans started trusting them a lot etc.)
thnx! : )
Your analogy successfully motivates the “man, I’d really like more people to be thinking about the potentially looming Octopcracy” sentiment, and my intuitions here feel pretty similar to the AI case. I would expect the relevant systems (AIs, von-Neumann-Squidwards, etc) to inherit human-like properties wrt human cognition (including normative cognition, like plan search), and a small-but-non-negligible chance that we end up with extinction (or worse).
On maximizers: to me, the most plausible reason for believing that continued human survival would be unstable in Grace’s story either consists in the emergence of dangerous maximizers, or the emergence of related behaviors like rapacious influence-seeking (e.g., Part II of What Failure Looks Like). I agree that maximizers aren't necessary for human extinction, but it does seem like the most plausible route to ‘human extinction’ rather than ‘something else weird and potentially not great’.
Nice. Well, I guess we just have different intuitions then -- for me, the chance of extinction or worse in the Octopcracy case seems a lot bigger than "small but non-negligible" (though I also wouldn't put it as high as 99%).Human groups struggle against each other for influence/power/control constantly; why wouldn't these octopi (or AIs) also seek influence? You don't need to be an expected utility maximizer to instrumentally converge; humans instrumentally converge all the time.
Oh also you might be interested in Joe Carlsmith's report on power-seeking AI, it has a relatively thorough discussion of the overall argument for risk.
This is a good summary, thanks for writing it up!
I do agree with this, in principle:
A system being ‘cognitively efficient wrt humanity’ doesn’t automatically entail ‘whatever goals the system has – and whatever constraints the system might otherwise face – the cognitively efficient system gets what it wants’.
...though I don't think it buys us more than a couple points; I think people dramatically underestimate how high the ceiling is for humans and think that a reasonably smart human familiar with the right ideas would stand a decent chance at executing a takeover if placed into the position of an AI (assuming speedup of cognition, + whatever actuators current systems typically possess).
However, I think this is wrong:
LLMs distill human cognition
LLMs have whatever capabilities they have because those are the capabilities discovered by gradient descent which, given their architecture, improved their performance on the test task (next token prediction). This task is extremely unlike the tasks represented in the environment where human evolution occurred, and the kind of cognitive machinery which would make a system effective at next token prediction seems very different from whatever it is that humans do. (Humans are capable of next token prediction, but notably we are much worse at it than even GPT-3.)
Separately, the cognitive machinery that represents human intelligence seems to be substantially decoupled from the cognitive machinery that represents human values (and/or the cognitive machinery that causes humans to develop values after birth), so if it turned out that LLMs did, somehow, share the bulk of their cognitive algorithms with humans, that would be a slight positive update for me, but not an overwhelming one, since I wouldn't expect an LLM to want anything remotely relevant to humans. (Most of the things that humans want are lossy proxies for things that improved IGF in the ancestral environment, many of which generalized extremely poorly out of distribution. What are the lossy proxies for minimizing prediction loss that a sufficiently-intelligent LLM would end up with? I don't know, but I don't see why they'd have anything to do with the very specific things that humans value.)
Pushback appreciated! But I don’t think you show that “LLMs distill human cognition” is wrong. I agree that ‘next token prediction’ is very different to the tasks that humans faced in their ancestral environments, I just don’t see this as particularly strong evidence against the claim ‘LLMs distill human cognition’.
I initially stated that “LLMs distill human cognition” struck me as a more useful predictive abstraction than a view which claims that the trajectory of ML leads us to a scenario where future AIs, are “in the ways that matter”, doing something more like “randomly sampling from the space of simplicity-weighted plans”. My initial claim still seems right to me.
If you want to pursue the debate further, it might be worth talking about the degree to which you’re (un)convinced by Quintin Pope’s claims in this tweet thread. Admittedly, it sounds like you don’t view this issue as super cruxy for you:
“The cognitive machinery that represents human intelligence seems to be substantially decoupled from the cognitive machinery that represents human values”
I don’t know the literature on moral psychology, but that claim doesn’t feel intuitive to me (possibly I’m misunderstanding what you mean by ‘human values’; I’m also interested in any relevant sources). Some thoughts/questions: