I tend to disagree with most EAs about existential risk from AI. Unfortunately, my disagreements are all over the place. It's not that I disagree with one or two key points: there are many elements of the standard argument that I diverge from, and depending on the audience, I don't know which points of disagreement people think are most important.
I want to write a post highlighting all the important areas where I disagree, and offering my own counterarguments as an alternative. This post would benefit from responding to an existing piece, along the same lines as Quintin Pope's article "My Objections to "We’re All Gonna Die with Eliezer Yudkowsky"". By contrast, it would be intended to address the EA community as a whole, since I'm aware many EAs already disagree with Yudkowsky even if they buy the basic arguments for AI x-risks.
My question is: what is the current best single article (or set of articles) that provide a well-reasoned and comprehensive case for believing that there is a substantial (>10%) probability of an AI catastrophe this century?
I was considering replying to Joseph Carlsmith's article, "Is Power-Seeking AI an Existential Risk?", since it seemed reasonably comprehensive and representative of the concerns EAs have about AI x-risk. However, I'm a bit worried that the article is not very representative of EAs who have substantial probabilities of doom, since he originally estimated a total risk of catastrophe at only 5% before 2070. In May 2022, Carlsmith changed his mind and reported a higher probability, but I am not sure whether this is because he has been exposed to new arguments, or because he simply thinks the stated arguments are stronger than he originally thought.
I suspect I have both significant moral disagreements and significant empirical disagreements with EAs, and I want to include both in such an article, while mainly focusing on the empirical points. For example, I have the feeling that I disagree with most EAs about:
- How bad human disempowerment would likely be from a utilitarian perspective, and what "human disempowerment" even means in the first place
- Whether there will be a treacherous turn event, during which AIs violently take over the world after previously having been behaviorally aligned with humans
- How likely AIs are to coordinate near-perfectly with each other as a unified front, leaving humans out of their coalition
- Whether we should expect AI values to be "alien" (like paperclip maximizers) in the absence of extraordinary efforts to align them with humans
- Whether the AIs themselves will be significant moral patients, on par with humans
- Whether there will be a qualitative moment when "the AGI" is created, rather than systems incrementally getting more advanced, with no clear finish line
- Whether we get only "one critical try" to align AGI
- Whether "AI lab leaks" are an important source of AI risk
- How likely AIs are to kill every single human if they are unaligned with humans
- Whether there will be a "value lock-in" event soon after we create powerful AI that causes values to cease their evolution over the coming billions of years
- How bad problems related to "specification gaming" will be in the future
- How society is likely to respond to AI risks, and whether they'll sleepwalk into a catastrophe
However, I also disagree with points made by many other EAs who have argued against the standard AI risk case. For example, I think that,
- AIs will eventually become vastly more powerful and smarter than humans. So, I think AIs will eventually be able to "defeat all of us combined"
- I think a benign "AI takeover" event is very likely even if we align AIs successfully
- AIs will likely be goal-directed in the future. I don't think, for instance, that we can just "not give the AIs goals" and then everything will be OK.
- I think it's highly plausible that AIs will end up with substantially different values from humans (although I don't think this will necessarily cause a catastrophe).
- I don't think we have strong evidence that deceptive alignment is an easy problem to solve at the moment
- I think it's plausible that AI takeoff will be relatively fast, and the world will be dramatically transformed over a period of several months or a few years
- I think short timelines, meaning a dramatic transformation of the world within 10 years from now, is pretty plausible
I'd like to elaborate on as many of these points as possible, preferably by responding to direct quotes from the representative article arguing for the alternative, more standard EA perspective.
To be clear, my argument would be that the go-ahead-point for longtermists likely looks much higher, like a 10% total risk of catastrophe. Actually that's not exactly how I'd frame it, since what matters more is how much we can reduce the risk of catastrophe by delaying, not just the total risk of a catastrophe. But I'd likely consider a world where we delay AI until the total risk falls below 0.1% to be intolerable from several perspectives.
I guess one way of putting my point here is that you probably think of "human disempowerment" as a terminal state that is astronomically bad, and probably far worse than "all currently existing humans die". But I don't really agree with this. Human disempowerment just means that the species homo sapiens is disempowered, and I don't see why we should draw the relevant moral boundary around our species. We can imagine other boundaries like "our current cultural and moral values", which I think would drift dramatically over time even if the human species remained.
I'm just not really attached to the general frame here. I don't identify much with "human values" in the abstract as opposed to other salient characteristics of intelligent beings. I think standard EA framing around "humans" is simply bad in an important way relevant to these arguments (and this includes most attempts I've seen to broaden the standard arguments to remove references to humans). Even when an EA insists their concern isn't about the human species per se I typically end up disagreeing on some other fundamental point here that seems like roughly the same thing I'm pointing at. Unfortunately, I consistently have trouble conveying this point to people, so I'm not likely to be understood here unless I give a very thorough argument.
I suspect it's a bit like the arguments vegans have with non-vegans about whether animals are OK to eat because they're "not human". There's a conceptual leap from "I care a lot about humans" to "I don't necessarily care a lot about the human species boundary" that people don't reliably find intuitive except perhaps after a lot of reflection. Most ordinary instances of arguments between vegans and non-vegans are not going to lead to people successfully crossing this conceptual gap. It's just a counterintuitive concept for most people.
Perhaps as a brief example to help illustrate my point, it seems very plausible to me that I would identify more strongly with a smart behavioral LLM clone of me trained on my personal data compared to how much I'd identify with the human species. This includes imperfections in the behavioral clone arising from failures to perfectly generalize from my data (though excluding extreme cases like the entity not generalizing any significant behavioral properties at all). Even if this clone were not aligned with humanity in the strong sense often meant by EAs, I would not obviously consider it bad to give this behavioral clone power, even at the expense of empowering "real humans".
On top of all of this, I think I disagree with your argument about discount rates, since I think you're ignoring the case for high discount rates based on epistemic uncertainty, rather than pure time preferences.