Below I describe three faulty thoughts that made me too confident that AI-related x-risk would be real. After I identified these thoughts, I grew a lot more sceptical of AI Safety as an effective cause area. This post consists of edited parts of some blog posts on my own website.
All sentences are wrong, but some are useful. I think that a certain emotional salience makes me talk about AI in a way that is more wrong than necessary. For example, a self-driving car and a pre-driven car are the exact same thing, but I can feel myself thinking about the two in completely different ways.
A self-driving car is easy to imagine: they are smart and autonomous and you can trust the car like you trust a cab driver. They can make mistakes but probably have good intent. When they encounter an unfamiliar situation they can think about the correct way to proceed. They behave in concordance with the goal their creator set them and they tend to make smart decisions. If anything goes wrong then the car is at fault.
A pre-driven car is hard to imagine: it has to have a bunch of rules coded into it by the manufacturer and you can trust the car like you trust a bridge; it does exactly what it was built to do, but unforeseen circumstances will lead to inelegant failure. The functioning of these systems depends deeply on how the programmers modelled the task of driving and the code's functionality is very brittle. When something goes wrong, the company and engineers are at fault.
This is to say, the language I use to talk about autonomous systems influences how I think about them, and what outcomes I consider more or less likely. If you want to understand present-day algorithms, the "pre-driven car" model of thinking works a lot better than the "self-driving car" model of thinking. The present and past are the only tools we have to think about the future, so I expect the "pre-driven car" model to make more accurate predictions. I try to move my own language away from words like "artificial intelligence" and "self-driving cars" towards words like "classification algorithms" and "pre-driven cars".
One can make these substitutions on any sentence in which a computer is ascribed agency. In the best case, “The neural network learned to recognize objects in images” becomes “The fitted model classifies images in close correspondence with the human-given labels”. (In reality, even that description might be too generous.)
It helps to keep in mind the human component. “The YouTube autoplay algorithm shows you exactly those videos that make you spend more time on the platform” is accurate in some sense, but it completely glances over the ways in which in the algorithm does not do that. When you listen to music using YouTube’s autoplay, it isn’t hard to notice that suggestions tend to point backwards in time compared to the upload date of the video you’re watching right now, and that, apart from preventing repeats, autoplay is pretty Markovian (that is mathspeak for the algorithm not doing anything clever based on your viewing history, just “this video is best followed by that video”). Both of those properties are clearly a result from the way in which YouTube’s engineers modelled the problem they were trying to solve. I would describe YouTube’s suggestions as “The YouTube autoplay algorithm shows you videos that most people watched and liked after watching the current video”.
When you rewrite AI-related statements, they tend to become more wordy. That is exactly what you would expect, since the common words did get selected based on ease of use, but does make it unwieldy to have accurate conversations.
I have not yet found any argument in favour of AI Risk being real that remained convincing after the above translation.
A sense of meaning
My background makes me prone to overrate how important AI Safety is.
My fields of expertise and enjoyment are mathematics and computer science. These skills are useful for the economy and in high demand. The general public is in awe of mathematics and thinks highly of anyone who can do it well. Computer science is the closest thing we have to literal magic.
Wealth, fun, respect, power. The only thing left for me to desire is cosmic significance, which is exactly the sales pitch of existential risk causes areas. It would be nice if AI-related existential risk were real; for my labour to potentially make the difference between a meaningless lifeless universe or a universe filled with happiness. It would give objective significance to my life.
This is fertile ground for motivated reasoning.
When EA's describe how much utility could fit in the universe, the reference class for numbers is “how many X fits in the universe”, where X ranges over things like atoms, people, planets, stars. These numbers are huge, typically expressed by quantities of the form for .
When we describe how likely certain events are, the tempting reference class is “statements of probability”, which are typically expressed as . It seems absurd to assign AI-risk less than 0.0000000000000000000000000000001% probability because that would be a lot of zeros.
The combination of these vastly different expressions of scale together with anchoring makes that we should expect people to over-estimate the probability of unlikely risks and hence to over-estimate the expected utility of x-risk prevention measures.
I think there is a good argument to be made that the probability of AI Safety work being effective is less than . I wrote a very rough draft trying to defend that number in this blog post, but it's very much a proof-of-concept written for myself and so the writing isn't very polished.
I used to think that working in AI Safety would be a good fit for me, but I stopped thinking that after I noticed that most of my belief in AI risk was caused by biased thinking: self-aggrandizing motivated reasoning, misleading language, and anchoring on unjustified probability estimates.
If people here would appreciate it, I would be happy to write one or more posts on object-level arguments as to why I am now sceptical of AI risk. Let me know in the comments.
I would like to read about these arguments.
I’d particularly appreciate an updated version of “Astronomical waste, astronomical schmaste” that disentangles the astronomical waste argument from arguments for the importance of AI safety. The current one makes it hard for me to engage with it because I don’t go along with the astronomical waste argument at all but are still convinced that a lot of projects under the umbrella of AI safety are top priorities, because extinction is considered bad by a wide variety of moral systems irrespective of astronomical waste, and particularly in order to avert s-risks, which are also considered bad by all moral systems I have a grasp on.
For clarity, I upvoted ofer's post, and I did it to indicate that I too would like to read about these arguments. (I suspect that all the other people who upvoted it did this for the same reason). PS this is a great post, thank you Beth!
The "language" section is the strongest IMO. But it feels like "self-driving" and "pre-driven" cars probably exist on some kind of continuum. How well do the system's classification algorithms generalize? To what degree does the system solve the "distributional shift" problem and tell a human operator to take control in circumstances that the car isn't prepared for? (You call these circumstances "unforeseen", but what about a car that attempts to foresee likely situations it doesn't know what to do in and ask a human for input in advance?) What experiment would let me determine whether a particular car is self-driving or pre-driven? What falsifiable predictions, if any, are you making about the future of self-driving cars?
I was confused by this sentence: "The second pattern is superior by wide margin when it comes to present-day software".
I think leaky abstractions are a big problem in discussions of AI risk. You're doubtless familiar with the process by which you translate a vague idea in your head into computer code. I think too many AI safety discussions are happening at the "vague idea" level, and more discussions should be happening at the code level or the "English that's precise enough to translate into code" level, which seems like what you're grasping at here. I think if you spent more time working on your ontology and the clarity of your thought, the language section could be really strong.
(Any post which argues the thesis "AI safety is easily solvable" is both a post that argues for de-prioritizing AI safety and a post that is, in a sense, attempting to solve AI safety. I think posts like these are valuable; "AI safety has this specific easy solution" isn't as within the Overton window of the community devoted to working on AI safety as I would like it to be. Even if the best solution ends up being complex, I think in-depth discussion of why easy solutions won't work has been neglected.)
Re: the anchoring section, pretty sure it is well documented by psychologists that humans are overconfident in their probabilistic judgements. Even if humans tend to anchor on 50% probability and adjust from there, it seems this isn't enough to counter our overconfidence bias. Regarding the "Discounting the future" section of your post, see the "Multiple-Stage Fallacy". If a superintelligent FAI gets created, it can likely make humanity's extinction probability almost arbitrarily low through sufficient paranoia. Regarding AI accidents going "really really wrong", see the instrumental convergence thesis. And AI safety work could be helpful even if countermeasures aren't implemented universally, through creation of a friendly singleton.
Thank you for your response and helpful feedback.
I'm not making any predictions about future cars in the language section. "Self-driving cars" and "pre-driven cars" are the exact same things. I think I'm grasping at a point closer to Clarke's third law, which also doesn't give any obvious falsifiable predictions. My only prediction is that thinking about "self-driving cars" leads to more wrong predictions than thinking about "pre-driven cars".
I changed the sentence you mention to "If you want to understand present-day algorithms, the "pre-driven car" model of thinking works a lot better than the "self-driving car" model of thinking. The present and past are the only tools we have to think about the future, so I expect the "pre-driven car" model to make more accurate predictions." I hope this is clearer.
Your remark on "English that's precise enough to translate into code" is close, but not exactly what I meant. I think that it is a hopeless endeavour to aim for such precise language in these discussions at this point in time, because I estimate that it would take a ludicrous amount of additional intellectual labour to reach that level of rigour. It's too high of a target. I think the correct target is summarised in the first sentence: "All sentences are wrong, but some are useful."
I think that I literally disagree with every sentence in your last paragraph on multiple levels. I've read both pages you linked a couple months ago and I didn't find them at all convincing. I'm sorry to give such a useless response to this part of your message. Mounting a proper answer would take more time and effort than I have to spare in the foreseeable future. I might post some scraps of arguments on my blog soonish, but those posts won't be well-written and I don't expect anyone to really read those.
That is clearer, thanks!
Well, it's already possible to write code that exhibits some of the failure modes AI pessimists are worried about. If discussions about AI safety switched from trading sentences to trading toy AI programs, which operate on gridworlds and such, I suspect the clarity of discourse would improve.
Cool, let me know!
It's good to see some intelligent criticisms of the argument for doing AI safety research!
Just two short remarks to this post: I generally tend to think in the log scale about probabilities and it's possible I haven't used the a.bcd notation with any probability smaller than 10^-3 (0.1%) for years. So it is hard to see why I should be influenced by the described effect.
With language, your distinction is correlated with the distinction between two of Dennett's levels of abstraction (the design stance and the intentional stance). Claiming something like the design stance is more accurate or better than the intentional stance for analyzing present day systems seems too bold: it's really a different level of description. Would you say the design stance is also more accurate when thinking e.g. about animals?
Obviously, looking on any system with intentional stance comes with ...mentalizing, assuming agency. People likely utilize the different levels of abstraction not really well, but I'm not convinced they systematically over-utilize one. It seems arguable they under-utilize the intentional stance when looking at "emergent agency" in systems like the stock market.
Example of a practical benefit from taking the intentional stance: this (n=116) study of teaching programming by personalising the editor:
They'll be systematically biased predictions, because AGI will be much smarter than the systems we have now. And it's dubious that AI should be the only reference class here (as opposed to human brains vis-a-vis animal brains, most notably).
If so, then you won't find any argument in favor of human risk being real after you translate "free will" to "acting on the basis of social influences and deterministic neurobiology", and then you will realize that there is nothing to worry about when it comes to terrorism, crime, greed or other problems. (Which is absurd.)
Also, I don't see how the arguments in favor of AI risk rely on language like this; are you referring to the real writing that explains the issue (e.g. papers from MIRI, or Bostrom's book) or are you just referring to simple things that people say on forums?
The reality is actually the reverse: people are prone to assert arbitrarily low probabilities because it's easy, but justifying a model with such a low probability is not. See: https://slatestarcodex.com/2015/08/12/stop-adding-zeroes/
And, after reading this, you are likely to still underestimate the probability of AI risk, because you've anchored yourself at 0.00000000000000000000000000000000000001% and won't update sufficiently upwards.
Anchoring points everywhere depending on context and it's infeasible to guess its effect in a general sense.
I'm not sure about your blog post because you are talking about "bits" which nominally means information, not probability, and it confuses me. If you really mean that there is, say, a 1 - 2^(-30) probability of extinction from some cause other than x-risk then your guesses are indescribably unrealistic. Here again, it's easy to arbitrarily assert "2^(-30)" even if you don't grasp and justify what that really means.
I used to think pretty much exactly the argument you're describing, so I don't think I will change my mind by discussing this with you in detail.
On the other hand, the last sentence of your comment makes me feel that you're equating my not agreeing with you with my not understanding probability. (I'm talking about my own feelings here, irrespective of what you intended to say.) So, I don't think I will change your mind by discussing this with you in detail.
I don't feel motivated to go back and forth on this thread, because I think we will both end up feeling like it was a waste of time. I want to make it clear that I do not say this because I think badly of you.
I will try to clear up the bits you pointed out to be confusing. In the Language section, I am referring to MIRI's writing, as well as Bostrom's Superintelligence, as well as most IRL conversations and forum talk I've seen. "bits" are an abstraction akin to "log-odds", I made them up because not every statement in that post is a probabilistic claim in a rigorous sense and the blog post was mostly written for myself. I really do estimate that there is less than 2−170 chance of AI being risky in a way that would lead to extinction, whose risk can be prevented, and moreover that it is possible to make meaningful progress on such prevention within the next 20 years, along with some more qualifiers that I believe to be necessary to support the cause right now.
Well, OK. But in my last sentence, I wasn't talking about the use of information terminology to refer to probabilities. I'm saying I don't think you have an intuitive grasp of just how mind-bogglingly unlikely a probability like 2^(-30) is. There are other arguments to be made on the math here, but getting into anything else just seems fruitless when your initial priors are so far out there (and when you also tell people that you don't expect to be persuaded anyway).
The arguments about pre-driven cars seem to draw a sharp line between understanding and doing. The obvious counter seems to be asking if your brain is "pre-programmed" or "self-directed". (If this seems confused, I strongly recommend the book "Good and Real" as a way to better think about this question.)
I'm also confused about why the meaning bias is a counter-argument to specific scenarios and estimates, but that's mostly directed toward my assumption that this claim is related to Pinker's argument. Otherwise I don't understand why "fertile ground for motivated reasoning" isn't a reason to simply find outside view estimates - the AI skeptics mostly say we don't need to worry about SAI for around 40 years, which seems consistent with investing a ton more in risk mitigation now.
Thank you so much for your reflection and honesty on this. Although I think concerns about the safe development of AI are very legitimate, I have long been concerned that the speculative, sci-fi nature of AI x-risks gives cover to a lot of bias. More cynically, I think grasping AI risk and thinking about it from a longtermist perspective is a great way to show off how smart and abstract you are while (theoretically) also having the most moral impact possible.
I just think identifying with x-risk and hyperastronomical estimates of utility/disutility is meeting a suspicious number of emotional and intellectual needs. If we could see the impact of our actions to mitigate AI risk today, motivated reasoning might not be such a problem. But longtermist issues are those where we really can't afford self-serving biases, because it won't necessarily show. I'm really glad to see someone speaking up about this, particularly from their own experience.
If people are biased towards believing their actions have cosmic significance, does this also imply that people without math & CS skills will be biased against AI safety as a cause area?
Not necessarily, because people can believe that multiple kinds of work are significant. I will never be in the military, but I believe there are Generals out there whose decisions are life-and-death for a lot of people. I could presumably believe the same about AI safety.
Thanks for the post! I am generally pretty worried that I and many people I know are all deluding ourselves about AI safety - it has a lot of red flags from the outside (although these are lessening as more experts come onboard, more progress is made in AI capabilities, and more concrete work is done on safety). I think it's more likely than not we've got things completely wrong, but that it's still worth working on. If that's not the case, I'd like to know!
I like your points about language. I think there's a closely related problem where it's very hard to talk or think about anything that's between human level at some task and omnipotent. Once you try to imagine something that can do things humans can't, there's no way to argue that the system wouldn't be able to do something. There is always a retort that just because you, a human, think it's impossible, doesn't mean a more intelligent system couldn't achieve it.
On the other hand, I think there are some good examples of couching safety concerns in non-anthropomorphic language. I like Dr Krakovna's list of specification gaming examples: https://vkrakovna.wordpress.com/2018/04/02/specification-gaming-examples-in-ai/
I also think Iterated Distillation and Amplification is a good example of a discussion of AI safety and potential mitigation strategies that's couched in ideas of training distributions and gradient descent rather than desires and omnipotence.
Re the sense of meaning point, I don't think that's been my personal experience - I switched into CS from biology partly because of concern about x-risk, and know various other people who switched fields from physics, music, maths and medicine. As far as I could tell, the arguments for AI safety still mostly hold up now I know more about relevant fields, and I don't think I've noticed egregious errors in major papers. I've definitely noticed some people who advocate for the importance of AI safety making mistakes and being confused about CS/ML fundamentals, but I don't think I've seen this from serious AI safety researchers.
Re anchoring, this seems like a very strong claim. I think a sensible baseline to take here would be expert surveys, which usually put several percent probability on HLMI being catastrophically bad. (e.g. https://aiimpacts.org/2016-expert-survey-on-progress-in-ai/#Chance_that_the_intelligence_explosion_argument_is_about_right)
I'd be curious if you have an explanation for why your numbers are so far away from expert estimates? I don't think that these expert surveys are a reliable source of truth, just a good ballpark for what sort of orders of magnitude we should be considering.
Thank you for the post. I agree with you insofar as Ai as an x-risk is concerned, especially in the near future timespan, where we are much more likely to be eradicated by more 'banal' means. However, just for emphasis: this does not mean that there are no risks related to AI safety. AGI is very likely far away, but even machine learning algorithms may become quite a powerful weapon when misused.
Could you go into a bit more detail about the two linguistic styles you described, perhaps using non-AI examples? My interpretation of them is basically agent-focused vs internal-mechanics-focused, but I'm not sure this is exactly what you mean.
If the above is correct, it seems like you're basically saying that internal-mechanics-focused descriptions work better for currently existing AI systems, which seems true to me for things like self-driving cars. But for something like AlphaZero, or Stockfish, I think an agentic framing is often actually quite useful:
So I think the reason this type of language doesn't work well for self-driving cars is because they aren't sufficiently agent-like. But we know there can be agentic agents - humans are an example - so it seems plausible to me that agentic language will be the best descriptor for them. Certainly it is currently the best descriptor for them, given that we do not understand the internal mechanics of as-yet-uninvented AIs.
"The combination of these vastly different expressions of scale together with anchoring makes that we should expect people to over-estimate the probability of unlikely risks and hence to over-estimate the expected utility of x-risk prevention measures. "
I am not entirely sure whether i understand this point. Is the argument that the anchoring effect would cause an overestimation, because the "perceived distance" from an anchor grows faster per added zero than per increase of one to the exponent?