I'm a postdoctoral researcher at Stanford HAI and CISAC. I recently gave a HAI Seminar Zoom talk, in which I lay out some of the basic arguments for existential risk from AI during the first 23m of the talk, after which I describe my research interviewing AI researchers and answer Q&A.
I recommend the first 23m as a resource to send to people who are new to these arguments (the talk was aimed at computer science researchers, but is also accessible to the public). This is a pretty detailed, current as of June 2022, public-facing overview that's updated with April-May 2022 papers, and includes readings, funding, additional resources at the bottom of the page.
An optional transcript of the first 23m is below; thanks to Jonathan Low for drafting it and Vaidehi Agarwalla for suggesting it.
[Link to post on LessWrong]
Dr. Vael Gates, a HAI-CISAC postdoc at Stanford University, describes their work interviewing researchers about their perceptions of risks from current and future AI. The transcript below runs over the first 23 minutes of the talk, in which they introduce some recent AI developments, researcher timelines for AGI, and the case for existential risk from non-aligned AGI. The latter part of the talk focuses on Gates’s preliminary research results, and audience Q&A.
My talk today is called “Researcher Perceptions of Current and Future AI”, though it could also be called “Researcher Perceptions of Risks from Advanced AI”, as my talk is actually focused on risk from advanced AI.
The structure of this talk is as follows: I'm going to give some context for the study I did, I'll talk about the development of AI, the concept of AGI, and the alignment problem and existential risk. [Then I'll go on to the research methods I used in this study, some of the research questions I asked researchers, and the interim results, finishing with some concluding thoughts. We should have about 10-15 minutes of Q&A, if my timing is right.]
Let's start with some context. Where are we in AI development? Here's some history from Wikipedia: we start with some precursors, then the birth of AI in 1952, symbolic AI, AI winter, a boom cycle, the second AI winter, and AI 1993-2011. Here we are in the deep learning paradigm, which is 2011 to the present, with AlexNet and the deep learning revolution.
We have some components of the current paradigm that we wouldn't have necessarily expected in the 1950s. We have black box systems. We're using machine learning and neural networks. Compute (computing power) is very important; computing power, data, algorithmic advances, and some of these algorithmic advances are kind of aimed at scaling. That means there are methods that are very general that you can throw more compute and data into to get better behavior. We see Sutton's Bitter Lesson here, which is the idea that general methods that leverage computation are ultimately the most effective - by a large margin compared to human knowledge approaches that were used earlier on.
Here's a quick comic to try and illustrate that lesson.
In the early days of AIs, we used something like statistical learning, where you would know a lot about the domain and you would be very careful to use methods specific to that domain. These days there's an idea of stacking more layers, throwing more compute and data in, and you'll get ever more sophisticated behaviour. It's worth noting that we've been working on AI for less than 100 years and the current paradigm is around 10 years old, and that we've gotten pretty far in that time. However, some people think that we should be much further.
Let's move on to where we are in the present. I think a useful distinction, in the current paradigm, is the difference between narrow AI or machine learning, and more general methods. Historically, it makes sense to start with narrow AI, dedicated to specific tasks. These tasks include things like self-driving cars, robotics, translation, image classification, AlphaGo, AlphaFold (protein folding), and Codex (coding). However, we've increasingly been seeing a move towards more general models, these large language models-- at Stanford known as Foundation Models. An example would be GPT-3, or more recently PaLM or Gato. In fact, in April and May 2022, we've seen a number of papers come out like Chinchilla and PaLM, which are big language models.
Here's an example of DALLE-2. You can write in text like “‘Teddy bears mixing sparkling chemicals as mad scientists’ in the style of steampunk” and you get images that are very beautiful like this and you can use many different prompts. There are more models like Imagen, which has come out shortly after DALLE-2 and is better than DALLE-2. Then there are even more examples like Socratic models, Flamingo, SayCan, and Gato.
Here are some things that Gato can do. Gato is described by DeepMind as: "The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more deciding based on its context, whether it output text, joint talks, button presses or other tokens."
So we were coming from a place in AI where AI and ML applications were very specific, and now we are going towards having models that can do more tasks and once more general. A question here is whether we're going to be able to continue scaling in the future.
We've seen, surprisingly, that scaling does continue to work in some sense, that more and more people are using these large language models. One can ask in the future, whether we'll get even more general. There are some trends in this direction.
Here's a figure showing time on the x-axis. We've got a logarithmic scale on the y-axis as a measure of compute. You can see that compute is definitely increasing.
We're going to run out of compute eventually, but we have models like Chinchilla that show that sometimes you can substitute data for compute in various ways to help correct this.
So maybe in the future, we will see something like artificial general intelligence, which as defined by Wikipedia is "AGI is the hypothetical ability of an intelligent agent to understand or learn any intellectual task that a human being can".
Whether or not we see AGI specifically, it seems like we are moving in a direction where we have AGI-like systems or systems like that in the future. Note again that we've only been working in the deep learning paradigm for 10 years and AI for less than 100.
If we do see these systems, when will we see them? There was a study done in 2018 that surveyed people whose papers were submitted at ICML or NeurIPS (two major machine learning conferences) in 2016. The study asked about high-level machine intelligence. They defined this as "high-level machine intelligence is achieved when unaided machines can accomplish every task better and more cheaply than human workers". Here are the results.
You can see that the median 50% probability of high-level machine intelligence was about 45 years from 2016 with 352 researchers responding to this survey. That's within many of our lifetimes and coming soon, which is interesting.
Here's another source that has been trying to aggregate these. There's a platform called Metaculus, which is a prediction solicitation and aggregation engine. So it’s sort of like a prediction market, where they have a bunch of forecasters who are trying to answer all sorts of different questions. They've had some success in predicting things like COVID and Russia's invasion of Ukraine.
Here's a question that they asked: the date that weakly general AI is publicly known. It's very hard to define what is weakly general AI. They had a whole bunch of conditions here. It needed to be able to reliably pass a Turing Test (of the type that would win the Loebner Silver Prize), needs to be able to score 90% or more on a robust version of the Winograd Schema Challenge, be able to score a 75% percentile on the full mathematics section of a circa 2015 to 2020 standard SAT exam, and be able to learn the classic Atari game Montezuma's Revenge.
On the forecast timeline, there's some guesswork here, and then we see this drop in April and May when all of the new papers came out. They're currently at a community prediction of 2030 - this is quite soon.
So, systems are getting more powerful using large language models. Eventually, we may get to things that are more AGI-like in the future. What are some of the risks of these models?
Turns out people are very concerned about risks. This is skipping a little bit ahead to some of my results from my work, but when I asked people what they were worried about in terms of risks from AI, people mentioned all sorts of things. One thing people mentioned is the idea of trustworthy AI or ethical AI, which includes things like fairness, algorithmic bias, privacy, surveillance, transparency, interpretability, explainability, and worries about manipulation (we can see this in social media). And military applications, algorithmic attacks-- where systems aren't very robust-- and misuse by industries and by nations.
However, I'm going to focus more on risks that I see arising specifically from very general AI. One of the problems that people most talk about is called the alignment problem. This could possibly, as some researchers think, lead to existential risk, and is relatively neglected compared to the amount of risks from narrow AI.
I’ve been talking about risks from general AI and existential risk, which is the death of all humanity. This seems pretty extreme. Are people even worried about this?
Here we reference again the study that I mentioned earlier from Grace et al., surveying the ICML and NeurIPS researchers in 2016. The researchers were asked the chance that high-level machine intelligence would have a positive or negative long-run impact on humanity. The researchers had some percentage on HLMI being “bad” or “extremely bad”, which includes that it could lead to human extinction. High-level machine intelligence was seen as likely to have positive outcomes, but catastrophic risks were seen as possible. Specifically, the researchers had a probability of 5% on an outcome described as extremely bad like human extinction. So that’s a dice roll.
It's pretty interesting that researchers think that there's a 5% median chance that the end result of their work will result in extremely bad outcomes like human extinction. Even though that probability is not the most likely— it's pretty small— 5% is still way higher than I'd like to gamble on. It might be worth putting attention on the possibility of extremely bad situations.
So, what is the problem exactly that people are concerned about? The challenge is often talked about is called the "alignment problem", which is essentially the challenge of building systems that are aligned with human values, that do what humans want, that are aligned with what people want.
This problem occurs in the context of current-day systems as well, and so I'm going to walk through those examples. Here is a boat racing game called Coast Runners. The goal here is to train this boat to learn to win the race.
There’s a course that the designers want the boat to go along, and you can see in the top left corner here, there's a bunch of other boats racing along. The goal was to train the boat to learn to race well and win the game. However, the designers were training on a reward function of points, which you can see at the bottom here. In the end, they got a boat that did exactly what they wanted, so the boat optimized for the number of points. What the boat did was it found this little corner of the course, where could go around in circles and wait for these little turbo-charge point things to develop, and then just go around and collect those. This was the strategy that earned the most points, even if it wasn't racing, or winning the game.
That's an example of the alignment problem, where the designers wanted something from the AI so they set a reward function to try to incentivize it to achieve that reward. But in fact, the thing that they got out wasn't what they wanted. You can imagine that this challenge is even harder as you get to more and more general AI, where we have AI that's acting in the real world that is dealing with increased complexity.
Note that human values are very hard to systematize and write down. They also differ between people, they differ across cultures, they differ over time, things have changed since the 1800s to the present, and so you can imagine that it would be very hard. It may be tricky to try to get an AI that performs exactly as humans intended given that we have to speak to it in the machine or mathematical language which is important in programming. So as we approach AGI and in general powerful systems, you may expect the alignment problem to become more difficult.
But, you know, we've solved everything so far. Humans haven't blown themselves up yet. And we have this thing called trial and error where we can try a system and if it doesn't work, then we'll just fix it and send it out again. Unfortunately, something happens that's very tricky with very general systems. And that is an idea called instrumental incentives.
In the words of Nick Bostrom, who's a philosopher: "Artificial intelligent agents, may have an enormous range of possible final goals. Nevertheless, according to what we may term the "instrumental convergence thesis", there are some instrumental goals likely to be pursued by almost any intelligent agent because there are some objectives that are useful intermediaries to the achievement of almost any final goal."
You can maybe try to guess what those would be. What are some of these instrumental incentives, that would arise as an agent trying to achieve any final goal? [pause]
Some of the ones that Bostrom outlines are self-preservation, acquisition of resources, and self-improvement, which are all different subgoals, that are incentives that arise as you’re an agent trying to achieve anything.
In the words of Professor Stuart Russell, who's a professor at UC Berkeley, "you can't fetch the coffee if you're dead." If you're a coffee-fetching robot, taking coffee-fetching as just some arbitrary goal, then you definitely can't fetch the coffee if you're dead. You have an incentive to make sure that you stay alive so that you can do the goal that you've been assigned. You also might have an incentive to acquire resources or make sure that you have self-improvement, becoming smarter and having better access to power than you would initially.
That's described in the book “Human Compatible” and also in the book “The Alignment Problem” by Brian Christian. This problem of instrumental incentives means that if you have an AI that is sufficiently smart, that is able to act in the world, that is able to plan ahead - it could have an incentive to make sure that it is self preserved, that is to say, it stays alive and that it doesn't get shut down.
If that's true, then we may only have one shot at developing an AI that is fully aligned with human values, because if it's not fully aligned with human values on that first shot, then it's going to have an instrumental incentive to not be shut down, and then we're stuck with whatever we've got.
So, I'm just going to lay out that argument structure again one time. This is the logic that underlies the idea that very powerful systems, not perfectly aligned with human values, could lead to existential risk. You can evaluate the arguments for yourself, whether you think that this makes sense.
AGI is general intelligence, it can by definition think outside the box unlike narrow systems, and it also has capabilities that can affect the real world, even if maybe only initially through text.
It has capabilities such it can duplicate itself, much easier than humans can.
It could use the internet to buy and sell goods. Then it could earn money, and it could consume information and data, and produce and send information over the Internet. We already know how powerful these models are through, for example, social media.
Furthermore, it could augment itself, so it could buy more compute, it could construct datasets. It could refine its code for higher efficiency and write new code. We already have the beginnings of some of these coding capabilities with things like Codex.
AGI will be likely to be able to design and implement any number of ways to kill humans if incentivized to do so. For example, it could use synthetic biology to create pandemics or it could otherwise take advantage of humans being biological organisms while it isn’t.
It also has instrumental incentives, which may arise through being an agent aiming towards any goal, and these include things like self-preservation, acquisition of resources, and self-improvement. Maybe this doesn't happen, or maybe it does, but there's a possibility that maybe instrumental incentives would arise if it is sufficiently agent-like. That would mean that humans consuming resources, or trying to shut off or modify the AI from its original goals, would then be obstacles to the AI achieving its original programmed goal.
So if not perfectly aligned with humans, AI is again incentivized against humans, which is a problem, given the AI is as good at reasoning, or better at reasoning than humans, or at human level. This is not a story about malicious AI, which is popular in the media - things like the Terminator— this is an AI that is indifferent with alien values, where humans are in the way, sort of like ants are often in the way of humans. We step on them and it's not like we hate ants, it's just that we're trying to accomplish things and they are in the way.
Even worse, maybe we only get one shot at an aligned AI, if it is incentivized to not be shut down.
So one could say, “Well, why don't we just make sure that AIs are aligned with humans, then we'd avoid all of this?” And I think that's right. I think if you get a perfectly aligned AI with humans, then we may have a very amazing future. AI could help us with many of the sorts of things that we'd hoped it would, like solving cancer and all sorts of things.
But there's the possibility that leading AI companies may not, by default, create powerful systems that are perfectly aligned with hard-to-specify, changing human values. I don't know that the economic incentives are in place such that AI companies would by default be trying to make sure that these systems are perfectly aligned with human values.
So this is the story of how this sort of thing could lead to existential risk.
Someone may ask who's working on this. If it's a problem, then surely there are some people working on it: and there are! This is called “AI alignment research” or “long-term AI safety research”, although perhaps not as long-term as we would necessarily hope, or “AI or ML safety that scales to advanced systems”. This is in contrast to more near-term safety areas which are also, of course, important.
The field has expanded since 2015. There are now books and conferences and research publications. In industry and non-profits, there is the DeepMind safety team, OpenAI’s safety team, Anthropic, Redwood Research, the Alignment Research Center, and the Machine Intelligence Research Institute.
There are a number of people in academia as well, for example, the Center for Human-Compatible AI (CHAI) at Berkeley which is quite near us. We also have the Cooperative AI Foundation and a bunch of individual researchers at various locations. There are academics at UC Berkeley, NYU, Oxford (correction: this should be Cambridge), and at Stanford—we have researchers at the Stanford Center for AI safety as well. So people quite near home.
In fact, skipping forward a little bit to my results, I asked the researchers in my sample: “Have you heard of AI alignment?” Approximately 39% of them said they had. They wouldn't necessarily be able to define it, but they'd heard the term before. So this is an idea that is sort of around. This is in contrast to “Have you heard about AI safety?”, where something like 76% of people said they'd heard of it.
However, the AI alignment community is growing slower than [AI] capabilities. What I mean by that is that the number of people trying to work on making sure that advanced AI is safe, is smaller than the number of people who are working on the very difficult and tricky problem of trying to make AI have more capabilities, make AI be able to work on all sort of different applications, work in many different contexts. It's this discrepancy in how fast both of these communities are growing that concerns me.
So that's the context for this study, which is understanding how AI researchers perceived these risks from advanced AI. Specifically, I presented AI researchers with some of the claims used to describe this possible existential risk, and explored people's responses to this aspect of safety. That's all the context for my work, and now at this point, let's talk about what I actually did, talk about research methods.
[Talk continued in video, 23:11 onwards.]
Further resources were attached on the Stanford HAI website and recreated below.
- “The case for taking AI seriously as a threat to humanity” by Kelsey Piper (Vox)
- Human-Compatible, by Stuart Russell
- The Alignment Problem, by Brian Christian
- The Precipice: Existential Risk and the Future of Humanity, by Toby Ord
- The Most Important Century, specifically "Forecasting Transformative AI", by Holden Karnofsky
- Empirical work by DeepMind's Safety team on alignment
- Empirical work by Anthropic on alignment
- Talk (and transcript) by Paul Christiano describing the AI alignment landscape in 2020
- Podcast (and transcript) by Rohin Shah, describing the state of AI value alignment in 2021
- Alignment Newsletter and ML Safety Newsletter
- Unsolved Problems in ML Safety by Hendrycks et al. (2022)
- Alignment Research Center
- Interpretability work aimed at alignment: Elhage et al. (2021) and Olah et al. (2020)
- AI Safety Resources by Victoria Krakovna (DeepMind) and Technical Alignment Curriculum
- Open Philanthropy Graduate Student Fellowship
- Open Philanthropy Faculty Fellowship (faculty and others can reach out to OpenPhil directly as well)
- FTX Future Fund
- Long-Term Future Fund
Contact Vael Gates at email@example.com for further questions or collaboration inquiries.