Even if we think the prior existence view is more plausible than the total view, we should recognize that we could be mistaken about this and therefore give some value to the life of a possible future. The number of human beings who will come into existence only if we can avoid extinction is so huge that even with that relatively low value, reducing the risk of human extinction will often be a highly cost-effective strategy for maximizing utility, as long as we have some understanding of what will reduce that risk.
— Katarzyna de Lazari-Radek & Peter Singer
Future Matters is a newsletter about longtermism. Each month we collect and summarize longtermism-relevant research, share news from the longtermism community, and feature a conversation with a prominent researcher. You can also subscribe on Substack, listen on your favorite podcast platform and follow on Twitter. Future Matters is also available in Spanish.
William MacAskill’s What We Owe the Future was published, reaching the New York Times Bestseller list in its first week and generating a deluge of media for longtermism. We strongly encourage readers to get a copy of the book, which is filled with new research, ideas, and framings, even for people already familiar with the terrain. In the next section, we provide an overview of the coverage the book has received so far.
In Samotsvety's AI risk forecasts, Eli Lifland summarizes the results of some recent predictions related to AI takeover, AI timelines, and transformative AI by a group of seasoned forecasters. In aggregate, the group places 38% on AI existential catastrophe, conditional on AGI being developed by 2070, and 25% on existential catastrophe via misaligned AI takeover by 2100. Roughly four fifths of their overall AI risk is from AI takeover. They put 32% on AGI being developed in the next 20 years.
John Halstead released a book-length report on climate change and longtermism and published a summary of it on the EA Forum. The report offers an up-to-date analysis of the existential risk posed by global warming. One of the most important takeaways is that extreme warming seems significantly less likely than previously thought: the probability of >6°C warming was thought to be 10% a few years ago, whereas it now looks <1% likely. (For much more on this topic, see our conversation with John that accompanied last month’s issue.)
In a similar vein, the Good Judgment Project asked superforecasters a series of questions on Long-term risks and climate change, the results of which are summarized by Luis Urtubey (full report here).
The importance of existential risk reduction is often motivated by two claims: that the value of humanity’s future is vast, and that the level of risk is high. David Thorstad’s Existential risk pessimism and the time of perils notes that these stand in some tension, since the higher the overall risk, the shorter humanity’s expected lifespan. This tension dissolves, however, if one holds that existential risk will decline to near-zero levels if humanity survives the next few centuries of high risk. This is precisely the view held by most prominent thinkers on existential risk, e.g. Toby Ord (see The Precipice) and Carl Shulman (see this comment).
In Space and existential risk, the legal scholar Chase Hamilton argues that existential risk reduction should be a central consideration shaping space law and policy. He outlines a number of ways in which incautious space development might increase existential risk, pointing out that our current laissez-faire approach fails to protect humanity against these externalities and offering a number of constructive proposals. We are in a formative period for space governance, presenting an unusual opportunity to identify and advocate for laws and policies that safeguard humanity’s future.
Michael Cassidy and Lara Mani warn about the risk from huge volcanic eruptions. Humanity devotes significant resources to managing risk from asteroids, and yet very little into risk from supervolcanic eruptions, despite these being substantially more likely. The absolute numbers are nonetheless low; super-eruptions are expected roughly once every 14,000 years. Interventions proposed by the authors include better monitoring of eruptions, investments in preparedness, and research into geoengineering to mitigate the climatic impacts of large eruptions or (most speculatively) into ways of intervening on volcanoes directly to prevent eruptions.
The risks posed by supervolcanic eruptions, asteroid impacts, and nuclear winter operate via the same mechanism: material being lofted into the stratosphere, blocking out the sun and causing abrupt and sustained global cooling, which severely limits food production. The places best protected from these impacts are thought to be remote islands, whose climate is moderated by the ocean. Matt Boyd and Nick Wilson’s Island refuges for surviving nuclear winter and other abrupt sun-reducing catastrophes analyzes how well different island nations might fare, considering factors like food and energy self-sufficiency. Australia, New Zealand, and Iceland score particularly well on most dimensions.
Benjamin Hilton's Preventing an AI-related catastrophe is 80,000 Hours' longest and most in-depth problem profile so far. It is structured around six separate reasons that jointly make artificial intelligence, in 80,000 Hours' assessment, perhaps the world's most pressing problem. The reasons are (1) that many AI experts believe that there is a non-negligible chance that advanced AI will result in an existential catastrophe; (2) that the recent extremely rapid progress in AI suggests that AI systems could soon become transformative; (3) that there are strong arguments that power-seeking AI poses an existential risk; (4) that even non-power-seeking AI poses serious risks; (5) that the risks are tractable; and (6) that the risks are extremely neglected.
In Most small probabilities aren't Pascalian, Gregory Lewis lists some examples of probabilities as small as one-in-a-million that society takes seriously, in areas such as aviation safety and asteroid defense. These and other examples suggest that Pascal's mugging, which may justify abandoning expected value theory when the probabilities are small enough, does not undermine the case for reducing the existential risks that longtermists worry about. In the comments, Richard Yetter Chappell argues that exceeding the one-in-a-million threshold is plausibly a sufficient condition for being non-Pascalian, but it may not be a necessary condition: probabilities robustly grounded in evidence—such as the probability of casting the decisive vote in an election with an arbitrarily large electorate—should always influence decisionmaking no matter how small they are.
In What's long-term about "longtermism"?, Matthew Yglesias argues that one doesn't need to make people care more about the long-term in order to persuade them to support longtermist causes. All one needs to do is persuade them that the risks are significant and that they threaten the present generation. Readers of this newsletter will recognize the similarity between Yglesias’s argument and those made previously by Neel Nanda and Scott Alexander (summarized in FM#0 and FM#1, respectively).
Eli Lifland's Prioritizing x-risks may require caring about future people notes that interventions aimed at reducing existential risks are, in fact, not clearly more cost-effective than standard global health and wellbeing interventions. On Lifland's rough cost-effectiveness estimates, AI risk interventions, for example, are expected to save approximately as many present-life-equivalents per dollar as animal welfare interventions. And as Ben Todd notes in the comments, the cost-effectiveness of the most promising longtermist interventions will likely go down substantially in the coming years and decades, as this cause area becomes increasingly crowded. Lifland also points out that many people interpret "longtermism" as a view focused on influencing events in the long-term future, whereas longtermism is actually concerned with the long-term impact of our actions. This makes "longtermism" a potentially confusing label in situations, such as the one in which we apparently find ourselves, where concern with long-term impact seems to require focusing on short-term events, like risks from advanced artificial intelligence.
Trying to ensure the development of transformative AI goes well is made difficult by how uncertain we are about how it will play out. Holden Karnofsky’s AI strategy nearcasting sets out an approach for dealing with this conundrum: trying to answer strategic questions about TAI, imagining that it is developed in a world roughly similar to today’s. In a series of posts, Karnofsky will do some nearcasting based on the scenario laid out in Ajeya Cotra’s Without specific countermeasures… (summarised in FM#4).
Karnofsky's How might we align transformative AI if it's developed very soon?, the next installment in the “AI strategy nearcasting” series, considers some alignment approaches with the potential to prevent the sort of takeover scenario described by Ajeya Cotra in a recent report. Karnofsky's post is over 13,000 words in length and contains many more ideas than we can summarize here. Readers may want to first read our conversation with Ajeya and then take a closer look at the post. Karnofsky's overall conclusion is that "the risk of misaligned AI is serious but not inevitable, and taking it more seriously is likely to reduce it."
In How effective altruism went from a niche movement to a billion-dollar force, Dylan Matthews chronicles the evolution of effective altruism over the past decade. In an informative, engaging, and at times moving article, Matthews discusses the movement’s growth in size and its shift in priorities. Matthews concludes: “My attitude toward EA is, of course, heavily personal. But even if you have no interest in the movement or its ideas, you should care about its destiny. It’s changed thousands of lives to date. Yours could be next. And if the movement is careful, it could be for the better.”
The level of media attention on What We Owe the Future has been astounding. Here is an incomplete summary:
- Parts of Will’s book were excerpted or adapted in What is longtermism and why does it matter? (BBC), How future generations will remember us (The Atlantic), We need to act now to give future generations a better world (New Scientist), The case for longtermism (The New York Times) and The beginning of history (Foreign Affairs).
- Will was profiled in Time, the Financial Times, and The New Yorker (see this Twitter thread for Will’s take on the latter).
- Will was interviewed by Ezra Klein, Tyler Cowen, Tim Ferriss, Dwarkesh Patel, Rob Wiblin, Sam Harris, Sean Carroll, Chris Williamson, Malaka Gharib, Ali Abdaal, Russ Roberts, Mark Goldberg, Max Roser, and Steven Levitt.
- What We Owe the Future was reviewed by Oliver Burkeman (The Guardian), Scott Alexander (Astral Codex Ten), Kieran Setiya (Boston Review), Caroline Sanderson (The Bookseller), Regina Rini (The Times Literary Supplement), Richard Yetter Chappell (Good Thoughts) and Eli Lifland (Foxy Scout).
- The book also inspired three impressive animations: How many people might ever exist calculated (Primer), Can we make the future a million years from now go better? (Rational Animations), Is civilisation on the brink of collapse? (Kurzgesagt).
- And finally, Will participated in a Reddit 'ask me anything'.
The Forethought Foundation is hiring for several roles working closely with Will MacAskill.
Dan Hendrycks, Thomas Woodside and Oliver Zhang announced a new course designed to introduce students with a background in machine learning to the most relevant concepts in empirical ML-based AI safety.
The Center for AI Safety announced the CAIS Philosophy Fellowship, a program for philosophy PhD students and postdoctorates to work on conceptual problems in AI safety.
Longview Philanthropy and Giving What We Can announced the Longtermism Fund, a new fund for donors looking to support longtermist work. See also this EA Global London 2022 interview with Simran Dhaliwal, Longview Philanthropy's co-CEO.
Radio Bostrom released an audio introduction to Nick Bostrom.
Michaël Trazzi interviewed Robert Long about the recent LaMDA controversy, the sentience of large language models, the metaphysics and philosophy of consciousness, artificial sentience, and more. He also interviewed Alex Lawsen on the pitfalls of forecasting AI progress, why one can't just "update all the way bro", and how to develop inside views about AI alignment.
The materials for two new courses related to longtermism were published: Effective altruism and the future of humanity (Richard Yetter Chappell) and Existential risks introductory course (Cambridge Existential Risks Initiative).
Verfassungsblog, an academic forum of debate on events and developments in constitutional law and politics, hosted a symposium on Longtermism and the law, co-organized by the University of Hamburg and the Legal Priorities Project.
The 2022 Future of Life Award—a prize awarded every year to one or more individuals judged to have had an extraordinary but insufficiently appreciated long-lasting positive impact—was given to Jeannie Peterson, Paul Crutzen, John Birks, Richard Turco, Brian Toon, Carl Sagan, Georgiy Stenchikov and Alan Robock “for reducing the risk of nuclear war by developing and popularizing the science of nuclear winter.”
Conversation with Ajeya Cotra
Ajeya Cotra is a Senior Research Analyst at Open Philanthropy. She has done research on cause prioritization, worldview diversification, AI forecasting, and other topics. Ajeya graduated from UC Berkeley with a degree in Electrical Engineering and Computer Sciences. As a student, she worked as a teaching assistant for various computer science courses, ran the Effective Altruists of Berkeley student organization, and taught a course on effective altruism.
Future Matters: You recently published a rather worrying report, Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover. The report doesn't try to cover all the different possible paths toward transformative AI, but focuses specifically on an AI company training a scientist model using an approach you call "human feedback on diverse tasks" (HFDT). To begin, can you tell us what you mean by HFDT and what made you focus on it?
Ajeya Cotra: Basically, the idea is that you have a large neural network; you pretrain it in the way that people pretrain GPT, where it learns to predict its environment. And maybe its environment is just text, so it's learning to predict text. In my particular example—which is a bit narrower than HFDT overall, just for concretely imagining things—I'm imagining the goal is to train a system to interact with the computer in the way humans would interact with the computers: googling things, writing code, watching videos, sending emails, etc. So the first stage of this training would be just training the model to have a model of what will happen if it does various things. The predictive pretraining that I'm imagining is to give it images of computer screens, and then actions that are taken, which might be pressing the escape key or something, and then it gets rewarded based on predicting what happens next.
Now you do that for a long time, and the hope is that this has created a system that has a broad understanding of how computers work, what happens if it does various things, and then you can build on that by imitating humans doing particular things. For example, gathering data sets of a programmer writing docstrings, writing functions or running tests, and capturing all that with keystroke logging or screen captures, in order to feed it into the model so it learns to act more like that. And then the last stage of the training is where the human feedback comes in. Once we have a model that is dealing with the computer and doing useful things in roughly the way that the humans you trained it on do stuff, to refine its abilities and potentially take it beyond human ability, we now switch to a training regime where it tries things, and humans see how well that thing worked, and give it a reward based on that.
For example, humans could ask for some sort of app, or some sort of functionality, and the model would try to write code. Humans would ask: Did the code pass our tests? Did the ultimate product seem useful and free of bugs?' and, based on that, would give the system some sort of RL reward.
It's a pretty flexible paradigm. In some sense, it involves throwing the whole kitchen sink of modern techniques into one model. It's not even necessarily majority-based on human feedback, in the sense of reinforcement learning. But I still called it human feedback on diverse tasks, because that is the step where you take it beyond imitating humans—you have it try things in the world, see how they work and give it rewards—and therefore the step that introduces a lot of the danger, so that's what I framed it around.
Future Matters: So that's the paradigm in which this model is created. And then, the report makes three further assumptions about how this scenario plays out. Could you walk us through these?
Ajeya Cotra: Yes. The three assumptions are what I call the racing forward assumption, the naive safety effort assumption, and then the HFDT scales far assumption.
So taking the last one first, this assumption is basically just that the process I outlined works for producing a very smart creative system that can automate all the difficult, long-term, intellectually demanding things that human scientists do in the course of doing their job. It's not limited to something much less impactful. In this story, I'm postulating that this technique doesn't hit a wall, and basically that you can get transformative AI with it.
And then the other two assumptions, racing forward and naive safety effort, are related. As to the racing forward idea, the company I'm imagining (which I'm calling Magma) is training this system (which I call Alex) in the context of some sort of intense competitive race, either with other companies for commercially dominating a market, or with other countries, if you imagine Magma to be controlled by a government. So Magma’s default presumption is that it's good to make our systems smarter: that will make them more useful, and that will make us more likely to win whatever race we're racing. We don't have a default stance of an abundance of caution and a desire to go slow. We have a default stance that is typical of any startup developing any technology, which is just move fast, make your product and make it as good as you can make it.
And then the third assumption almost follows from the racing forward assumption: the naive safety assumption, which is that the company that's developing the system doesn't have it as a salient or super plausible outcome that the system could develop goals of its own and end up taking over the world, or harming its creators. They may have other potential safety issues in mind, like failures of robustness where the system can do wacky things and cause a lot of damage by accident, but they don't have this deliberate, deceptive failure enough at the top of their mind to make major sacrifices to specifically address that.
They're doing their safety effort in the same way that companies today do safety efforts for the systems that they release. For instance, they want to make sure this thing doesn't embarrass them by saying something toxic, or they want to make sure that this thing won't accidentally delete all your files, or things like that. And basically the way they go about achieving this safety is testing it in these scenarios and training it until it no longer displays these problematic behaviours, and that's about the main thing they do for safety.
Future Matters: You say that Alex—the model trained by Magma, the company— will have some key properties; and it is in virtue of having these properties that Alex poses the sort of threat that your report focuses on. What are these properties?
Ajeya Cotra: I included these properties as part of the assumptions, but I generally think that they’re very likely to fall out of the HFDT scales far assumption, where if you can really automate everything that humans are doing, I think that you'll have the following two properties.
The first one is having robust, broadly-applicable skills and understanding of the world. Alex's understanding of the world isn't in shallow, narrow patterns, which tend to break when it goes out of distribution: it has a commonsensically coherent understanding of the world, similar to humans, which allows us to not fall apart and say something stupid if we see a situation that we haven't exactly encountered before, or if we see something too weird. We act sensibly, maybe not like maximally intelligently, but sensibly.
And property number two is coming up with creative plans to achieve open-ended goals. And here, this leans on picturing the training like, 'Hey, accomplish this thing, synthesise this protein or build this web app, or whatever: we're going to see how well you did, and we're going to reward you based on our perception of how well you did'. So it's not particularly constraining the means in any specific way, and it's giving rewards based on end outcomes. And the tasks that it's being trained on are difficult tasks, and ultimately pretty long-term tasks.
The idea is that, because of the racing forward assumption, Magma is just trying to make Alex as useful as possible. And one of the components of being maximally useful in these intellectual roles, these knowledge work tasks, is being able to come up with plans that work, that sometimes work for unexpected reasons: just like how an employee who’s creative and figures out how to get the thing you want done is more useful than an employee who follows a certain procedure to the letter, and isn't looking out for ways to get more profit or finish something faster, or whatever.
Future Matters: Turning to the next section of the report, you claim that, in the lab setting, Alex will be rewarded for what you call "playing the training game". What do you mean by this expression, and why do you think that the training process will push Alex to behave in that way?
Ajeya Cotra: By playing the training game I mean that this whole setup is pushing Alex very hard to try to get as much reward as possible, where, based on the way its training is set up, as much reward as possible roughly means making humans believe it did as well as possible, or at least claim that it did as well as possible. This is just pointing out the gap between actually doing a good job and making your supervisors believe you did a good job. I claim that they're going to be many tradeoffs, both small and large, between these two goals, and that whenever they conflict, the training process pushes Alex to care about the latter goal of its making supervisors believe it did a good job, because that's what the reward signal in fact is.
This isn't necessarily extremely dangerous. I haven't argued for that yet. It's more an argument that you won't get a totally straightforward system that for some reason never deceives you, or for some reason is obedient in this kind of deontological way, because you're training it to find creative ways to attain reward, and sometimes creative ways to attain reward will involve deceptive behaviour. For example, making you think that its deployment of some product had no issues—when in fact it did—because it knows that if you found out about those issues, you would give it a lower reward; or playing to your personal or political biases, or emotional biases to get you to like it and rate it higher, and just a cluster of things in that vein.
Future Matters: The next, and final, central claim in your analysis relates to the transition from the lab setting to the deployment setting. You argue that deploying Alex would lead to a rapid loss of human control. Can you describe the process that results in this loss of control and explain why you think it's the default outcome in the absence of specific countermeasures?
Ajeya Cotra: Yes. So far in the story, we have this system that has a good understanding of the world, is able to adapt well in novel situations, can come up with these creative long-term plans, and is trying very hard to get a lot of reward, as opposed to trying very hard to be helpful, or having a policy of being obedient, or having a policy of being honest. And so, when that system is deployed and used in all the places where it would be useful, a lot of things happen. For example, science and technology advance much more rapidly than it could if humans were the only scientists, because the many copies of Alex run a lot faster than a human brain, there are potentially a lot more of them than there are human scientists in the world, and they can improve themselves, make new versions of themselves, and reproduce much more quickly than humans can.
And so you have this world where increasingly it's the case that no human really knows why certain things are happening, and increasingly it's the case that the rewards are more and more removed from the narrow actions that these many copies of Alex are taking. Humans can still send in rewards into this crazy system, but it'll basically be based on, 'Oh, did this seem to be a good product?', 'Did we make money this quarter?' or 'Do things look good in a very broad way?', which increasingly loosens the leash that the systems are on, relative to the lab setting. In the lab, when they're taking these particular actions, humans are able to potentially scrutinise them more, and more importantly, the actions aren't affecting the outside world, and changing systems out there in the world.
That's one piece of it, and then you have to combine that with what we know about Alex, or what we have assumed about Alex in this story, namely that it's very creative, it understands the world well, it can make plans, and it's making plans to do something, which in the lab setting looked like trying very hard to get reward, and didn't look like being helpful or loyal to humans, at least not fully. And so, if you ask what is the psychology of a system that in the lab setting tries really hard to get reward, one thing you might believe is that it's a system that will try really hard to get reward in the deployment setting as well, and maybe you could call it a system that wants reward intrinsically. That doesn't seem good, and seems like it would lead to a takeover situation, roughly because if Alex can secure control of the computers that it's running on, then it will have maximal control over what rewards it gets, and it can never have as good a situation, letting humans continue to give it rewards, if only because humans will sometimes make mistakes and give it lower rewards than they should have or something.
But then you might say that you don't know if Alex really wants reward, that you don't know what it wants at all, if it wants anything. And that's true, that seems plausible to me. But whatever its psychology is or whatever it really wants, that thing led it to try really hard to get rewarded in the lab setting. And it was trained to be like that. If Alex just wanted to sit in a chair for five minutes, that wouldn't have been a very useful system. As soon as it got to sit in a chair for five minutes, it would stop doing anything helpful to humans, and then we would continue to train it until we found a system that, in fact, was doing helpful things to humans.
So the claim I make is that if Alex doesn't care about reward intrinsically, it still had some sort of psychological setup that caused it to try extremely hard to get rewarded in the lab setting. And the most plausible kind of thing that it feels like would lead to that behaviour is that Alex does have some sort of ambitious goals, or something that it wants, for which trying really hard to get reward in the lab setting was a useful intermediate step. If Alex just wanted to survive and reproduce, say, if it had some sort of genetic fitness based goal, then that would be sufficient to cause it to try really hard to get rewarded in the lab setting, because it has to get a lot of reward in order to be selected to be deployed and make a lot of copies of itself in the future.
And similarly, any goal, as long as it wasn't an extremely short-term and narrow goal, like 'I want to sit in a chair for five minutes', would motivate Alex to try and get reward in the training setting. And none of those goals are very good for humans either, because actually the whole array of those goals benefits from Alex gaining control over the computers that are running it, and gaining control over resources in the world. In this case, it's not because it wants to intrinsically wirehead or change its reward to be a really high number, but rather because it doesn't want humans continually coming in and intervening on what it's doing, and intervening on its psychology by changing its rewards. So maybe it doesn't care about reward at all, but it still wants to have the ability to pursue whatever it is it wants to pursue.
Future Matters: Supposing that we get into this scenario, can you talk about the sorts of specific countermeasures Magma could adopt to prevent takeover?
Ajeya Cotra: Yes. I think a big thing is simply having it in your hypothesis space and looking out for early signs of this. So a dynamic that I think can be very bad is, you observe early systems that are not super powerful do things that look like deception, and the way you respond to that is by giving it negative reward for doing those things, and then it stops doing those things. I think the way I'd rather people respond to that is, 'Well, this is a symptom of a larger problem in which the way we train the system causes it to tend toward psychologies or goals or motivation systems that motivate deception'. If we were to give a negative reward to the instances of deception we found, we should just expect us to find fewer and fewer instances, but not necessarily because we solve the root problem, but because we are teaching this system to be more careful. Instead, we should stop and examine the model with more subtle tools—for example, mechanistic interpretability or specific test environments—and we should have the discipline not to simply train away measurable indicators of a problem, and not to feel good if upon seeing something bad and then training it away, it goes away.
Interpretability seems like a big thing, trying to create feedback mechanisms that are more epistemically competitive with the model. In this case it's not a human who tries to discern whether the model's actions had a good effect: it's maybe some kind of amplified system, maybe it has help from models very similar to this model, etc. Holden has a whole post on how we might align transformative AI if it's developed soon, that goes over a bunch of these possibilities.
Future Matters: Thanks, Ajeya!
We thank Leonardo Picón for editorial assistance.
Disclosure: one of us is a member of Samotsvety.
Rob Wiblin made a similar point in If elections aren’t a Pascal’s mugging, existential risk shouldn’t be either, Overcoming Bias, 27 September 2012, and in Saying ‘AI safety research is a Pascal’s Mugging’ isn’t a strong response, Effective Altruism Forum, 15 December 2015.
We made a similar point in our summary of Alexander's article, referenced in the previous paragraph: "the 'existential risk' branding […] draws attention to the threats to [...] value, which are disproportionately (but not exclusively) located in the short-term, while the 'longtermism' branding emphasizes instead the determinants of value, which are in the far future."
See James Aitchison’s post for a comprehensive and regularly updated list of all podcast interviews, book reviews, and other media coverage.
A full list of such courses may be found here.