This is my masterlist of resources I send AI researchers who are mildly interested in learning more about AI safety. I pick and choose which resources to send based on the researcher's interests. The resources at the top of the email draft are the ones I usually send, and I add in later sections as seems useful. I'll also sometimes send The Alignment Problem, Human-Compatible, or The Precipice.
I've also included a list of resources that I had students read through for the course Stanford first-year course "Preventing Human Extinction", though I'd most recommend sufficiently motivated students read AGISF Technical Agenda.
These reading choices are drawn from the various other reading lists; this is not original in any way, just something to draw from if you're trying to send someone some of the more accessible resources.
There's a decent chance that I'll continue updating this post as time goes on, since my current use case is copy-pasting sections of this email to interested parties. Note that "I" and "Vael" are mentioned a few times, so you'll need to edit a bit if you're copy-pasting. Happy to make any edits and take suggestions.
[Crossposted to LessWrong]
List for AI researchers
Very nice to speak to you! As promised, some resources on AI alignment. I tried to include a bunch of stuff so you could look at whatever you found interesting. Happy to chat more about anything, and thanks again.
Introduction to the ideas
- "The case for taking AI seriously as a threat to humanity" by Kelsey Piper (Vox)
- The Most Important Century and specifically "Forecasting Transformative AI" by Holden Karnofsky, blog series and podcast. Most recommended for timelines.
- A short interview from Prof. Stuart Russell (UC Berkeley) about his book, Human-Compatible (the other main book in the space is The Alignment Problem, by Brian Christian, which is written in a style I particularly enjoyed)
Technical work on AI alignment
- Empirical work by DeepMind's Safety team on alignment
- Empirical work by Anthropic on alignment
- Talk (and transcript) by Paul Christiano describing the AI alignment landscape in 2020
- Podcast (and transcript) by Rohin Shah, describing the state of AI value alignment in 2021
- Alignment Newsletter and ML Safety Newsletter
- Unsolved Problems in ML Safety by Hendrycks et al. (2022)
- Alignment Research Center
- Interpretability work aimed at alignment: Elhage et al. (2021) and Olah et al. (2020)
- AI Safety Resources by Victoria Krakovna (DeepMind) and Technical Alignment Curriculum
Introduction to large-scale risks from humanity, including "existential risks" that could lead to the extinction of humanity
- The first third of this book summary (copied below) of the book "The Precipice: Existential Risk and the Future of Humanity" by Toby Ord
Chapter 3 is on natural risks, including risks of asteroid and comet impacts, supervolcanic eruptions, and stellar explosions. Ord argues that we can appeal to the fact that we have already survived for 2,000 centuries as evidence that the total existential risk posed by these threats from nature is relatively low (less than one in 2,000 per century).
Chapter 4 is on anthropogenic risks, including risks from nuclear war, climate change, and environmental damage. Ord estimates these risks as significantly higher, each posing about a one in 1,000 chance of existential catastrophe within the next 100 years. However, the odds are much higher that climate change will result in non-existential catastrophes, which could in turn make us more vulnerable to other existential risks.
Chapter 5 is on future risks, including engineered pandemics and artificial intelligence. Worryingly, Ord puts the risk of engineered pandemics causing an existential catastrophe within the next 100 years at roughly one in thirty. With any luck the COVID-19 pandemic will serve as a "warning shot," making us better able to deal with future pandemics, whether engineered or not. Ord's discussion of artificial intelligence is more worrying still. The risk here stems from the possibility of developing an AI system that both exceeds every aspect of human intelligence and has goals that do not coincide with our flourishing. Drawing upon views held by many AI researchers, Ord estimates that the existential risk posed by AI over the next 100 years is an alarming one in ten.
Chapter 6 turns to questions of quantifying particular existential risks (some of the probabilities cited above do not appear until this chapter) and of combining these into a single estimate of the total existential risk we face over the next 100 years. Ord's estimate of the latter is one in six.
- "How to use your career to help reduce existential risk" by 80,000 Hours or "Our current list of pressing world problems" by 80,000 Hours
How AI could be an existential risk
- AI alignment researchers disagree a weirdly high amount about how AI could constitute an existential risk, so I hardly think the question is settled. Some plausible ones people are considering (copied from the paper)
- A single AI system with goals that are hostile to humanity quickly becomes sufficiently capable for complete world domination, and causes the future to contain very little of what we value, as described in “Superintelligence". (Note from Vael: Where the AI has an instrumental incentive to destroy humans and uses its planning capabilities to do so, for example via synthetic biology or nanotechnology.)
- Part 2 of “What failure looks like”
- This involves multiple AIs accidentally being trained to seek influence, and then failing catastrophically once they are sufficiently capable, causing humans to become extinct or otherwise permanently lose all influence over the future. (Note from Vael: I think we might have to pair this with something like "and in loss of control, the environment then becomes uninhabitable to humans through pollution or consumption of important resources for humans to survive")
- Part 1 of “What failure looks like”
- This involves AIs pursuing easy-to-measure goals, rather than the goals humans actually care about, causing us to permanently lose some influence over the future. (Note from Vael: I think we might have to pair this with something like "and in loss of control, the environment then becomes uninhabitable to humans through pollution or consumption of important resources for humans to survive")
- Some kind of war between humans, exacerbated by developments in AI, causes an existential catastrophe. AI is a significant risk factor in the catastrophe, such that no catastrophe would be occurred without the developments in AI. The proximate cause of the catastrophe is the deliberate actions of humans, such as the use of AI-enabled, nuclear or other weapons. See Dafoe (2018) for more detail. (Note from Vael: Though there's a recent argument that it may be unlikely for nuclear weapons to cause an extinction event, and instead it would just be catastrophically bad. One could still do it with synthetic biology though, probably, to get all of the remote people.)
- Intentional misuse of AI by one or more actors causes an existential catastrophe (excluding cases where the catastrophe was caused by misuse in a war that would not have occurred without developments in AI). See Karnofsky (2016) for more detail.
There's also a growing community working on AI alignment
- The strongest academic center is probably UC Berkeley's Center for Human-Compatible AI. Mostly there are researchers distributed at different institutions e.g. Sam Bowman at NYU, Dan Hendrycks and Jacob Steinhardt at UC Berkeley, Dylan Hadfield-Menell at MIT, David Krueger at Cambridge, Alex Turner at Oregon State, etc. Also, a lot of the work is done by industry and nonprofits: Anthropic, Redwood Research, OpenAI's safety team, DeepMind's Safety team, Alignment Research Center, Machine Intelligence Research Institute, independent researchers in various places. Consider also the Cooperative AI Foundation, and Andrew Critch's article on AI safety areas.
- There is money in the space! If you want to do AI alignment research, you can be funded by either Open Philanthropy (students, faculty- one can also just email them directly) or LTFF or FTX with your research proposal
- If you wanted to rapidly learn more about the theoretical technical AI alignment space, walking through this curriculum is one of the best resources. A lot of the interesting theoretical stuff is happening online, at LessWrong / Alignment Forum (Introductory Content), since this field is still pretty pre-paradigmatic and people are still working through a lot of the ideas.
- And if you're interested in what the career pathway looks like, check out Rohin Shah (DeepMind)'s FAQ here! An additional guide is here.
Off-switch game and corrigibility
- Off-switch game and corrigibility paper, about incentives for AI to be shut down. This article from DeepMind about "specification gaming" isn't about off-switches, but also makes me feel like there's currently maybe a tradeoff in task specification, where more building more generalizability into a system will result in novel solutions but less control. Their follow-up paper where they outline a possible research to this problem makes me feel like encoding human preferences is going to be quite hard, and all of the other discussion in AI alignment, though we don't know how hard the alignment problem will be.
- "Forecasting Transformative AI" by Holden Karnofsky
- Metaculus prediction and "AI safety and timelines", by user Sergio (Apr 2022 post attempting to integrate recent predictions from Metaculus, a prediction solicitation and aggregation engine)
- April-May 2022 papers: Chinchilla and PaLM (language), DALL·E 2 and Imagen (image), Socratic Models (multimodal/language), Flamingo (visual/language), SayCan (robot/language), Gato (multimodal)
There's also two related communities who care about these issues, who you might find interesting
- Effective Altruism community, whose strong internet presence is on the EA Forum. If you're interested in working on an AI safety career, you can apply to schedule a one-on-one coaching call here.
- Rationalist community. The most(?) popular blog from this community is from Scott Alexander (first blog, second blog), and the Rationalist's main online forum is LessWrong. Amusingly, they also write fantastic fanfiction (e.g. Harry Potter and the Methods of Rationality) and I think some of their nonfiction is fantastic.
Governance, aimed at highly capable systems in addition to today's systems
It seemed like a lot of your thoughts about AI risk went through governance, so wanted to mention what the space looks like (spoiler: it's preparadigmatic) if you haven't seen that yet!
- AI governance curriculum (highly recommended)
- The longtermist AI governance landscape: a basic overview and more personal posts of how to get involved
- The case for building expertise to work on US AI policy, and how to do it by 80,000 Hours
- AI Governance: Opportunity and Theory of Impact / AI Governance: A Research Agenda by Allan Dafoe and GovAI generally
- See also: Legal Priorities Project, and Gillian Hadfield (U. Toronto)
AI Safety in China
- Tianxia 天下 and Concordia Consulting 安远咨询 are the main organizations in the space. If you're interested in getting involved in those communities, let me know and I can connect you!
- China-related AI safety and governance paths
- ChinAI Newsletter
AI Safety community building, student-focused (see academic efforts above)
- EA Cambridge’s AGISF program
- Stanford Existential Risk Initiative (SERI), Swiss Existential Risk Initiative (CHERI), Cambridge Existential Risk Initiative (CARI)
- An article about Stanford EA if you haven’t seen it yet
- Global Challenges Project
- (if they're interested in my work specifically) Transcripts on Interviews with AI Researchers, by Vael Gates
If they're curious about other existential / global catastrophic risks:
Large-scale risks from synthetic biology:
- Calma J. AI suggested 40,000 new possible chemical weapons in just six hours. The Verge, 2022
- Kupferschmidt K. (2017) "How Canadian researchers reconstituted an extinct poxvirus for $100,000 using mail-order DNA". Science, AAAS.
- "Reducing global catastrophic biological risks" by 80,000 Hours
- Email I sent to some Stanford students, with further resources
Large-scale risks from nuclear
Why I don't think we're on the right timescale to worry most about climate change:
- "Climate change" by 80,000 Hours
Happy to chat more about anything, and good to speak to you!
List for "Preventing Human Extinction" class
When might advanced AI be developed?
- Grace, K., Salvatier, J., Dafoe, A., Zhang, B., & Evans, O. (2018). When will AI exceed human performance? Evidence from AI experts. Journal of Artificial Intelligence Research, 62, 729-754.
Why might advanced AI be a risk?
- Cotra, A. (2021, Sep 21). Why AI alignment could be hard with modern deep learning. Cold Takes. https://www.cold-takes.com/why-ai-alignment-could-be-hard-with-modern-deep-learning/
- Krakovna, V., Uesato, J., Mikulik, V., Rahtz, M., Everitt, T., Kumar, R., Kenton, Z., Leike, J., & Legg, S. (2020) Specification gaming: the flip side of AI ingenuity. DeepMind Safety Research. https://medium.com/@deepmindsafetyresearch/specification-gaming-the-flip-side-of-ai-ingenuity-c85bdb0deeb4
Thinking about making advanced AI go well (technical)
- Christiano, P. (2019). Current work in AI alignment [Lecture]. YouTube. https://www.youtube.com/watch?v=-vsYtevJ2bc
- Choose one:
Thinking about making advanced AI go well (governance)
- Dafoe, A. (2020). AI Governance: Opportunity and Theory of Impact. Center for the Governance of AI. September 15, 2020.
Optional (large-scale risks from AI)
- Karnofsky, H. (n.d.) The "most important century" blog post series (few page summary). Cold Takes. https://www.cold-takes.com/most-important-century/
- Ngo, R. (2020). AGI Safety from First Principles. Alignment Forum. https://www.alignmentforum.org/s/mzgtmmTKKn5MuCzFJ
- Zweetsloot, R., & Dafoe, A. (2019). Thinking About Risks From AI: Accidents, Misuse and Structure. Lawfare. February, 11, 2019.
- Miles, R. (2021, June 24). Intro to AI Safety, Remastered [Video]. YouTube. https://www.youtube.com/watch?v=pYXy-A4siMw
- Clarke, S., Carlier, A., & Schuett, J. (2021). Survey on AI existential risk scenarios. Alignment Forum. https://www.alignmentforum.org/posts/WiXePTj7KeEycbiwK/survey-on-ai-existential-risk-scenarios
- Carlsmith, J. (2021). Is power-seeking AI an existential risk?. Alignment Forum. https://www.alignmentforum.org/posts/HduCjmXTBD4xYTegv/draft-report-on-existential-risk-from-power-seeking-ai
Natural science sources
- Calma J. AI suggested 40,000 new possible chemical weapons in just six hours. The Verge, 2022
- Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., & Carter, S. (2020). Zoom in: An introduction to circuits. Distill, 5(3), e00024-001. (Important work in the field of AI interpretability, a subfield of AI safety)
In the interest of increasing options, I wanted to reach out and say that I'd be particularly happy to help you explore synthetic biology pathways more, if you were so inclined. I think it's pretty plausible we'll get another worse pandemic in our lifetimes, and it'd be worth investing a career or part of a career to work on it. Especially since so few people will make that choice, so a single person probably matters a lot compared to entering other more popular careers.
No worries if you're not interested though-- this is just one option out of many. I'm emailing you in a batch instead of individually so that hopefully you feel empowered to ignore this email and be done with this class :P. Regardless, thanks for a great quarter and hope you have great summers!
If you are interested:
- I'm happy to talk on Zoom, get you connected up with resources (reading list, 80K, job board) and researchers at Stanford (e.g. Megan Palmer's lab, Daniel Greene, Prof. Luby). [also mention 80K coaching if relevant]
- A lot of the students who are interested in this at Stanford are affiliated with Stanford EA (I'd sign up for a one-on-one), and there's some very cool people working on these issues at the community-building ("Tessa Alexanian: How Biology Has Changed", "Biosecurity as an EA cause area"), grantmaking, start-up (see next point) and governance levels (see job board).
- There's a lot of room for new (startup / nonprofit) projects to be started-- consider Alvea (website), and this list, and other lists contained in these posts. Plus: job board!