Related: AI policy ideas: Reading list.
This document is about ideas for AI labs. It's mostly from an x-risk perspective. Its underlying organization black-boxes technical AI stuff, including technical AI safety.
Lists & discussion
- Towards best practices in AGI safety and governance: A survey of expert opinion (GovAI, Schuett et al. 2023) (LW)
- This excellent paper is the best collection of ideas for labs. See pp. 18–22 for 100 ideas.
- Frontier AI Regulation: Managing Emerging Risks to Public Safety (Anderljung et al. 2023)
- Mostly about government regulation, but recommendations on safety standards translate to recommendations on actions for labs
- Model evaluation for extreme risks (DeepMind, Shevlane et al. 2023)
- What AI companies can do today to help with the most important century (Karnofsky 2023) (LW)
- Karnofsky nearcasting: How might we align transformative AI if it’s developed very soon?, Nearcast-based "deployment problem" analysis, and Racing through a minefield: the AI deployment problem (LW) (Karnofsky 2022)
- Survey on intermediate goals in AI governance (Räuker and Aird 2023)
- Corporate Governance of Artificial Intelligence in the Public Interest (Cihon, Schuett, and Baum 2021) and The case for long-term corporate governance of AI (Baum and Schuett 2021)
- Three lines of defense against risks from AI (Schuett 2022)
- The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation (Brundage et al. 2018)
- Adapting cybersecurity frameworks to manage frontier AI risks: defense in depth (IAPS, Ee et al. 2023)
Levers
- AI developer levers and AI industry & academia levers in Advanced AI governance (LPP, Maas 2023)
- This report is excellent
- "Affordances" in "Framing AI strategy" (Stein-Perlman 2023)
- This list may be more desiderata-y than lever-y
Desiderata
Maybe I should make a separate post on desiderata for labs (for existential safety).
- Six Dimensions of Operational Adequacy in AGI Projects (Yudkowsky 2022)
- "Carefully Bootstrapped Alignment" is organizationally hard (Arnold 2023)
- Slowing AI: Foundations (Stein-Perlman 2023)
- [Lots of stuff implicated elsewhere, like "help others act well" and "minimize diffusion of your capabilities research"]
Ideas
Coordination[1]
See generally The Role of Cooperation in Responsible AI Development (Askell et al. 2019).
- Coordinate to not train or deploy dangerous AI
- Model evaluations
- Model evaluation for extreme risks (DeepMind, Shevlane et al. 2023) (LW)
- ARC Evals
- Safety evaluations and standards for AI (Barnes 2023)
- Update on ARC's recent eval efforts (ARC 2023) (LW)
- Safety standards
- Model evaluations
Transparency
Transparency enables coordination (and some regulation).
- Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims (Brundage et al. 2020)
- Followed up by Filling gaps in trustworthy development of AI (Avin et al. 2021)
- Structured transparency
- Exploring the Relevance of Data Privacy-Enhancing Technologies for AI Governance Use Cases (Bluemke et al. 2023)
- Beyond Privacy Trade-offs with Structured Transparency (Trask and Bluemke et al. 2020)
- Honest organizations (Christiano 2018)
- Auditing & certification
- Theories of Change for AI Auditing (Apollo 2023) and other Apollo stuff
- What does it take to catch a Chinchilla? Verifying Rules on Large-Scale Neural Network Training via Compute Monitoring (Shavit 2023)
- Auditing large language models: a three-layered approach (Mökander et al. 2023)
- The first two authors have other relevant-sounding work on arXiv
- AGI labs need an internal audit function (Schuett 2023)
- AI Certification: Advancing Ethical Practice by Reducing Information Asymmetries (Cihon et al. 2021)
- Private literature review (2021)
- Model evaluations
- Model evaluation for extreme risks (DeepMind, Shevlane et al. 2023) (LW)
- Safety evaluations and standards for AI (Barnes 2023)
- Update on ARC's recent eval efforts (ARC 2023) (LW)
Publication practices
Labs should minimize/delay the diffusion of their capabilities research.
- Publication decisions for large language models, and their impacts (Cottier 2022)
- Shift AI publication norms toward "don't always publish everything right away" in Survey on intermediate goals in AI governance (Räuker & Aird 2023)
- "Publication norms for AI research" (Aird unpublished)
- Publication policies and model-sharing decisions (Wasil et al. 2023)
Structured access to AI models
- Sharing Powerful AI Models (Shevlane 2022)
- Structured access for third-party research on frontier AI models (GovAI, Bucknall and Trager 2023)
- Compute Funds and Pre-trained Models (Anderljung et al. 2022)
Governance structure
- How to Design an AI Ethics Board (Schuett et al. 2023)
- Ideal governance (for companies, countries and more) (Karnofsky 2022) (LW) has relevant discussion but not really recommendations
Miscellanea
- Do more/better safety research; share safety research and safety-relevant knowledge
- Do safety research as a common good
- Do and share alignment and interpretability research
- Help people who are trying to be safe be safe
- Make AI risk and safety more concrete and legible
- See Larsen et al.'s Instead of technical research, more people should focus on buying time and Ways to buy time (2022)
- Pay the alignment tax (if you develop a critical model)
- Do safety research as a common good
- Improve your security (operational security, information security, and cybersecurity)
- There's a private reading list on infosec/cybersec, but it doesn't have much about what labs (or others) should actually do.
- Plan and prepare: ideally figure out what's good, publicly commit to doing what's good (e.g., perhaps monitoring for deceptive alignment or supporting external model evals), do it, and demonstrate that you're doing it
- For predicting and avoiding misuse
- For alignment
- For deployment (especially of critical models)
- For coordinating with other labs
- Sharing
- Stopping
- Merging
- More
- For engaging government
- For increasing time 'near the end' and using it well
- For ending risk from misaligned AI
- For how to get from powerful AI to a great long-term future
- Much more...
- Recommendation: Bug Bounties and Responsible Disclosure for Advanced ML Systems (Gray 2023)
- See also this comment
- See OpenAI's bug bounty program
- Report incidents
- The Windfall Clause: Distributing the Benefits of AI for the Common Good (O'Keefe et al. 2020)
- Also sounds relevant: Safe Transformative AI via a Windfall Clause (Bova et al. 2021)
- Watermarking[2]
- Make, share, and improve a safety plan
- OpenAI (LW) (but their more recent writing on "AI safety" is more prosaic)
- DeepMind (unofficial and incomplete)
- See also Shah on DeepMind alignment work
- Anthropic (LW)
- Make share, and improve a plan for the long-term future
- Improve other labs' actions
- Inform, advise, advocate, facilitate, support, coordinate
- Differentially accelerate safer labs
- Improve non-lab actors' actions
- Government
- Support good policy
- See AI policy ideas: Reading list (Stein-Perlman 2023)
- Standards-setters
- Kinda the public
- Kinda the ML community
- Government
- Support miscellaneous other strategic desiderata
- E.g. prevent new leading labs from appearing
See also
- Best Practices for Deploying Language Models (Cohere, OpenAI, and AI21 Labs 2022)
- See also Lessons learned on language model safety and misuse (OpenAI 2022)
- Slowing AI (Stein-Perlman 2023)
- Survey on intermediate goals in AI governance (Räuker and Aird 2023)
Some sources are roughly sorted within sections by a combination of x-risk-relevance, quality, and influentialness– but sometimes I didn't bother to try to sort them, and I haven't read all of them.
Please have a low bar to suggest additions, substitutions, rearrangements, etc.
Current as of: 9 July 2023.
- ^
At various levels of abstraction, coordination can look like:
- Avoiding a race to the bottom
- Internalizing some externalities
- Sharing some benefits and risks
- Differentially advancing more prosocial actors?
- More? - ^
Policymaking in the Pause (FLI 2023) cites A Systematic Review on Model Watermarking for Neural Networks (Boenisch 2021); I don't know if that source is good. (Note: this disclaimer does not imply that I know that the other sources in this doc are good!)
I am not excited about watermarking. (Note: this disclaimer does not imply that I am excited about the other ideas in this doc! But I am excited about most of them.)
Zach - it can be helpful to develop reading lists. But in my experience, busy people are much more likely to dive into a list of 3-4 things that are each no more than 2,000 words, rather than a comprehensive list of all the great things they could possibly read if they have a few extra weeks of life.
So, the ideally targeted 'AI risk/AI alignment' reading list, IMHO, would involve no more than 8,000 words total (that could be read in about 40 minutes).
That would be good too! And it would fill a different niche. This list is mostly meant for AI strategy researchers rather than busy laymen, and it's certainly not meant to be read cover to cover.
(Note also that this list isn't really about AI risk and certainly isn't about AI alignment.)
(Note also that I'm not trying to make people "more likely" to read it-- it's optimal for some people to engage with it and not optimal for others.)