Hide table of contents

Rethink Priorities is working on a project called ‘Defense in Depth Against Catastrophic AI Failures’. “Defense in depth” refers to the use of multiple redundant layers of safety and/or security measures such that each layer reduces the chance of catastrophe. Our project is intended to (1) make the case for taking a defense in depth approach to ensuring safety when deploying near-term, high-stakes AI systems and (2) identify many defense layers/measures that may be useful for this purpose.

If you can think of any possible layers, please mention them below. We’re hoping to collect a very long list of such layers, either for inclusion in our main output or for potentially investigating further in future, so please err on the side of commenting even if the ideas are quite speculative, may not actually be useful, or may be things we’ve already thought of. Any relevant writing you can refer us to would also be useful.

If we end up including layers you suggest in our outputs, we’d be happy to either leave you anonymous or credit you, depending on your preference.

Some further info about the project: By “catastrophic AI failure”, we mean harmful accidents or harmful unintended use of computer systems that perform tasks typically associated with intelligent behavior (and especially of machine learning systems) that lead to at least 100 fatalities or $1 billion in economic loss. This could include failures in contexts like power grid management, autonomous weapons, or cyber offense (if you’re interested in more concrete examples, see here )

Defense layers can relate to any phase of a technology’s development and deployment, from early development to monitoring deployment to learning from failures, and can be about personnel, procedures, institutional set up, technical standards, etc.

Some examples of defense layers for AI include (find more here):

  • Procedures for vetting and deciding on institutional partners, investors, etc.
  • Methods for scaling human supervision and feedback during and after training high-stakes ML systems
  • Tools for blocking unauthorized use of developed/trained IP, akin to the PALs on nuclear weapons
  • Technical methods and process methods (e.g. certification; Cihon et al. 2021, benchmarks?) for gaining high confidence in certain properties of ML systems, and properties of the inputs to ML systems (e.g. datasets), at all stages of development (a la Ashmore et al. 2019)
  • Background checks & similar for people being hired or promoted to certain types of roles
  • Methods for avoiding or detecting supply chain attacks
  • Procedures for deciding when and how to engage one's host government to help with security/etc.




New Answer
New Comment

3 Answers sorted by

Four layers come to mind for me:

  • Have strong theoretical reasons to think your method of creating the system cannot result in something motivated to take dangerous actions
  • Inspect the system thoroughly after creation, before deployment, to make sure it looks as expected and appears incapable of making dangerous decisions
  • Deploy the system in an environment where it is physically incapable of doing anything dangerous
  • Monitor the internals of the system closely during deployment to ensure operation is as expected, and that no dangerous actions are attempted

In response to an earlier version of this question (since taken down) weeatquince responded with the following helpful comment:

Regulatory type interventions (pre-deployment):

  • Regulatory restriction (rules on what can be done)
  • Regulatory oversight (regulators)
  • Industry self-regulation
  • Industry (& regulator) peer reviews systems
  • Fiduciary duties
  • Senior management regimes
  • Information sharing regimes
  • Whistleblowing regimes
  • Staff security clearances
  • Cybersecurity of AI companies
  • Standardisation (to support ease of oversight etc)
  • Clarity about liability & legal responsibility
  • Internal government oversight (all of the above applied internally by government to itself, e.g. internal military safety best practice)
  • Technical type interventions (pre-deployment):
  • AI safety research

Defence in depth type interventions (post-deployment):

  • Windfall clauses etc
  • Shut-off switches for AI systems
  • AIs policing other AIs' behaviours
  • Internet / technology shut-off systems

The boring answers

Don't give your AI system excess compute. Like ideally on a hardware level. Run it on a small isolated machine not a 0.1% timeshare on a supercomputer.

Use the coding practices developed by Nasa to minimize standard bugs. 

Record all random seeds and input data to make everything reproducible. 

Put in hard coded sanity checks between AI and output. A robot arm isn't allowed to move beyond safe limits by a simple max(AI_OUTPUT, MAXIMUM_ARM_ANGLE) type code.

Humans checking in the loop.

Hardware minimization of unneeded action space. Suppose you are programming a social media bot. The AI is quite smart. If you have 1 computer running the AI that has a network connection, and it rowhammers, it can send arbitrary internet packages. Computer A runs the AI and sends raw ascii down a wire to computer B. Computer B checks every simple check that you would expect to be true in normal operation, and performs a hard shutdown if it spots anything odd. (Ie funny characters, post too long or short. ) Computer A of course has no other connections, has a power supply resistant to modulated energy demand attacks, has a radio and audio noise generator next to it and is in a Farady cage. 

2 or 3 redundant computers (cosmic ray errors)

Thermite on a timer. Slag the whole computer system. Stops the AI thinking too long. Makes it harder to hide info for future versions of itself.

The speculative answers.

Holomorphic encryption. 

Run it on a quantum computer using amplitude amplification techniques. Any breaking out of the box technique must be reversible to work, or use new physics.

Honeypots that look like good hacking targets. 

A fairly easy way for any moderately smart AI to wirehead, setting its own utility to maximum. (and triggering some alarm/ delayed shutdown.)

The silly answers

Chuck the AI system into a black hole, set to only start computing after they cross the event horizon.

Curated and popular this week
Relevant opportunities