Brief thoughts on Data, Reporting, and Response for AI Risk Mitigation

Davidmanheim

An idea I’ve been thinking a lot about recently is risk mitigation for AI companies, how this is handled in other industries, and what can be learned. Problem identification is not trivial, and more investment in data and processes for identifying where failures occur and how to address them is critical. Approaches like those used in areas like nuclear power and airline safety, where failures are understood to be dangerous, are useful as a starting point. Given that AI systems can fail in ways that are even more dangerous, I will outline some of the aspects that matter, and try to provide a brief overview of some initial thoughts about how failure identification can inform failure mitigation and response for AI systems.

Risk mitigation takes several forms. Across domains, the best mitigation strategy is prevention, followed by reduced impact and/or rapid recovery. In general, these can all be achieved better when systems are built or adapted to deal with the types of failure that can occur. This explains why analysis, identification, and categorization of failures is important - but analysis is just a prerequisite for what we care about.

Risk Analysis: A Prerequisite for Risk Mitigation

High-reliability industries like airlines are required to track their performance via flight data monitoring, to report incidents and near-misses, and to submit to both internal and external safety audits routinely. This triad of data sources gives different information about safety and is in addition to protection for other types of reporting and whistleblowing. Below, I consider a few analogs of each for AI, and note a few which seem most practical or useful.

Reporting, both for status and for incidents and near misses, is standard in many fields. Routine reporting of status and problems short of accidents could be relatively straightforward. For example, one form of routine reporting would include providers reporting on the number of various types of problematic questions or prompts submitted, the number of automatically rejected prompts (“as an AI model, I cannot…”) for safety reasons, along with the category of question (Prompt jailbreaking, Illegal or violent activities, overt racism/sexism, biased content, sexual content, etc.) Another would be reporting training of new models over a certain size or compute usage, along with information on the structure of the model, the dataset used and the cleaning or filtering applied, and so on.

This type of status and routine reporting addresses only a narrow set of issues relevant to safety, and does not address accidents and risk directly. However, reporting incidents is difficult at present because what qualifies as an incident is unclear. There are currently two projects that attempt to catalog incidents. https://incidentdatabase.ai/, which is a general open database, and AVID-ML, paralleling what is done in cybersecurity. Unfortunately, despite a typology and ongoing work, at present it is still unclear what does or does not qualify as an issue that needs to be fixed, much less what would be a near miss. This needs more work adapting the processes of reporting to AI failure modes.

Despite the challenge, this is critical for two reasons. First, it builds a culture where accidents are not hidden, and the risks that exist are addressed rather than disguised or ignored. Second, shared knowledge of accidents allows everyone in an industry to learn from others, rather than needing to blindly fall into issues others have already found and addressed. Whistleblowing protections are an obvious extension of the need to report issues, but are also critical for reinforcing all of the other safety measures discussed.

Lastly, audits for safety are a critical tool for creating and maintaining safety in critical systems. There is already at least an initial acceptance of such safety audits for individual AI systems, as ARC performed for GPT-4. This type of safety audit could be extended, and could be made required and standard for all such systems. A standard requirement does not, of course, imply that only a standardized rubric would be used - audits must be flexible and adapt to newly found issues and risks, including audits to address global catastrophic risks. However, the audits that were performed so far were seemingly limited to the safety of the systems themselves, rather than looking at the process more broadly. And as pointed out by Mökander et al, there are two other layers of auditing on opposite sides of the models themselves which are potentially useful, governance-level audits, and application-level audits.

For more comprehensive application-level audits, auditors might be given access to any inputs and logs used in RHLF or fine tuning. For governance audits, they need access to details about the process used for testing and the standards used for the decision to release. And for all three audit types, they should review the process for ongoing monitoring for safety. AI systems already presumably maintain logs of use, or can do so, and these may be used for internal quality control, but they should also be subject to monitoring by external agencies such as auditors. For application audits, there are additional challenges, and while log access is potentially difficult, even in those cases user privacy could be preserved to extent, in that the logs would not be public, but providers would at least preserve the ability for external safety auditors, both government and private, to audit the data^[1].

Analysis

Once a problem is identified, companies need to fix the problem, but the default way this occurs is patching something narrow, in a way sufficient to fix the problem. A comprehensive process obviously only starts with collecting data - and with reviewing the accident types others have reported. Additional processes that are needed are root cause analysis, comprehensive risk reduction, and drilling for response.

Root cause analysis is a family of techniques for moving beyond noticing failures, the focus of most of the above, and towards fixing them and preventing their recurrence. It starts, however, with a clear description of the failure, which enables the above processes as well. From that point, per wikipedia, “RCA may include the techniques Five whys, Failure mode and effects analysis (FMEA), Fault tree analysis, Ishikawa diagram, and Pareto analysis.” Each of these are methods for identifying the causes, and adapting these to AI failures is important, and as far as I have seen, still in its infancy.

All of these techniques will focus on accidents that have been identified. That allows for incident reviews and learning from the past. In most domains, this is helpful - the types of accidents which have occurred are the same type that will likely recur. Unfortunately, in domains where change is very rapid, risks are not static, and looking at the past will find only a subset of risks.

Despite the objection, there are two reasons to think that these methods are still relevant for reducing future risks. First, we have theoretical models that predict failure modes that do not yet occur, and we can use those failure modes in risk mitigation. Second, some failure modes that occur in the future can be mitigated in ways that also mitigate current failure modes, so making models significantly more robust to detected failure modes can reduce the risk from newer failures.

Risk Reduction

In critical systems, it is not sufficient to consider only likely or frequently occurring failures. Instead, comprehensive risk reduction must look at and address failures of a wide variety of different types. This should include those failures which occurred, “near-miss” events, failures that are known to exist in other systems due to public reporting, and failures that are possible based on theoretical models.

In sociotechnical systems, it is also not enough to mitigate only technical failures. For example, misuse of a system is a failure, and if it is possible for a nuclear power plant operator to override alarms and controls to cause an accident, risk mitigations would include screening the operators to ensure they are not going to intentionally do so, and that they are well trained and alert enough not to accidentally allow failure. This points to a key problem for AI systems that are deployed to the public - we cannot assume that all users are benign, and certainly should not presume they are responsible enough to avoid accidents.

This requires an extension of an approach that is already needed in any high-reliability system, which is to comprehensively prevent failures at multiple points. When a failure is diagnosed, the analysis identifies not just a single point at which a vulnerability can be patched, but rather all the places where the failure occurred. Each point identified is also a place where the failure can be stopped. As a concrete parallel, in programming for a high-reliability system, when a bug is found or a failure occurs, the response would involve a variety of technical mitigations; patching the bug where it occurs, implementing error catching, doing validation to ensure the situation where the error occurs is prevented, rewriting code in ways that the given class of errors are prevented, and ensuring that the system recovers or fails gracefully even when now-impossible events occur. While any one of these would prevent the specific issue, performing all of them makes it likely that similar but unknown failures would also be caught.

On the socio-technical side of the failure, users can be trained to avoid scenarios that will cause failures, and programmers and UX designers can assist in making sure the UI makes correct usage easy. Further, the system can ensure that malicious or careless use is not only rejected, but also detected and flagged for management or external review. Design patterns that are safe, and enhancing verification and validation procedures are also helpful. Again, casting a wide net and defense-in-depth is helpful.

In the analogy, each of the proposed steps can reduce the risk of failure, so if a system needs to be safe, all would be pursued. The marginal benefit of each set of mitigations likely decreases, but because failures and their causes are varied, each can provide additional defense. For parallel reasons, a set of mitigation measures for AI systems would potentially be similarly valuable. Because of this, it seems useful to propose at least an initial list of such measures, with the understanding that more extensive research into the topic would be valuable.

Technical Measures to Mitigate or Prevent Failures

To start, we consider artificial intelligence systems as solely technical systems, and consider technical mitigations to both technical and social failures. A technical failure would be a language model failing to generate plausible output, for example, hallucinations, which are very low probability outputs in the true underlying distribution. A social failure would be generating misinformation or biased content, which might be a correct reflection of the distribution of text on which the model is trained, but is unacceptable as an output. Both types of failure can be mitigated with technical measures.

An initial list of mitigations for specific known failures for LLMs includes responses like removing training data that represents or leads to the failure mode, editing the LLM to remove problematic facts, concepts, or modes of reasoning, doing RHLF to either train better response, or to penalize the class of failure, and implementing APIs that perform detection and reject queries that might be dangerous. For other types of models, such as generative image synthesis models, the failure modes are less severe, but some of these techniques are also applicable - making it likely that similar families of techniques could be useful for future classes of models as well.

Of course, failures will still occur, and the above-mentioned types of reporting are important for when something occurs despite these measures, to learn from the failure - but so is proactive and immediate response.

Sociotechnical Responses

The set of measures that can reduce risk are not only technical. These span a range from measures that are primarily policy focused, to ones that address aspects of the systems themselves. While both more drastic measures^[2] and market-based or legal measures are reasonable, we will focus more narrowly on sociotechnical measures relevant to the models themselves.

For example, detection of high-risk users and enhanced monitoring of their usage is a measure that can help prevent some types of misuse. Going further in that direction, requiring screening and access control could prevent a wide variety of types of misuse. Instead of selling API access to their models widely, firms could be required to perform some level of “Know Your Customer” activity, and ban users that are likely to misuse the systems. This could be done directly, by mandating such measures, or indirectly and perhaps more effectively, such as by imposing legal liability for users’ actions.

Monitoring, discussed above, also allows for additional response activities. For an AI system, providers would document and inform users about what safety failures they might experience, and provide guidance on how to monitor and respond to them. On the provider side, processes for shutting down systems automatically in response to certain detected events seem like a reasonable initial step. In extreme cases, instead of planning remediation, such as post-release patches to perform further RHLF, the system would be stopped from production uses, at least until a failure analysis is performed, and new robustness checks are performed.

High reliability organizations routinely drill everyone - both users and workers - on procedures for accidents and failures of various types, including such seemingly over-the-top procedures as airline’s mandatory safety video for passengers, which they are forced to watch not once, but every time they fly. Similar levels of warning for users of untrustworthy systems that are being applied in domains where safety is a concern would be at least a stop-gap that can be used to ensure that there is awareness of the lack of safety of such systems.

Tentative Conclusions

All of these measures are only useful if done, and are of only marginal value unless done well. For example, claiming that a process exists for reporting failures or responding to them is far easier than actually having a robust process. For this reason, process transparency is a critical enabler to these avenues for response and reporting to be useful. Governance audits are one useful potential avenue for this, and can also ensure there is a process that deals with anticipating and addressing known failure modes, and ensure that the systems are ready to respond to a failure. Safety culture more broadly is valuable, but challenging for other reasons^[3].

It seems that there are a number of useful avenues to explore in this area, and it seems that there are others already discussing working on this. Given that, I’m interested in people’s thoughts about this, collaborations to move this forward, or critiques.

^{^}
There are civil liberty concerns with certain parts of this process which are critical. At the same time, in applications where privacy is needed, some technological conveniences from the latest generation of AI systems, which are not known to be safe in many ways, can and should be unavailable. Of course, in cases where auditing for safety is needed, providers should make clear to users that there is no expectation of full privacy, to address concerns raised by, for example, ABC v Lenah Game Meats, which said that if people “reasonably expect” privacy, that their expectation has legal force. On the other hand, for better, or, mostly, for worse, extensive and invasive data collection from online services is already standard, and the expectation for privacy is arguably minimal.
^{^}
The most obvious of these policy risk reduction measures is the simple but drastic measure of discontinuing production of advanced chips that allow machine learning (and gaming, and digital filmography, etc.) This, along with several other less severe but similar approaches, are the equivalent of banning nuclear power, or regulating it to raise costs to accomplish the same.
^{^}
I'm working on writing these issues up more clearly, and would love to talk to others about this.

quinnJun 15 20234

Re KYC: if liability is concentrated on an upstream LLM vendor, you would predict a higher bar of gatekeeping applied to downstream API consumers. So the median gpt app would be much closer to duolingo than to an edgy poetry collective. If liability falls only to end-users, LLM vendors would keep gatekeeping rather low.

This itself is a cost-benefit calculation, thinking through different definitions of "democratized AI" and "equitable access" and figuring out which class of mistakes is preferable.

"Who foots the bill for compliance?" is a fun incentive alignment puzzle!

DavidmanheimJun 16 20232

Yeah, I'm a fan of joint and several liability for LLMs that includes any party which was not explictly stopping a use of the LLM for something that causes harms, for exactly this reason.

Effective Altruism Forum
EA Forum