Hide table of contents

This work is supported by CEEALAR.

Many thanks to Vanessa Kosoy, Zac Hatfield-Dodds, Koen Holtman, Aaron Gertler, and Aman Patel

Software is difficult to do. Sometimes software can be made less wrong. 

Wikipedia: In the context of hardware and software systems, formal verification [FV] is the act of proving or disproving the correctness of intended algorithms underlying a system with respect to a certain formal specification or property, using formal methods of mathematics.

In June of 2020, Arden Koehler wrote in “Some promising career ideas beyond 80,000 Hours' priority paths“ that becoming an expert in formal verification is a candidate.

'Proof assistants' are programs used to formally verify that computer systems have various properties—for example that they are secure against certain cyberattacks—and to help develop programs that are formally verifiable in this way.

Currently, proof assistants are not very highly developed, but the ability to create programs that can be formally verified to have important properties seems like it could be helpful for addressing a variety of issues, perhaps including AI safety and cybersecurity. So improving proof assistants seems like it could be very high-value.

For example, it might be possible to use proof assistants to help solve the AI ‘alignment problem’ by creating AI systems that we can prove have certain properties we think are required for the AI system to reliably do what we want it to do. Alternatively, we may be able to use proof assistants to generate programs that we need to solve some sub-part of the problem. (Read our career review of researching risks from AI)

We haven't looked into formal verification yet much, but both further research in this area as well as applying existing techniques to important issues seem potentially promising to us. You can enter this path by studying formal verification at the undergraduate or graduate level, or learning about it independently if you have a background in computer science. Jobs in this area exist both in industry and in academia.

I want to scrutinize FV’s position in AI, mostly existential safety. We will not be covering information security careers within the scope of this post, but a broader treatment of high impact careers in formal verification ought to cover information security. Within FV, my emphasis will be on FV tooling, specifically our ability to work with machine-checked proofs, not necessarily manual proofs.

Some relevant details about me before I begin: I was E2G in a python-ish/security job for most of 2020, I’m currently at CEEALAR working on Cooperative AI (common pool resource problems in multiagent RL) in AI Safety Camp 5 and hosting the Technical AI Safety Podcast. I’ve read two volumes of Software Foundations (a major textbook in programming language theory (PLT)) and I’ve written small real analysis and linear algebra programs/libraries for personal use in Coq (a major proof assistant). Credentially, I've almost finished an associates degree in mathematics. I am writing this post because I enjoy working in proof assistants, and I want to create a resource for the community to point to if they meet someone who’s excited about FV (i.e. is considering it as a career path). Additionally, I hope the post is useful at a higher level for nontechnical audiences to help them reason about things they’re hearing in the space. 

Broadly, I will conclude that the case for applying FV to AI existential safety is mixed, but I’m pessimistic. I will begin by getting us on the same page on what it means to be “formal”. Then, I will discuss two cultures in computer science; formal verification and artificial intelligence. After that, I will review a chapter of Stuart Russell’s book Human Compatible concerning claims about “provably beneficial” AI. Penultimately, I will introduce a distinction between theories and products to talk about the role adoption and developer ecosystems play in existential safety. Finally, I will conclude by giving my probability mass on the belief that AI existential safety is a path to impact for an FV tooling expert

When is a career “high impact”?

If you’re coming to this post from outside the effective altruism (EA) community, welcome! I found a definition of high impact career on the homepage of 80000 Hours, a career advice nonprofit.

The highest-impact careers are those which allow you to make the biggest contribution to solving the world’s most pressing problems.

In this post, I’ll be leaning heavily on the assumption that shaping AI is important.

At the end, I will ask “is altruistic pressure needed?” because, as always, it is important to cleave a problem into areas that will be solved by default incentives and areas that require extra coordination or attention. Andrew Critch speaks at length about this process of cleaving in expectation here, and it comes up in the form of neglect forecasts here. In other words, sometimes roles are crucial but we don’t need altruists to fill them, because because being motivated by money or status could lead to someone doing an adequate job. 

What does “formal” mean, and why does it matter?

When something is “formal” that means it can be understood by writing and rewriting. When we “formalize” something we reduce it to rules for writing and rewriting things. An important distinction in a formal system is between statics (what you can write down) and dynamics (how you can rewrite it). Just for intuition, I want to provide a minimal formal system, and I choose the MU puzzle from Gödel Escher Bach. 1. “MI” is an axiom of the theory (it is a string that we know we can write down without any premises), 2. if a string of the theory ends in I then you can append U to the end of it to get another string of the theory (so from the axiom “MI” I can compute “MIU”, and admit “MIU” to the theory; the rule “1” is a “proof” of “MIU” (“string” is interchangeable with “theorem”)), 3. if you separate out anything that follows an M into a substring, you can repeat that substring to get a new string (so I can apply rule 3 to “MIU” to compute “MIUIU”, or I can apply it to “MI” to compute “MII”), and so on. In MU, the statics is just the alphabet {M, I, U}, and the dynamics are the rules 2, 3, and so on (the axiom “MI” is neither a static nor dynamic, i.e. games with the same rules but different initial conditions may compute different sets of strings/theorems).

As it turns out, computation is adequately modeled by formal systems. But it’s not just mathematical: formal systems imply techniques to write something called “correct software”. For example, if you know a priori that particular configurations (i.e. strings/theorems in the MU example) are bad, then you can sometimes synthesize a program that is guaranteed to avoid those states, other times the best you can do with that knowledge is certify a program that was written by ordinary programmer labor involving ordinary rubber ducks and ordinary coffee (in other words, to prove that a program is correct with respect to the property of not being found in the bad configuration, much like the guarantee but post-hoc). In ideal circumstances, we call the a priori knowledge a specification. This all rests on a few mighty “ifs”, so we should unpack the premises. Is desired behavior knowable a priori? Then, is it describable in an appropriate specification? These are crucial questions, if you don’t have these then FV is a nonstarter. But in real world software it’s worse than that even: state of the art research in program synthesis deals in what looks like toy problems from the perspective of real-world software, and the post-hoc certification approach is bounded by project size[1] and available labor hours. In other words, FV is hard. When I say FV is hard, I mean that both in the research sense of showing that things are, in principle, possible and in the industrial/economical sense of showing that things can reasonably be implemented by a competitive (i.e. fast) team. However, to formally verify an entire project end to end is often not the ask, in fact picking out subproblems or components that are candidates for FV can sometimes be very impactful on the overall functioning of a large software project. As an example, consider your bank password: the developers may formally verify that information can’t leak at the point of checking that your password is correct, but that does not mean the entire website is formally verified, i.e. they may not bother to formally check that the redirect after sign-in goes to the intended location. 

FV in Artificial Intelligence

FV and AI are two cultures of computer science. Their intersection is not exactly neglected, but there is still a gap. For reference, here is the alignment newsletter spreadsheet as of April 2021 filtered for the “Verification” category. The spreadsheet should not be considered an exhaustive overview of efforts to close the gap. I will assert, but not defend, that vehicular AI (including weaponry) is likely to be a site of this intersection taking hold in industry. The more interesting question lies in FV’s relationship to alignment problems and existential safety, which is what I hope to cover in this post. 

Some facts about AI institutions

Rohin Shah

Many people believe that ultimately we will need to prove theorems about the safety of our AIs.

We established that we can only apply FV if we know a priori what we consider desirable. This presents the first problem: what is the specification with respect to which we apply these techniques? As wikipedia playfully puts it, “who verifies the verifiers?”, a problem in FV before you even involve AI. As for the specification for a safe AI: we don’t know it - we don’t know what desirable behavior for an arbitrary goal-accomplishing system looks like, besides such vagueness as “not killing everybody” etc. - but there are three cases describing what lies ahead: 1. we’ll someday write it down mathematically or philosophically and then we’ll find a computational description/implementation, 2. we’ll someday write it down but there will be no computational interpretation possible, 3. we’ll never write it down at all. I will call the first case formally specifiable safety, the second case informally specifiable safety, and the third case nonspecifiable safety. Note these three cases correspond to different schools of thought about how to pursue alignment or existential safety research. 

The difference between formally specifiable safety and informally specifiable safety may be a matter of tooling, in fact, the claim that something can be made mathematically precise but in principle remain out of reach of FV tools (such as, but not limited to, proof assistants) is possibly quite strong and I don’t think I can justify it. However, if something takes an order of magnitude more engineering effort to accomplish it’s effectively a non-starter. So instead of informally specifiable safety a more accurate name would be intractably-formally specifiable safety, but it doesn’t roll off the tongue so much. 

A key danger of FV, in the words of researcher Greg Anderson in TASP episode 2, “...because you’ve done it with FV you have a lot of confidence, of course if your specs don’t take into account all the bad things that can happen then you may believe you have a safer system then you really do. This is a classic problem in FV, this comes up all the time.” In other words, ideal safety or ideal correctness in software is always with respect to some specification, so the key question does not become how best to use specifications to synthesize or certify software, but who’s writing the specification and how good are they at it, or what processes (social or computational) are generating specifications? Betting on FV’s guarantees can lure us into a false sense of security, even under ideal implementation circumstances (circumstances which, again, are technically difficult to accomplish). 

I think the specifiable safety crowd’s burdens lie partially in providing threat models and success stories in which specifications play a pivotal role. It would be especially powerful if they could provide a threat model in which success or failure hinged on whether the certification that a program obeyed a particular specification. If these threat models turn out to be plausible, then work to close the academic/cultural gap between FV and AI could prove to be high impact in expectation. Alternatively, advocates of specifiable safety could mine other threat models for the kinds of specifications that are likely to be needed and work ahead of time to make them legitimate and viable for FV tools. An example kind of specification that’s been mentioned is fuzzy specifications. 

The nonspecifiable safety crowd tends to be attracted toward threat models in which precision of implementation doesn’t play a pivotal role. If you think failure looks like the accumulated effects of the average behavior of systems that aren’t intent-aligned, and implementations with lower error rate might not ultimately avert disaster, you might fall into the nonspecifiable safety school of thought. 

What does Stuart Russell mean by “provably beneficial?”

TLDR, in spite of the claim that he’s working toward “provably beneficial” AI, there’s no path to impact for a FV expert found here.

In my reading of the book Human Compatible, I think Russell wants to go long on formally specifiable safety. After classifying goal-accomplishing systems as “the standard model” of AI, he proposes a new, nonstandard model that follows three principles. 1. The machine’s only objective must be satisfaction of human preferences, 2. the machine is initially uncertain about what those preferences are, and 3. the ultimate source of information about human preferences in human behavior (page 173). The nonstandard model falls under specifiable safety because the three principles are an instance of a proposal for the specification, and the only research left to do is figure out how precise and computational we can make it (and of course if it really succeeds at averting disasters). The Center for Human Compatible AI’s mission statement directly references the subsequent chapter, which is called “provably beneficial AI”. Unfortunately, Russell didn’t walk through the justification for the word “provable” to my satisfaction. He does a good job at explaining that the limits of proof are in how well the premises adhere to reality, but he doesn’t spare any ink toward the problems of finding implementations of mathematical specifications or of getting a competitive team of software developers to make correct software actionable. His example case for logic and engineering taking place in “imaginary worlds” (pg 186) is structural engineering, engineers need their mastery and experience to know when to strengthen or weaken assumptions in their mathematical models to create the appropriate approximation such that the building doesn’t fall down but you’re also not spending resources shoring things up at an irrelevant order of magnitude. He then invokes cybersecurity, citing side-channel attacks: when there’s a proof of security in the digital domain one may neglect, for example, the sound the keyboard makes or the delta in voltage detectable when the user is typing, either of which might lead to information leaks (recall the false sense of security from above). I guess he wouldn’t do this in a book for barnes and nobles shelves, but I wanted to see him speculate more about what a certification of his three principles could look like. It’s clear to me that the three principles of the nonstandard model are what Russell means by “beneficial”, if I’m being optimistic I think it could form an anchor for a formal description of “beneficial” someday. I just didn’t get a good sense of what he meant by “provable”. He quickly jumps off into a long section on the state of the art in inverse reinforcement learning and assistance games, am I missing the fact that that’s what he meant by provable? 

Before he opens the discussion up to inverse reinforcement learning and assistance games, he does provide an example sketch of a proof with appropriately glaring holes (pg 187). 

Let’s look at the kind of theorem we would like eventually to prove about machines that are beneficial to humans. One type might go something like this: 

Suppose a machine has components A, B, and C, connected to eachother like so and to the environment like so, with internal learning algorithms lA, lB, lC that optimize internal feedback rewards rA, rB, rC defined like so, and [a few more conditions] . . . then, with very high probability, the machine’s behavior will be very close in value (for humans) to the best possible behavior realizable on any machine with the same computational and physical capabilities.

I think the reason this proof sketch doesn’t strike me as satisfying because it’s effectively a family of proofs indexed on values of epsilon, a subjective threshold must be selected of acceptably high probabilities 1 – epsilon. I can’t help but recall “Security Mindset and Ordinary Paranoia” by Eliezer Yudkowsky, which suggests that if you have to select a threshold, you’ve already lost. 

If you write a system that will hurt you if a certain speed of self-improvement turns out to be possible, then you’ve written the wrong code. The code should just never hurt you regardless of the true value of that background parameter.

I view Eliezer’s choice of background parameter “speed of self-improvement” as just an example. To be fair, not all threat models are like Eliezer’s, ones in which a minor slip-up means instant death for everyone so a 1 – epsilon probability of success is unacceptable. Take a look at this threat model by Paul Christiano, which strikes me more as being about the cumulative effects of average behavior. The sense I got from the book was that Russell’s threat models are much closer to Paul’s, with the focus on enfeeblement (pg 254-5) and the right to mental security (pg 107) not having that instant-death vibe. However, on the margin I’d prefer to see us not having to select thresholds and epsilons in our solutions! 

Even though Russell would like to go long on formally specifiable safety, his principles are far from mathematically precise, and his threat model implies, according to the rough taxonomy I provided above, that he’s interested in nonspecifiable safety. TLDR, in spite of the claim that he’s working toward “provably beneficial” AI, there’s no path to impact for a FV expert found here.

Distinguishing theories and products

Let’s have a story

The year is 2045. AI capabilities has progressed linearly over the last 25 years, the inner and intent alignment problems were both completely understood around 2035 when a training paradigm called New Optimization was discovered/invented, and the commercial impacts are drastic. Swarms of middle-management products are optimizing the daily operations of major organizations with minimal oversight, and executive assistant products are increasingly influencing the direction and values of each organization. While products are mostly aligned to their promises to consumers, we failed to instill any broader notion of value alignment that would have prevented enfeeblement, prevented us from neglecting the things that are harder to measure than commercial fashion sensibilities. The software engineering culture that drove most of the products understood how to take New Optimization solutions from theory paper to production code, but it didn’t understand why the solutions were important, never really appreciated the problems. So, when faced with stakeholder pressure, it opted for a “silent failure” approach, in which errors would be intercepted and the module would defer to more primitive iterative descent instead of crashing the system. As dependence on these products grew, people grew worse at supervising and overriding behaviors, some of which were artifacts of an older time before the solution to these two alignment problems had arrived. Various experts and concerned citizens realized the problem, that there is this gap between theory and product. A few sympathetic voices in the IEEE wanted to help by publishing a standard, but had trouble populating a committee of people with the right mix of expertise. Two years ago (2043), a paper was pushed to the arxiv, detailing for the first time a complete formal specification of New Optimization. The author had to invent a new kind of proof assistant to do it and even included a proof-of-concept of how to certify a complex software project with it. By then, the unrigorous implementations were already roaming loose in the economy, and it would be a catchup game to implement a standard, but in principle it should be possible. However, we’ve so far failed to coordinate on this and no one’s done anything with that paper yet. 

The major variable in this story is the relative size of the time deltas between the mathematical description of New Optimization, the ability for commercial software developers to capitalize on it, and the ability for a FV specialist to make it certifiable. In the version you just read, New Optimization was published in 2035, products began spilling out within a few years, and it became certifiable (in principle) in 2043. You could easily tell a much nicer story where the paper that made it certifiable came out one year after it was originally published, before products had a chance to be implemented, and IEEE could get ahead of it by publishing a standard making it more accessible to implement certified New Optimization. Another major variable is the software culture’s ability to respond to the certifiability result, I wanted to illustrate a failure mode so I chose to set this variable low, but you could easily adjust the story into something where it’s higher. 

What’s important here are the premises of the story: 1. theorists will cough up transformative results, 2. those results won’t automatically be brought to the commercial world in a rigorous way, 3. a FV expert can have leverage over outcomes. 

The third premise seems to me the weakest- as we saw, the FV expert could push a paper, but they could not make software society adopt. It really doesn’t qualify as leverage over an outcome when no one reads your paper. In functional programming culture it’s popular to lament that the world is driven by fickle languages like C or eldritch horrors like JavaScript when instead “just imagine a world where all the properties I love about Haskell and Rust are driving the world,” and so on. The punchline is, of course, that the world goes on turning! Tony Hoare, an early pioneer of formal methods, wrote in 1995 “It has turned out that the world just does not suffer significantly from the kind of problem that our research was originally intended to solve”[2]. I wonder if Hoare in 2020 or 2030, seeing the scale of software that was not present in 1995, would update his words “does not suffer significantly” to something else. Also see [3]. 

Perhaps you could say this is precisely where someone with FV expertise may find an opportunity, provided they’re also an expert in developer ecosystems. Better yet, an organization like a startup or nonprofit that takes the paper on certified New Optimization and throws a ton of resources into advocating for the ecosystem around it, making the tooling easier to use, etc. Developer ecosystems move at a glacial pace though, so this whole thing may be a nonstarter in fast-takeoff scenarios, unless you had adequate foresight to start the change we’d need to see in the developer ecosystem in advance.

I think the second premise is optimistic in the sense that safety theories would get brought to the commercial world at all. At the margin, a bad implementation is probably better than no implementation, right? Finally, premise number one: that theorists will cough up transformative results. How much does the case for FV in AI existential safety rest on this? Again, from the “who verifies the verifiers” problem, we know that FV doesn’t exist in isolation from the properties and behaviors that we want, and I think the transformative results that the theorists will (hopefully) arrive at could give us a description of these properties and behaviors. Paul Christiano mentions this in footnote [4].

TLDR, is AI existential safety a path to impact for a FV tooling expert? I put between two thirds and three quarters of the mass on “no”, because my threat models are closer to the ones I identified as belonging to the nonspecifiable safety school of thought, and because there’s no upper bound to to the Tony Hoare quote. You could easily read this far and come to a different assignment of mass, and I’m curious to hear in the comments (numbers and all!) 


Software is hard to do correctly. There are techniques to make software less wrong. I’m skeptical that these techniques will prove pivotal in the overall path to existential safety of software (i.e. AI) systems. 

In the spirit of anticipating, legitimizing, and fulfilling governance demands, it may make sense in the near term (e.g. right now) to innovate ML-targeted FV capabilities to increase the probability that certified AI is viable when the time comes to implement advanced systems, but only if you’re convinced by the case for ML-targeted FV capabilities playing a pivotal role in existential safety. In the medium term (e.g. five-twenty years before TAI), the connection between FV capabilities and developer ecosystems must be assessed, and possibly attended to. 

Additionally, it could be great to build up a career that bridges technical FV expertise with policy expertise on the off chance that you get to be in the room where policies, standards, or regulations get drawn up. Public-interest-tech.com maintains a list of programs to put technologists in policy positions. 

Subproblems, not the whole problem

It’s important to remember that, like the current day’s implementations of FV’d software, FV demands on AI may never cover the entire problem or the entire implementation, but may target specific components. Properties of large systems, e.g. that a smart city doesn’t have every light green at the same time, may be worth verifying even if we can’t verify one big awesome existential safety guarantee. Verifying solutions to parts of the problem in implementations may be critically important. 

Is altruistic pressure needed?

FV tooling expertise that is deployed in setting policies, standards, or regulations might benefit from an altruist or someone thinking with worst-case threat models. If those policies, standards, or regulations are in place, and implemented with teeth, it doesn’t make sense for altruists to trouble themselves with implementation and innovation, but if they’re implemented without teeth (or not implemented) altruists could be pivotal at every step, building better tools where the market fails to, or making late stage pushes for more or more teethful regulation. In the long term (e.g. as TAI is being deployed), FV wouldn’t have any leverage in an low-regulation environment and altruistic FV experts would be drowned out by ordinary money and status seekers in an high-regulation environment.

Software is eating humanity, which could either go well or go poorly. Thank you for caring enough about the outcome to read this post. Let’s make it go well for all


[1] The largest project I’ve heard about getting formally verified is 8700 lines of C code, seL4

[2] “Ten years ago [1985], researchers into formal methods (and I was the most mistaken among them) predicted that the programming world would embrace with gratitude every assistance promised by formalization to solve the problems of reliability that arise when programs get large and more safety-critical. Programs have now gotten very large and very critical – well beyond the scale which can be comfortably tackled by formal methods. There have been many problems and failures, but these have nearly always been attributable to inadequate analysis of requirements or inadequate management control. It has turned out that the world just does not suffer significantly from the kind of problem that our research was originally intended to solve.” Tony Hoare, source.

[3] Towards Making Formal Methods More Normal: Meeting Developers Where They Are 

[4] Paul Christiano: “the technical issues really will probably occur first in a security context, and the best near-term analogies for control problems will probably be security problems.”, in the linked Medium post he says “Many security problems don’t have direct relevance to alignment” and lists three. Later in that post, bold added: “Formal verification is a tool for building secure systems. As described above, this may become more important in a world with AI, but does not seem directly relevant to alignment. Formal verification can also be used to design components of AI systems that definitely satisfy some formal contract. Figuring out what kind of contract we might want to satisfy seems relevant to alignment; but I’m less excited about improving our ability to formally verify that we’ve satisfied some particular contract.”

[5] This EA Forum exchange is instructive

[6] This stackexchange is instructive

[7] This tumblr exchange is instructive





More posts like this

Sorted by Click to highlight new comments since: Today at 4:06 PM

3 year update: I consider this 2 year update to be a truncated version of the post, but it's actually too punchy and even superficial. 

My opinion lately / these days is too confused and nuanced to write about. 

As someone who's worked both in ML for formal verification with security motivations in mind, and (now) directly on AGI alignment, I think most EA-aligned folk who would be good at formal verification will be close enough to being good at direct AGI alignment that it will be higher impact to work directly on AGI alignment.  It's possible this would change in the future if there are a lot more people working on theoretically-motivated prosaic AGI alignment, but I don't think we're there yet.

Planned summary for the Alignment Newsletter:

This post considers the applicability of formal verification techniques to AI alignment. Now in order to “verify” a property, you need a specification of that property against which to verify. The author considers three possibilities:

1. **Formally specifiable safety:** we can write down a specification for safe AI, _and_ we’ll be able to find a computational description or implementation

2. **Informally specifiable safety:** we can write down a specification for safe AI mathematically or philosophically, but we will not be able to produce a computational version

3. **Nonspecifiable safety:** we will never write down a specification for safe AI.

Formal verification techniques are applicable only to the first case. Unfortunately, it seems that no one expects the first case to hold in practice: even CHAI, with its mission of building provably beneficial AI systems, is talking about proofs in the informal specification case (which still includes math), on the basis of comments like [these](


) in Human Compatible. In addition, it currently seems particularly hard for experts in formal verification to impact actual practice, and there doesn’t seem to be much reason to expect that to change. As a result, the author is relatively pessimistic about formal verification as a route to reduce existential risk from failures of AI alignment.

I think this post makes some excellent points; for brevity, I'm just going to mention my (potential) disagreements.

The punchline is, of course, that the world goes on turning! Tony Hoare, an early pioneer of formal methods, wrote in 1995 “It has turned out that the world just does not suffer significantly from the kind of problem that our research was originally intended to solve”

I suspect (although it's hard to check) this is because most software is mostly used by people who aren't trying to break it. Weird edge cases, for a lot of software, usually don't have that bad of a consequence.

Throw ML into the mix and any divergence between your intended (possibly implicit) utility function and the one you actually programmed can be ruthlessly exploited. In other words: in the past, the world just didn't suffer significantly from the kind of problem that formal methods was intended to solve. But that was when human beings were responsible for the logic.

A particularly colorful example of this is story about the (RL?) agent that was being trained to land an aircraft on a carrier in simulation. Due to an overflow error, it learned a strategy of crashing into the deck at very high velocity. Most human programmers would not think to try that; it may be the case that formal methods were mostly unnecessary in the past, but become critical for guaranteeing safety of autonomy.

Your point about not knowing what the specification for safe AI looks like is still well taken.

formally verify an entire project end to end is often not the ask

Leslie Lamport has suggest that most of the value of formal methods is in getting programmers to actually write down specifications. I believe it's possible that:

  • Formal verification of AGI is not viable
  • Many critical components of AGI can be formally specified
  • Better tooling can help understand/explore/check specifications

An anecdote from industry: Amazon actually budgets engineer hours to write TLA+, which is not used to formally verify implementations of algorithms, but does let programmers model check their specifications, e.g. prove the absence of deadlocks. Speculatively, we might imagine improved formal methods tools and techniques that allow AI researcher to check that their specifications are sound, even while end-to-end verification remains intractable.

Overall, I'm still mildly pessimistic about formal verification being a high-impact career, largely for the reasons that you describe.

Is it easy to dig up a source for the RL agent that learned to crash into the deck?

I don't remember where I initially read about it, but I found a survey paper that mentions it (and several other, similar anecdotes), along with some citations


in which a minor slip-up means instant death for everyone so a 1 – epsilon probability of success is unacceptable.

Oh, does Eliezer still think (speak?) that way? I think that would be the first clear reasoning error (that can't just be written off as a sort of opinionated specialization) I've seen him make about AI strategy. In a situation where there's a certain yearly baseline risk of the deployment of misaligned AGI occurring (which is currently quite low and so this wouldn't be active yet!), it does actually become acceptable to deploy a system that has a well estimated risk of being misaligned. Techniques that only have a decent chance of working is actually useful and should be collected enthusiastically.

I don't know that he is still taking a zero risk policy, I've been seeing a lot more "no it will almost certainly be misaligned" recently, but it could have given rise to a lot of erroneous inferences.