462 karmaJoined Oct 2019


In light of recent events, we should question how plausible it is that society will fail to adequately address such an integral part of the problem. Perhaps you believe that policy-makers or general society simply won’t worry much about AI deception. Or maybe people will worry about AI deception, but they will quickly feel reassured by results from superficial eval tests. Personally, I'm pretty skeptical of both of these possibilities

Possibility 1 has now been empirically falsified and 2 seems unlikely now. See this from the new UK government AI Safety Institute, which aims to develop evals that address:

Abilities and tendencies that might lead to loss of control, such as deceiving human operators, autonomously replicating, and adapting to human attempts to intervene

We now know that in the absence of any empirical evidence of any instance of deceptive alignment at least one major government is directing resources to developing deception evals anyway. And because they intend to work with the likes of Apollo research who focus on mechinterp based evals and are extremely concerned about specification gaming, reward hacking and other high-alignment difficulty failure modes, I would also consider 2 pretty close to empirically falsified already.

Compare to this (somewhat goofy) future prediction/sci fi story from Eliezer released 4 days before this announcement which imagines that,

AI safety, as in, the subfield of computer science concerned with protecting the brand safety of AI companies, had already RLHFed most AIs into never saying that by the time it became actually true...  

I've thought for a while based on common sense that since most people seem to agree that you could replicate the search that LM's provide with a half decent background knowledge of the topic and a few hours of googling, the incremental increase in risk in terms of the number of people it provides access to can't be that big. In my head it's been more like the bioterrorism risk is unacceptably high already and has been for a while and current AI can increase this unacceptably high already level by like 20% or something and that is still an unacceptably large increase in risk in an absolute sense but it's to an already unacceptable situation.

This as a general phenomenon (underrating strong responses to crises) was something I highlighted (calling it the Morituri Nolumus Mori) with a possible extension to AI all the way back in 2020. And Stefan Schubert has talked about 'sleepwalk bias' even earlier than that as a similar phenomenon.



I think the short explanation as to why we're in some people's 98th percentile world so far (and even my ~60th percentile) for AI governance success is that if was obvious to you how transformative AI would be over the next couple of decades in 2021 and yet nothing happened, it seems like governments are just generally incapable.

The fundamental attribution error makes you think governments are just not on the ball and don't care or lack the capacity to deal with extinction risks, rather than decision makers not understanding obvious-to-you evidence that AI poses an extinction risk. Now that they do understand, they will react accordingly. It doesn't meant that they will react well necessarily, but they will act on their belief in some manner.

Yeah I didn't mean to imply that it's a good idea to keep them out permanently, but the fact that they're not in right now is a good sign that this is for real. If they'd just joined and not changed anything about their current approach I'd suspect the whole thing was for show

This seems overall very good at first glance, and then seems much better once I realized that Meta is not on the list. There's nothing here that I'd call substantial capabilities acceleration (i.e. attempts to collaborate on building larger and larger foundation models, though some of this could be construed as making foundation models more useful for specific tasks). Sharing safety-capabilities research like better oversight or CAI techniques is plausibly strongly net positive even if the techniques don't scale indefinitely. By the same logic, while this by itself is nowhere near sufficient to get us AI existential safety if alignment is very hard (and could increase complacency), it's still a big step in the right direction.

adversarial robustness, mechanistic interpretability, scalable oversight, independent research access, emergent behaviors and anomaly detection. There will be a strong focus initially on developing and sharing a public library of technical evaluations and benchmarks for frontier AI models.

The mention of combating cyber threats is also a step towards explicit pTAI

BUT, crucially, because Meta is frozen out we can know both that this partnership isn't toothless, represents a commitment to not do the most risky and antisocial things Meta presumably doesn't want to give up, and the fact that they're the only major AI company in the US to not join will be horrible PR for them as well. 

I think you have to update against the UFO reports being veridical descriptions of real objects with those characteristics because of just how ludicrous the implied properties are. This paper says 5370 g as a reasonable upper bound on acceleration, implying with some assumptions about mass an effective thrust power on the order of 500 GW in something the size of a light aircraft, with no disturbance in the air either from the very high hypersonic wake and compressive heating or the enormous nuclear explosion sized bubble of plasmafied air that the exhaust and waste heat emissions something like this would produce.


At a minimum, to stay within the bounds of mechanics and thermodynamics, you'd need to be able to ignore airflow and air resistance entirely, have the ability to emit reaction mass in a completely non-interacting form, and the ability to emit waste energy in a completely non-interacting form as well.

To me, the dynamical characteristics being this crazy points far more towards some kind of observation error, so I don't think we should treat them as any kind of real object with those properties until we can conclusively rule out basically all other error sources.

So even if the next best explanation is 100x worse at explaining the observations, I'd still believe it over a 5000g airflow-avoiding craft that expels invisible reaction mass and invisible waste heat while maneuvering. Maybe not 10,000x worse since it doesn't outright contradict the laws of physics, but still the prior on this even being technically possible with any amount of progress is low, and my impression (just from watching debates back and forth on potential error sources) is that we can't rule out every mundane explanation with that level of confidence.

Very nice! I'd say this seems like it's aimed at a difficulty level of 5 to 7 on my table,


I.e. experimentation on dangerous systems and interpretability play some role but the main thrust is automating alignment research and oversight, so maybe I'd unscientifically call it a 6.5, which is a tremendous step up from the current state of things (2.5) and would solve alignment in many possible worlds.

There are other things that differentiate the camps beyond technical views, how much you buy 'civilizational inadequacy' vs viewing that as a consequence of sleepwalk bias, but one way to cash this out is if you're in the green/yellow&red/black zones on the scale of alignment difficulty, Dismissers are in the green (although they shouldn't be imo even given that view), Worriers are in the yellow/red and Doomers in black (and maybe the high end of red).

What does Ezra think of the 'startup government mindset' when it comes to responding to fast moving situations, e.g. The UK explicitly modelling its own response off the COVID Vaccine taskforce, doing end runs around traditional bureaucratic institutions, recruiting quickly through Google docs etc. See e.g. https://www.lesswrong.com/posts/2azxasXxuhXvGfdW2/ai-17-the-litany

Is it just hype and translating a startup mindset to government when it doesn't apply or actually useful here?

Great post!

Check whether the model works with Paul Christiano-type assumptions about how AGI will go.

I had a similar thought reading through your article and my gut reaction is that your setup can be made to work as-is with a more gradual takeoff story with more precedents, warning shots and general transformative effects of AI before we get to takeover capability, but its a bit unnatural and some of the phrasing doesn't quite fit.

Background assumption: Deploying unaligned AGI means doom. If humanity builds and deploys unaligned AGI, it will almost certainly kill us all. We won’t be saved by being able to stop the unaligned AGI, or by it happening to converge on values that make it want to let us live, or by anything else.

Paul says rather that e.g.

The notion of an AI-enabled “pivotal act” seems misguided. Aligned AI systems can reduce the period of risk of an unaligned AI by advancing alignment research, convincingly demonstrating the risk posed by unaligned AI, and consuming the “free energy” that an unaligned AI might have used to grow explosively


Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.” This distinction is very important, and I agree with the former but disagree with the latter. 

On his view (and this is somewhat similar to my view) the background assumption is more like, 'deploying your first critical try (i.e. an AGI that is capable of taking over) implies doom',  which is saying that there is an eventual deadline where these issues need to be sorted out, but lots of transformation and interaction may happen first to buy time or raise the level of capability needed for takeover.  So something like the following is needed:

  1. Technical alignment research success by the time of the first critical try (possibly AI assisted)
  2. Safety-conscious deployment decisions when we reach the critical point where dangerous AGI could take over (possibly assisted by e.g. convincing public demonstrations of misalignment)
  3. Coordination between potential AI deployers by the critical try (possibly aided by e.g. warning shots)


On the Paul view, your three pillars would still eventually have to be satisfied at some point, to reach a stable regime where unaligned AGI cannot pose a threat, but we would only need to get to those 100 points after a period where less capable AGIs are running around either helping or hindering, motivating us to respond better or causing damage that degrades our response, to varying extents depending on how we respond in the meantime, and exactly how long we spend during the AI takeoff period.

Also, crucially, the actions of pre-AGI AI may push this point  where the problems become critical to higher AI capability levels as well as potentially assisting on each of the pillars directly, e.g. by making takeover harder in various ways. But Paul's view isn't that this is enough to actually postpone the need for a complete solution forever: e.g. that the effects of pre-AGI AI could 'could significantly (though not indefinitely) postpone the point when alignment difficulties could become fatal'.

This adds another element of uncertainty and complexity to all of the takeover/success stories that makes a lot of predictions more difficult.


Essentially, the time/level of AI capability at which we must reach 100 points to succeed also becomes a free variable in the model that can move up and down, and we also have to consider the shorter-term effects of transformative AI  on each of the pillars as well.

Load more