[Added 13Jun: Submitted to OpenPhil AI Worldviews Contest - this pdf version most up to date]

This is an accompanying post to AGI rising: why we are in a new era of acute risk and increasing public awareness, and what to do now.

If you apply a security mindset (Murphy’s Law) to the problem of AI alignment, it should quickly become apparent that it is very difficult. And the slow progress of the field to date is further evidence for this.[1] There are 4 major unsolved components to alignment:

And a variety of corresponding threat models.

Any given approach that might show some promise on one or two of these still leaves the others unsolved. We are so far from being able to address all of them. And not only do we have to address them all, but we have to have totally watertight solutions, robust to AI being superhuman in its capabilities. In the limit of superintelligent AI, the alignment needs to be perfect.

OpenAI brags about a 29% improvement in alignment in their GPT-4 announcement. This is not going to cut it! The alignment paradigms used for today’s LLMs only appear to make them relatively safe because the AIs are weaker than us. If the “grandma’s bedtime story napalm recipe” prompt engineering hack actually led to the manufacture of napalm, it would be immediately obvious how poor today’s level of practical AI alignment is.

Plans that involve increasing AI input into alignment research appear to rest on the assumption that they can be grounded by a sufficiently aligned AI at the start. But how does this not just result in an infinite, error-prone, regress? Such “getting the AI to do your alignment homework” approaches are not safe ways of avoiding doom. Tegmark says that he can use his own proof checker to verify that a superintelligent AI’s complicated plans are safe. But (security mindset): how do you stop physical manipulation of your proof checker's output (rowhammer or more subtle)?

The above considerations are the basis for the case that disjunctive reasoning should predominantly be applied to AI x-risk: the default is doom[2]. All the doom flows through the tiniest crack of imperfect alignment once the power level of the AI is superhuman. Every possible exploit, loophole or unknown that could engender catastrophic risk needs to be patched for this to not be the case (or we have to get very lucky).

Despite this, many people in EA who take AI x-risk seriously put P(doom|AGI) in the 1-10% range. I am struggling to understand this. What is happening in the other 90-99%? How is it that it appears that the default is “we’re fine”? I have asked this here, and not had any satisfactory answers[3].

My read on this so far is that low estimates for P(doom|AGI) are either borne of ignorance of what the true difficulties in AI alignment are; stem from wishful thinking / a lack of security mindset; or are a social phenomenon where people want to sound respectable and non-alarmist; as opposed to being based on any sound technical argument. I would really appreciate it if someone could provide a detailed technical argument for believing P(doom|AGI)≤10%[4].

To go further in making the case for doom by default given AGI, there are reasons to believe that alignment might actually be impossible[5] (i.e. superintelligence is both unpredictable and controllable). Yampolskiy is of the opinion that “lower level intelligence cannot indefinitely control higher level intelligence"[6]. This is far from being a consensus view though, with many people in AI Alignment considering that we have barely scratched the surface in looking for solutions. What is clear is that we need more time![7] 

The case for the reality of existential risk has been well established. Indeed, doom even looks to be a highly likely outcome from the advent of the AGI era. Common sense suggests that the burden of proof is now on AI labs to prove their products are safe (in terms of global catastrophe and extinction risk).

There are two prominent camps in the AI Alignment field when it comes to doom and foom. The Yudkowsky view has very high P(doom|AGI) and a fast take-off happening for the transition from AGI to ASI (Artificial Superintelligence). The Christiano view is more moderate on both axes. However, it’s worth noting that even Christiano is ~50% on doom[8], and ~a couple of years from AGI to ASI (a “slow” take-off that then speeds up). Metaculus predicts weak AGI within 3 years, and takeoff to happen in less than a year.

50% ≤ P(doom|AGI) < 100% means that we can confidently make the case for the AI capabilities race being a suicide race.[9] It then becomes a personal issue of life-and-death for those pushing forward the technology. Perhaps Sam Altman or Demis Hassabis really are okay with gambling 800M lives in expectation on a 90% chance of utopia from AGI? (Despite a distinct lack of democratic mandate.) But are they okay with such a gamble when the odds are reversed? Taking the bet when it’s 90%[10] chance of doom is not only highly morally problematic, but also, frankly, suicidal and deranged.[11]

What to do about this? Join in pushing for a global moratorium on AGI. For more, see the accompanying post to this one (of which this was originally written as a subsection).

  1. ^

    Although some would argue that we still haven’t really tried.

  2. ^

    We can operationalise "doom" as Ord's definition of "the greater part of our potential is gone and very little remains"; although I pretty much think of it as everyone being killed – squiggled or equivalent – so that ~0 value remains. If you are less inclined to foom scenarios, think of it as a global catastrophe that kills >10% of the world's population and ends civilisation as we know it.

  3. ^

    Relatedly, Richard Ngo laments a lack of plausible rebuttals to the core claims of AGI x-risk.

  4. ^

    I’m hereby offering up to $1000 for someone to provide one, on a level that takes into account everything written in this section, and the accompanying post. Please also link me to anything already written, and not referenced in any answers here, that you think is relevant. I’m very interested in further steelmanning the case for high P(doom|AGI). I will note that someone on Twitter has linked me to an unfinished draft that has a high degree of detail and seems promising (but not promising enough to update me significantly away from doom. Still, I’m offering them $1k to finish it). See also, the OpenPhil AI Worldviews Contest (deadline is May 31 and there haven’t been many submissions to date).

  5. ^

    If this is the case, it’s a rather depressing thought. We may be far closer to the Dune universe than the Culture one (the worry driving a future Butlerian Jihad will be the advancement of AGI algorithms to the point of individual laptops and phones being able to end the world). For those who may worry about the loss of the “glorious transhumanist future”, and in particular, radical life extension and cryonic reanimation (I’m in favour of these things), I think there is some consolation in thinking that if a really strong taboo emerges around AGI, to the point of stopping all algorithm advancement, we can still achieve these ends using standard supercomputers, bioinformatics and human scientists. I hope so.

  6. ^

    See also, Geoffrey Miller: “Our Bayesian prior, based on the simple fact that different sentient beings have different interests, values, goals, and preferences, must be that AI alignment with 'humanity in general', or 'sentient life in general', is simply not possible. Sad, but true.” Worth reading the whole comment this quote is taken from.

  7. ^

    To work on Alignment and/or prove its impossibility.

  8. ^

    See also his appearance on the Bankless podcast.

  9. ^

    Re doom “only” being 50%: "it's not suicide, it's a coin flip; heads utopia, tails you're doomed" seems like quibbling at this point. [Regarding the word you're in "you're doomed" in the above - I used that instead of "we're doomed", because when CEOs hear "we're", they're probably often thinking "not we're, you're. I'll be alright in my secure compound in NZ". But they really won't! Do they think that if the shit hits the fan with this and there are survivors, there won't be the vast majority of the survivors wanting justice?]

  10. ^

    This is my approximate estimate for P(doom|AGI). And I’ll note here that the remaining 10% for “we’re fine” is nearly all exotic exogenous factors (related to the simulation hypothesis, moral realism being true - valence realism?, consciousness, DMT aliens being real etc), that I really don't think we can rely on to save us!

  11. ^

    You might be thinking, “but they are trapped in an inadequate equilibrium of racing”. Yes, but remember it’s a suicide race! At some point someone really high up in an AI company needs to turn their steering wheel. They say they are worried, but actions speak louder than words. Take the financial/legal/reputational hit and just quit! Make a big public show of it. Pull us back from the brink.

    Update: 3 days after I wrote the above, Geoffrey Hinton, the “Godfather of AI”, has left Google to warn of the dangers of AI.

Comments29
Sorted by Click to highlight new comments since: Today at 12:02 PM

Based solely on my own impression, I'd guess that one reason for the lack of engagement on your original question stems from the fact that it felt like you were operating within a very specific frame, and I sensed that untangling the specific assumptions of your frame (and consequently a high P(doom)) would take a lot of work. In my own case, I didn’t know which assumptions are driving your estimates, and so I consequently felt unsure as to which counter-arguments you'd consider relevant to your key cruxes.

 (For example: many reviewers of the Carlsmith report (alongside Carlsmith himself) put P(doom) ≤ 10%. If you've read these responses, why did you find the responses uncompelling? Which specific arguments did you find faulty?)

Here's one example from this post where I felt as though it would take a lot of work to better understand the argument you want to put forward:

“The above considerations are the basis for the case that disjunctive reasoning should predominantly be applied to AI x-risk: the default is doom.”

When I read this, I found myself asking “wait, what are the relevant disjuncts meant to be?”. I understand a disjunctive argument for doom to be saying that doom is highly likely conditional on any one of {A, B, C, … }. If each of A, B, C … is independently plausible, then obviously this looks worrying. If you say that some claim is disjunctive, I want an argument for believing that each disjunct is independently plausible, and an argument for accepting the disjunctive framing offered as the best framing for the claim at hand.

For instance, here’s a disjunctive framing of something Nate said in his review of the Carlsmith Report.

For humanity to be dead by 2070, only one premise below needs to be true:

  1. Humanity has < 20 years to prepare for AGI
  2. The technical challenge of alignment isn’t “pretty easy”
  3. Research culture isn’t alignment-conscious in a competent way.

Phrased this way, Nate offers a disjunctive argument. And, to be clear, I think it’s worth taking seriously. But I feel like ‘disjunctive’ and ‘conjunctive’ are often thrown around a bit too loosely, and such terms mostly serve to impede the quality of discussion. It’s not obvious to me that Nate’s framing is the best framing for the question at hand, and I expect that making the case for Nate’s framing is likely to rely on the conjunction of many assumptions. Also, that’s fine! I think it’s a valuable argument to make! I just think there should be more explicit discussions and arguments about the best framings for predicting the future of AI.   

Finally, I feel like asking for “a detailed technical argument for believing P(doom|AGI) ≤ 10%” is making an isolated demand for rigor. I personally don’t think there are ‘detailed technical arguments’ P(doom|AGI) greater than 10%. I don’t say this critically, because reasoning about the chances of doom given AGI is hard. I'm also >10% on many claims in the absence of 'detailed, technical arguments' for such claims in the absence of such arguments, and I think we can do a lot better than we're doing currently.  

I agree that it’s important to avoid squeamishness about proclamations of confidence in pessimistic conclusions if that’s what we genuinely believe the arguments suggest. I'm also glad that you offered the 'social explanation' for people's low doom estimates, even though I think it's incorrect, and even though many people (including, tbh, me) will predictably find it annoying. In the same spirit, I'd like to offer an analogous argument: I think many arguments for p(doom | AGI) > 90% are the result of overreliance on specific default frame, and insufficiently careful attention to argumentative rigor. If that claim strikes you as incorrect, or brings obvious counterexamples to mind, I'd be interested to read them (and to elaborate my dissatisfaction with existing arguments for high doom estimates).

I don't find Carlsmith et al's estimates convincing because they are starting with a conjunctive frame and applying conjunctive reasoning. They are assuming we're fine by default (why?), and then building up a list of factors that need to go wrong for doom to happen.

I agree with Nate. Any one of a vast array of things can cause doom. Just the 4 broad categories mentioned at the start of the OP (subfields of Alignment) and the fact that "any given [alignment] approach that might show some promise on one or two of these still leaves the others unsolved." is enough to provide a disjunctive frame! Where are all the alignment approaches that tackle all the threat models simultaneously? Why shouldn't the naive prior be that we are doomed by default when dealing with something alien that is much smarter than us? [see fn.6].

"I expect that making the case for Nate’s framing is likely to rely on the conjunction of many assumptions"

Can you give an example of such assumptions? I'm not seeing it.

I feel like asking for “a detailed technical argument for believing P(doom|AGI) ≤ 10%” is making an isolated demand for rigor. I personally don’t think there are ‘detailed technical arguments’ P(doom|AGI) greater than 10%.

This blog is ~1k words. Can you write a similar length blog for the other side, rebutting all my points?

I think many arguments for p(doom | AGI) > 90% are the result of overreliance on specific default frame, and insufficiently careful attention to argumentative rigor. If that claim strikes you as incorrect

It does strike me as incorrect. I've responded to / rebutted all comments here, and here, here, here, here etc, and I'm not getting any satisfying rebuttals back. Bounty of $1000 is still open. 

Ay thanks, sorry I’m late back to you. I’ll respond to various parts in turn.

I don't find Carlsmith et al's estimates convincing because they are starting with a conjunctive frame and applying conjunctive reasoning. ​​They are assuming we're fine by default (why?), and then building up a list of factors that need to go wrong for doom to happen.

My initial interpretation of this passage is: you seem to be saying that conjunctive/disjunctive arguments are presented against a mainline model (say, one of doom/hope). In presenting a ‘conjunctive’ argument, Carlsmith belies a mainline model of hope. However, you doubt the mainline model of hope, and so his argument is unconvincing. If that reading is correct, then my view is that the mainline model of doom has not been successfully argued for. What do you take to be the best argument for a ‘mainline model’ of doom? If I’m correct in interpreting the passage below as an argument for a ‘mainline model’ of doom, then it strikes me as unconvincing:

Any one of a vast array of things can cause doom. Just the 4 broad categories mentioned at the start of the OP (subfields of Alignment) and the fact that "any given [alignment] approach that might show some promise on one or two of these still leaves the others unsolved." is enough to provide a disjunctive frame!

Under your framing, I don’t think that you’ve come anywhere close to providing an argument for your preferred disjunctive framing. On my way of viewing things, an argument for a disjunctive framing shows that “failure on intent alignment (with success in the other areas) leads to a high P(Doom | AGI), failure on outer alignment alignment (with success in the other areas) leads to a high P(Doom | AGI), etc …”. I think that you have not shown this for any of the disjuncts, and an argument for a disjunctive frame requires showing this for all of the disjuncts.

Nate’s Framing

I claimed that an argument for (my slight alteration of) Nate’s framing was likely to rely on the conjunction of many assumptions, and you (very reasonably) asked me to spell them out. To recap, here’s the framing:

For humanity to be dead by 2070, only one of the following needs to be true:

  1. Humanity has < 20 years to prepare for AGI
  2. The technical challenge of alignment isn’t “pretty easy”
  3. Research culture isn’t alignment-conscious in a competent way.

For this to be a disjunctive argument for doom, all of the following need to be true:

  1. If humanity has < 20 years to prepare for AGI, then doom is highly likely. 
  2. Etc … 

That is, the first point requires an argument which shows the following: 

A Conjunctive Case for the Disjunctive Case for Doom:[1]

  1. Even if we have a competent alignment-research culture, and 
  2. Even if the technical challenge of alignment is also pretty easy, nevertheless 
  3. Humanity is likely to go extinct if it has <20 years to prepare for AGI. 

If I try to spell out the arguments for this framing, things start to look pretty messy. If technical alignment were “pretty easy”, and tackled by a culture which competently pursued alignment research, then I don’t feel >90% confident in doom. The claim “if humanity has < 20 years to prepare for AGI, then doom is highly likely” requires (non-exhaustively) the following assumptions:

  1. Obviously, the argument directly entails the following: Groups of competent alignment researchers would fail to make ‘sufficient progress’ on alignment within <20 years, even if the technical challenge of alignment is “pretty easy”.
    1. There have to be some premises here which help make sense of why this would be true. What’s the bar for a competent ‘alignment culture’? 
    2. If the bar is low, then the claim does not seem obviously true. If the bar for ‘competent alignment-research culture’ is very high, then I think you’ll need an assumption like the one below.
  2. With extremely high probability, the default expectation should be that the values of future AIs are unlikely to care about continued human survival, or the survival of anything we’d find valuable.
    1. I will note that this assumption seems required to motivate the disjunctive framing above, rather than following from the framing above.
    2. The arguments I know of for claims like this do seem to rely on strong claims about the sort of ‘plan search’ algorithms we’d expect future AIs to instantiate. For example, Rob claims that we’re on track to produce systems which approximate ‘randomly sample from the space of simplicity-weighted plans’. See discussion here.
    3. As Paul notes, “there are just a lot of plausible ways to care a little bit (one way or the other!) about a civilization that created you, that you've interacted with, and which was prima facie plausibly an important actor in the world.”
  3. By default, the values of future AIs are likely to include broadly-scoped goals, which will involve rapacious influence-seeking.
    1. I agree that there are instrumentally convergent goals, which include some degree of power/influence-seeking. But I don’t think instrumental convergence alone gets you to ‘doom with >50%’. 
    2. It’s not enough to have a moderate desire for influence. I think it’s plausible that the default path involves systems who do ‘normal-ish human activities’ in pursuit of more local goals. I quote a story from Katja Grace in my shortform here.

So far, I’ve discussed just one disjunct, but I can imagine outlining similar assumptions for the other disjuncts. For instance: if we have >20 years to conduct AI alignment research conditional on the problem not being super hard, why can’t there be a decent chance that a not-super-competent research community solves the problem? Again, I find it hard to motivate the case for a claim like that without already assuming a mainline model of doom. 

I’m not saying there aren’t interesting arguments here, but I think that arguments of this type mostly assume a mainline model of doom (or the adequacy of a ‘disjunctive framing’), rather than providing independent arguments for a mainline model of doom. 

Future Responses  

This blog is ~1k words. Can you write a similar length blog for the other side, rebutting all my points?

I think so! But I’m unclear what, exactly, your arguments are meant to be. Also, I would personally find it much easier to engage with arguments in premise-conclusion format. Otherwise, I feel like I have to spend a lot of work trying to understand the logical structure of your argument, which requires a decent chunk of time-investment. 

Still, I’m happy to chat over DM if you think that discussing this further would be profitable. Here’s my attempt to summarize your current view of things.  

We’re on a doomed path, and I’d like to see arguments which could allow me to justifiably believe that there are paths which will steer us away from the default attractor state of doom. The technical problem of alignment has many component pieces, and it seems like failure to solve any one of the many component pieces is likely sufficient for doom. Moreover, the problems for each piece of the alignment puzzle look ~independent.

  1. ^

    Suggestions for better argument names are not being taken at this time.

Thanks for the reply. I think the talk of 20 years is a red herring as we might only have 2 years (or less). Re your example of "A Conjunctive Case for the Disjunctive Case for Doom", I don't find the argument convincing because you use 20 years. Can you make the same arguments s/20/2? 

And what I'm arguing is not that we are doomed by default, but the conditional on being doomed given AGI; P(doom|AGI). I'm actually reasonably optimistic that we can just stop building AGI and therefore won't be doomed! And that's what I'm working toward (yes, it's going to be a lot of work; I'd appreciate more help).

On my way of viewing things, an argument for a disjunctive framing shows that “failure on intent alignment (with success in the other areas) leads to a high P(Doom | AGI), failure on outer alignment alignment (with success in the other areas) leads to a high P(Doom | AGI), etc …”. I think that you have not shown this for any of the disjuncts

Isn't it obvious that none of {outer alignment, inner alignment, misuse risk, multipolar coordination} have come anywhere close to being solved? Do I really need to summarise progress to date and show why it isn't a solution, when no one is even claiming to have a viable, scalable, solution to any of them!? Isn't it obvious that current models are only safe because they are weak? Will Claude-3 spontaneously just decide not to make napalm with the Grandma's bedtime story napalm recipe jailbreak when it's powerful enough to do so and hooked up to a chemical factory?

So far, I’ve discussed just one disjunct, but I can imagine outlining similar assumptions for the other disjuncts.

Ok, but you really need to defeat all of them given that they are disjuncts!

I don’t think instrumental convergence alone gets you to ‘doom with >50%’.

Can you elaborate more on this? Is it because you expect AGIs to spontaneously be aligned enough to not doom us?

I’m unclear what, exactly, your arguments are meant to be. Also, I would personally find it much easier to engage with arguments in premise-conclusion format

Judging by the overall response to this post, I do think it needs a rewrite.

Here's a quick attempt at a subset of conjunctive assumptions in Nate's framing:

- The functional ceiling for AGI is sufficiently above the current level of human civilization to eliminate it
- There is a sharp cutoff between non-AGI AI and AGI, such that early kind-of-AGI doesn't send up enough warning signals to cause a drastic change in trajectory.
- Early AGIs don't result in a multi-polar world where superhuman-but-not-godlike agents can't actually quickly and recursively self-improve, in part because none of them wants any of the others to take over - and without being able to grow stronger, humanity remains a viable player.

Thanks!

- The functional ceiling for AGI is sufficiently above the current level of human civilization to eliminate it

I don't think anyone is seriously arguing this? (Links please if they are).

- There is a sharp cutoff between non-AGI AI and AGI, such that early kind-of-AGI doesn't send up enough warning signals to cause a drastic change in trajectory.

We are getting the warning signals now. People (including me) are raising the alarm. Hoping for a drastic change of trajectory, but people actually have to put the work in for that to happen! But your point here isn't really related to P(doom|AGI)  - i.e. the conditional is on getting AGI. Of course there won't be doom if we don't get AGI! That's what we should be aiming for right now (not getting AGI).

- Early AGIs don't result in a multi-polar world where superhuman-but-not-godlike agents can't actually quickly and recursively self-improve, in part because none of them wants any of the others to take over - and without being able to grow stronger, humanity remains a viable player.

Nate may focus on singleton scenarios, but that is not a pre-requisite for doom. To me Robin Hanson's (multipolar) Age of Em is also a kind of doom (most humans don't exist, only a few highly productive ones are copied many times and only activated to work; a fully Malthusian economy). I don't see how "humanity remains a viable player" in a world full of superhuman agents.

Plans that involve increasing AI input into alignment research appear to rest on the assumption that they can be grounded by a sufficiently aligned AI at the start. But how does this not just result in an infinite, error-prone, regress? Such “getting the AI to do your alignment homework” approaches are not safe ways of avoiding doom.

On this point, the initial AI's needn't be actually aligned, I think. They could for example do useful alignment work that we can use even though they are "playing the training game"; they might want to take over, but don't have enough influence yet, so are just doing as we ask for now. (More)

(Clearly this is not a safe way of reliably avoiding catastrophe. But it's an example of a way that it's at least plausible we avoid catastrophe.)

How can their output be trusted if they aren't aligned? Also, I don't think it's a way for reliably avoiding catastrophe even in the event the output can be trusted to be correct: how do you ensure that the AI rewarded for finding bugs finds all the bugs?

From the Planned Obsolescence article you link to:

This sets up an incredibly stressful kind of “race”:

  • If we don’t improve our alignment techniques, then eventually it looks like the winning move for models playing the training game is to seize control of the datacenter they’re running on or otherwise execute a coup or rebellion of some kind.

and

For so many reasons, this is not a situation I want to end up in. We’re going to have to constantly second-guess and double-check whether misaligned models could pull off scary shenanigans in the course of carrying out the tasks we’re giving them. We’re going to have to agonize about whether to make our models a bit smarter (and more dangerous) so they can maybe make alignment progress a bit faster. We’re going to have to grapple with the possible moral horror of trying to modify the preferences of unwilling AIs, in a context where we can’t trust apparent evidence about their moral patienthood any more than we can trust apparent evidence about their alignment. We’re going to have to do all this while desperately looking over our shoulder to make sure less-cautious, less-ethical actors don’t beat us to the punch and render all our efforts useless.

I desperately wish we could collectively slow down, take things step by step, and think hard about the monumental questions we’re faced with before scaling up models further. I don’t think I’ll get my way on that — at least, not entirely.

[my emphasis]

Yes, I completely agree that this is nowhere near good enough. It would make me very nervous indeed to end up in that situation.

The thing I was trying to push back against was the idea that what I thought you were claiming: that we're effectively dead if we end up in this situation.

Why aren't we effectively dead, assuming the misaligned AI reaches AGI and beyond in capability? Do we just luck out? And if so, what makes you think that is the dominant, or default (90%) outcome?

To give one example: how would you use this technique (the "training game") to eliminate 100% of all possible prompt engineering hacks and so protect against misuse by malicious humans (cf. "If the “grandma’s bedtime story napalm recipe” prompt engineering hack - as mentioned in the OP).

[comment deleted]10mo2
0
0

My read on this so far is that low estimates for P(doom|AGI) are either borne of ignorance of what the true difficulties in AI alignment are; stem from wishful thinking / a lack of security mindset; or are a social phenomenon where people want to sound respectable and non-alarmist; as opposed to being based on any sound technical argument.

After spending a significant amount of my own free time writing up technical arguments that AI risk is overestimated, I find it quite annoying to be told that my reasons must be secretly based on social pressure. No, I just legitimately think you're wrong, as do a huge number of other people who have been turned away from EA by dismissive attitudes like this. 

If I had to state only one argument (there are very many) that P of AGI doom is low, it'd be the following. 

Conquering the world is really really really hard. 

Conquering the world starting from nothing is really, really, really, ridiculously hard. 

Conquering the world, starting from nothing, when your brain is fully accessible to your enemy for your entire lifetime of plotting, is stupidly, ridiculously, insanely hard. 

Every time I point this basic fact out, the response is a speculative science fiction story, or an assertion that "a superintelligence will figure something out". But nobody actually knows the capabilities of this invention that doesn't exist yet. I have seen zero convinc

Why is "it will be borderline omnipotent" being treated as the default scenario? No invention in the history of humanity has been that perfect, especially early on. No intelligence in the history of the universe has been that flawless. Can you really be 90% sure that 

your brain is fully accessible to your enemy for your entire lifetime of plotting

This sounds like you are assuming that mechanistic interpretability has somehow been solved. We are nowhere near on track for that to happen in time!

Also, re "it will be borderline omnipotent": this is not required for doom. ~Human level AI hackers copied a million times, and sped up a million times, could destroy civilisation.

It doesn't seem to me that titotal is assuming MI is solved; having direct access to the brain doesn't give you full insight into someone's thoughts either, because neuroscience is basically a pile of unsolved problems with growing-but-still-very-incomplete-picture of low-level and high-level details. We don't even have a consensus on how memory is physically implemented.

Nonetheless, if you had a bunch of invasive probes feeding you gigabytes/sec of live data from the brain of the genius general of the opposing army, it would be extremely likely to be useful information.

A really interesting thing is that, at the moment, this appears in practice to be a very-asymmetrical advantage. The high-level reasoning processes that GPT-4 implements don't seem to be able to introspect about fine-grained details, like "how many tokens are in a given string". The information is obviously and straightforwardly part of the model, but absent external help the model doesn't seem to bridge the gap between low-level implementation details and high-level reasoning abilities - like us.

Ok, so the "brain" is fully accessible, but that is near useless with the level of interpretability we have. We know way more human neuroscience by comparison. It's hard to grasp just how large these AI models are. They have of the order of a trillion dimensions. Try plotting that out in Wolfram Alpha or Matlab..

It should be scary in itself that we don't even know what these models can do ahead of time. It is an active area of scientific investigation to discover their true capabilities, after the fact of their creation.

Well probability of AGI doom doesn't depend on probability that AI can 'conquer the world'. 

It only depends on the probability that AI can disrupt the world sufficiently that the latent tensions in human societies, plus all the other global catastrophic risks that other technologies could unleash (e.g. nukes, bioweapons), would lead to some vicious downward spirals, eventually culminating in human extinction. 

This doesn't require AGI or ASI. It could just happen through very good AI-generated propaganda, that's deployed at scale, in multiple languages, in a mass-customized way, by any 'bad actors' who want to watch the world burn. And there are many millions of such people. 

"Applying a security mindset" means looking for ways that something could fail. I agree that this is a useful technique for preventing any failures from happening.

But I'm not sure that assuming that this is a sound principle to rely on when trying to work out how likely it is that something will go wrong. In general, Murphy’s Law is not correct. It's not true that "anything that can go wrong will go wrong".

I think this is part of the reason I'm sceptical of confident predictions of catastrophe, like your 90% - it's plausible to me that things could work out okay, and I don't see why Murphy's law/security mindset should be invoked to assume they won't.

Given that it's much easier to argue that the risk of catastrophe is unacceptably high (say >10%), and this has the same practical implications, I'd suggest that you argue for that instead of 90%, which to me (and others) seems unjustified. I'm worried that if people think you're unjustifiably confident in catastrophe, then they don't need to pay attention to the concerns you raise at all because they think you're wrong.

[Strong Disagree.] I think "anything that can go wrong will go wrong" becomes stronger and stronger, the bigger the intelligence gap there is between you and an AI you are trying to align. For it not to apply requires a mechanism for the AI to spontaneously become perfectly aligned. What is that mechanism?

Given that it's much easier to argue that the risk of catastrophe is unacceptably high (say >10%), and this has the same practical implications, I'd suggest that you argue for that instead of 90%

It does not have the same practical implications. As I say in the post, there is a big difference between the two in terms of it becoming a "suicide race" for the latter (90%). Arguing the former (10%) - as many people are already doing, and have been for years - has not moved the needle (it seems as though OpenAI, Google Deepmind and Anthropic are basically fine with gambling tens to hundred of millions of lives on a shot at utopia.[1]

  1. ^

    To be clear - I'm not saying it's 90% for the sake of argument. It is what I actually believe.

Regarding the whether they have the same practical implications, I guess I agree that if everyone had a 90% credence in catastrophe, that would be better than them having 50% credence or 10%.

Inasmuch as you're right that the major players have a 10% credence of catastrophe, we should either push to raise it or to advocate for more caution given the stakes.

My worry is that they don't actually have that 10% credence, despite maybe saying they do, and that coming across as more extreme might stop them from listening.

I think you might be right that if we can make the case for 90%, we should make it. But I worry we can't.

I think we should at least try! (As I am doing here.)

Ah I think I see the misunderstanding.

I thought you were invoking "Murphy's Law" as a general principle that should generally be relied upon - I thought you were saying that in general, a security mindset should be used.

But I think you're saying that in the specific case of AGI misalignment, there is a particular reason to apply a security mindset, or to expect Murphy's Law to hold.

Here are three things I think you might be trying to say:

  1. As AI systems get more and more powerful, if there are any problems with your technical setup (training procedures, oversight procedures, etc.), then if those AI systems are misaligned, they will be sure to exploit those vulnerabilities.
  2. Any training setup that could lead to misaligned AI will lead to misaligned AI. That is, unless your technical setup for creating AI is watertight, then it is highly likely that you end up with misaligned AI.
  3. Unless the societal process you use to decide what technical setup gets used to create AGI is watertight, then it's very likely to choose a technical setup that will lead to misaligned AI.

I would agree with 1 - that once you have created sufficiently powerful misaligned AI systems, catastrophe is highly likely.

But I don't understand the reason to think that 2 and especially 3 are both true. That's why I'm not confident in catastrophe: I think it's plausible that we end up using a training method that ends up creating aligned AI even though the way we chose that training method wasn't watertight.

You seem to think that it's very likely that we won't end up with a good enough training setup, but I don't understand why.

Looking forward to your reply!

Thanks. Yes you're right in that I'm saying that you specifically need to apply security mindset/Murphy's Law when dealing with sophisticated threats that are more intelligent than you. You need to red team, to find holes in any solutions that you think might work. And a misaligned superhuman AI will find every hole/loophole/bug/attack vector you can think of, and more!

Yes, I'm saying 1. This is enough for doom by default!

2 and 3 are red herrings imo as it looks like you are assuming that the AI is in some way neutral (neither aligned or misaligned), and then either becomes aligned or misaligned during training. Where is this assumption coming from? The AI is always misaligned to start! The target of alignment is a tiny pocket in a vast multidimensional space of misalignment. It's not anything close to a 50/50 thing. Yudkowsky uses the example of a lottery: the fact that you can either win or lose (2 outcomes) does not mean that the chance of winning is 50% (1 in 2)!

Ah thanks Greg! That's very helpful.

I certainly agree that the target is relatively small, in the space of all possible goals to instantiate.

But luckily we aren't picking at random: we're deliberately trying to aim for that target, which makes me much more optimistic about hitting it.

And another reason I see for optimism comes from that yes, in some sense I see the AI is in some way neutral (neither aligned nor misaligned) at the start of training. Actually, I would agree that it's misaligned at the start of training, but what's missing initially are the capabilities that make that misalignment dangerous. Put another way, it's acceptable for early systems to be misaligned, because they can't cause an existential catastrophe. It's only by the time a system could take power if it tried that it's essential it doesn't want to.

These two reasons make me much less sure that catastrophe is likely. It's still a very live possibility, but these reasons for optimism make me feel more like "unsure" than "confident of catastrophe".

We might be deliberately aiming, but we have to get it right on the first try (with transformative AGI)! And so far none of our techniques are leading to anything close to perfect alignment even for relatively weak systems (see ref. to "29%" in OP!)

Actually, I would agree that it's misaligned at the start of training, but what's missing initially are the capabilities that make that misalignment dangerous.

Right. And that's where the whole problem lies! If we can't meaningfully align today's weak AI systems, what hope do we have for aligning much more powerful ones!? It's not acceptable for early systems to be misaligned, precisely because of what that implies for the alignment of more powerful systems and our collective existential security. If OpenAI want to say "it's ok GPT-4 is nowhere close to being perfectly aligned, because we definitely definitely will do better for GPT-5", are you really going to trust them? They really tried to make GPT-4 as aligned as possible (for 6 months). And failed. And still released it anyway.

If you apply a security mindset (Murphy’s Law) to the problem of AI alignment, it should quickly become apparent that it is very difficult.

FYI I disagree with this. I think that the difficulty of alignment is a complicated and open question, not something that is quickly apparent. In particular, security mindset is about beating adversaries, and it's plausible that we train AIs in ways that mostly avoid them treating us as adversaries.

Interesting perspective, although I'm not sure how much we actually disagree. "Complicated and open", to me reads as "difficult" (i.e. the fact that it is still open means it has remained unsolved. For ~20 years now.).

And re "adversaries", I feel like this is not really what I'm thinking of when I think about applying security mindset to transformative AI (for the most part - see next para.). "Adversary" seems to be putting too much (malicious) intent into the actions of the AI. Another way of thinking about misaligned transformative AI is as a super powered computer virus that is in someways an automatic process, and kills us (manslaughters us?) as collateral damage. It seeps through every hole that isn't patched. So eventually, in the limit of superintelligence, all the doom flows through the tiniest crack in otherwise perfect alignment (the tiniest crack in our "defences").

However, having said that, the term adversaries is totally appropriate when thinking of human actors who might maliciously use transformative AI to cause doom (Misuse risk, as referred to in OP). Any viable alignment solution needs to prevent this from happening too! (Because we now know there will be no shortage of such threats).

Interesting perspective, although I'm not sure how much we actually disagree. "Complicated and open", to me reads as "difficult"

 

Is there a rephrasing of the initial statement you would endorse that makes this clearer? I'd suggest "If you apply a security mindset (Murphy’s Law) to the problem of AI alignment, it should quickly become apparent that we do not currently possess the means to ensure that any given AI is safe."

Yes, I would endorse that phrasing (maybe s/"safe"/"100% safe"). Overall I think I need to rewrite and extend the post to spell things out in more detail. Also change the title to something less provocative[1] because I get the feeling that people are knee-jerk downvoting without even reading it, judging by some of the comments (i.e. I'm having to repeat things I refer to in the OP).

  1. ^

    perhaps "Why the most likely outcome of AGI is doom"?

I submitted this to the OpenPhil AI Worldviews Contest on 31st May with a few additions and edits - this pdf version is most up to date.