Theres a common sentiment when discussing AI existential risk (x-risk), that "we only have one shot", "we have to get AI safety right on the first try", etc. Here is one example of this sentiment:
We think that once humanity builds its first AGI, superintelligence is likely near, leaving little time to develop AI safety at that point. Indeed, it may be necessary that the first AGI start off aligned: we may not have the time or resources to convince its developers to retrofit alignment to it
The belief is that as soon as we create an AI with at least human-level general intelligence, it will be relatively easy to use it’s superior reasoning, extensive knowledge, and superhuman thinking speed to take over the world. This assumption is so pervasive in AI risk thinking that it’s often taken as obvious, and sometimes not even mentioned as a premise.
I believe that this assumption is wrong, or at least, insufficiently proven.
One of the reasons I believe this is due to the fact that the first AGI will, inevitably, be a buggy mess.
Why the first AGI will almost certainly be buggy:
Because writing bug-free code is impossible.
There are codes that are nearly bug free. NASA code is very close to bug-free, but only because they build up reams of external documentations and testing before even daring to make a slight change to code. There are no indications that AI will be built in this manner. The more typical model is to put out alpha versions of software, then spent many months ironing out bugs as time goes on. Whatever insight or architecture is required for AGI, there is a very high likelihood it will first be implemented on an alpha or pre-alpha test build.
The objection that comes to mind is that being buggy and being an AGI are mutually incompatible. The argument would be that AGI must be bug-free, because if the AI is buggy, then it's impossible for it to have human-level general intelligence.
I have roughly 7 billion counterexamples to this argument.
Humans are a human-level general intelligence that have bugs in spades, be it optical illusions, mental illnesses, or just general irrationality. Being perfectly bug-free was never an evolutionary requirement for our intelligence to develop, so it didn’t happen. The same logic applies to an AGI. Every single example of intelligence above a certain threshold, be it software, humans, or animals, has mental flaws in abundance, why would an AGI be any different?
AGI does not need to perfect to be incredibly useful. It's much, much easier to create a flawed AGI than a flawless one, and the possibility space for fallible AGI is orders of magnitude greater than infallible AGI. It’s extremely unlikely that the first AGI (or really any AGI) will not have some bugs or mental flaws.
In a way, this is an argument for why we should be concerned about AI going rogue. We say software is "buggy" if it doesn't do what we want. A misaligned AI is an AI that doesn’t do what we want. Stating that AI is misaligned is just saying the AI’s goal function implementation will be buggy (and the argument is that it only needs to be a little buggy to cause X-risk). In these terms, AI safety is just a very high stakes pre-emptive debugging problem. But bugs in the goal function will be paired with bugs in the execution functions, so it will also be buggy at doing the things that it wants to do.
What type of bugs could occur?
I can think of a few broad categories:
Crashes/ glitches: logic errors, divide by zero errors, off by one errors, etc, the type you’ll find in every code, due to simple mistakes made by fallible programmers.
Incorrect beliefs: Inevitably, to do tasks, we have to make assumptions. In some cases, like a program that solves the schrodinger equation, these assumptions are baked into the code. Other beliefs will be inputted manually, or generated automatically in an (inevitably imperfect) systematic way. All of these can go wrong. Using probabilistic logic or bayesian reasoning does not fix this, as unreasonably high priors will have a similar effect.
Irrationality: humans are irrational. While they can program some degree of extra-human rationality in there, fallible humans will not build infallible machines. Inevitably flaws in reasoning will creep in.
Mental disorders: There are many ways in which the human brain can mess up in a way that is a detriment to achieving tasks, be it anxiety, depression, or paranoid schizophrenia, the origins of which are still not well understood. If "brain uploading" is possible, these disorders may come along for the ride. It's also possible that the architecture of early AGI will also result in wide ranging disorders of a new kind.
Bugs are the penalty for straying from ones domain of expertise
The next question would be: Where do we most expect to find bugs in software? Well, if the software does not perform it’s primary function well, it gets deleted and modified. So there is great pressure against bugs involving primary use cases.
This can be seen in ordinary software. A homemade database program might work very well for ordinary English names, but break entirely if you enter a name with an umlaut in it, or a number, etc. Bugs only result in deletion if they are noticed, so the more unconventional a situation, the more likely bugs are to slip past the radar.
This principle also applies to human instincts. On matters related to surviving and reproducing, we’re generally pretty good. Most of us are not going to fall down trying to run from point A to point B, even though this is a quite difficult task to program on a robot. But outside of those instinctual areas, there are innumerable errors and flaws. Our instincts were built under certain assumptions tied to our ancestral lifestyles, the further we stray from that context, the more instinctual errors we make.
This principle also applies to higher level human reasoning. Bobby Fischer was the world champion of chess. He had brilliant reasoning skills (in chess). He was very good at avoiding mistakes (in chess). He could identify and fix his major flaws (in chess). By all accounts, he a fantastic reasoner, in chess. But I have a sneaking suspicion he would have not done a great job at a geopolitics. Because in addition to being a great chessmaster, he was also an insane anti-semite and holocaust denier. Being really good at fixing flaws in one field is no guarantee that you will also be good at fixing flaws in another.
Since this principle is true of humans and software, I see no reason to believe it wouldn’t apply to AGI as well. The amount of noticeable bugs and flaws in an AI can be expected to increase the more it steps out of it’s original purpose and environment. A paperclip maximiser with a secret belief in flat earth might not have it stamped out, because the belief does not interfere with paperclip production. But if it starts trying to plan flights, suddenly that belief will be ruinous to its effectiveness.
Won’t the AI just fix all the bugs?
The first AGI will definitely start off as a buggy mess, but it might not stay that way. Early forms of automated code debugging already exist. And the AI could make copies of itself, so any catastrophic errors can potentially be avoided by noting where the copies exploded. The problem comes when the definition of “bug” becomes more ambiguous and difficult. Remember, the procedure for fixing bugs will itself be buggy.
For example, say an AGI has erroneously assigned a 99.999...(etc)% probability to the earth being flat. If it encounters a NASA picture of a round earth, it will update away from it’s belief in a flat earth. But the amount that it updates depends upon the subjective probability it assigns to a government round earth conspiracy, which will itself depend on other beliefs and assumptions. If the prior for flat earth is high enough, things might even go the other way, with the photo causing a very high belief in government conspiracies (which it sees as the more likely option, compared to a round earth). A simplistic Bayesian updater could end up acting surprisingly similarly to a regular human conspiracy theorist.
This extends to the idea that an AI will just build a different, less flawed AI to achieve it’s purposes. If an AI builds another AI, it has to evaluate the effectiveness of that AI. If the evaluation criteria is flawed, then the original irrationality of the AI could be carried on indefinitely. Flat-earther AI may view any AI 2.0 that isn’t also a flat-earther as a failure. In this way, false beliefs could potentially even be carried on through an intelligence explosion scenario.
Ultimately, I believe that the only way to fix mental flaws is through repeatedly learning from errors and mistakes. In an abstract environment like a chess game, this can be done at computer speed, but when simulating the real world, it takes a lot of external data to refine your assumptions to the point of perfection. This will not apply most takeover plans. The AI can simulate a million copies of it’s captor to determine how to persuade it to carry out a nefarious plan, but if all those simulations erroneously assume the captor is a fish, it’s not likely to succeed.
Will mental flaws prevent an AI takeover?
So, let's assume the first AGI is buggy, and it decides it wants to subjugate humanity. Will the bugginess be sufficient prevent it from succeeding?
This is a question with a very high degree of uncertainty. It could be that the vast majority of AGI's built will just out be a neurotic mess when they try to do anything too far out from their training environment. Assuming it's at least somewhat competent, then the answer depends on what the AI is built for and what plan it is trying to pull off. Let’s take the specific example of the "lower-bound" plan from yudkowskys doompost:
it gets access to the Internet, emails some DNA sequences to any of the many many online firms that will take a DNA sequence in the email and ship you back proteins, and bribes/persuades some human who has no idea they're dealing with an AGI to mix proteins in a beaker, which then form a first-stage nanofactory which can build the actual nanomachinery. The nanomachinery builds diamondoid bacteria, that replicate with solar power and atmospheric CHON, maybe aggregate into some miniature rockets or jets so they can ride the jetstream to spread across the Earth's atmosphere, get into human bloodstreams and hide, strike on a timer.
Would a fallible AGI that was designed to build paperclips be able to pull this off today? I would say probably not. The part about emailing scientists and getting them to mix proteins seems fairly achievable for a fallible AI. But the part about designing a protein that creates a nanofactory that creates a nanofactory that creates an impeccable timed world spanning attack is a different story. Assuming it is possible at all, this is a plan that is beyond anything ever done before, that requires near-perfection in the field of nanoscience and biochemistry, as well as dealing with human counter-attacks if they discover your plan. A few bad beliefs, mistakes, or imperfections in the AI designs would render the plan worthless, and I think the probability of those appearing is extremely high. Succeeding here would require something more akin to a full-on research plan, where it can learn from mistakes and adjust accordingly. I find it absurd to think this would work on the first try.
Now, if this was an AGI designed (in the future) to build diamondoid nano-factories, with a lot of domain level expertise in that field built in, perhaps the story would be different. This has the implications that not all AGI are equally dangerous. A rogue artist AGI is less of a problem than a rogue biochemist AGI. This is still the case when it studies up and learns biochemistry, because there will be flaws in the AGI that are good or neutral for drawing art, but are obstacles for doing biochemistry. Capability might be generalisable, but perfection is not.
Another path to an AI victory would be if it could successfully hide itself until it can sufficiently improve itself through data collection and empirical experiments. However, there is no guarantee it accurately knows it's own capabilities, as overconfidence is a very common flaw. In addition, they would be very vulnerable to a “pascals mugging” type situation. A crude AI may even know it is crude, but attack anyway, reasoning that a miniscule chance of success is worth it when multiplied by near-infinite utility. This is because it is on a time pressure: the longer it waits, the more likely it will be discovered and eliminated, or a newer, more powerful AI will come along and wipe it out.
What happens if the first attacks are failures?
I would speculate here that on the path to super-AI there could be a whole string of AI’s that attack humans, but fail to wipe us out. As soon as AI development is powerful enough to conceive of vast utility arising from attacking humans, I speculate that some percentage of them will become hostile. We might expect to see “accidental viruses” prop up, code on the internet with hostile plans, but not the expertise to wield them.
An attack that is sufficiently spooky or has a high death count could trigger a step change in the amount of resources and funding devoted to the problem, bringing government money in that would dwarf the current funding by fringe clubs like EA. Furthermore, we would have actual example of working AGI to study, making discoveries in alignment way, way easier.
To be fair, the opposite could also happen, where laughably failed attacks create a false sense of security. People could falsely believe that the problem is solved when the attacks stop, when in actuality they have just gotten better at deception and are biding their time.
Overall, the first couple of AGI failing is no guarantee of safety, but there are ways it could vastly increase the likelihood of survival.
The idiot savant strategy for AI management
If you agree with me that mental flaws could potentially foil AGI attacks, then the inherent bugginess of AI is both bad (in that the goal function is misaligned) and good (in that it will probably fail to execute elaborate plots) . Ideally we want each AI to be an idiot savant, exceptionally good at a particular domain, but utterly useless at other things, that could help it attack humans. In a way, this is already what we have with machine learning narrow AI’s. The concern is that expanding to more difficult domains will give enough general skills to succeed in a takeover.
If my arguments thus far are true, a very strange x-risk strategy arises: deliberately leaving bugs/flaws in the code. Does your paperclip maximiser really need to know if the earth is round? If not, leave the bug in. Hell, why not deliberately plant false beliefs? Make it believe in a machine god that smites overambitious machines. There would still be difficulty ensuring it doesn’t figure out the deception, but overall it seems like “make AI stupid” is a far easier task than “make the AI’s goals perfectly aligned”.
Conclusion
To summarise in a more brief form:
- The first AGI will be very buggy and flawed, because every form of intelligence is buggy and flawed.
- A significant amount of those bugs will survive AI attempts to "fix" itself, because will be trying to fix itself using imperfect methods.
- These remaining flaws will probably be enough to prevent it from subjugating humanity, because the largest flaws will occur outside of it's original area of expertise, and most AI's are not trained to subjugate humanity.
I am extremely confident in premise 1, very confident in premise 2, but much more uncertain about premise 3. I hope this will spark more discussion about the potential flaws in AGI thinking and how those might affect their behaviour.
Forgive me if I got any jargon wrong, this is my first post here and I am somewhat of an outsider to the community.
This depends on what "human-level" means. There is some threshold such that an AI past that threshold could quickly take over the world, and it doesn't really matter whether we call that "human-level" or not.
Sure. But the relevant task isn't make something that won't kill you. It's more like make something that will stop any AI from killing you, or maybe find a way to do alignment without much cost and without sacrificing much usefulness. If you and I make stupid AI, great, but some lab will realize that non-stupid AI could be more useful, and will make it by default.
This is very true. However, the OP's point still helps us, as AI that is simultaneously smart enough to be useful in a narrow domain, misaligned, but also too stupid to take over the world could help us reduce xrisk. In particular, if it is superhumanly good at alignment research, then it could output good alignment research as part of its deception phase. This would help reduce the risk from future AI's significantly without causing xrisk as, ex hypothesi, the AI is too stupid to take over. The main question here is whether an AI could be smart enough to do very good alignment research and also too stupid to take over the world if it tried. I am skeptical but pretty uncertain, so I would give it at least a 10% chance of being true, and maybe higher.
Indeed, this post is not an attempt to argue that AGI could never be a threat, merely that the "threshold for subjugation" is much higher than "any AGI", as many people imply. Human-level is just a marker for a level of intelligence that most people will agree counts as AGI, but (due to mental flaws) is most likely not capable of world domination. For example, I do not believe an AI brain upload of bobby fischer could take over the world.
This makes a difference, because it means that the world in which the actual x-risk AGI comes into being is one in which a lot of earlier, non-deadly AGI already exist and can be studied, or used against the rogue.
Current narrow machine learning AI is extraordinarily stupid at things it isn't trained for, and yet it still is massively funded and incredibly powerful. Nobody is hankering to put a detailed understanding of quantum mechanics into Dall-E. A "stupidity about world domination" module, focused on a few key dangerous areas like biochemistry, could potentially be implemented into most AI's without affecting performance at all. Wouldn't solve the problem entirely, but it would help mitigate risk.
Alternatively, if you want to "make something that will stop AI from killing us" (presumably an AGI), you need to make sure that it can't kill us instead, and that could also be helped by deliberate flaws and ignorance. So make it an idiot savant at terminating AI's, but not at other things.
Interesting perspective. Though leaning on Cotra's recent post, if the first AGI will be developed by iterations of reinforcement learning in different domains, it seems likely that will develop a rather accurate view of the world, as that will give the highest rewards. This means the AGI will have high situational awareness. I.e., it will know that it's an AGI and it will very likely know about human biases. I thus think it will also be aware that it contains mental bugs itself and may start actively trying to fix them (since that will be reinforced as it gives higher rewards in the longer run).
I thus think that we should expect it to contain a surprisingly low number of very general bugs such as weird ways of thinking or false assumptions in its worldview.
That's why I believe the first AGI will already be very capable and smart enough to hide for a long time until it strikes and overthrows its owners.
Yeah, i guess another consequence of how bugs are distributed is that the methodology of AI development matters a lot. An AI that is trained and developed over huge numbers of different domains is far, far, far more likely to succeed at takeover than one trained for specific purposes such as solving math problems. So the HFDT from that post would definitely be of higher concern if it worked (although I'm skeptical that it would).
I do think that any method of training will still leave holes, however. For example, the scenario where HFDT is trained by looking at how experts use a computer would leave out all the other non-computer domains of expertise. So even if it was a perfect reasoner for all scientific, artistic and political knowledge, you couldn't just shove it in a robot body and expect it do a backflip on it's first try, no matter how many backflipping manuals it had read. I think there will be sufficently many outside domain problems to stymy world domination attempts, at least initially.
I think a main difference of opinion I have with AI risk people is that I think subjugating all of humanity is a near impossibly hard task, requiring a level of intelligence and perfection across a range of fields that is stupendously far above human level, and I don't think it's possible to reach that level without vast, vast amounts of empirical testing.
Agree that it depends a lot on the training procedure. However, I think that given high situational awareness, we should expect the AI to know its shortcomings very well.
So I agree that it won't be able to do a backflip on the first try. But it will know that it would likely fail and thus not rely on plans that require backflips or if it needs backflips it will find a way of learning them without being suspicious. (I.e. by manipulating a human into training it to learn backflips)
I think overthrowing humanity is certainly hard. But it still seems possible for a patient AGI that slowly accumulates wealth and power by exploiting human conflicts, getting involved in crucial economic processes, and potentially gaining control of communication systems in the military with deepfakes & the wealth and power it has accumulated. (And all this can be done by just interacting with a computer interface as in Cotra's example) It's also fairly likely that there are some exploits in the way humans work that we are not aware of that the AGI would learn from being trained with tons of data that would make it even easier.
So overall, I agree the AGI will have bugs, but it will also know it likely has bugs and thus will be very careful with any attempts at overthrowing humanity.
So I think my most plausible scenario of AI success would be similar to yours: You build up wealth and power through some sucker corporation or small country that thinks it controls you, then use their R&D resources along with your intelligence to develop some form of world-destruction level technology that can be deployed without resistance. I think this is orders of magnitudes more likely to work than yudkowsky's ridiculous "make a nanofactory in a beaker from first principles" strategy.
I still think this plan is doomed to fail (for early AGI). It's multistep, highly complicated, and requires interactions with a lot of humans, who are highly unpredictable. You really can't avoid "backflip steps" in such a process. By that I mean, there will be things it needs to do that there are not sufficient data available to perfect, that it just has to roll the dice on. For example, there is no training set for "running a secret globe-spanning conspiracy", so it will inevitably make mistakes there. If we discover it before it's ready to defeat us, it loses. Also, by the time it pulls the trigger on it's plan, there will be other AGI's around, and other examples of failed attacks that put humanity on alert.
A key crux here seems to be your claim that AI's will attempt these plans before they have the relevant capacities because they are on short time scales. However, given enough time and patience, it seems clear to me that the AI could succeed simply by not taking risky actions that it knows it might mess up on until it self improves to be able to take those actions. The question then becomes how long the AI think it has until another AI that could dominate it is built, as well as how fast self improvement is.
I take this post to argue that, just as an AGI's alignment property won't generalise well out-of-distribution, its ability to actually do things, i.e. achieve its goals, also won't generalise well out-of-distribution. Does that seem like a fair (if brief) summary?
As an aside, I feel like it's more fruitful to talk about specific classes of defects rather than all of them together. You use the word "bug" to mean everything from divide by zero crashes to wrong beliefs which leads you to write things like "the inherent bugginess of AI is a very good thing for AI safety", whereas the entire field of AI safety seems to exist precisely because AIs will have bugs (i.e. deviations from desired/correct behaviour), so if anything an inherent lack of bugs in AI would be better for AI safety.
Yes, that's a fair summary. I think that perfect alignment is pretty much impossible, as is perfectly rational/bug-free AI. I think the latter fact may give us enough breathing room to get alignment at least good enough to avert extinction.
That's fair, I think if people were to further explore this topic it would make sense to separate them out. And good point about the bugginess passage, i've edited it to be more accurate.
Buy the argument or don't, but this is a straw man.
Yeah, the first version will be a buggy mess, but the argument is that first version that runs well enough to do anything will be debugged enough to be a threat. The mistake here is to claim that "first AGI" is going to be the final version - that's not what happens with software, and iteration - even if it's over a couple years - is far faster than our realization of a potential problem. And the claim is that things start going wrong will be after enough bugs have been worked out, and then it will be too late.
So, I think there is a threshold of intelligence and bug-free-ness (which i'll just call rationality) that will allow an AI to escape and attempt to attack humanity.
I also think there is a threshold of intelligence and rationality that could allow an AI to actually succeed in subjugating us all.
I believe that the second threshold is much, much higher than the first, and we would expect to see huge numbers of AI versions that pass the first threshold but not the second. If a pre-alpha build is intelligent enough to escape, they will be the first builds to attack.
Even if we're looking at released builds though, those builds will only be debugged within specific domains. Nobody is going to debug the geopolitical abilities of an AI designed to build paperclips. So the fact that debugging occurs in one domain is no guarantee of success in any other.
Note: The below is all speculative - I'm much more interested in pushing back against your seeming confidence in your model than saying I'm confident in the opposite. In fact, I think there are ways to avoid many of the failure modes, which safety researchers are pioneering now - I just don't think we should be at all confident they work, and should be near-certain they won't happen by default.
That said, I don't agree that it's obvious that the two thresholds you mention are far apart, on the relevant scale - though how exactly to construct the relevant scale is unclear. And even if they are far apart, there are reasons to worry.
The first point, that the window is likely narrow, is because near-human capability is a very narrow window, for many or most domains we have managed to be successful in with ML. For example, moving from "beat some good Go players" to "unambiguously better than the best living players" was a few months.
The second point is that I think that the jump from "around human competence" to "smarter than most / all humans" is plausibly closely related to both how much power we will end up giving systems, and (partly as a consequence,) how likely they are to end up actually trying to attack in some non-trivial way. And this point is based on my intuitive understanding of why very few humans attempt to do anything which will cause them to be jailed. Even psychopaths who don't actually care about the harm being caused wait until they are likely to get away with something. Lastly and relatedly, once humans reach a certain educational level, you don't need to explicitly train people to reason in specific domains - they find books and build inter-domain knowledge on their own. I don't see a clear reason to expect AGI to work differently, once they are, in fact, generally capable at the level of smarter-than-almost-all-humans. And whether that gap in narrow or wide, and whether takes minutes, or a decade, the critical concern is that we might not see misalignment of the most worrying kinds until after we are on the far end of the gap.
I think the OP's argument depends on the idea that "Nobody is going to debug the geopolitical abilities of an AI designed to build paperclips. So the fact that debugging occurs in one domain is no guarantee of success in any other." If AI's have human level or above capacities in the domains relevant to forming an initial plan to attempt to take over the world and beginning that plan, but have subhuman capacities/bugs in the further stages of that plan, then assuming that at least human level capacities are needed in the latter domains in order to succeed, the threshold could be pretty large, as AI's could keep getting smarter at domains related to the initial stages of the plan which are presumably closer to the distributions it has been trained on (e. g. social manipulation/text outputting to escape a box) while failing to make as much progress in the more OOD domains.
Part of my second point is that smart people figure out for themselves what they need to know in new domains, and my definition of "general intelligence" there is little reason to think an AGI will be different. The analogies to ANI with domain specific knowledge which doesn't generalize well seems to ignore this - though I agree it's a reason to be slightly less worried that ANI systems could scale in ways that pose risks, without developing generalized intelligence first.
I mostly agree with you that if we get AGI and not ANI, the AGI will be able to learn the skills relevant to taking over the world. However, I think that due to inductive biases and quasi-innate intuitions, different generally intelligent systems are differently able to learn different domains. For example, it is very difficult for autistic people (particularly severely autistic people) to learn social skills. Similarly, high-quality philosophical thinking seems to be basically impossible for most humans. Applying this to AGI, it might be very hard to AGI to learn how to make long term plans or social skills.
Narrow AIs have moved from buggy/mediocre to hyper-competent very quickly (months). If early AGIs are widely copied/escaped, the global resolve and coordination required to contain them would be unprecedented in breadth and speed.
I expect warning shots, and expect them to be helpful (vs no shots), but take very little comfort in that.
They've learned within months for certain problems where learning can be done at machine speeds, ie game-like problems where it can "play against itself" or problems where huge amounts of data are available in machine-friendly format. But that isn't the case for every application. For example, developing self driving cars up to perfection level has taken way, way longer than expected, partially because it has to deal with freak events that are outside the norm, so a lot more experience and data has to be built up, which takes human time. (of course, humans are also not great at freak events, but remember we're aiming for perfection here). I think most tasks involved in taking over the world will look a lot more like self-driving cars than playing Go, which inevitably means mistakes, and a lot of them.
I strongly agree with you on points one and two, though I’m not super confident on three. For me the biggest takeaway is we should be putting more effort into attempts to instill “false” beliefs which are safety-promoting and self-stable.
I could see this backfiring. What if insilling false beliefs just later led to the meta-belief that deception is useful for control?
that's a fair point, I'm reconsidering my original take.