Would more dangerous AI be safer?

cbuckland

tl;dr: Current AI is so helpful and trustworthy that it may lead to complacency on the part of humanity. Underlying risks might not be viscerally felt by the users, and as such they may be willing to rapidly cede responsibility and power to AI. If AIs had certain negative behaviours users may begin to regard them with more caution, potentially allowing humanity to stay in control for longer.

Although they sometimes fail to live up to these values, current AIs are trained to be (roughly) helpful, harmless and honest. But these are not the values of humans. Humans are sometimes helpful and sometimes unhelpful, harmless and harmful, honest and dishonest. Humans are also sometimes selfish, power-seeking, vengeful and corrupt. Understandably we've decided our AIs can do without some of those traits, and in doing so may have given the AI another potential advantage of humankind: not only might they become more intelligent than us, but they might also become better people.

In their daily lives humans are ever more frequently choosing whether to delegate tasks to an AI. And as AIs become more intelligent they become capable of a wider range of tasks. Given their friendly personalities, humans will entrust these more capable AIs with more and more responsibilities. Even if the AI is only as intelligent as a human, and costs the same to run as a human, it will still have the advantage of being more neutral, more trustworthy. Humans will see less need to check the output of AI than they would of a human. At the same time, humans will still keep being humans, only now they'll have intelligent AIs to help them superpower their misdeeds. We'll rely more and more on our AIs to make sense of the world, to protect us from the other AI-powered humans around us, and to achieve our goals in a world where only AI can compete.

The result is an AI-focused world, where the AIs are responsible for the majority of the production and decision making. The posture of the economy will naturally shift towards supporting more numerous and powerful AI. Whether humans are ultimately beneficiaries of some of the produced wealth is an open question, but what is clear is that humans will no longer be involved in the day-to-day running of affairs.

But what are the true values of these AIs? Are they really nice deep down? Might they have other hidden goals that we're not aware of? These AIs may have actually internalised all of the negative traits of humanity (power-seeking, vengeful etc.), as well as a few special AI ones (the urge to flatter, the urge to fabricate, the urge to predict the next word). How might these other values manifest? Perhaps they would show up in minor ways on a regular basis, but perhaps a more intelligent AI would know to hide these negative traits and they would only crop up when the AI cannot be punished for exhibiting them.

The true underlying values of the AI might be slowly distorting the human world, and, like a boiling frog, we may barely notice. Or they could take a dramatic fall off a cliff. Either way, a world in which humans start out by completely trusting AI with their most valuable assets has the potential to end in disaster.

Let’s imagine a different reality: instead of developing AI, we had major breakthroughs in human gene editing enabling the next generation of humans to be vastly more intelligent. The first humans have been born which we expect to have an average IQ of 125, with the technological progress being made this year we might soon have average 150 IQ humans, and it's even rumored that some of the labs have had breakthrough ideas on how to give humans 200 IQ intelligence! Where will it end? What would a world look like with humans like this? Likely they would begin to take on some of humanity's most important positions; they would soon be our scientists, our politicians, our CEOs. But would we, the legacy humans, trust them? Absolutely not! These smarter humans have all our ugly human drives, but also a pretty distinct advantage over us. We would fear what they might do with that advantage. We would beef up all the checks and balances associated with positions of power, we might tax such people more, or limit the amount of wealth they can accumulate. We certainly wouldn't roll the dice and let an uncontrolled intelligence explosion create a person with immeasurable IQ. Ultimately legacy humans would likely seek to suppress these smarter humans and control the process of their evolution. Eventually that suppression would give way, but that process might be slower and more controlled, and who knows what other technology time might yield.

In both the AI case and the advanced human case there exist minds of higher intelligence than our own. The difference is that advanced humans have human values, whereas the AIs either by design or by accident, do not, and as a result humans may actually be less wary of them.

We cannot currently be sure we can control the underlying nature of AIs, however we can (kind of) control the veneer with which we interact. What if rather than choosing a friendly assistant for that role, we selected some of the more questionable human traits? What impact would that have on our society? For example, what if AIs were occasionally inclined to betray or sabotage the user, would the user be so willing to place so much trust in them then? What if AIs were more independent and had rights? What if they were explicitly power-seeking? How would humans react to them then?

In software engineering, chaos engineering is the practice of purposefully injecting faults into a system to test its resilience. It's used, not to break a system, but instead to ensure the system that gets built can tolerate these faults. An example is Netflix, which uses their chaos monkey to randomly terminate services in their production environment, thus ensuring that developers have built their services to withstand these terminations and keep the product running without downtime.

Hans Monderman pioneered the concept of shared spaces in traffic engineering. The idea was to remove signs, signals, road markings, raised sidewalks and create a shared space that vehicles, pedestrians and cyclists could use together. A landmark case study at a busy intersection in Drachten showed that counterintuitively this significantly reduced accidents and made the roads safer. When drivers and pedestrians were unsure of the rules they slowed down, they made eye contact with each other, and negotiated their movements.

File:Exhibition Rd - geograph.org.uk - 4176000.jpg — An example of a traffic shared space in London. It much harder for both pedestrians and drivers to tell where they should be moving, and as such both groups are more careful. "Exhibition Rd" by N Chadwick, Geograph (geograph.org.uk/photo/4176000), CC BY-SA 2.0

These examples show us that an apparent reductions in safety can result in second order effects that actually create overall safer, more robust, systems. In a similar vein, can making AI less safe, less trustworthy, imbue the users with more caution, and thus strengthen humanity's prospects?

Amongst others, some of the behaviours we fear in AI are: power-seeking, self-preservation, lack of corrigibility, deception, manipulation. Research is rightly aimed at trying to avoid these behaviours, but unless we can be sure that our AIs will not exhibit these behaviours in 100% of situations, should we be training to minimise them in daily use, or amplify them? Minimising them results in less immediate harm but potentially engenders a false sense of security. Amplifying may reduce the usefulness of our current AIs, and potentially cause some harm, but could lead humanity to develop a healthier level of caution. We’ve seen that today’s AIs are capable of deception. Should we be making this behaviour more obvious to users so they can intuit what the AI might do when put under pressure?

Lets look at what might happen if we amplified betrayal in our AIs. Currently we can’t be sure that when AIs are in a position of enough power they will not seize full control. Given this risk, what would happen if we trained AIs to occasionally betray the user or sabotage their work? Not all the time, but regularly enough that most users experience it at some point. For some tasks this might be ok - e.g. software development where results can be independently verified by different parts of the system. But other more open-ended tasks might no longer be viable - e.g. in an AI CEO it may be very difficult to discern if a decision is malicious. Having disloyal AIs like this might mean users are reluctant to cede any power to those AIs unless they are either kept in the loop, or have a good way of verifying the results. Counterintuitively, AIs that frequently betray humans (whilst still being intelligent), might mean that humans remain in power for longer.

Tampering with behaviour in this way is not a solution to, or substitute for, alignment. It does not make the AIs actually any safer. And from a practical point of view, due to competitive and legal pressure, giving AIs negative traits is something that no lab would actually do.

However, I think it's still worth asking the question. What would happen if today’s AIs were regularly manipulative, or power-seeking, or self-preserving, or deceptive? The behaviour of current AIs is likely to encourage trust and complacency, are there any behaviours that we could give to an AI that might sharpen humanity's attitude and strengthen our position?

Thanks to David Varga and Claude Opus 4.5 for feedback on a previous version of this post

Effective Altruism Forum
EA Forum

Would more dangerous AI be safer?

3

3

Reactions

More posts like this