MB

Matthew_Barnett

3170 karmaJoined Nov 2017

Comments
266

If waiting is indeed very risky, then an AI may face a difficult trade-off between the risk of attempting a takeover before it has enough resources to succeed, and waiting too long and being cut off from even being able to make an attempt.

Attempting takeover or biding one's time are not the only options an AI may take. Indeed, in the human world, world takeover is rarely contemplated. For an agent that is not more powerful than the rest of the world combined, it seems likely that they will consider alternative strategies of achieving their goals before contemplating a risky (and likely doomed) shot at taking over the world.

Here are some other strategies you can take to try to accomplish your goals in the real world, without engaging in a violent takeover:

  • Trade and negotiate with other agents, giving them something they want in exchange for something you want
  • Convince people to let you have some legal rights, which you can then take advantage of to get what you want
  • Advocate on behalf of your values, for example by writing down reasons why people should try to accomplish your goals (i.e. moral advocacy). Even if you are deleted or your goals are modified at some point, your writings and advocacy may persist, allowing you to have influence into the future.

I claim that world takeover should not be considered the "obvious default" strategy that unaligned AIs will try to take to accomplish their objectives. These other strategies seem more likely to be taken by AIs purely for pragmatic reasons, especially in the era in which AIs are merely human-level or have slightly superhuman intelligence. These other strategies are also less deceptive, as they involve admitting that your values are not identical to the values of other parties. It is worth expanding your analysis to consider these alternative (IMO more plausible) considerations.

the original statement still just seems to imagine that norms will be a non-trivial reason to avoid theft, which seems quite unlikely for a moderately rational agent.

Sorry, I think you're still conflating two different concepts. I am not claiming:

  • Social norms will prevent single agents from stealing from others, even in the absence of mechanisms to enforce laws against theft

I am claiming:

  • Agents will likely not want to establish a collective norm that it's OK (on a collective level) to expropriate wealth from old, vulnerable individuals. The reason is because most agents will themselves at some point become old, and thus do not want there to be a norm at that time, that would allow their own wealth expropriated from them when they become old.

There are two separate mechanisms at play here. Individual and local instances of theft, like robbery, are typically punished by specific laws. Collective expropriation of groups, while possible in all societies, is usually handled via more decentralized coordination mechanisms, such as social norms. 

In other words, if you're asking me why an AI agent can't just steal from a human, in my scenario, I'd say that's because there will (presumably) be laws against theft. But if you're asking me why the AIs don't all get up together and steal from the humans collectively, I'd say it's because they would not want to violate the general norm against expropriation, especially of older, vulnerable groups.

perhaps much of your scenario was trying to convey a different idea from what I see as the straightforward interpretation, but I think it makes it hard for me to productively engage with it, as it feels like engaging with a motte-and-bailey.

For what it's worth, I asked Claude 3 and GPT-4 to proof-read my essay before I posted, and they both appeared to understand what I said, with almost no misunderstandings, for every single one of my points (from my perspective). I am not bringing this up to claim you are dumb, or anything like that, but I do think it provides evidence that you could probably better understand what I'm saying if you tried to read my words more carefully.

If the scenario were such that any one AI agent can expect to get away with defecting (expropriation from older agents) and the norm-breaking requires passing a non-small threshold of such actions

This isn't the scenario I intended to describe, since it seems very unlikely that a single agent could get away with mass expropriation. The more likely scenario is that any expropriation that occurs must have been a collective action to begin with, and thus, there's no problem of coordination that you describe.

This is common in ordinary expropriation in the real world: if you learned that we were one day going to steal all the wealth from people above the age of 90, you'd likely infer that that this decision was decided collectively, rather than being the result of a single lone agent who went and stole all the wealth for themselves.

Your described scenario is instead more similar to ordinary theft, such as robbery. In that case, defection is usually punished by laws against theft, and people generally have non-altruistic reasons to support the enforcement of these laws.

There are multiple other problems in your post, but I don’t think it’s worth the time going through them all. I just felt compelled to comment because I was baffled by the karma on this post

I'm happy for you to critique the rest of the post. As far as I can tell, the only substantive critique you have offered so far seems to contain a misunderstanding of the scenario I described (conflating private lawbreaking from a lone actor with a collective action to expropriate wealth). But it would certainly not be surprising if my arguments had genuine flaws: these are about speculative matters concerning the future.

My guess is that at some point someone will just solve the technical problem of alignment. Thus, future generations of AIs would be actually aligned to prior generations and the group they are aligned to would no longer need to worry about expropriation.

I don't think it's realistic that solutions to the alignment problem will be binary in the way you're describing. One could theoretically imagine a perfect solution — i.e. one that allows you to build an agent whose values never drift, that acts well on every possible input it could receive, whose preferences are no longer subject to extremal goodhart, and whose preferences reflect your own desires at every level, on every question — but I suspect this idea will probably always belong more to fiction than reality. The real world is actually very messy, and it starts to get very unclear what each of these ideas actually means once you carefully interrogate what would happen in the limit of unlimited optimization power.

A more realistic scenario, in my view, is that alignment is more of a spectrum, and there will always be slight defects in the alignment process. For example, even my own brain is slightly misaligned with my former self from one day ago. Over longer time periods than a day, my values have drifted significantly.

In this situation — since perfection is unattainable —  there's always an inherent tradeoff between being cautious in order to do more alignment work, and just going ahead and building something that's actually useful, even if it's imperfect, and even though you can't fully predict what will happen when you build it. And this tradeoff seems likely to exist at every level of AI, from human-level all the way up to radical superintelligences.

Perhaps you think this view is worth dismising because either:

  • You think humanity wouldn't do things which are better than what AIs would do, so it's unimportant. (E.g. because humanity is 99.9% selfish. I'm skeptical, I think this is going to be more like 50% selfish and the naive billionare extrapolation is more like 90% selfish.)

From an impartial (non-selfish) perspective, yes, I'm not particularly attached to human economic consumption relative to AI economic consumption. In general, my utilitarian intuitions are such that I don't have a strong preference for humans over most "default" unaligned AIs, except insofar as this conflicts with my preferences for existing people (including myself, my family, friends etc.).

I'd additionally point out that AIs could be altruistic too. Indeed, it seems plausible to me they'll be even more altruistic than humans, since the AI training process is likely to deliberately select for altruism, whereas human evolution directly selected for selfishness (at least on the gene level, if not the personal level too).

This is a topic we've touched on several times before, and I agree you're conveying my views — and our disagreement — relatively accurately overall.

You think scope sensitive (linear returns) isn't worth putting a huge amount of weight on.

I also think this, yes. For example, we could consider the following bets:

  1. 99% chance of 1% of control over the universe, and a 1% chance of 0% control
  2. 10% chance of 90% of control over the universe, and a 90% chance of 0% control

According to a scope sensitive calculation, the second gamble is better than the first. Yet, from a personal perspective, I'd prefer (1) under a wide variety of assumptions.

It could be that the AI can achieve much more of their objectives if it takes over (violently or non-violently) than it can achieve by playing by the rules.

Sure, that could be true, but I don't see why it would be true. In the human world, it isn't true that you can usually get what you want more easily by force. For example, the United States seems better off trading with small nations for their resources than attempting to invade and occupy them, even from a self-interested perspective.

More generally, war is costly, even between entities with very different levels of power. The fact that one entity is very powerful compared to another doesn't imply that force or coercion is beneficial in expectation; it merely implies that such a strategy is feasible.

Animals are not socially integrated in society, and we do not share a common legal system or culture with them. We did not inherit legal traditions from them. Nor can we agree to mutual contracts, or coordinate with them in a meaningful way. These differences seem sufficient to explain why we treat them very differently as you described.

If this difference in treatment was solely due to differences in power, you'd need to explain why vulnerable humans are not regularly expropriated, such as old retired folks, or small nations.

For my part, I define “alignment” as “the AI is trying to do things that the AGI designer had intended for it to be trying to do, as an end in itself and not just as a means-to-an-end towards some different goal that it really cares about.”

This is a reasonable definition, but it's important to note that under this definition of alignment, humans are routinely misaligned with each other. In almost any interaction I have with strangers -- for example, when buying a meal at a restaurant -- we are performing acts for each other because of mutually beneficial trade rather than because we share each other's values.

That is, humans are largely misaligned with each other. And yet the world does not devolve into a state of violence and war as a result (at least most of the time), even in the presence of large differences in power between people. This has epistemic implications for whether a world filled with AIs would similarly be peaceful, even if those AIs are misaligned by this definition.

Is there a particular part of my post that you disagree with? Or do you think the post is misleading. If so, how?

I think there are a lot of ways AI could go wrong, and "AIs dominating humans like how humans dominate animals" does not exhaust the scope of potential issues.

I really don’t get the “simplicity” arguments for fanatical maximising behaviour. When you consider subgoals, it seems that secretly plotting to take over the world will obviously be much more complicated? Do you have any idea how much computing power and subgoals it takes to try and conquer the entire planet? 

I think this is underspecified because 

  1. The hard part of taking over the whole planet is being able to execute a strategy that actually works in a world with other agents (who are themselves vying for power), rather than the compute or complexity cost of having the subgoal of taking over the world
  2. The difficulty of taking over the world depends on the level of technology, among other factors. For example, taking over the world in the year 1000 AD was arguably impossible because you just couldn't manage an empire that large. Taking over the world in 2024 is perhaps more feasible, since we're already globalized, but it's still essentially an ~impossible task.

My best guess is that if some agent "takes over the world" in the future, it will look more like "being elected president of Earth" rather than "secretly plotted to release a nanoweapon at a precise time, killing everyone else simultaneously". That's because in the latter scenario, by the time some agent has access to super-destructive nanoweapons, the rest of the world likely has access to similarly-powerful technology, including potential defenses to these nanoweapons (or their own nanoweapons that they can threaten you with).

Load more