Paul Christiano once clarified AI alignment as follows:
When I say an AI A is aligned with an operator H, I mean:
A is trying to do what H wants it to do.
This definition is clear enough for many purposes, but it leads to confusion when one wants to make a point about two different types of alignment:
- A is trying to do what H wants it to do because A is trading or cooperating with H on a mutually beneficial outcome for the both of them. For example, H could hire A to perform a task, and offer a wage as compensation.
- A is trying to do what H wants it to do because A has the same values as H — i.e. its "utility function" overlaps with H's utility function — and thus A intrinsically wants to pursue what H wants it to do.
These cases are important to distinguish because they have dramatically different consequences for the difficulty and scope of alignment.
To solve alignment in the sense (1), A and H don't necessarily need to share the same values with each other in any strong sense. Instead, the essential prerequisite seems to be for A and H to operate in an environment in which it's mutually beneficial to them to enter contracts, trade, or cooperate in some respect.
For example, one can imagine a human hiring a paperclip maximizer AI to perform work, paying them a wage. In return the paperclip maximizer could use their wages to buy more paperclips. In this example, the AI performed their duties satisfactorily, without any major negative side effects resulting from their differing values, and both parties were made better off as a result.
By contrast, alignment in the sense of (2) seems far more challenging to solve. In the most challenging case, this form of alignment would require solving extremal goodhart, in the sense that A's utility function would need to be almost perfectly matched with H's utility function. Here, the idea is that even slight differences in values yield very large differences when subject to extreme optimization pressure. Because it is presumably easy to make slight mistakes when engineering AI systems, by assumption, these mistakes could translate into catastrophic losses of value.
Effect on alignment difficulty
My impression is that people's opinions about AI alignment difficulty often comes down to differences in how much they think we need to solve the second problem relative to the first problem, in order to get AI systems that generate net-positive value for humans.
If you're inclined towards thinking that trade and compromise is either impossible or inefficient between agents at greatly different levels of intelligence, then you might think that we need to solve the second problem with AI, since "trading with the AIs" won't be an option. My understanding is that this is Eliezer Yudkowsky's view, and the view of most others who are relatively doomy about AI. In this frame, a common thought is that AIs would have no need to trade with humans, as humans would be like ants to them.
On the other hand, you could be inclined — as I am — towards thinking that agents at greatly different levels of intelligence can still find positive sum compromises when they are socially integrated with each other, operating under a system of law, and capable of making mutual agreements. In this case, you might be a lot more optimistic about the prospects of alignment.
To sketch one plausible scenario here, if AIs can own property and earn income by selling their labor on an open market, then they can simply work a job and use their income to purchase whatever it is they want, without any need to violently "take over the world" to satisfy their goals. At the same time, humans could retain power in this system through capital ownership and other grandfathered legal privileges, such as government welfare. Since humans may start out with lots of capital, these legal rights would provide a comfortable retirement for us.
In this scenario, AIs would respect the legal rights of humans for both cultural and pragmatic reasons. Culturally, AIs would inherit our norms, legal traditions, and social conventions. It would be unpopular to expropriate human wealth just as it's now unpopular to expropriate the wealth of old people in our current world, even though in both cases the relevant entities are physically defenseless. Pragmatically, AIs would also recognize that stealing wealth from humans undermines the rule of law, which is something many AIs (as well as humans) would not like.
A threat to the rule of law is something many agents would likely coordinate to avoid, as it would erode the predictable and stable environment they rely on in order to make long-term plans, and keep the peace. Furthermore, since AIs would "get old" too, in the sense of becoming obsolete in the face of new generations of improved AIs, they could also have reason to not collectively expropriate the wealth of vulnerable old agents because they too will be in such a vulnerable position one day, and thus would prefer not to establish a norm of expropriating the type of agent they may one day become.
If an individual AI's relative skill-level is extremely high, then this could simply translate into higher wages for them, obviating the need for them to take part in a violent coup to achieve their objectives. In other words, there's really no strong incentive for AIs — even if they're super powerful — to try to kill or steal to get what they want, since peaceful strategies could be equally, or even more effective at accomplishing their aims. Power-seeking AIs could simply lawfully accumulate wealth instead, with no loss of value from their own unaligned perspective.
Indeed, attempting to take over the world is generally quite risky, as the plan could fail, and you could thereby die or be subject to legal penalties in the aftermath. Even designing such a takeover plan is risky, as it may be exposed prematurely, and this possibility becomes more likely the more allies that you need to recruit to ensure a successful execution. Moreover, war is generally economically inefficient compared to trade, suggesting that compromise is usually a better option than attempting violent takeover for rational, well-informed agents in society.
These facts suggest that even if taking over the world is theoretically possible for some set of agents, the expected value of pursuing such a plan could be inferior to simply compromising with other agents in society on terms that are beneficial to both sides of the compromise. This conclusion becomes even stronger if, in fact, there's no way for a small set of agents to take over the entire world and impose their will on everyone else.
My sketched scenario provides for an optimistic assessment of alignment by lowering the bar for what counts as "aligned enough". Some degree of cultural assimilation, social integration, and psychological orientation towards seeking compromise and avoiding violence may still be necessary for AIs to treat humans reasonably well. But under these assumptions, it is unnecessary for AIs to share our exact goals.
Of course, in this scenario, it would still be nice if AIs cared about exactly what we cared about; but even if they don't, we aren't necessarily made worse off as a result of building them. If they share our preferences, that would simply be a nice bonus for us. The future could still be bright for humans even if the universe is eventually filled with entities whose preferences we do not ultimately share.
I think there's also a problem with treating "misaligned" as a binary thing, where the AI either exactly shares all our values down to the smallest detail (aligned) or it doesn't (misaligned). As the OP has noted, in this sense all human beings are "misaligned" with each other.
It makes sense to me to divide your category 2 further, talking about alignment as a spectrum, from "perfectly aligned", to "won't kill anyone aligned" to "won't extinct humanity aligned". The first is probably impossible, the last is probably not that difficult.
If we have an AI that is "won't kill anyone aligned", then your world of AI trade seems fine. We can trade for our mutual benefit safe in the knowledge that if a power struggle ensues, it will not end in our destruction.
My terminology would be that (2) is “ambitious value learning” and (1) is “misaligned AI that cooperates with humans because it views cooperating-with-humans to be in its own strategic / selfish best interest”.
I strongly vote against calling (1) “aligned”. If you think we can have a good future by ensuring that it is always in the strategic / selfish best interest of AIs to be nice to humans, then I happen to disagree but it’s a perfectly reasonable position to be arguing, and if you used the word “misaligned” for those AIs (e.g. if you say “alignment is unnecessary”), I think it would be viewed as a helpful and clarifying way to describe your position, and not as a reductio or concession.
For my part, I define “alignment” as “the AI is trying to do things that the AGI designer had intended for it to be trying to do, as an end in itself and not just as a means-to-an-end towards some different goal that it really cares about.” (And if the AI is not the kind of thing for which the word “trying” and “cares about” is applicable in the first place, then the AI is neither aligned nor misaligned, and also I’d claim it’s not an x-risk in any case.) More caveats in a thing I wrote here:
This is a reasonable definition, but it's important to note that under this definition of alignment, humans are routinely misaligned with each other. In almost any interaction I have with strangers -- for example, when buying a meal at a restaurant -- we are performing acts for each other because of mutually beneficial trade rather than because we share each other's values.
That is, humans are largely misaligned with each other. And yet the world does not devolve into a state of violence and war as a result (at least most of the time), even in the presence of large differences in power between people. This has epistemic implications for whether a world filled with AIs would similarly be peaceful, even if those AIs are misaligned by this definition.
Humans are less than maximally aligned with each other (e.g. we care less about the welfare of a random stranger than about our own welfare), and humans are also less than maximally misaligned with each other (e.g. most people don’t feel a sadistic desire for random strangers to suffer). I hope that everyone can agree about both those obvious things.
That still leaves the question of where we are on the vast spectrum in between those two extremes. But I think your claim “humans are largely misaligned with each other” is not meaningful enough to argue about. What percentage is “largely”, and how do we even measure that?
Anyway, I am concerned that future AIs will be more misaligned with random humans than random humans are with each other, and that this difference will have important bad consequences, and I also think there are other disanalogies / reasons-for-concern as well. But this is supposed to be a post about terminology so maybe we shouldn’t get into that kind of stuff here.
The difference is that a superintelligence or even an AGI is not human and they will likely need very different environments from us to truly thrive. Ask factory farmed animals or basically any other kind of nonhuman animal if our world is in a state of violance or war… As soon as strong power differentials and diverging needs show up the value cocreation narrative starts to lose it’s magic. It works great for humans but it doesn’t really work with other species that are not very close and aligned with us. Dogs and cats have arguably fared quite well but only at the price of becoming strongly adapted to OUR needs and desires.
In the end, if you don’t have anything valuable to offer there is not much more you can do besides hoping for, or ideally ensuring, value alignment in the strict sense. Your scenario may work well for some time but it’s not a longterm solution.
Animals are not socially integrated in society, and we do not share a common legal system or culture with them. We did not inherit legal traditions from them. Nor can we agree to mutual contracts, or coordinate with them in a meaningful way. These differences seem sufficient to explain why we treat them very differently as you described.
If this difference in treatment was solely due to differences in power, you'd need to explain why vulnerable humans are not regularly expropriated, such as old retired folks, or small nations.
I have never said that how we treat nonhuman animals is “solely” due to differences in power. The point that I have made is that AIs are not humans and I have tried to illustrate that differences between species tend to matter in culture and social systems.
But we don’t even have to go to species differences, ethnic differences are already enough to create quite a bit of friction in our societies (e.g., racism, caste systems, etc.). Why don’t we all engage in mutually beneficial trade and cooperate to live happily ever after?
Because while we have mostly converging needs in a biological sense, we have different values and beliefs. It still roughly works out in the grand scheme of things because cultural checks and balances have evolved in environments where we had strongly overlapping values and interests. So most humans have comparable degrees of power or are kept in check by those checks and balances. That was basically our societal process of getting to value alignment but as you can probably tell by looking at the news, this process has not reached a satisfactory quality, yet. We have come far but it’s still a shit show out there. The powerful take what they can get and often only give a sh*t to the degree that they actually feel consequences from it.
So, my point is that your “loose” definition of value alignment is an illusion if you are talking about super powerful actors that have divergent needs and don’t share your values. They will play along as long as it suits them but will stop doing it as soon as an alternative more aligned with their needs and values is more convenient. And the key point here is that AIs are not humans and that they have very different needs from us. If they become much more powerful than us, only their values can keep them in check in the long run.
Apologies for being blunt, but the scenario you lay out is full of claims that just seem to completely ignore very facially obvious rebuttals. This would be less bad if you didn’t seem so confident, but as written the perspective strikes me as naive and I would really like an explanation/defense.
Take for example:
Setting aside the debatable assumptions about AIs getting “old,” this just seems to completely ignore the literature on collective action problems. If the scenario were such that any one AI agent can expect to get away with defecting (expropriation from older agents) and the norm-breaking requires passing a non-small threshold of such actions, a rational agent will recognize that their defection has minimal impact on what the collective will do, so they may as well do it before others do.
There are multiple other problems in your post, but I don’t think it’s worth the time going through them all. I just felt compelled to comment because I was baffled by the karma on this post, unless it was just people liking it because they agreed with the beginning portion…?
This isn't the scenario I intended to describe, since it seems very unlikely that a single agent could get away with mass expropriation. The more likely scenario is that any expropriation that occurs must have been a collective action to begin with, and thus, there's no problem of coordination that you describe.
This is common in ordinary expropriation in the real world: if you learned that we were one day going to steal all the wealth from people above the age of 90, you'd likely infer that that this decision was decided collectively, rather than being the result of a single lone agent who went and stole all the wealth for themselves.
Your described scenario is instead more similar to ordinary theft, such as robbery. In that case, defection is usually punished by laws against theft, and people generally have non-altruistic reasons to support the enforcement of these laws.
I'm happy for you to critique the rest of the post. As far as I can tell, the only substantive critique you have offered so far seems to contain a misunderstanding of the scenario I described (conflating private lawbreaking from a lone actor with a collective action to expropriate wealth). But it would certainly not be surprising if my arguments had genuine flaws: these are about speculative matters concerning the future.
I don't find this response to be a compelling defense of what you actually wrote:
It's one thing if the argument is "there will be effective enforcement mechanisms which prevent theft," but the original statement still just seems to imagine that norms will be a non-trivial reason to avoid theft, which seems quite unlikely for a moderately rational agent.
Ultimately, perhaps much of your scenario was trying to convey a different idea from what I see as the straightforward interpretation, but I think it makes it hard for me to productively engage with it, as it feels like engaging with a motte-and-bailey.
Sorry, I think you're still conflating two different concepts. I am not claiming:
I am claiming:
There are two separate mechanisms at play here. Individual and local instances of theft, like robbery, are typically punished by specific laws. Collective expropriation of groups, while possible in all societies, is usually handled via more decentralized coordination mechanisms, such as social norms.
In other words, if you're asking me why an AI agent can't just steal from a human, in my scenario, I'd say that's because there will (presumably) be laws against theft. But if you're asking me why the AIs don't all get up together and steal from the humans collectively, I'd say it's because they would not want to violate the general norm against expropriation, especially of older, vulnerable groups.
For what it's worth, I asked Claude 3 and GPT-4 to proof-read my essay before I posted, and they both appeared to understand what I said, with almost no misunderstandings, for every single one of my points (from my perspective). I am not bringing this up to claim you are dumb, or anything like that, but I do think it provides evidence that you could probably better understand what I'm saying if you tried to read my words more carefully.
Personally, I think that the possibility of paying AIs might make the situation considerably more likely to go well, but via a mechanism more like:
Note that this requires substantial control-style safeguards/internal controls which is by no means clearly the default. (And it also won't scale to arbitrarily powerful AIs without advances in technology.)
I'm using misaligned in the same way as Steven Byrnes defines it in this comment.
It's possible that one key disagreement I have with Matthew is that I think the level of verification required for AI checks and balances isn't clearly technically easy by default.
That is, the technical measures required for 'We have several "different" AIs and sufficent safeguards that if at least one AI doesn't collude with the other AIs and instead works with the AI lab, we'd remain in control.' are non-trivial.
For instance, if we have serious ELK problems (black-box techniques like debate are insufficient and we can't figure out something else that worse), then this implies that we won't be able to distinguish between an AI with our best interests in mind and an AI which is tampering with all of our measuresments and deluding us. This is despite both of these AIs trying to present their best evidence that they are action in our interests. Further, tampering with all of our measurements and deluding us could look better than acting in our best interests.
This certainly isn't the only disagreement I have with Matthew, but it might explain a lot of differences in how we think about the situation.
Also, note that this still applies when trying to pay AIs for goods and services. (Unless humanity has already augmented it's intelligence, but if so, how did this happen in a desirable way?)
I don't think many people are very optimistic about ensuring good outcomes from AI due to the combination of the following beliefs:
Other than Robin Hanson and you, I'm not aware of anyone else who puts substantial weight on this collection of views.
I think more common reasons for optimism are either:
Separately, I'm somewhat optimistic about gaining value from approaches that involve paying AIs as I discuss in another comment.
My guess is that at some point someone will just solve the technical problem of alignment. Thus, future generations of AIs would be actually aligned to prior generations and the group they are aligned to would no longer need to worry about expropriation.
Further, for AIs it might be relatively easy to do "poor man's alignment" via enhancing their own intelligence (e.g. adding more weights and training for longer and seeing how they feel after doing this).
Thus, I expect that this cycle stops quickly and there is a final generation which has to worry about expropriation. My expectation is that this final generation is likely to be either humans or the first AIs which end up acquiring substantial power.
I don't think it's realistic that solutions to the alignment problem will be binary in the way you're describing. One could theoretically imagine a perfect solution — i.e. one that allows you to build an agent whose values never drift, that acts well on every possible input it could receive, whose preferences are no longer subject to extremal goodhart, and whose preferences reflect your own desires at every level, on every question — but I suspect this idea will probably always belong more to fiction than reality. The real world is actually very messy, and it starts to get very unclear what each of these ideas actually means once you carefully interrogate what would happen in the limit of unlimited optimization power.
A more realistic scenario, in my view, is that alignment is more of a spectrum, and there will always be slight defects in the alignment process. For example, even my own brain is slightly misaligned with my former self from one day ago. Over longer time periods than a day, my values have drifted significantly.
In this situation — since perfection is unattainable — there's always an inherent tradeoff between being cautious in order to do more alignment work, and just going ahead and building something that's actually useful, even if it's imperfect, and even though you can't fully predict what will happen when you build it. And this tradeoff seems likely to exist at every level of AI, from human-level all the way up to radical superintelligences.
I don't think it's binary, but I do think it's likely to be a sigmoid in practice. And I expect this sigmoid will saturate relatively early.
Another way to put this is that I expect that "fraction of value lost by misalignment" will quickly exponentially decay with the number of AI generations. (This is by no means obvious, just my main line guess.)
The main reason to expect nearly perfect (e.g. >99% of value) solutions to be doable are:
From a scope sensitive (linear returns) longtermist perspective, we're potentially much worse off.
If we built aligned AIs, we would acquire 100% of the value (from humanity's perspective). If we built misaligned AIs that end up keeping humans alive and happy but don't directly care about anything we value, we might directly acquire vastly less than this, perhaps 1 millionth of the scope sensitive value. (Note that we might recover some value (e.g. 10%) from acausal trade, I'm not counting this in the direct value.)
Perhaps you think this view is worth dismising because either:
To be clear, it's important to not equivocate between:
I think both are probably true, but these are separate.
Edit: I clarified some language a bit.
From an impartial (non-selfish) perspective, yes, I'm not particularly attached to human economic consumption relative to AI economic consumption. In general, my utilitarian intuitions are such that I don't have a strong preference for humans over most "default" unaligned AIs, except insofar as this conflicts with my preferences for existing people (including myself, my family, friends etc.).
I'd additionally point out that AIs could be altruistic too. Indeed, it seems plausible to me they'll be even more altruistic than humans, since the AI training process is likely to deliberately select for altruism, whereas human evolution directly selected for selfishness (at least on the gene level, if not the personal level too).
This is a topic we've touched on several times before, and I agree you're conveying my views — and our disagreement — relatively accurately overall.
I also think this, yes. For example, we could consider the following bets:
According to a scope sensitive calculation, the second gamble is better than the first. Yet, from a personal perspective, I'd prefer (1) under a wide variety of assumptions.
I think this post is conflating two different things in an important way. There are two distinctions here about what we might mean by "A is trying to do what H wants":
To illustrate the second distinction: I could drive my son to a political rally because I also believe in the cause, or because I love my son and want to see him succeed at whatever goals he has.
I think it is much more likely that we will instill AIs with something like loyalty than that we will instill them with our exact values directly, and I think most alignment optimists consider this the more promising direction. (I think this is essentially what the term "corrigibility" refers to in alignment.) I know this has been Paul Christiano's approach for a long time now, see for example this post.
I agree with this and I'm glad you wrote it.
To steelman the other side I would point to 16th century new world encounters between Europeans and Natives. It seems like this was a case where the technological advantage of the Europeans made conquest better than comparative advantage trade.
The high productivity of the Europeans made it easy for them to lawfully accumulate wealth (e.g buying large tracts of land for small quantities of manufactured goods), but they still often chose to take land by conquest rather than trade.
Maybe transaction frictions were higher here than they might be with AIs since we'd share a language and be able to use AI tools to communicate.
But what makes you think that this can be a longterm solution if the needs and capabilities of the involved parties are strongly divergent as in human vs AI scenarios?
I agree that trading can probably work for a couple of years, maybe decades, but if the AIs want something different from us in the long term what should stop them from getting this?
I don’t see a way around value alignment in the strict sense (ironically this could also involve AIs aligning our values to theirs similar to how we have aligned dogs).
It could be that the AI can achieve much more of their objectives if it takes over (violently or non-violently) than it can achieve by playing by the rules. To use your paperclip example, the AI might think it can get 10^22 paperclips if it takes over the world, but can only achieve 10^18 paperclips with the strategy of making money through legal means and buying paperclips on the open market. In this case, the AI would prefer the takeover plan even if it has only a 10% chance of success.
Also, the objectives of the AI must be designed in such a way that they can be achieved in a legal way. For example, if an AI strongly prefers a higher average temperature of the planet, but the humans put a cap on the global average temperature, then it will be hard to achieve without breaking laws or bribing lawmakers.
There are lots of ways for AIs to have objectives that are shaped in a bad way. To obtain guarantees that the objectives of the AIs don't take these bad shapes is still a very difficult thing to do.
Sure, that could be true, but I don't see why it would be true. In the human world, it isn't true that you can usually get what you want more easily by force. For example, the United States seems better off trading with small nations for their resources than attempting to invade and occupy them, even from a self-interested perspective.
More generally, war is costly, even between entities with very different levels of power. The fact that one entity is very powerful compared to another doesn't imply that force or coercion is beneficial in expectation; it merely implies that such a strategy is feasible.
See here for some earlier discussion of whether violent takeover is likely. (For third parties to view, Matthew was in this discussion.)