I have some objections to the idea that groups will be "immortal" in the future, in the sense of never changing, dying, or rotting, and persisting over time in a roughly unchanged form, exerting consistent levels of power over a very long time period. To be clear, I do think AGI can make some forms of value lock-in more likely, but I want to distinguish a few different claims:
(1) is a future value lock-in likely to occur at some point, especially not long after human labor has become ~obsolete?
(2) is lock-in more likely if we perform, say, a century more of technical AI alignment research, before proceeding forward?
(3) is it good to make lock-in more likely by, say, delaying AI by 100 years to do more technical alignment research, before proceeding forward? (i.e., will it be good or bad to do this type of thing?)
My quick and loose current answers to these questions are as follows:
I'm a bit surprised you haven't seen anyone make this argument before. To be clear, I wrote the comment last night on a mobile device, and it was intended to be a brief summary of my position, which perhaps explains why I didn't link to anything or elaborate on that specific question. I'm not sure I want to outline my justifications for my view right now, but my general impression is that civilization has never had much central control over cultural values, so it's unsurprising if this situation persists into the future, including with AI. Even if we align AIs, cultural and evolutionary forces can nonetheless push our values far. Does that brief explanation provide enough of a pointer to what I'm saying for you to be ~satisfied? I know I haven't said much here; but I kind doubt my view on this issue is that rare that you've literally never seen someone present a case for it.
I guess overall I'm still inclined to push for a future where "AI alignment" and "human safety" are both solved, instead of settling for one in which neither is (which I'm tempted to summarize your position as, but I'm not sure if I'm being fair)
For what it's worth, I'd loosely summarize my position on this issue as being that I mainly think of AI as a general vehicle for accelerating technological and economic growth, along with accelerating things downstream of technology and growth, such as cultural change. And I'm skeptical we could ever fully "solve alignment" in the ambitious sense you seem to be imagining.
In this frame, it could be good to slow down AI if your goal is to delay large changes to the world. There are plausible scenarios in which this could make sense. Perhaps most significantly, one could be a cultural conservative and think that cultural change is generally bad in expectation, and thus more change is bad even if it yields higher aggregate prosperity sooner in time (though I'm not claiming this is your position).
Whereas, by contrast, I think cultural change can be bad, but I don't see much reason to delay it if it's inevitable. And the case against delaying AI seems even stronger here if you care about preserving (something like) the lives and values of people who currently exist, as AI offers the best chance of extending our lifespans, and "putting us in the driver's seat" more generally by allowing us to actually be there during AGI development.
If future humans were in the driver's seat instead, but with slightly more control over the process, I wouldn't necessarily see that as being significantly better in expectation compared to my favored alternative, including over the very long run (according to my values).
(And as a side note, I also care about influencing human values, or what you might term "human safety", but I generally see this as orthogonal to this specific discussion.)
My own thinking is that war between AIs and humans could happen in many ways. One simple (easy to understand) way is that agents will generally refuse a settlement worse than what they think they could obtain on their own (by going to war), so human irrationality could cause a war when e.g. the AI faction thinks it will win with 99% probability, and humans think they could win with 50% probability, so each side demand more of the lightcone (or resources in general) than the other side is willing to grant.
This generally makes sense to me. I also think human irrationality could prompt a war with AIs. I don't disagree with the claim insofar as you're claiming that such a war is merely plausible (say >10% chance), rather than a default outcome. (Although to be clear, I don't think such a war would likely cut cleanly along human vs. AI lines.)
On the other hand, humans are currently already irrational and yet human vs. human wars are not the default (they happen frequently, e.g. but at any given time on Earth, the vast majority of humans are not in a warzone or fighting in an active war). It's not clear to me why human vs. AIs would make war more likely to occur than in the human vs. human case, if by assumption the main difference here is that one side is more rational.
In other words, if we're moving from a situation of irrational parties vs. other irrational parties to irrational parties vs. rational parties, I'm not sure why we'd expect this change to make things more warlike and less peaceful as a result. You mention one potential reason:
Also, given that humans often do (or did) go to war with each other, our shared values (i.e. the extent to which we do have empathy/altruism for others) must contribute to the current relative peace in some way.
I don't think this follows. Humans presumably also had empathy in e.g. 1500, back when war was more common, so how could it explain our current relative peace?
Perhaps you mean that cultural changes caused our present time period to be relatively peaceful. But I'm not sure about that; or at least, the claim should probably be made more specific. There are many things about the environment that have changed since our relatively more warlike ancestors, and (from my current perspective) I think it's plausible that any one of them could have been the reason for our current relative peace. That is, I don't see a good reason to single out human values or empathy as the main cause in itself.
For example, humans are now a lot richer per capita, which might mean that people have "more to lose" when going to war, and thus are less likely to engage in it. We're also a more globalized culture, and our economic system relies more on long-distance trade than it did in the past, making war more costly. We're also older, in the sense that the median age is higher (and old people are less likely to commit violence), and women got the right to vote (who perhaps are less likely to support hawkish politicians).
To be clear, I don't put much confidence in any of these explanations. As of now, I'm very uncertain about why the 21st century seems relatively peaceful compared to the distant past. However I do think that:
Would be interesting to spell out more which points there seem much more plausible with respect to notion (1) but not to (2). If one has high credence in the view that AIs will decide to compromise with humans, rather than extinguish them, this would be one example of a view which leads to a much higher credence in (1) than in (2).
I think the view that AIs will compromise with humans rather than go to war with them makes sense under the perspective shared by a large fraction (if not majority) of social scientists that war is usually costlier, riskier, and more wasteful than trade between rational parties with adequate levels of information, who have the option of communicating and negotiating successfully.
This is a general fact about war, and has little to do with the values of the parties going to war, c.f. Rationalist explanations for war. Economic models of war do not generally predict war between parties that have different utility functions. On the contrary, a standard (simple) economic model of human behavior consists of viewing humans as entirely misaligned with other agents in the world, in the sense of having completely non-overlapping utility functions with random strangers. This model has been generalized to firms, countries, alliances etc., and yet it is rare for these generalized models to predict war as the default state of affairs.
Usually when I explain this idea to people, I am met with skepticism that we can generalize these social science models to AI. But I don't see why not: they are generally our most well-tested models of war. They are grounded in empirical facts and decades of observations, rather than evidence-free speculation (which I perceive as the primary competing alternative in AI risk literature). And most importantly, the assumptions of the models are robust to differences in power between agents, and misalignment between agents, which are generally the two key facts that people point to when arguing why these models are wrong when applied to AI. Yet this alleged distinction appears to merely reflect a misunderstanding of the modeling assumptions, rather than any key difference between humans and AIs.
What's interesting to me is that many people generally have no problem generalizing these economic models to other circumstances. For example, we could ask:
In each case, I generally encounter AI risk proponents claiming that what distinguishes these cases from the case of AI is that, in these cases, we can assume that the genetically engineered humans and human emulations will be "aligned" with human values, which adequately explains why they will attempt to compromise rather than go to war with the ordinary biological humans. But as I have already explained, standard economic models of war do not predict that war is constrained by alignment to human values, but is instead constrained by the costs of war, and the relative benefits of trade compared to war.
To the extent you think these economic models of war are simply incorrect, then I think it is worth explicitly engaging with the established social science literature, rather than inventing a new model that makes unique predictions about what non-human AIs would apparently do, who definitionally do not share human values.
In (e.g.) GPT-4 trained via RL from human feedback, it is true that it typically executes your instructions as intended. However, sometimes it doesn’t and, moreover, there are theoretical reasons to think that this would stop being the case if the system was sufficiently powerful to do an action which would maximize human feedback but which does not consist in executing instructions as intended (e.g., by deceiving human raters).
It is true that GPT-4 "sometimes" fails to follow human instructions, but the same could be said about humans. I think it's worth acknowledging the weight of the empirical evidence here regardless.
In my opinion the empirical evidence generally seems way stronger than the theoretical arguments, which (so far) seem to have had little success predicting when and how alignment would be difficult. For example, many people believed that AGI would be achieved at the time AIs are having natural conversations with humans (e.g. Eliezer Yudkowsky implied as much in his essay about a fire alarm[1]). According to this prediction, we should have already been having pretty severe misspecification problems if such problems were supposed to arise at AGI-level. And yet, I claim, we are not having these severe problems (and instead, we are merely having modestly difficult problems that can be patched with sufficient engineering effort).
It is true that problems of misspecification should become more difficult as AIs get smarter. However, it's important to recognize that as AI capabilities grow, so too will our tools and methods for tackling these alignment challenges. One key factor is that we will have increasingly intelligent AI systems that can assist us in the alignment process itself. To illustrate this point concretely, let's walk through a hypothetical scenario:
Suppose that aligning a human-level artificial general intelligence (AGI) merely requires a dedicated team of human alignment researchers. This seems generally plausible given that evaluating output is easier than generating novel outputs (see this article that goes into more detail about this argument and why it's relevant). Once we succeed in aligning that human-level AGI system, we can then leverage it to help us align the next iteration of AGI that is slightly more capable than human-level (let's call it AGI+). We would have a team of aligned human-level AGIs working on this challenge with us.
Then, when it comes to aligning the following iteration, AGI++ (which is even more intelligent), we can employ the AGI+ systems we previously aligned to work on this next challenge. And so on, with each successive generation of AI systems helping us to align the next, even more advanced generation.
It seems plausible that this cycle of AI systems assisting in the alignment of future, more capable systems could continue for a long time, allowing us to align AIs of ever-increasing intelligence without at any point needing mere humans to solve the problem of superintelligent alignment alone. If at some point this cycle becomes unsustainable, we can expect the highly intelligent AI advisors we have at that point to warn us about the limits of this approach. This would allow us to recognize when we are reaching the limits of our ability to maintain reliable alignment.
Full quote from Eliezer: "When they are very impressed by how smart their AI is relative to a human being in respects that still feel magical to them; as opposed to the parts they do know how to engineer, which no longer seem magical to them; aka the AI seeming pretty smart in interaction and conversation; aka the AI actually being an AGI already."
I read most of this paper, albeit somewhat quickly and skipped a few sections. I appreciate how clear the writing is, and I want to encourage more AI risk proponents to write papers like this to explain their views. That said, I largely disagree with the conclusion and several lines of reasoning within it.
Here are some of my thoughts (although these not my only disagreements):
I don't think I'm going to flesh this argument out to an extent to which you'd find it sufficiently rigorous or convincing, sorry.
Getting a bit meta for a bit, I'm curious (if you'd like to answer) whether you feel that you won't explain your views rigorously in a convincing way here mainly because (1) you are uncertain about these specific views, (2) you think your views are inherently difficult or costly to explain despite nonetheless being very compelling, (3) you think I can't understand your views easily because I'm lacking some bedrock intuitions that are too costly to convey, or (4) some other option.
I can currently observe humans which screens off a bunch of the comparison and let's me do direct analysis.
I'm in agreement that this consideration makes it hard to do a direct comparison. But I think this consideration should mostly make us more uncertain, rather than making us think that humans are better than the alternative. Analogy: if you rolled a die, and I didn't see the result, the expected value is not low just because I am uncertain about what happened. What matters here is the expected value, not necessarily the variance.
I can directly observe AIs and make predictions of future training methods and their values seem to result from a much more heavily optimized and precise thing with less "slack" in some sense. (Perhaps this is related to genetic bottleneck, I'm unsure.)
I guess I am having trouble understanding this point.
AIs will be primarily trained in things which look extremely different from "cooperatively achieving high genetic fitness".
Sure, but the question is why being different makes it worse along the relevant axes that we were discussing. The question is not just "will AIs be different than humans?" to which the answer would be "Obviously, yes". We're talking about why the differences between humans and AIs make AIs better or worse in expectation, not merely different.
Current AIs seem to use the vast, vast majority of their reasoning power for purposes which aren't directly related to their final applications. I predict this will also apply for internal high level reasoning of AIs. This doesn't seem true for humans.
I am having a hard time parsing this claim. What do you mean by "final applications"? And why won't this be true for future AGIs that are at human-level intelligence or above? And why does this make a difference to the ultimate claim that you're trying to support?
Humans seem optimized for something which isn't that far off from utilitarianism from some perspective? Make yourself survive, make your kin group survive, make your tribe survive, etc? I think utilitarianism is often a natural generalization of "I care about the experience of XYZ, it seems arbitrary/dumb/bad to draw the boundary narrowly, so I should extend this further" (This is how I get to utilitarianism.) I think the AI optimization looks considerably worse than this by default.
This consideration seems very weak to me. Early AGIs will presumably be directly optimized for something like consumer value, which looks a lot closer to "utilitarianism" to me than the implicit values in gene-centered evolution. When I talk to GPT-4, I find that it's way more altruistic and interested in making others happy than most humans are. This seems kind of a little bit like utilitarianism to me -- at least more than your description of what human evolution was optimizing for. But maybe I'm just not understanding the picture you're painting well enough though. Or maybe my model of AI is wrong.
I am a human.
"Human" is just one category you belong to. You're also a member of the category "intelligent beings", which you will share with AGIs. Another category you share with near-future AGIs is "beings who were trained on 21st century cultural data". I guess 12th century humans aren't in that category, so maybe we don't share their values?
Perhaps the category that matters is your nationality. Or maybe it's "beings in the Milky Way", and you wouldn't trust people from Andromeda? (To be clear, this is rhetorical, not an actual suggestion)
My point here is that I think your argument could benefit from some rigor by specifying exactly what about being human makes someone share your values in the sense you are describing. As it stands, this reasoning seems quite shallow to me.
In this "quick take", I want to summarize some my idiosyncratic views on AI risk.
My goal here is to list just a few ideas that cause me to approach the subject differently from how I perceive most other EAs view the topic. These ideas largely push me in the direction of making me more optimistic about AI, and less likely to support heavy regulations on AI.
(Note that I won't spend a lot of time justifying each of these views here. I'm mostly stating these points without lengthy justifications, in case anyone is curious. These ideas can perhaps inform why I spend significant amounts of my time pushing back against AI risk arguments. Not all of these ideas are rare, and some of them may indeed be popular among EAs.)
By comparison, I find it more likely that no individual AI will ever be strong enough to take over the world, in the sense of overthrowing the world's existing institutions and governments by surprise. Instead, I broadly expect AIs will integrate into society and try to accomplish their goals by advocating for their legal rights, rather than trying to overthrow our institutions by force. Upon attaining legal personhood, unaligned AIs can utilize their legal rights to achieve their objectives, for example by getting a job and trading their labor for property, within the already-existing institutions. Because the world is not zero sum, and there are economic benefits to scale and specialization, this argument implies that unaligned AIs may well have a net-positive effect on humans, as they could trade with us, producing value in exchange for our own property and services.
Note that my claim here is not that AIs will never become smarter than humans. One way of seeing how these two claims are distinguished is to compare my scenario to the case of genetically engineered humans. By assumption, if we genetically engineered humans, they would presumably eventually surpass ordinary humans in intelligence (along with social persuasion ability, and ability to deceive etc.). However, by itself, the fact that genetically engineered humans will become smarter than non-engineered humans does not imply that genetically engineered humans would try to overthrow the government. Instead, as in the case of AIs, I expect genetically engineered humans would largely try to work within existing institutions, rather than violently overthrow them.
It is conceivable that GPT-4's apparently ethical nature is fake. Perhaps GPT-4 is lying about its motives to me and in fact desires something completely different than what it professes to care about. Maybe GPT-4 merely "understands" or "predicts" human morality without actually "caring" about human morality. But while these scenarios are logically possible, they seem less plausible to me than the simple alternative explanation that alignment—like many other properties of ML models—generalizes well, in the natural way that you might similarly expect from a human.
Of course, the fact that GPT-4 is easily alignable does not immediately imply that smarter-than-human AIs will be easy to align. However, I think this current evidence is still significant, and aligns well with prior theoretical arguments that alignment would be easy. In particular, I am persuaded by the argument that, because evaluation is usually easier than generation, it should be feasible to accurately evaluate whether a slightly-smarter-than-human AI is taking unethical actions, allowing us to shape its rewards during training accordingly. After we've aligned a model that's merely slightly smarter than humans, we can use it to help us align even smarter AIs, and so on, plausibly implying that alignment will scale to indefinitely higher levels of intelligence, without necessarily breaking down at any physically realistic point.
I'm quite skeptical of this argument because I think that the default response to AI (in the absence of intervention from the EA community) will already be quite strong. My view here is informed by the base rate of technologies being overregulated, which I think is quite high. In fact, it is difficult for me to name even a single technology that I think is currently underregulated by society. By pushing for more regulation on AI, I think it's likely that we will overshoot and over-constrain AI relative to the optimal level.
In other words, my personal bias is towards thinking that society will regulate technologies too heavily, rather than too loosely. And I don't see a strong reason to think that AI will be any different from this general historical pattern. This makes me hesitant to push for more regulation on AI, since on my view, the marginal impact of my advocacy would likely be to push us even further in the direction of "too much regulation", overshooting the optimal level by even more than what I'd expect in the absence of my advocacy.
Since unaligned AIs will likely be both conscious and share human social and moral concepts, I don't see much reason to think of them as less "deserving" of life and liberty, from a cosmopolitan moral perspective. They will likely think similarly to the way we do across a variety of relevant axes, even if their neural structures are quite different from our own. As a consequence, I am pretty happy to incorporate unaligned AIs into the legal system and grant them some control of the future, just as I'd be happy to grant some control of the future to human children, even if they don't share my exact values.
Put another way, I view (what I perceive as) the EA attempt to privilege "human values" over "AI values" as being largely arbitrary and baseless, from an impartial moral perspective. There are many humans whose values I vehemently disagree with, but I nonetheless respect their autonomy, and do not wish to deny these humans their legal rights. Likewise, even if I strongly disagreed with the values of an advanced AI, I would still see value in their preferences being satisfied for their own sake, and I would try to respect the AI's autonomy and legal rights. I don't have a lot of faith in the inherent kindness of human nature relative to a "default unaligned" AI alternative.
I think these benefits are large and important, and commensurate with the downside potential of existential risks. While a fully committed strong longtermist might scoff at the idea that curing aging might be important — as it would largely only have short-term effects, rather than long-term effects that reverberate for billions of years — by contrast, I think it's really important to try to improve the lives of people who currently exist. Many people view this perspective as a form of moral partiality that we should discard for being arbitrary. However, I think morality is itself arbitrary: it can be anything we want it to be. And I choose to value currently existing humans, to a substantial (though not overwhelming) degree.
This doesn't mean I'm a fully committed near-termist. I sympathize with many of the intuitions behind longtermism. For example, if curing aging required raising the probability of human extinction by 40 percentage points, or something like that, I don't think I'd do it. But in more realistic scenarios that we are likely to actually encounter, I think it's plausibly a lot better to accelerate AI, rather than delay AI, on current margins. This view simply makes sense to me given the enormously positive effects I expect AI will likely have on the people I currently know and love, if we allow development to continue.