3507 karmaJoined


In my latest post I talked about whether unaligned AIs would produce more or less utilitarian value than aligned AIs. To be honest, I'm still quite confused about why many people seem to disagree with the view I expressed, and I'm interested in engaging more to get a better understanding of their perspective.

At the least, I thought I'd write a bit more about my thoughts here, and clarify my own views on the matter, in case anyone is interested in trying to understand my perspective.

The core thesis that was trying to defend is the following view:

My view: It is likely that by default, unaligned AIs—AIs that humans are likely to actually build if we do not completely solve key technical alignment problems—will produce comparable utilitarian value compared to humans, both directly (by being conscious themselves) and indirectly (via their impacts on the world). This is because unaligned AIs will likely both be conscious in a morally relevant sense, and they will likely share human moral concepts, since they will be trained on human data.

Some people seem to merely disagree with my view that unaligned AIs are likely to be conscious in a morally relevant sense. And a few others have a semantic disagreement with me in which they define AI alignment in moral terms, rather than the ability to make an AI share the preferences of the AI's operator. 

But beyond these two objections, which I feel I understand fairly well, there's also significant disagreement about other questions. Based on my discussions, I've attempted to distill the following counterargument to my thesis, which I fully acknowledge does not capture everyone's views on this subject:

Perceived counter-argument: The vast majority of utilitarian value in the future will come from agents with explicitly utilitarian preferences, rather than those who incidentally achieve utilitarian objectives. At present, only a small proportion of humanity holds partly utilitarian views. However, as unaligned AIs will differ from humans across numerous dimensions, it is plausible that they will possess negligible utilitarian impulses, in stark contrast to humanity's modest (but non-negligible) utilitarian tendencies. As a result, it is plausible that almost all value would be lost, from a utilitarian perspective, if AIs were unaligned with human preferences.

Again, I'm not sure if this summary accurately represents what people believe. However, it's what some seem to be saying. I personally think this argument is weak. But I feel I've had trouble making my views very clear on this subject, so I thought I'd try one more time to explain where I'm coming from here. Let me respond to the two main parts of the argument in some amount of detail:

(i) "The vast majority of utilitarian value in the future will come from agents with explicitly utilitarian preferences, rather than those who incidentally achieve utilitarian objectives."

My response:

I am skeptical of the notion that the bulk of future utilitarian value will originate from agents with explicitly utilitarian preferences. This clearly does not reflect our current world, where the primary sources of happiness and suffering are not the result of deliberate utilitarian planning. Moreover, I do not see compelling theoretical grounds to anticipate a major shift in this regard.

I think the intuition behind the argument here is something like this:

In the future, it will become possible to create "hedonium"—matter that is optimized to generate the maximum amount of utility or well-being. If hedonium can be created, it would likely be vastly more important than anything else in the universe in terms of its capacity to generate positive utilitarian value.

The key assumption is that hedonium would primarily be created by agents who have at least some explicit utilitarian goals, even if those goals are fairly weak. Given the astronomical value that hedonium could potentially generate, even a tiny fraction of the universe's resources being dedicated to hedonium production could outweigh all other sources of happiness and suffering.

Therefore, if unaligned AIs would be less likely to produce hedonium than aligned AIs (due to not having explicitly utilitarian goals), this would be a major reason to prefer aligned AI, even if unaligned AIs would otherwise generate comparable levels of value to aligned AIs in all other respects.

If this is indeed the intuition driving the argument, I think it falls short for a straightforward reason. The creation of matter-optimized-for-happiness is more likely to be driven by the far more common motives of self-interest and concern for one's inner circle (friends, family, tribe, etc.) than by explicit utilitarian goals. If unaligned AIs are conscious, they would presumably have ample motives to optimize for positive states of consciousness, even if not for explicitly utilitarian reasons.

In other words, agents optimizing for their own happiness, or the happiness of those they care about, seem likely to be the primary force behind the creation of hedonium-like structures. They may not frame it in utilitarian terms, but they will still be striving to maximize happiness and well-being for themselves and others they care about regardless. And it seems natural to assume that, with advanced technology, they would optimize pretty hard for their own happiness and well-being, just as a utilitarian might optimize hard for happiness when creating hedonium.

In contrast to the number of agents optimizing for their own happiness, the number of agents explicitly motivated by utilitarian concerns is likely to be much smaller. Yet both forms of happiness will presumably be heavily optimized. So even if explicit utilitarians are more likely to pursue hedonium per se, their impact would likely be dwarfed by the efforts of the much larger group of agents driven by more personal motives for happiness-optimization. Since both groups would be optimizing for happiness, the fact that hedonium is similarly optimized for happiness doesn't seem to provide much reason to think that it would outweigh the utilitarian value of more mundane, and far more common, forms of utility-optimization.

To be clear, I think it's totally possible that there's something about this argument that I'm missing here. And there are a lot of potential objections I'm skipping over here. But on a basic level, I mostly just lack the intuition that the thing we should care about, from a utilitarian perspective, is the existence of explicit utilitarians in the future, for the aforementioned reasons. The fact that our current world isn't well described by the idea that what matters most is the number of explicit utilitarians, strengthens my point here.

(ii) "At present, only a small proportion of humanity holds partly utilitarian views. However, as unaligned AIs will differ from humans across numerous dimensions, it is plausible that they will possess negligible utilitarian impulses, in stark contrast to humanity's modest (but non-negligible) utilitarian tendencies."

My response:

Since only a small portion of humanity is explicitly utilitarian, the argument's own logic suggests that there is significant potential for AIs to be even more utilitarian than humans, given the relatively low bar set by humanity's limited utilitarian impulses. While I agree we shouldn't assume AIs will be more utilitarian than humans without specific reasons to believe so, it seems entirely plausible that factors like selection pressures for altruism could lead to this outcome. Indeed, commercial AIs seem to be selected to be nice and helpful to users, which (at least superficially) seems "more utilitarian" than the default (primarily selfish-oriented) impulses of most humans. The fact that humans are only slightly utilitarian should mean that even small forces could cause AIs to exceed human levels of utilitarianism.

Moreover, as I've said previously, it's probable that unaligned AIs will possess morally relevant consciousness, at least in part due to the sophistication of their cognitive processes. They are also likely to absorb and reflect human moral concepts as a result of being trained on human-generated data. Crucially, I expect these traits to emerge even if the AIs do not share human preferences. 

To see where I'm coming from, consider how humans routinely are "misaligned" with each other, in the sense of not sharing each other's preferences, and yet still share moral concepts and a common culture. For example, an employee can share moral concepts with their employer while having very different consumption preferences from them. This picture is pretty much how I think we should primarily think about unaligned AIs that are trained on human data, and shaped heavily by techniques like RLHF or DPO.

Given these considerations, I find it unlikely that unaligned AIs would completely lack any utilitarian impulses whatsoever. However, I do agree that even a small risk of this outcome is worth taking seriously. I'm simply skeptical that such low-probability scenarios should be the primary factor in assessing the value of AI alignment research.

Intuitively, I would expect the arguments for prioritizing alignment to be more clear-cut and compelling than "if we fail to align AIs, then there's a small chance that these unaligned AIs might have zero utilitarian value, so we should make sure AIs are aligned instead". If low probability scenarios are the strongest considerations in favor of alignment, that seems to undermine the robustness of the case for prioritizing this work.

While it's appropriate to consider even low-probability risks when the stakes are high, I'm doubtful that small probabilities should be the dominant consideration in this context. I think the core reasons for focusing on alignment should probably be more straightforward and less reliant on complicated chains of logic than this type of argument suggests. In particular, as I've said before, I think it's quite reasonable to think that we should align AIs to humans for the sake of humans. In other words, I think it's perfectly reasonable to admit that solving AI alignment might be a great thing to ensure human flourishing in particular.

But if you're a utilitarian, and not particularly attached to human preferences per se (i.e., you're non-speciesist), I don't think you should be highly confident that an unaligned AI-driven future would be much worse than an aligned one, from that perspective.

I disagree with the implied theses in statements like "I'm not very sympathetic to pausing or slowing down AI as a policy proposal."

This overlooks my arguments in section 3, which were absolutely critical to forming my opinion here. My argument here can be summarized as follows:

  • The utilitarian arguments for technical alignment research seem weak, because AIs are likely to be conscious like us, and also share human moral concepts.
  • By contrast, technical alignment research seems clearly valuable if you care about humans who currently exist, since AIs will presumably be directly aligned to them.
  • However, pausing AI for alignment reasons seems pretty bad for humans who currently exist (under plausible models of the tradeoff).
  • I have sympathies to both utilitarianism and the view that current humans matter. The weak considerations favoring pausing AI on the utilitarian side don't outweigh the relatively much stronger and clearer arguments against pausing for currently existing humans.

The last bullet point is a statement about my values. It is not a thesis independently of my values. I feel this was pretty explicit in the post.

If you wrote a post that just said "look, we're super uncertain about things, here's your reminder that there are worlds in which alignment work is negative", I'd be on board with it. But it feels like a motte-and-bailey to write a post that is clearly trying to cause the reader to feel a particular way about some policy, and then retreat to "well my main thesis was very weak and unobjectionable".

I'm not just saying "there are worlds in which alignment work is negative". I'm saying that it's fairly plausible. I'd say greater than 30% probability. Maybe higher than 40%. This seems perfectly sufficient to establish the position, which I argued explicitly, that the alternative position is "fairly weak". 

It would be different if I was saying "look out, there's a 10% chance you could be wrong". I'd agree that claim would be way less interesting.

I don't think what I said resembles a motte-and-bailey, and I suspect you just misunderstood me.


Well, I can believe it's weak in some absolute sense. My claim is that it's much stronger than all of the arguments you make put together.

Part of me feels like this statement is an acknowledgement that you fundamentally agree with me. You think the argument in favor of unaligned AIs being less utilitarian than humans is weak? Wasn't that my thesis? If you started at a prior of 50%, and then moved to 65% because of a weak argument, and then moved back to 60% because of my argument, then isn't that completely consistent with essentially every single thing I said? OK, you felt I was saying the probability is like 50%. But 60% really isn't far off, and it's consistent with what I wrote (I mentioned "weak reasons" in the post). Perhaps like 80% of the reason why you disagree here is because you think my thesis was something else.

More generally I get the sense that you keep misinterpreting me as saying things that are different or stronger than what I intended. That's reasonable given that this is a complicated and extremely nuanced topic. I've tried to express areas of agreement when possible, both in the post and in reply to you. But maybe you have background reasons to expect me to argue a very strong thesis about utilitarianism. As a personal statement, I'd encourage you to try to read me as saying something closer to the literal meaning of what I'm saying, rather than trying to infer what I actually believe underneath the surface.]

I have lots of other disagreements with the rest of what you wrote, although I probably won't get around to addressing them. I mostly think we just disagree on some basic intuitions about how alien-like default unaligned AIs will actually be in the relevant senses. I also disagree with your reversal tests, because I think they're not actually symmetric, and I think you're omitting the best arguments for thinking that they're asymmetric.

This, in addition to the comment I previously wrote, will have to suffice as my reply.

Just a quick reply (I might reply more in-depth later but this is possibly the most important point):

I agree that the fact we are aligning AI should make one more optimistic. Could you define what you mean by "unaligned AI"? It seems quite plausible that I will agree with your position, and think it amounts to something like "we were pretty successful with alignment".

In my post I talked about the "default" alternative to doing lots of alignment research. Do you think that if AI alignment researchers quit tomorrow, engineers would stop doing RLHF etc. to their models? That they wouldn't train their AIs to exhibit human-like behaviors, or to be human-compatible?

It's possible my language was misleading by giving an image of what unaligned AI looks like that isn't actually a realistic "default" in any scenario. But when I talk about unaligned AI, I'm simply talking about AI that doesn't share the preferences of humans (either its creator or the user). Crucially, humans are routinely misaligned in this sense. For example, employees don't share the exact preferences of their employer (otherwise they'd have no need for a significant wage). Yet employees are still typically docile, human-compatible, and assimilated to the overall culture.

This is largely the picture I think we should imagine when we think about the "default" unaligned alternative, rather than imaging that humans will create something far more alien, far less docile, and therefore something with far less economic value.

(As an aside, I thought this distinction wasn't worth making because I thought most readers would have already strongly internalized the idea that RLHF isn't "real alignment work". I suspect I was mistaken, and probably confused a ton of people.)

Here are a few (long, but high-level) comments I have before responding to a few specific points that I still disagree with:

  • I agree there are some weak reasons to think that humans are likely to be more utilitarian on average than unaligned AIs, for basically the reasons you talk about in your comment (I won't express individual agreement with all the points you gave that I agree with, but you should know that I agree with many of them). 

    However, I do not yet see any strong reasons supporting your view. (The main argument seems to be: AIs will be different than us. You label this argument as strong but I think it is weak.) More generally, I think that if you're making hugely consequential decisions on the basis of relatively weak intuitions (which is what I believe many effective altruists do in this context), you should be very cautious. The lack of robust evidence for your position seems sufficient, in my opinion, for the main thesis of my original post to hold. (I think I was pretty careful in my language not to overstate the main claims.)
  • I suspect you may have an intuition that unaligned AIs will be very alien-like in certain crucial respects, but I predict this intuition will ultimately prove to be mistaken. In contrast, I think the fact that these AIs will be trained on human-generated data and deliberately shaped by humans to fulfill human-like functions and to be human-compatible should be given substantial weight. These factors make it quite likely, in my view, that the resulting AI systems will exhibit utilitarian tendencies to a significant degree, even if they do not share the preferences of either their users or their creators (for instance, I would guess that GPT-4 is already more utilitarian than the average human, in a meaningful sense).

    There is a strong selection pressure for AIs to display outward behaviors that are not overly alien-like. Indeed, the pressure seems to be for AIs to be inhumanly altruistic and kind in their actions. I am not persuaded by the idea that it's probable for AIs to be entirely human-compatible on the surface while being completely alien underneath, even if we assume they do not share human preferences (e.g., the "shoggoth" meme).
  • I disagree with the characterization that my argument relies primarily on the notion that "you can't rule out" the possibility of AIs being even more utilitarian than humans. In my previous comment, I pointed out that AIs could potentially have a higher density of moral value per unit of matter, and I believe there are straightforward reasons to expect this to be the case, as AIs could be optimized very efficiently in terms of physical space. This is not merely a "you can't rule it out" type of argument, in my view.

    Similarly, in the post, I pointed out that humans have many anti-utilitarian intuitions and it seems very plausible that AIs would not share (or share fewer of) these intuitions. To give another example (although it was not prominent in the post), in a footnote I alluded to the idea that AIs might care more about reproduction than humans (who by comparison, seem to want to have small population sizes with high per-capita incomes, rather than large population sizes with low per capita incomes as utilitarianism would recommend). This too does not seem like a mere "you cannot rule it out" argument to me, although I agree it is not the type of knockdown argument you'd expect if my thesis were stated way stronger than it actually was.
  • I think you may be giving humans too much credit for being slightly utilitarian. To the extent that there are indeed many humans who are genuinely obsessed with actively furthering utilitarian objectives, I agree that your argument would have more force. However, I think that this is not really what we actually observe in the real world to a large degree. I think it's exaggerated at least; even within EA I think that's somewhat rare.
  • I suspect there is a broader phenomenon at play here, whereby people (often those in the EA community) attribute a wide range of positive qualities to humans (such as the idea that our values converge upon reflection, or the idea that humans will get inherently kinder as they get wealthier) which, in my opinion, do not actually reflect the realities of the world we live in. These ideas seem (to me) to be routinely almost entirely disconnected from any empirical analysis of actual human behavior, and they sometimes appear to be more closely related to what the person making the claim wishes to be true in some kind of idealized, abstract sense (though I admit this sounds highly uncharitable).

    My hypothesis is that this tendency can maybe perhaps be explained by a deeply ingrained intuition that identifies the species boundary of "humans" as being very special, in the sense that virtually all moral value is seen as originating from within this boundary, sharply distinguishing it from anything outside this boundary, and leading to an inherent suspicion of non-human entities. This would explain, for example, why there is so much focus on "human values" (and comparatively little on drawing the relevant "X values" boundary along different lines), and why many people seem to believe that human emulations would be clearly preferable to de novo AI. I do not really share this intuition myself.

I can believe that if the population you are trying to predict for is just humans, almost all of whom have at least some affective empathy. But I'd feel pretty surprised if this were true in whatever distribution over unaligned AIs we're imagining.

My basic thoughts here are: on the one hand we have real world data points which can perhaps relevantly inform the degree to which affective empathy actually predicts utilitarianism, and on the other hand we have an intuition that it should be predictive across beings of very different types. I think the real world data points should epistemically count for more than the intuitions? More generally, I think it is hard to argue about what might be true if real world data counts for less than intuitions.

Maybe the argument is that if they are more conscious-in-the-sense-of-feeling-pleasure-and-pain they are more likely to be utilitarians? If so I might buy that but feel like it's a weak effect.

Isn't this the effect you alluded to, when you named reasons why some humans are utilitarians?

I agree, but I think very few people want to acquire e.g. 10 T$ of resources without broad consent of others.

I think I simply disagree with the claim here. I think it's not true. I think many people would want to acquire $10T without the broad consent of others, if they had the ability to obtain such wealth (and they could actually spend it; here I'm assuming they actually control this quantity of resources and don't get penalized because of the fact it was acquired without the broad consent of others, because that would change the scenario). It may be that fewer than 50% of people have such a desire. I'd be very surprised if it were <1% and, I'd even be surprised if it was <10%.

I agree biological humans will likely become an increasingly small fraction of the world, but it does not follow that AI carries a great risk to humas[1]. I would not say people born after 1960 carry a great risk risk to people born before 1960, even though the fraction of the global resources controlled by the latter is becoming increasingly small.

I think humans born after 1960 do pose a risk to humans born before 1960 in some ordinary senses. For example, the younger humans could vote to decrease medical spending, which could lead to early death for the older humans. They could also vote to increase taxes on people who have accumulated a lot of wealth, which very disproportionately hurts old people. This is not an implausible risk either; I think these things have broadly happened many times in the past.

That said, I suspect part of the disagreement here is about time scales. In the short and medium term, I agree: I'm not so much worried about AI posing a risk to humanity. I was really only talking about long-term scenarios in my above comment.

In my mind, very few humans would want to pursue capabilities which are conducive to gaining control over humanity.

This seems false. Plenty of people want wealth and power, which are "conducive to gaining control over [parts of] humanity". It is true that no single person has ever gotten enough power to actually get control over ALL of humanity, but that's presumably because of the difficulty of obtaining such a high level of power, rather than because few humans have ever pursued the capabilities that would be conducive towards that goal. Again, this distinction is quite important.

There are diminishing returns to having more resources. For example, if you give 10 M$ (0.01 % of global resources) to a random human, they will not have much of a desire to take risks to increase their wealth to 10 T$ (10 % of global resources), which would be helpful to gain control over humanity. To increase their own happiness and that of their close family and friends, they would do well by investing their newly acquired wealth in exchange-traded funds (ETFs). A good imitator AI would share our disposition of not gaining capabilities beyhond a certain point, and therefore (like humans) never get close to having a chance of gaining control over humanity.

I agree that a good imitator AI would likely share our disposition towards diminishing marginal returns to resource accumulation. This makes it likely that such AIs would not take very large risks. However, I still think the main reason why no human has ever taken control over humanity is because there was no feasible strategy that any human in the past could have taken to obtain such a high degree of control, rather than because all humans in the past have voluntarily refrained from taking the risks necessary to obtain that degree of control.

In fact, risk-neutral agents that don't experience diminishing returns to resource consumption will asymptotically eventually lose all their wealth in high-risk bets. Therefore, even without this human imitation argument, we shouldn't be much concerned about risk-neutral agents in most scenarios (including risks from reinforcement learners) since they're very likely to go bankrupt before they ever get to the point at which they can take over the world. Such agents are only importantly relevant in a very small fraction of worlds.

I think humans usually aquire power fairly gradually. A good imitator AI would be mindful that acquiring power too fast (suddenly fooming) would go very much against what humans usually do.

Again, the fact that humans acquire power gradually is more of a function of our abilities than it is a function of our desires. I repeat myself but this is important: these are critical facts to distinguish from each other. "Ability to" and "desire to" are very different features of the situation.

It is very plausible to me that some existing humans would "foom" if they had the ability. But in fact, no human has such an ability, so we don't see anyone fooming in the real world. This is mainly a reflection of the fact that humans cannot foom, not that they don't want to foom.

No human has ever had control over all humanity, so I agree there is a sense in which we have "zero data" about what humans would do under such conditions. Yet, I am still pretty confident that the vast majority of humans would not want to cause human extinction.

I am also "pretty confident" about that, but "pretty confident" is a relatively weak statement here. When evaluating this scenario, we are extrapolating into a regime in which we have no direct experience. It is one thing to say that we can be "pretty confident" in our extrapolations (and I agree with that); it is another thing entirely to imply that we have tons of data points directly backing up our prediction, based on thousands of years of historical evidence. We simply do not have that type of (strong) evidence.

I do not think this is the best comparison. There would arguably be many imitator AIs, and these would not gain near-omnipotent abilities overnight. I would say both of these greatly constrain the level of subjugation. Historically, innovations and new investions have spread out across the broader economy, so I think there should be a strong prior against a single imitator AI suddenly gaining control over all the other AIs and humans.

I agree, but this supports my point: I think imitator AIs are safe precisely because they will not have godlike powers. I am simply making the point that this is different from saying they are safe because they have human-like motives. Plenty of things in the world are safe because they are not very powerful. It is completely different if something is safe because its motives are benevolent and pure (even if it's extremely powerful).

How long-run are you talking about here? Humans 500 years ago arguably had little control over current humans, but this alone does not imply a high existential risk 500 years ago.

I agree with Robin Hanson on this question. However, I think humans will likely become an increasingly small fraction of the world over time, as AIs become a larger part of it. Just as hunter-gatherers are threatened by industrial societies, so too may biological humans one day become threatened by future AIs. Such a situation may not be very morally bad (or deserving the title "existential risk"), because humans are not the only morally important beings in the world. Yet, it is still true that AI carries a great risk to humanity.

What is the risk level above which you'd be OK with pausing AI?

My loose off-the-cuff response to this question is that I'd be OK with pausing if there was a greater than 1/3 chance of doom from AI, with the caveats that:

  • I don't think p(doom) is necessarily the relevant quantity. What matters is the relative benefit of pausing vs. unpausing, rather than the absolute level of risk.
  • "doom" lumps together a bunch of different types of risks, some of which I'm much more OK with compared to others. For example, if humans become a gradually weaker force in the world over time, and then eventually die off in some crazy accident in the far future, that might count as "humans died because of AI" but it's a lot different than a scenario in which some early AIs overthrow our institutions in a coup and then commit genocide against humans.
  • I think it would likely be more valuable to pause later in time during AI takeoff, rather than before AI takeoff

Under what conditions would you be happy to attend a protest? (LMK if you have already attended one!)

I attended the protest against Meta because I thought their approach to AI safety wasn't very thoughtful, although I'm still not sure it was a good decision to attend. I'm not sure what would make me happy to attend a protest, but these scenarios might qualify:

  • A company or government is being extremely careless about deploying systems that pose great risks to the world. (This doesn't count situations in which the system poses negligible risks but some future system could pose a greater risk.)
  • The protesters have clear, reasonable demands that I broadly agree with (e.g. they don't complain much about AI taking people's jobs, or AI being trained on copyrighted data, but are instead focused on real catastrophic risks that are directly addressed by the protest).

So e.g. if I thought humans were utilitarians primarily because it is simple to express in concepts that humans and AIs share, then I would agree with you. But in fact I feel like it is pretty important that humans feel pleasure and pain, and have empathy, to explain why some humans are utilitarians. (Mostly I think the "true explanation" will have to appeal to more than simplicity, and the additional features this "true explanation" will appeal to are very likely to differ between humans and AIs.)

Thanks for trying to better understand my views. I appreciate you clearly stating your reasoning in this comment, as it makes it easier for me to directly address your points and explain where I disagree.

You argued that feeling pleasure and pain, as well as having empathy, are important factors in explaining why some humans are utilitarians. You suggest that to the extent these reasons for being utilitarian don't apply to unaligned AIs, we should expect it to be less likely for them to be utilitarians compared to humans.

However, a key part of the first section of my original post was about whether unaligned AIs are likely to be conscious—which for the purpose of this discussion, seems roughly equivalent to whether they will feel pleasure and pain. I concluded that unaligned AIs are likely to be conscious for several reasons:

  1. Consciousness seems to be a fairly convergent function of intelligence, as evidenced by the fact that octopuses are widely accepted to be conscious despite sharing almost no homologous neural structures with humans. This suggests consciousness arises somewhat robustly in sufficiently sophisticated cognitive systems.
  2. Leading theories of consciousness from philosophy and cognitive science don't appear to predict that consciousness will be rare or unique to biological organisms. Instead, they tend to define consciousness in terms of information processing properties that AIs could plausibly share.
  3. Unaligned AIs will likely be trained in environments quite similar to those that gave rise to human and animal consciousness—for instance, they will be trained on human cultural data and, in the case of robots, will interact with physical environments. The evolutionary and developmental pressures that gave rise to consciousness in biological organisms would thus plausibly apply to AIs as well.

So in short, I believe unaligned AIs are likely to feel pleasure and pain, for roughly the reasons I think humans and animals do. Their consciousness would not be an improbable or fragile outcome, but more likely a robust product of being a highly sophisticated intelligent agent trained in environments similar to our own.

I did not directly address whether unaligned AIs would have empathy, though I find this fairly likely as well. At the very least, I expect they would have cognitive empathy—the ability to model and predict the experiences of others—as this is clearly instrumentally useful. They may lack affective empathy, i.e. the ability to share the emotions of others, which I agree could be important here. But it's notable that explicit utilitarianism seems, anecdotally, to be more common among people on the autism spectrum, who are characterized as having reduced affective empathy. This suggests affective empathy may not be strongly predictive of utilitarian motivations.

Let's say you concede the above points and say: "OK I concede that unaligned AIs might be conscious. But that's not at all assured. Unaligned AIs might only be 70% likely to be conscious, whereas I'm 100% certain that humans are conscious. So there's still a huge gap between the expected value of unaligned AIs vs. humans under total utilitarianism, in a way that overwhelmingly favors humans."

However, this line of argument would overlook the real possibility that unaligned AIs could be more conscious than humans, or have an even stronger tendency towards utilitarian motivations. This could be the case if, for instance, AIs are more cognitively sophisticated than humans or are more efficiently designed in a morally relevant sense. Given that the vast majority of humans do not seem to be highly motivated by utilitarian considerations, it doesn't seem like an unlikely possibility that AIs could exceed our utilitarian inclinations. Nor does it seem particularly unlikely that their minds could have a higher density of moral value per unit of energy, or matter.

We could similarly examine this argument in the context of considering other potential large changes to the world, such as creating human emulations, genetically engineered humans, or bringing back Neanderthals from extinction. In each case, I do not think the (presumably small) probability that the entities we are adding to the world are not conscious constitutes a knockdown argument against the idea that they would add comparable utilitarian value to the world compared to humans. The main reason is because these entities could be even better by utilitarian lights than humans are.

Indeed I feel like AIs probably build fewer pyramids in expectation, for basically the same reason. (The concrete hypothesis I generated for why humans build pyramids was "maybe pyramids were especially easy to build historically".)

This seems minor, but I think the relevant claim is whether AIs would build more pyramids going forward, compared to humans, rather than comparing to historical levels of pyramid construction among humans. If pyramids were easy to build historically, but this fact is no longer relevant, then that seems true now for both humans and AIs, into the foreseeable future. As a consequence it's hard for me to see a strong reason for preferring humans over AIs if you cared about pyramid-maximization. By essentially the same arguments I gave above about utilitarianism, I don't think there's a strong argument for thinking that aligning AIs is good from the perspective of pyramid maximization.

General note: I want to note that my focus on AI alignment is not necessarily coming from a utilitarian perspective. I work on AI alignment because in expectation I think a world with aligned AI will better reflect "my values"

This makes sense to me, but it's hard to say much about what's good from the perspective of your values if I don't know what those values are. I focused on total utilitarianism in the post because it's probably the most influential moral theory in EA, and it's the explicit theory used in Nick Bostrom's influential article Astronomical Waste, and this post was partly intended as a reply to that article (see the last few paragraphs of the post).

Fwiw, I reread the post again and still failed to find this idea in it

I'm baffled by your statement here. What did you think I was arguing when discussed whether "aligned AIs are more likely to have a preference for creating new conscious entities, furthering utilitarian objectives"? The conclusion of that section was that aligned AIs are plausibly not more likely to have such a preference, and therefore, human utilitarian preferences here are not "unusually high compared to other possibilities" (the relevant alternative possibility here being unaligned AI).

This was a central part of my post that I discussed at length. The idea that unaligned AIs might be similarly utilitarian or even more so, compared to humans, was a crucial part of my argument. If indeed unaligned AIs are very likely to be less utilitarian than humans, then much of my argument in the first section collapses, which I explicitly acknowledged. 

I consider your statement here to be a valuable data point about how clear my writing was and how likely I am to get my ideas across to others who read the post. That said, I believe I discussed this point more-or-less thoroughly.

ETA: Claude 3's summary of this argument in my post:

The post argued that the level of utilitarian values exhibited by humans is likely not unusually high compared to other possibilities, such as those of unaligned AIs. This argument was made in the context of discussing whether aligned AIs are more likely to have a preference for creating new conscious entities, thereby furthering utilitarian objectives.

The author presented several points to support this argument:

  1. Only a small fraction of humans are total utilitarians, and most humans do not regularly express strong preferences for adding new conscious entities to the universe.
  2. Some human moral intuitions directly conflict with utilitarian recommendations, such as the preference for habitat preservation over intervention to improve wild animal welfare.
  3. Unaligned AI preferences are unlikely to be completely alien or random compared to human preferences if the AIs are trained on human data. By sharing moral concepts with humans, unaligned AIs could potentially be more utilitarian than humans, given that human moral preferences are a mix of utilitarian and anti-utilitarian intuitions.
  4. Even in an aligned AI scenario, the consciousness of AIs will likely be determined mainly by economic efficiency factors during production, rather than by moral considerations.

The author concluded that these points undermine the idea that unaligned AI moral preferences will be clearly less utilitarian than the moral preferences of most humans, which are already not very utilitarian. This suggests that the level of utilitarian values exhibited by humans is likely not unusually high compared to other possibilities, such as those of unaligned AIs.

I agree with the title and basic thesis of this article but I find its argumentation weak.

First, we’ll offer a simple argument that a sufficiently advanced supervised learning algorithm, trained to imitate humans, would very likely not gain total control over humanity (to the point of making everyone defenseless) and then cause or allow human extinction from that position.

No human has ever gained total control over humanity. It would be a very basic mistake to think anyone ever has. Moreover, if they did so, very few humans would accept human extinction. An imitation learner that successfully gained total control over humanity and then allowed human extinction would, on both counts, be an extremely poor imitation of any human, and easily distinguishable from one, whereas an advanced imitation learner will likely imitate humans well.

This basic observation should establish that any conclusion to the contrary should be very surprising, and so a high degree of rigor should be expected from arguments to that effect.

The obvious reason why no human has ever gained total control over humanity is because no human has ever possessed the capability to do so, not because no human would make the choice to do so if given the opportunity. This distinction is absolutely critical, because if humans have historically lacked total control due to insufficient ability rather than unwillingness, then the quoted argument essentially collapses. That's because we have zero data on what a human would do if they suddenly acquired the power to exert total dominion over the rest of humanity. As a result, it is highly uncertain and speculative to claim that an AI imitating human behavior would refrain from seizing total control if it had that capability.

The authors seem to have overlooked this key distinction in their argument.

It takes no great leap of imagination to envision scenarios where, if a human was granted near-omnipotent abilities, some individuals would absolutely choose to subjugate the rest of humanity and rule over them in an unconstrained fashion. The primary reason I believe imitation learning is likely safe is that I am skeptical it will imbue AIs with godlike powers in the first place, not because I naively assume humans would nobly refrain from tyranny and oppression if they suddenly acquired such immense capabilities.

Note: Had the authors considered this point and argued that an imitation learner emulating humans would be safe precisely because it would not be very powerful, their argument would have been stronger. However, even if they had made this point, it likely would have provided only relatively weak support for the (perhaps implicit) thesis that building imitation learners is a promising and safe approach to building AIs. There are essentially countless proposals one can make for ensuring AI safety simply by limiting its capabilities. Relying solely on the weakness of an AI system as a safety guarantee seems like an unsound strategy to me in the long-run.

Load more