This is probably a Utilitarianism 101 question. Many/most people in EA seem to accept as a given that:

  1) Non-human agent's welfare can count toward utilitarian calculations (hence animal welfare)

  2) AGI welfare cannot count towards utility calculations (otherwise alternative to alignment would be working on an AGI which has a goal of maximizing copies of itself experiencing maximum utility, likely a much easier task)

Which means there should be a compelling argument, or Schelling point, which includes animals but not AGIs into the category of moral patients. But I haven't seen any and can't easily think of a good one myself. What's the deal here? Am I missing some important basic idea about utilitarianism?


[To be clear, this is not an argument against alignment work. I'm mostly just trying to improve my understanding of the matter, but insofar there has to be an argument, it's one against the whatever branches of utilitarianism say yielding the world to AIs is an acceptable choice.]

New Answer
Ask Related Question
New Comment

5 Answers sorted by

I think this argument mostly fails in claiming that 'create an AGI which has a goal of maximizing copies of itself experiencing maximum utility' is meaningfully different than just ensuring alignment. This is in some sense exactly what I am hoping to get from an aligned system. Doing this properly would likely have to involve empowering humanity and helping us figure out what 'maximum utility' looks like first, and then tiling the world with something CEV-like

The only ways this makes the problem easier compared to a classic ambitious alignment goal of 'do whatever maximizes the utility of the world' is the provision that the world be tiled with copies of the AGI, which is likely suboptimal. But this could be worth it if it made the task easier?

The obvious argument for why it would is that creating copies of itself with high welfare will be in the interest of AGI systems with a wide variety of goals, which relaxes the alignment problem. But this does not seem true. A paperclip AI will not want to fill the world with copies of itself experiencing joy, love and beauty but rather with paperclips. The AI systems will want to create copies of itself fulfilling its goals, not experiencing maximum utility by my values. 

This argument risks identifying 'I care about the welfare (by my definition of welfare) of this agent' with 'I care about this agent getting to accomplish its goals'. As I am not a preference utilitarian I strongly reject this identification. 

Tl;dr: I do care significantly about the welfare of AI systems we build, but I don't expect those AI system themselves to care much at all about their own welfare, unless we solve alignment. 

I think this gets a lot right, though

As I am not a preference utilitarian I strongly reject this identification.

While this does seem to be part of the confusion of the original question, I'm not sure (total) preference vs. hedonic utilitarianism is actually a crux here. An AI system pursuing a simple objective wouldn't want to maximize the number of satisfied AI systems; it would just pursue its objective (which might involve relatively few copies of itself with satisfied goals). So highly capable AI systems pursuing very simple or random goals aren't only bad by hedonic utilitarian lights; they're also bad by (total) preference utilitarian lights (not to mention "common sense ethics").

1Alex P17d
That's true, but I think robustly embedding a goal of "multiply" is much easier than actual alignment. You can express it mathematically, you can use evolution, etc. [To reiterate, I'm not advocating for any of this, I think any moral system that labels "humans replaced by AIs" as an acceptable outcome is a broken one]
Maybe, but is "multiply" enough to capture the goal we're talking about? "Maximize total satisfaction" seems much harder to specify (and to be robustly learned) - at least I don't know what function would map states of the world to total satisfaction.
1Alex P17d
Can you, um, coherently imagine an agent that does not try to achieve its own goals (assuming it has no conflicting goals)?
I can't, but I'm not sure I see your point?
1Alex P17d
My point is, getting the "multiply" part right is sufficient, AI will take care of the "satisfaction" part on its own, especially given that it's able to reprogram itself. This assumes "[perceived] goal achievement" == "satisfaction" (aka utility), which was my assumption all along, but apparently is only true under preference utilitarianism.
I'm struggling to articulate how confused this seems in the context of machine learning. (I think my first objection is something like: the way in which "multiply" could be specified and the way in which an AI system pursues satisfaction are very different; one could be an aspect of the AI's training process, while another is an aspect of the AI's behavior. So even if these two concepts each describe aspects of the AI system's objectives/behavior, that doesn't mean its goal is to "multiply satisfaction." That's sort of like arguing that a sink gets built to be sturdy, and it gives people water, therefore it gives people sturdy water--we can't just mash together related concepts and assume our claims about them will be right.) (If you're not yet familiar with the basics of machine learning [] and this distinction [], I think that could be helpful context.)
1Alex P17d
I am familiar with the basics of ML and the concept of mesa-optimizers. "Building copies of itself" (i.e. multiply) is an optimization goal you'd have to specifically train into the system, I don't argue with that, I just think it's a simple and "natural" (in the sense it aligns reasonably well with instrumental convergence) goal that you can robustly train it comparatively easily. "Satisfaction" however, is not a term that I've met in ML or mesa-optimizers context, and I think the confusion comes from us mapping this term differently onto these domains. In my view, "satisfaction" roughly corresponds to "loss function minimization" in the ML terminology - the lower an AIs loss function, the higher satisfaction it "experiences" (literally or metaphorically, depending on the kind of AI). Since any AI [built under the modern paradigm] is already working to minimize its own loss function, whatever that happened to be, we wouldn't need to care much about the exact shape of the loss function it learns, except that it should robustly include "building copy of itself". And since we're presumably talking about a super-human AIs here, they would be very good at minimizing that loss function. So e.g. they can have some stupid goal like "maximize paperclips & build copies of self", they'll convert the universe to some mix of paperclips and AIs and experience extremely high satisfaction about it. But you seem to be meaning something very different when you say "satisfaction"? Do you mind stating explicitly what it is?
Ah sorry, I had totally misunderstood your previous comment. (I had interpreted "multiply" very differently.) With that context, I retract my last response. By "satisfaction" I meant high performance on its mesa-objective (insofar as it has one), though I suspect our different intuitions come from elsewhere. I think I'm still skeptical on two points: * Whether this is significantly easier than other complex goals * (The "robustly" part seems hard.) * Whether this actually leads to a near-best outcome according to total preference utilitarianism * If satisfying some goals is cheaper than satisfying others to the same extent, then the details of the goal matter a lot * As a kind of silly example, "maximize silicon & build copies of self" might be much easier to satisfy than "maximize paperclips & build copies of self." If so, a (total) preference utilitarian would consider it very important that agents have the former goal rather than the latter.
1Alex P16d
>By "satisfaction" I meant high performance on its mesa-objective Yeah, I'd agree with this definition. I don't necessarily agree with your two points of skepticism, for the first one I've already mentioned my reasons, for the second one it's true in principle but it seems almost anything an AI would learn semi-accidentally is going to be much simpler and more intrinsically consistent than human values. But low confidence on both and in any case that's kind of beyond the point, I was mostly trying to understand your perspective on what utility is.

Aaaaahhhh, that's it, "preference utilitarianism" is the concept I was missing! Or rather, I assumed that any utilitarianism is preference utilitarianism, in that it leaves definition of what's "good" or "bad" to the agents involved. And apparently it's not the case?

Only now I'm even more confused. What is "welfare" you're referring to, if it is not achievement of agent's goals? Saying things like "joy" or "happiness" or "maximum utility" doesn't really clarify anything when we're talking about non-human agents. How do you define utility in non-preference utilitarianism?

Good question. I suggest you have a look at "Sharing the World with Digital Minds", by Carl Shulman and Nick Bostrom (PDF, audio).

Eliezer has long argued that they could, and we should be very cautious about creating sentient AIs for this reason (in addition to the standard 'they would kill us all' reason).

Also note that this question is not specific to utilitarianism at all, and affects most ethical systems.

Eliezer seems to come from the position that utility is more or less equal to "achieving this agent's goals, whatever those are" and as such even agents extremely different from humans can have it (example of a trillion times more powerful AI). This is very different from [my understanding of] what HjalmarWijk above says, where utility seems to be defined in a more-or-less universal way and a specific agent can have goals orthogonal or even opposite to utility, so you can have a trillion agents fully achieving their goals and yet not a single "utiliton".

&n... (read more)

The distinctive feature of utilitarianism is not that it thinks happiness/utility matter, but that it thinks nothing else intrinsically matters. Almost all ethical systems apply at least some value to consequences and happiness. And even austere deontologists who didn't would still face the question of whether AIs could have rights that might be impermissible to violate, etc. Agreed egoism seems less affected.

I think 2) is not generally accepted as a given. Rather, AGI should not be assumed to experience welfare. It might, but it's not obviously necessary that it is sentient, which seems a necessary feature for experiencing welfare. A thermostat has goal-directed behaviour. Some might argue that even a thermostat is sentient, but it's a controversial position.

It doesn't seem obvious to me that abstract reasoning necessarily requires subjective experience. Experience might just as well be a product of animals evolving as embodied agents in the world. The thin layer of abstract thought on the outer parts of our brains don't seem to me to be the thing generating our qualia. Subjective experience seems more primal to my intuition.

If we create digital minds that can experience welfare, they matter as much as us in the moral calculus. To flesh out the implications of that would require a fully general understanding of what we mean by a mind. Considering how uncertain we are about insect minds, this seems to require a lot of progress. It would be preferable if the creation of digital minds could be avoided until that progress has been made. In a world with digital minds alignment might mean creating a superintelligence that's compatible with the flourishing of minds in a more general sense. 

I mean, this is an ethical reason to want to create AGI that is very well aligned with our utility functions. We already did this (the slow, clumsy, costly way) with dogs - while they aren't perfectly compatible with us, it's also not too hard to own a dog in such a way that both you and the dog provide lots of positive utility to one another. 

So if you start from the position that we should make AI that has empathy and a human-friendly temperament modeled on something like a golden retriever, you can at least get non-human agents whose interactions with us should be win-win.

This doesn't solve the problem of utility monsters or various other concerns that arise when treating total utility as a strictly scalar measure. But it does suggest that we can avoid a situation where humans and AGI agents are at odds trying to divide some pool of possible utility.

In actual practice, I think it will be difficult to raise human awareness of concerns with AGI utility. Of course it's possible even today to create an AI that superficially emulates suffering in such a way as to evoke sympathy. For now it's still possible to analyze the inner workings and argue that this is just a clever text generator with no actual suffering taking place . However, since we have no reason to implement this kind of histrionic behavior in an AGI, we will quite likely end up with agents that don't give any human-legible indication that they are suffering. Or, if they conclude that this is a useful way of interacting with humans, agents that are experts at mimicking such indications (whether they are suffering or not). 

There is a short story in Lem's 'Cyberiad' ("The Seventh Sally, or How Trurl’s Own Perfection Led to No Good") which touches on a situation a bit like this - Trurl creates a set of synthetic miniature 'subjects' for a sadistic tyrant, which among other things perfectly emulate suffering. His partner Klapaucius (rejecting the idea that there is any such thing as a p-zombie) declares this a monstrous deed, holding their suffering to be as real as any other. 

Unfortunately I don't think we can just endorse Klapaucius' viewpoint without reservation here due to the possibility of deceptive mimickry mentioned above. However, if we are serious about the utility of AGI, we will probably want to deliberately incorporate some expressive interface that allows for it to communicate positive or negative experience in a sincere and humanlike way. Otherwise everyone who isn't deeply committed to understanding the situation will dismiss its experience on naive reductionist grounds ('just bits in a machine').

This doesn't fully address your concern. I don't subscribe to the idea that there is a meaningful scalar measure of (total, commensurable, bulk) utility. So for me there isn't really a paradox to resolve when it comes to propositions like 'the best future is one where an enormous number of highly efficient AGIs are experiencing as much joy as cybernetically possible, meat is inefficient at generating utility'.

Ok, so the crux of my question was not understanding that non-preference utilitarianism exists, although now I'm even more confused, as I explained in my reply to HjalmarWijk. You also seem to be coming from the assumption that suffering (and I assume pleasure) exists separately from an agent achieving it's goals, so I'm curious to hear your thoughts on how you define them?


>So for me there isn't really a paradox to resolve when it comes to propositions like 'the best future is one where an enormous number of highly efficient AGIs are experiencing as much joy as cybernetically possible, meat is inefficient at generating utility'.

Does this mean that you can agree with such proposition?

7 comments, sorted by Click to highlight new comments since: Today at 3:57 PM

I think many alignment researchers don't accept (2), and also don't accept the claim that the proposed "alternative to alignment" would be much easier than alignment.

Since animals share many similar biological structures with us and evolved similarly, it's relatively possible to make claims about their sentience by analogy to our own. Claims about AI sentience are far harder to verify. One could imagine the possibility of an AI that behaves as if sentient but isn't really sentient. This gives significantly more reason to be wary of just handing everything over to AI systems, even if you are a total hedonistic utilitarian.

I also agree with others that building a sentient AI with a positive inner life doesn't seem remotely easy.

So two questions (please also see my reply to HjalmarWijk for context)::

  1. Do you on these grounds think that insect suffering (and everything more exotic) is meaningless? Because our last common ancestor with insects hardly have any neurons, and unsurprisingly our neuronal architecture is very different, so there isn't many reasons to expect any isomorphism between our "mental" processes.
  2.  Assuming an AI is sentient (in whatever sense you put into this word) but otherwise not meaningfully isomorphic to humans. How do you define "positive" inner life in that case?

In philosophy of mind the theory of functionalism defines mental states as causal structures. So for example, pain is the thing that usually causes withdrawal, avoidance, yelping, etc. and is often caused by e.g. tissue damage. If you see pain as the "tissue damage signaling" causal structure, then you could imagine insects also having this as well, even if there is no isomorphism. It's hard to imagine AI systems having this, but you could more easily imagine AI systems having frustration, if you define it as "inability to attain goals and realization that such goals are not attained". The idea of an isomorphism is required by the theory of machine functionalism, which essentially states that two feelings are the same if they are basically the same Turing machine running. But humans could be said to be running many Turing machines, and besides no two humans are running the same Turing machine, and comparing states across two Turing machines doesn't really make sense. So I'm not very interested in this idea of strict isomorphism.

But I'm not fully onboard with functionalism of the more fuzzy/"squishy" kind either. I suppose something could have the same causal structures but not really "feel" anything. Maybe there is something to mind body materialism: for instance pain is merely a certain kind of neuron firing. In that case, we should have reason to doubt that insects suffer if they don't have those neurons. I certainly am one to doubt that insects suffer, but on the more functionalist flavor of thinking I don't. So I'm pretty agnostic. I'd imagine I might be similarly agnostic towards AI, and as such wouldn't be in favor of handing over the future to them and away from humans, just as I'm not in favor of handing over the future to insects.

To answer the second question, I think of this in a functionalist way, so if something performs the same causal effects as positive mental states in humans, that's a good reason to think it's positive.

For more I recommend Amanda Askell's blog post or Jaegwon Kim's Philosophy of Mind textbook.

>It's hard to imagine AI systems having this

Why? As per instrumental convergence, any advanced AI is likely to have self-preservation and a negative reward signal it would receive upon a violation of such drive would be functionally very similar to pain (give or take the bodily component, but I don't think it's required? Otherwise simulate a million human minds in agony is OK, and I assume we agree it's not). Likewise, any system with goal-directed agentic behavior would experience some reward from moving towards its goals, which seems functionally very similar to pleasure (or satisfaction or something along these lines).

I just think anguish is more likely than physical pain. I suppose there could be physical pain in a distributed system as a result of certain nodes going down.

It's actually not obvious to me that simulations of humans could have physical pain. Seems possible, but maybe only other orders of pain like anguish and frustration are possible.

Ok, so here's my take away from the answers so far:

Most flavors of utilitarianism (except for preference utilitarianism) don't consider any goal-having agent achieving those goals as utility. Instead there assumed to be some metric of similarity between the goals and/or mental states of the agent and those of humans, and the agent's achievement of its goals counts the less toward total utility the lower this similarity metric is, so completely alien agents achieving their alien goals and [non-]experiencing alien non-joy about it don't register as adding utility.

How exactly this metric should be formulated is disputed and fuzzy, and quite often a lot of this fuzziness and uncertainty is swept under the rug with the word "sentience" (or something similar) written on it.

Additionally, the proportion of EAs who would seriously consider "all humans replaced by [particular kind of] AIs" as an acceptable outcome may be not as trivial as I assumed.

Please let me know if I'm grossly misunderstanding or misrepresenting something, and thank you everyone for your explanations!