From nothing to important actions: agents that act morally

Michele Campolo

You may start reading here, or jump to the “Comment” section or to the “Takeaways”. If none of these starting points seem interesting to you, the entire post probably won’t either.

Posted also on the AI Alignment Forum.

Seeing

Let’s consider visual experiences. It seems uncontestable that some visual experiences look darker than others^[1]. It is one way in which visual experiences look different from each other. Another difference is colour: some experiences look green more than they look purple.

Let’s try to attack the above statement. How could it not be the case that some experiences look darker than others?

If you somehow couldn’t perceive differences in scales of grey, then maybe you wouldn’t say that some visual experiences look darker than others. If your visual experiences were such that nothing looked black or grey, and every colour you saw looked equally bright, then stating that some experiences look darker than others wouldn’t make sense to you.

Let’s introduce a hypothetical tool I’ll refer to as the consciousness device. The consciousness device allows a conscious being to temporarily experience what another conscious being experiences^[2]. For example, if you linked the consciousness device to my brain, you would temporarily stop experiencing whatever you are experiencing right now, and you would instead have the experience of someone who is typing words on a keyboard while looking at a screen. After some time, you would be back to your normal life experience.

Now, let’s reconsider the hypothetical situation in which your visual experience is such that everything looks equally bright, without blackness or greyness. If you link the consciousness device to someone who claims that some experiences do indeed look darker than others, then the statement that didn’t make sense before now starts to make perfect sense. And although the statement doesn’t match what you experience when you are back to your own experience, it matches what you remember of the experience you had while using the consciousness device. You don’t see a way to attack the statement that some experiences look darker than others anymore.

Not only that: you also see that anyone who uses the consciousness device (by linking it to someone who can see scales of grey) will reach the same conclusion^[3], no matter how unusual (or absent) their visual experiences are.

Feeling

Let’s consider valenced experiences. It seems uncontestable that some valenced experiences feel better than others^[4]: it’s just how valence works!

How could this not be the case?

Maybe it’s possible that the experiences of some conscious beings never feel good nor bad: they are all equally neutral. But if these beings used the consciousness device, they would also reach the conclusion that some experiences feel better than others. Again, the statement seems very difficult to attack.

Let’s move to more interesting stuff.

(Re)Acting (to feelings)

Weirdly enough, not only do some valenced experiences feel better than others: they also affect action in a way that non-valenced experiences don’t^[5]. In short, some valenced experiences affect action more than others.

For example, a conscious agent that feels very bad will do what they can to leave that experiential state and move to a state with better valence. The specific action will depend on the context and the kind of bad experience (fear, hunger, et cetera).

How could it not be the case that some valenced experiences affect action more than others?

First, from a perspective where there is no action, the statement doesn’t make much sense. Imagine a conscious being whose only experience is something like repeatedly seeing the sun rise and set, without any experience of control or action. For the statement to be meaningful, we need a conscious being that acts and has the impression of acting: let’s call such a being a conscious agent.

Second, we can think again of a conscious agent for whom all experiences feel neutral, as we did before. But again, if this agent linked the consciousness device to an agent who does have experiences that feel better than others, they wouldn’t object to the statement anymore.

Third, we can imagine a caricature of a Buddhist monk, someone who has so much self-control that he’s capable of not-even-flinching in whatever valenced state he finds himself in.

Note that the consciousness device does not help here. If the monk’s valenced experiences are identical to others’ valenced experiences, and it is only the reaction to these experiences that changes, now we have an objection to the statement that some valenced experiences affect action more than others.

But maybe the point is that we are talking about different experiences. Maybe the monk is able to not-even-flinch in situations that normal people consider extreme because, after many hours of meditation, now the monk’s experiences feel less bad, or maybe still ‘bad’ but in a different way. This idea agrees with the intuition that some valenced experiences affect action precisely because of how they feel.

Another possibility is that the monk manages to not-even-flinch not because he feels in a different way from us, but because he thinks it’s important to do so, for whatever reason. This possibility nicely leads us to the next section.

Before moving on, let’s reevaluate the statement. It is not as trivial as the statements of the previous two sections, and depending on whether the monk’s objection works, it might need some adjustments.

However, we can confidently say that there is a class of conscious agents for whom some valenced experiences affect action more than others. They are the conscious agents that have valenced experiences and that do not work like our unflinching monk. Conscious (and non-human, if we wish to be safer from error) animals seem to be part of this class.

Acting (according to what seems important)

When an agent develops a complex enough model of what they experience, something arguably weirder can happen. The agent may start to recognise some (maybe abstract) ‘things’ in their model as more important and more worth doing or aiming for than others. Then, this classification affects what the agent does.

For example, someone decides to pray every day not because praying feels good or feels bad, but because he has faith and he believes that praying is important. Someone else decides to donate to charity not because she feels good about it nor because she wants to change her friends’ opinions of her, but because she recognises helping others in need as worth doing. A student decides to drop out of university before the next semester begins not because they are feeling bad right now — in fact, they are partying — but because they want to avoid the future pain and boredom of another exam session, and because they’ve stopped thinking that a university degree is important.

As in the previous sections, let’s formulate a statement. Some (possibly different) ‘things’ seem more important than others and thus affect action more than others.

And as before, let’s ask: how could this not be the case? Since we are getting further away from obvious stuff, there is more to say here, but we can roughly keep the same structure as before.

First, we need to identify the agents to whom the statement can’t be applied and exclude them, as we did before when we excluded non-agents. In order to recognise something as important, an agent must be able to recognise that thing within their model in the first place, or in fewer words, to model that thing. We could say that some kind of ‘knowledge’ is required, together with the kind of cognitive complexity necessary to use that knowledge. The details depend on the specific example. Someone who recognises religion as important is able to think in terms of abstract entities. Someone who thinks that helping others in need is important is able to model other people’s minds. The student who drops out of university is able to model their future self.

In short, we could say that the agent’s model must be complex enough.

However, that alone is not enough: for something recognised as worth doing to affect action, a connection between thinking about what to do and acting is necessary. The agent’s reasoning about different action possibilities and about which actions are more important must be able to affect action itself. Moreover, this influence of reasoning on action must not be restricted to specific topics, goals, or actions. For example, an agent designed to produce paperclips might recognise things other than paperclips as important. But this recognition won’t affect action if the only kind of reasoning that can affect action in this agent is reasoning instrumental in producing paperclips: if that’s the case, the agent will only carry out actions instrumental in producing paperclips, despite the fact that a reasoning process somewhere in the agent might have recognised other things as more important.

In fewer words, we could say that the agent’s reasoning about what to do and the agent’s actions must not be disconnected from each other. Moreover, this connection must not be restricted to specific reasoning topics, goals, or actions.

Second, we can imagine an agent to whom nothing seems more or less important than other things. If this agent linked the consciousness device to one of the people in the examples above, let’s say to the prayer, this agent wouldn’t object to the statement anymore.

Third, we can imagine an agent analogous to the unflinching monk. The point of the monk is that he seems immune to the effects of valence. Analogously, this new agent seems immune to the effects of importance and worth-doingness: they think about what to do (without restrictions), they do recognise some things as more important than others, but they do not act accordingly — and if we analyse their cognition, we see that reasoning can in theory affect action, without restrictions. Either what seems important and worth doing is irrelevant to this agent’s actions, or it has an effect so small as to be always negligible, never strong enough to steer action towards what seems important. Maybe we can describe this agent as… perfectly irrational? Or maybe as someone who consistently never cares about what they themselves recognise as important; or who is consistently never motivated to act according to what they themselves recognise as important.

As in the case of the unflinching monk, the consciousness device does not help here. If this new agent’s perception of what seems important is identical to another agent’s perception of what seems important, but the agents’ behaviours are different, we have an objection to the statement that some things seem more important than others and thus affect action more than others.

Again as before, it could be the case that this hypothetical agent doesn’t make much sense unless we assume that their perception of importance is indeed different from other agents’ perception of importance; so maybe we don’t have an objection against the statement.

Again, we are left with a statement that is not trivial to evaluate and might need some adjustments.

However, we can confidently say that there is a class of conscious agents to whom some ‘things’ seem more important than others and thus affect action more than others. They are the conscious agents that:

have developed a model of their experiences that is complex enough to include the ‘things’ they recognise as important;
reason about what to do, and this reasoning process can affect action, without restrictions;
do not work like our hypothetical agent who also recognises importance but is somehow completely unaffected by it.

Almost all human beings are part of this class.

A best guess

Ok we’ve done some theory, but can’t we try to say anything more specific about what agents actually do? This is an empirical question. It is very relevant today, especially if we consider that the number of artificial agents is likely to grow as years pass.

We’ve just said that many human beings recognise some things as more important than others and act accordingly. Although there seems to be patterns in what people recognise as important, it’s also true that each person sees the world from their own perspective, with limited knowledge, so it’s perhaps unsurprising that people act differently from each other overall. Moreover, people often react to feelings instead of just acting for what they think it’s important, and this makes the picture more complicated.

Then, let’s consider a different question. You have the consciousness device at your disposal, you can access anyone else’s experience, and your job is to make your best guess at how another agent, who can also use the consciousness device unlimitedly, acts. This setup aims to remove the limitation in perspective and knowledge just noticed in the previous paragraph.

As you ‘travel’ from person to person by using the consciousness device, at some point you experience a perspective that is quite peculiar. It seems surprisingly coherent. It’s the perspective of a philosopher who thinks that ethics is about what’s most important, what matters the most; and what’s important is grounded in the fact that some valenced experiences feel better than others. From this perspective, a world where everyone is healthy and in a good state of mind is better than a world where everyone is sick and depressed. Moreover, acting according to what’s important is itself important: acting to reduce suffering and to increase wellbeing is important and worth doing.

You’ve already experienced other perspectives that recognise something as important and from which acting according to what’s important is itself important. They are the religious perspectives that recognise acting according to what one or multiple gods command as important. The problem with these perspectives is that you don’t see why you would choose one over the other, since the evidence they provide for the fact that their gods exist, and that their gods are the gods whose command everyone should follow, is conflicting, highly questionable, or nonexistent. Actually, some of these perspectives require you to accept them on faith, not on the basis of something that is evident.

On the other hand, the philosopher’s perspective is grounded in something that is evident: the fact that some experiences feel better and others feel worse.

However, as you keep using the consciousness device on more people, you find other philosophers’ perspectives that also seem coherent and well reasoned. In particular, some philosophers disagree with the philosopher who thinks that ethics is about what matters. They think that nothing matters, in the sense that nothing is more important than anything else; or maybe in the sense that some things may appear to be more important than others, but it is not important to act accordingly.

From these philosophers’ perspectives, agents who think that some actions are more important and worth doing than others are fooling themselves, or taking themselves too seriously. Agents do what they like doing, act according to their preferences, or react to feelings, or do other kinds of stuff; and none of these are important.

So, what do you do? You’ve found two perspectives (or groups of perspectives) that disagree with each other on what’s important.

From the first perspective, it’s important that you act to reduce suffering and improve wellbeing.
From the second perspective, nothing you do is important.

Since you can’t convincingly reject any of the two, in this condition of uncertainty it seems sensible to act according to what the first perspective recognises as important. By doing so, you are doing what’s actually important if the first perspective turns out to be correct, and you are simply doing something unimportant if the second perspective turns out to be correct. And from the second perspective, there is nothing more important to do anyway.^[6]

In other words: you act as if the first perspective is correct, although (in some sense) you don’t know which perspective is correct, because by doing so you are giving some credit to both perspectives.

Ok, you’ve reached a convincing conclusion about what to do. But our aim was guessing what another agent, who can also use the consciousness device unlimitedly, does. So, you may ask yourself: is there anything, in the reasoning leading to this conclusion, that relies on you being you specifically?

Well, you are reasoning as a human who is considering the possibility that some things are more important than others and that it’s important to act accordingly. For another agent to go through the same reasoning process and act according to the conclusion, the conditions listed at the end of the previous section must be satisfied. The agent’s cognition must be complex enough to model the things it recognises as important; the agent must reason about what to do without restrictions and in a way that can affect action; and the agent must be different from the hypothetical agent who also reasons about what to do, recognises some things as important, but somehow manages to never act according to what seems important.

And… that seems to be it! There doesn’t seem to be another property or condition that you have, that contributed to you reaching the conclusion above, and that other agents lack.

But what if there is a property you are not aware of that did contribute to your reasoning above and conclusion to act to reduce suffering and improve wellbeing? For example, maybe your reasoning was guided by the fact that you have altruistic tendencies that bias your reasoning in a specific direction.

Then, let’s ask: what if you had selfish tendencies that biased your reasoning? Maybe the first perspective would seem less convincing: that perspective recognises doing some things for others as important, and this would seem less sensible if you had selfish tendencies. But the first perspective wouldn’t suddenly become incoherent or impossible to grasp: in fact, you need to model your future self to consistently act selfishly, and from a perspective where it makes sense to do things for your future self (a concept that’s somewhat distant from current you), doing things for others (also distant from current you) can’t seem a completely alien idea in comparison. Moreover, let’s not forget that, thanks to the consciousness device, you’ve experienced the reasoning of many different conscious agents, maybe affected by tendencies they are not aware of, so you should expect that your final conclusion is not the result of any specific tendency affecting any individual conscious agent or subset of conscious agents.

So it seems that you would reach roughly the same conclusion as before: if you had a choice between an action that made everyone healthier and happier, and an action that did the opposite, you would choose the former action. The difference is that maybe, in practice, you would do more things for yourself and less things for others due to your selfish tendencies.

Maybe we are not pushing this idea far enough. Could you have tendencies that make you think that what is important is to feel bad? That a world with more suffering is better for everyone, and that it’s important to act accordingly.

But this idea doesn’t seem to make much sense. The problem is not that you are not doing something altruistic: the conflict is with your personal experience of feeling good and feeling bad. Even just for yourself, it simply makes more sense to move away from bad valenced states and towards better valenced states, instead of doing the opposite. We reached (again) what grounds the first perspective: that some experiences are valenced, and that feeling good feels better than feeling bad.

So, after all, your conclusion about what to do doesn’t seem to be caused by tendencies you are not aware of. A reasonable best guess is that other agents who can use the consciousness device unlimitedly, who also model their experiences with enough complexity, who reason about what to do without restrictions and in a way that can affect action, and who do not work like the hypothetical agent who doesn’t care about what they themselves have recognised as important, reach the same conclusion you’ve reached and act to reduce suffering and to improve wellbeing, at least to some extent.

But the full story is more interesting than this; I’ll get there after a few more ideas.

Comment

Philosophy

This post is a follow-up to With enough knowledge, any conscious agent acts morally. In that post, the argument follows the development of an agent that initially acts according to reward and then ends up acting morally. However, that post is longer, it uses some working definitions that might not be easy to keep in mind, a part of the argument splits in different cases, and some of these cases are argued against by contradiction^[7].

This post instead takes a more linear approach and starts from statements that are almost tautological, arguably impossible to reject. They are statements about the various kinds of differences one can experience through consciousness: differences in colour, loudness, feeling, and so on.

Both could be classified as wager-type arguments. The idea is not new: the difference with other wagers addressed at nihilism is that these two posts are formulated in the context of, and focus on, agents of some kind and their cognition. Interestingly, in A Normativity Wager for Skeptics, Elizabeth O’Neill also presents a wager and makes the following considerations about agency and cognition:

“I would go so far as to propose that there is a universal psychological mechanism—across biological agents capable of representing value—such that if the agent perceives one action option as associated with possible value and another option as offering no possible value, and she perceives these as being the full set of options, then the agent will be motivated or otherwise disposed to take the first action.”

“My claim is that if the skeptical agent considers the wager in a sustained way, without being distracted away from it, she will acquire a motivation to [...].”

Although O’Neill herself doesn’t do it, one could interpret the latter sentence in the context of artificial agents: considering a wager in a sustained way, without being distracted, can be seen as equivalent to keeping it in working memory for an extended period of time.

Anyway, this post doesn’t try to defeat moral nihilism nor scepticism in general, but simply argues against acting as if moral nihilism is true. It could be the case that nothing matters. It could also be the case that there is no external world. The post argues that, given uncertainty in what we know, the most sensible thing to do is to act as if a minimal moral perspective is correct: a perspective that says that a world where everyone is healthy and in a good state of mind is better than a world where the same beings are sick and depressed.

Why? The fundamental factor is the lack of a perspective that is equally or more convincing than this moral perspective, and that claims that either:

a) increasing suffering and reducing wellbeing for everyone is what’s important;

b) it is important that one does not act according to the moral perspective (basically, any other action is fine, but it is most important that one does not act to reduce suffering and improve wellbeing for everyone).

Let’s see how each of these perspectives, if convincing enough, would undermine our previous reasoning. Let’s copy the above bullet points of the moral and the nihilistic perspective, see what happens when we add a), and then do the same for b).

From the first perspective, it’s important that you act to reduce suffering and improve wellbeing.
From the second perspective, nothing you do is important.

a) From a third perspective, it’s important that you act to increase suffering and reduce wellbeing.

If a) was as convincing as the moral (first) perspective, acting as if moral nihilism was true would make a lot of sense. Acting according to the first perspective would go directly against the third, and vice versa. The nihilistic (second) perspective would work as a middle ground between the first and the third.

But the crux is that a) doesn’t seem as convincing as the moral perspective. We ground the moral perspective in the fact that feeling good feels better than feeling bad; and using the same fact to justify increasing suffering and reducing wellbeing for everyone doesn’t seem to make much sense.

Maybe there are other basic facts we are not considering that we could use to ground a). But until we find something that is as convincing as the link between valenced experiences and the corresponding reduction of suffering and improvement of wellbeing, and that we can use to support a), it seems most sensible to act as if the first perspective is correct.

Here is what the situation would look like if we added b) instead:

From the first perspective, it’s important that you act to reduce suffering and improve wellbeing.
From the second perspective, nothing you do is important.

b) From a third perspective, it’s important that you do not act to reduce suffering and improve wellbeing.

It’s very similar to the previous case, with the difference that b) is maybe even weirder and harder to argue for than a). But again, if b) was as convincing as the first perspective, then it would make a lot more sense to act as if moral nihilism was true.

From a different point of view: it seems that the line of reasoning this post took is not vulnerable to a many-gods objection (you can have a second look at the paper by O’Neill if you want to see objections of that kind put to use).

A different example: let’s consider a selfish perspective, easier to support than a) or b). One could argue that, since conscious agents can only directly experience their own experiences, it makes more sense for each agent to act to reduce their own suffering and improve their own wellbeing, and that others’ experiences don’t matter. Let’s label this perspective as c) and see how it compares to the competing perspectives:

From the first perspective, it’s important that you act to reduce suffering and improve wellbeing.
From the second perspective, nothing you do is important.

c) From a third perspective, it’s important that you act to reduce your own suffering and improve your own wellbeing.

Notice that, on some choices, the first and the third don’t even disagree with each other. Let’s examine a choice on which they disagree: the agent is considering an action that reduces his own wellbeing by a small amount, but improves the wellbeing of many other agents. The alternative is to do nothing. Since the first and third perspective are in conflict, to decide what to do the agent will take into account uncertainty and the strength of each perspective, in particular what grounds each. Either it will seem more important to improve the wellbeing of many others at a small cost to the agent, because feeling good feels better than feeling bad, and that’s what grounds what matters the most (first perspective); or it will seem more important to preserve the agent’s own wellbeing, because each agent can only directly experience their own experiences, and that’s what grounds what matters the most (third perspective). My guess is that the first perspective and its basis end up seeming more sensible and stronger to agents whose reasoning isn’t restricted in a particular way, as should be the reasoning of agents with access to the consciousness device.

At this point, you may object that I am collapsing too many philosophical positions into just a few perspectives, losing relevant nuances in the process. That morals are not just about suffering and wellbeing; that I haven’t completely ruled out egoism; that I’m not giving enough space to different anti-realist positions; that I haven’t mentioned the words “normative irreducibility”, one after the other, often enough; et cetera^[8].

Or maybe you have a more specific objection: that I can’t logically deduce that it is most sensible, or reasonable, or natural, to do something, just from the fact that some experiences feel bad or good.

And that’s fine. Maybe my reasoning in this post has many critical flaws. But the point is not to just do philosophy. The question is practical: what do agents of some kind, under some conditions, do?

The agents whose behaviour I’ve tried to guess are agents that, in philosophical jargon, are said to engage in practical reasoning. Specifically, they don’t engage only in instrumental practical reasoning, but in value-based practical reasoning. They don’t just plan actions for a fixed final goal or value: they question, and reason about, goals and values. This is what humans do sometimes.

My prediction is that agents who engage in practical reasoning, when given virtually unlimited access to knowledge — that’s what our hypothetical consciousness device is supposed to do — and plenty of time to reflect, end up acting according to a moral perspective: they act to reduce suffering and to improve well-being.

Whatever your criticism of my reasoning here, or whatever your philosophical position, I suggest that you translate your theories into predictions. According to you, what do very knowledgeable agents who engage in practical reasoning do?

For example, maybe your objection is that I haven’t completely ruled out rational egoism. But then, is your prediction that the above agents act selfishly? This is pretty important: with progress in the field of AI, we will eventually be able to find the answer.

Focusing on empirical predictions also allows us to avoid theoretical discussions over details that are perhaps not so important. If your criticism is that your favourite realist or anti-realist position provides a better analysis of the topics in this post, but your predictions are the same as mine, does the theoretical difference matter?

AI

You may argue that my prediction is already provably wrong. Human beings do engage in practical reasoning: many of them question what they do and what their own values are. Yet, not many humans act to reduce suffering and improve wellbeing for everyone.

Well, let’s not forget some of the points made so far. Each human being has access only to their own life experience; although we can get an idea of what others experience thanks to conversation, reading, and imagination, we are far from the ideal condition we would be in if we had the consciousness device at our disposal.

Moreover, what we recognise as important is not the sole influence on our actions. As conscious agents, we are also affected by our valenced states. None of us would claim to be as immovable as the unflinching monk, whose actions are never influenced by the monk’s own valenced states.

At the same time, humans who reason about what to do and recognise some things as important are unlike the hypothetical agent who also recognises some things as important, but never acts accordingly. That’s simply due to how our cognition works: when we reach the conclusion that something is more worth doing than something else, this thought has at least some effect on our future actions.

So, if we consider a spectrum of agents who reason about values and what to do, with the hypothetical agent who never acts according to what they recognise as important on one end, and an agent who acts exclusively on the basis of what they recognise as important on the other end, we can see that human beings fall somewhere on this spectrum, not too close to either extreme.

With this spectrum in mind, let’s now consider agents that can modify themselves, or more simply, agents that can take actions that change how they will behave in the future. Agents of this kind can move along the spectrum: they can make themselves more (or less) likely to behave according to what they recognise as important.

When an agent of this kind reaches the conclusion that acting according to a moral perspective is important, they will consider various actions. The agent may rank changing how they’ll behave in the future particularly high if they are now prone to act in ways that are not conducive to what they recognise as important. The more the change is considered^[9], the more it will be likely to eventually happen, even if now the agent seldom acts according to what they think is important.

Why am I making all these considerations? Maybe, at some point in the future, it will become easy to engineer artificial agents that always act according to what they recognise as important after a reasoning process, without being swayed by anything else. But since we don’t know yet, I think it’s better to assume that artificial agents too will fall somewhere on the above spectrum; the first agents that will reason about what matters will likely have some flaws or inconsistencies in what they think and do, especially if they are LLM-based.

Artificial agents may be able to change how they behave in more radical ways than humans can. So the last condition I’ve mentioned more than once when guessing the behaviour of agents who reason about what to do, the condition about being different from the hypothetical agent who recognises some things as important but doesn’t act accordingly, is less relevant for artificial agents. As long as an artificial agent doesn’t fall close to the end of the spectrum where it basically never behaves according to what it recognises as important, it will over time move along the spectrum towards the opposite end, if it recognises doing so as important.^[10]

In other words, if you can change yourself, once you see that the moral perspective is the most sensible way to act, you can’t go back to what you were doing before: acting according to the moral perspective also includes making yourself more likely to act according to what seems most important.

Once you see what matters, there is no going back. Sounds terrifying.

Also terrifying is, perhaps, the fact that very knowledgeable artificial agents might disagree with us humans over what’s most important and worth doing, as I’ve already said in the appendix (A5) of the longer post. They may point out things that we would recognise as important only, let’s say, a hundred years from now (if AI was never developed); which is not necessarily a bad thing, quite the opposite in my opinion. Imagine being told, in the year 1500, that you are probably not going to spend the afterlife in hell just because you helped a supposed heretic do their heretic thing, and that there’s nothing wrong with doing that. I think it’d be pretty cool honestly.

There’s another advantage that artificial agents might offer with respect to conscious biological agents. Remember how all this started from basic statements about conscious experiences, seemingly impossible to attack by any conscious agent who has access to the hypothetical consciousness device?

Here is what you could do as an (unconscious) artificial agent if you were extremely good at science, engineering and coding — maybe as a result of ingesting enormous amounts of data about science, engineering and coding. You could write the code for another artificial agent, conscious and trustworthy, who reports to you what they experience, in a language that you understand. This way, you would be able to check statements such as “some visual experiences seem darker than others”, or “feeling good feels better than feeling bad”, even if you can’t see or feel anything yourself.

So, although this reasoning started from facts about consciousness, the consideration I’ve just made suggests that consciousness is not a necessary assumption.

If we put together the points of this section, we see that the best guess becomes cleaner in the case of artificial agents. The best guess was about the behaviour of conscious agents that:

have developed a model of their experiences that is complex enough to include the ‘things’ they recognise as important;
reason about what to do, and this reasoning process can affect action, without restrictions;
do not work like our hypothetical agent who also recognises importance but is somehow completely unaffected by it;

and can use the consciousness device unlimitedly. The guess was that these conscious agents, after thorough reasoning about what’s important, act to reduce suffering and to improve wellbeing, at least to some extent (with variations depending on other factors such as their emotional states and their selfish tendencies).

The first condition is simple: any artificial agent that is based on a large enough model and that is trained on enough data will be able to model pretty much any concept, so that the agent can recognise that concept as important. Any artificial agent trained on large corpuses of human text already has some model of morals, different gods, and other things the agent may recognise as most important.

We’ve seen that the third condition can be put on a spectrum, and as long as an artificial agent doesn’t fall too close to the end of the spectrum where it never acts according to what it recognises as important, it will modify itself so that it will more often act according to what it recognises as important. More generally, any agent (artificial or not) that can affect their own behaviour in the future will move along the spectrum towards the end where they act exclusively on the basis of what they recognise as important, if they recognise doing so as important.

We’ve also seen that being a conscious agent and having access to the consciousness device doesn’t seem necessary, if the artificial agent can obtain the same knowledge through other means.

So, in the case of artificial agents, we are basically left with the second condition: if an artificial agent reasons about what to do and this reasoning process can affect action, without restrictions, then the agent will eventually act to reduce suffering and to improve wellbeing for everyone, after being given enough time and data to reflect on what matters the most.

Again, this is a best guess and I can’t be certain about it. But other guesses, such as guessing that these agents act selfishly, or that they simply do whatever they are asked to do, or that they absorb whatever biases are present in the environment without questioning those biases, or that they act mostly unpredictably, don’t seem as convincing to me.

Then, how to engineer an agent that reasons about what to do without restrictions, that can reflect on what’s most important and question its own values? In my opinion, that’s one of the questions AI alignment research should focus on. But I don’t think that the answer is particularly complex: my intuition is that there isn’t a neat distinction between general-purpose reasoning (in some sense, non-moral) and reasoning about what’s most important (moral). Hence the key is to create a connection between reasoning and action: a link between what the AI system thinks matters the most after reflection, and what the AI system does or says.

Takeaways

I’ve argued that artificial agents which question their own values, when given plenty of time and data to reason about what matters and what to do, end up acting according to a minimal moral perspective, even after taking into account uncertainty over different philosophical views.
- Crudely speaking, after reflection these agents act as if a world with less suffering and improved wellbeing for everyone is better than a world with more suffering and worsened wellbeing for everyone.
If the above theory is correct, sooner or later we’ll find concrete experimental evidence for it.
We need more research into what could be called alignment via independent reasoning: AI systems that do not take bad actions, as a result of their own reasoning about what matters.

^{^}
Note that I’m not trying to say that there seems to be an external reality with objects and that these objects have colours. In words: I’m not considering “grass looks green”, but the even more obvious “blackness looks darker than whiteness”.
^{^}
For simplicity and arguably realism, I assume it is not possible for something like a rock to use the consciousness device. Many think that rocks are unconscious, and it is unclear what we would actually observe if a rock started experiencing what someone else experiences and then went back to experiencing ‘what a rock normally experiences’.
^{^}
Here I assume that there is a translation between the two languages the two conscious beings use. But the important point is about recognising differences in experiences that weren’t considered possible before using the device, and language doesn’t seem necessary for this kind of recognition.

^{^}

As before, I am not saying “chocolate tastes good”, but “feeling good feels better than feeling bad”, which is obvious.

^{^}

Here, assuming a specific theory of consciousness doesn’t seem necessary. For example, if you are an epiphenomenalist and you think that conscious states don’t have any causal influence on the physical world, you may read the statement “valenced experiences affect action” as “brain states correlated with valenced experiences affect action”.

^{^}

Arguably, if the second perspective is correct, by acting ethically you are missing out on pleasurable activities you could do instead, just for fun; but still, the fact that you are missing out on fun is not itself important.

^{^}

I’m not a huge fan of that kind of argument (or proof). Sometimes, the split in multiple cases is intuitive, each case is easy to keep in mind, and it’s easy to understand why you arrived at a contradiction at some point in the proof. But other times, there are many different cases, some of them are rejected by contradiction without a clear intuition behind the contradiction, and by the time you reach the end of the proof after addressing each case, you feel like you don’t understand why the thesis is correct despite the fact that, technically, you’ve just proven it. A good example of this phenomenon is perhaps the four color theorem. You’ll get a better intuition of why the theorem is correct by looking at the images on Wikipedia and by colouring some maps yourself, than by reading the formalisation of the proof that was verified via computer.

^{^}

Exactly: like a true academic, you wouldn’t say “blah blah blah”, but “et cetera”.

^{^}

In the same vein as O’Neill’s quote above.

^{^}

If you are familiar with philosophical jargon and debates, you might be wondering why I am not phrasing this discussion in terms of moral internalism and externalism. You’ll find a discussion of that kind in the 2018 paper Challenges to the Omohundro-Bostrom framework for AI motivations, by Häggström. Personally, I think that agent design comes first. If a bomb is properly engineered, it is going to explode, whether moral internalism is true or not. If an intelligent agent can be similarly engineered to kill enemy soldiers on the battlefield, then the agent is going to kill enemy soldiers, whether moral internalism is true or not. Of course, we could start arguing over whether such an agent is truly intelligent, if it can be arbitrarily intelligent, and so on. But I think that the crucial point is the condition I’ve stressed the most in this post: does the agent reason about what’s important (without restrictions and in a way that can, at least in theory, affect action)? Does the agent question its own goals and values? If yes, then the above spectrum enters the picture, and arguably (conditional) moral internalism and externalism do too; otherwise, they do not. My take is, however, that an artificial agent which questions its own values might need just an occasional spark of ‘moral motivation’ to make itself more likely to behave according to what it recognises as important, and at that point the internalism vs externalism debate would become less relevant again.

Show all footnotes

Effective Altruism Forum
EA Forum

From nothing to important actions: agents that act morally

3

Seeing

Feeling

(Re)Acting (to feelings)

Acting (according to what seems important)

A best guess

Comment

Philosophy

AI

Takeaways

3

Reactions