A mesa-optimization perspective on AI valence and moral patienthood

jacobpfau

Edit: I no longer endorse this view as written.

I now see this view as a necessary (but not sufficient) condition for valence.

1. Introduction

Summary: I argue that mesa-optimization is a prerequisite for valenced experience. In a mesa-optimizing agent, some heuristics are learned at the base-optimizer level, e.g. human instincts are of this kind. Then, valenced experience is just how the base-optimizer presents evaluative judgements to mesa-optimizing processes. I provide a timeline estimating the emergence of valenced experience in AI, arriving at a ~12% credence in AI valence by 2030, and a ~2% credence that existing systems already experience valence. Allowing for alternative definitions of valence, other than my own, I estimate a 25% chance of valenced experience by 2030.

Motivation: What conditions incentivize the development of valence? Can we identify conditions which incentivize particular aspects of valence and use these to inform AI design? This post aims to show that these questions are more tractable and neglected than the hard problem of consciousness. A secondary goal is to provide an explanation of AI moral patienthood which starts from the philosophical mainstream, since previous EA writings on AI valence have started from idiosyncratic sets of assumptions [ref,ref].

Clarifying whether near-term AI have moral status is important, particularly if such a judgement is fine-grained enough to distinguish which AI have valenced experience. Compare to farm animal welfare: It is philosophically uncontroversial that extreme suffering in farm animals is morally heinous, but there’s little alternative given the demand for meat. If all we know is that advanced AI will experience valence, then the economic pressure to build such AI will render this information unlikely to be actionable. But, if we can distinguish between two different AI designs of equal capabilities, one supporting valence and the other philosophical-zombie-like, then we might be in a position to avoid large amounts of negative valence.

2. Valence functionalism and its implications

I take functionalism as my starting point. Roughly speaking, this is the idea that conscious experience must be explained by the causal structure of the physical mind. I’m agnostic about the precise definition of functionalism, but as a working approximation Chalmers’ definition of functional explanation will do,

'A physical system implements a given computation when there exists a grouping of physical states of the system into state-types and a one-to-one mapping from formal states of the computation to physical state-types, such that formal states related by an abstract state-transition relation are mapped onto physical state-types related by a corresponding causal state-transition relation.'^[1] [ref]

Chalmers goes on to explain how such functional characterizations of a system can be neatly formalized into combinatorial-state automata (CSA) [ref]. To make the definition of functionalism more concrete, consider a hypothetical explanation of human consciousness: Let’s suppose that conscious states correlate with groups of neurons^[2]. These neuron groups would correspond to states^[3]. Transitions between states would correspond to how neural firings of these groups influence each other. Under these assumptions, valence^[4] must be made up of^[5] a certain subset of the computations constituting consciousness. Let me take this opportunity to emphasize two points:

The philosophy of valence has yet to reach the functionalist level of precision and formalization. No one (to my knowledge) has proposed a characterization of valence in formal state-transition terms.
For the functionalist, valence is internal to the mind, and valence is substrate invariant. Hence, any neuroscientific theory of valence should not be interpreted as excluding the possibility of AI valence^[6].

The most well-known arguments for functionalism are Chalmers’ fading and dancing qualia thought experiments [ref]. Functionalism, as stated, is a fine-grained theory. Fine-grained in the sense that when Chalmers talks about the causal structure of a state he is referring to all of its causal properties. This fine-grained causal structure differs from the informal notion of mental causation: For example, what happens when you introspect on causes of your mental states e.g. asking ‘What makes me angry?’ Perhaps, you think of your reaction to hearing an angrily yelled curse word. On reflection, it is natural to think that the experience of hearing this curse word is not uniquely specified by its causal properties. However, you are not able to think of the enormous number of causal properties triggered by such a curse word. What comes to mind is merely a few of the most salient, high-level causal effects e.g. getting angry, being disposed to yell back, recoiling, etc.

Given that phenomenology only has this coarse-grained level of evidence to work with, this distinction between the coarse and fine-grained levels seems like bad news for understanding valence. However, by bringing in ideas from optimization, we can explain much of our coarse-grained phenomenal data without needing a fine-grained functionalist understanding.

3. Phenomenological perspective — An overview of contemporary theories of valence via pain

Most philosophical work on valence has focused on understanding pain^[7]. Here I quote from Corns’ overview of the literature introducing the most common views [ref]:

Evaluativism for pain is a representationalist theory according to which pain is a perception and unpleasant pain is an evaluative perception, that is, a perception that essentially involves an evaluation of its object. [E.g.] ‘A subject’s being in unpleasant pain consists in his (i) undergoing an experience (the pain) that represents a disturbance of a certain sort, and (ii) that same experience additionally representing the disturbance as bad for him in the bodily sense’.
Imperativism is a strongly representationalist theory according to which the qualitative character of pain consists in intentional content… [E.g.] The intentional content of pain is a negative imperative to stop doing something.
Psycho-functionalism: Affective character – both unpleasant, as in pain, and pleasant, as in orgasm – is instead to be explained by the way in which sensory, intentional, indicative content is [m-]processed… we find in introspection a felt attraction (pleasant) or aversion (unpleasant) to the sensory object about which information is being m-processed; this is dubbed phen-desiring. Second, we find ‘registration’ of the sensory information about that object; this is dubbed phen-believing.
1. Cf. Schukraft on the evolutionary biology of pain: ‘To motivate fitness-improving actions, valenced experience is thought to play three roles. First, valenced experience represents, in at least a loose sense, fitness-relevant information. Second, valenced experience serves as a common currency for decision-making. Third, valenced experience facilitates learning.’ [ref]
I address some prominent EA views in Appendix A, e.g. Tomasik’s view that RL operations may have moral weight. Other views in philosophy include [ref, ref]

Seen from a functionalist perspective, these theories of valence describe the role of the valence-relevant subset of the ultimate CSA model. Much more detail is needed to reach a formal definition of valence. For example, how is the evaluative/imperative signal learned and generated? What part of the mind receives this signal? In the next section, I suggest a certain kind of interaction between optimizers provides answers to these questions.

There’s something obviously evaluative and imperative about pain, so it might seem like these phenomenological observations don’t help constrain a functionalist theory of valence. To highlight their significance, let’s look at what an alternative phenomenological picture might be for a different sort of conscious being. Carruther’s describes such a ‘Phenumb’ agent as follows:

Let us imagine, then, an example of a conscious, language-using, agent — I call him ‘Phenumb’ who is unusual only in that satisfactions and frustrations of his conscious desires take place without the normal sorts of distinctive phenomenology. So when he achieves a goal he does not experience any warm glow of success, or any feelings of satisfaction. And when he believes that he has failed to achieve a goal, he does not experience any pangs of regret or feelings of depression. Nevertheless, Phenumb has the full range of attitudes characteristic of conscious desire-achievement and desire-frustration… Notice that Phenumb is not (or need not be) a zombie. That is, he need not be entirely lacking in phenomenal consciousness. On the contrary, his visual, auditory, and other experiences can have just the same phenomenological richness as our own. [ref]

Normal Human

Phenumb

Figure 1, Ways of processing harmful stimuli.

Imagine now a version of Phenumb (a lieutenant Data, if you will) wherein the agent is missing the usual distinctive phenomenology for pain^[8]. The difference between Phenumb and a normal human is shown above. Phenumb still realizes when a stimulus poses a threat to his body (noxiousness), but this conscious awareness of damaging stimuli occurs without the intermediate perception-like phenomenology. As a caricatured example, imagine Phenumb has the word DANGER pop into his head when encountering noxious stimuli. To explain why our processing of pain must involve perception-like experience of valence rather than Phenumb-like experience, let’s reinterpret what’s going on in pain perception in terms of optimization.

4. Optimization perspective — Is mesa-optimization a necessary condition for valence?

Figure 2, A schematic showing the relationship between a base and mesa-optimizer. Taken from [ref].

Let me begin by defining a key concept. Mesa-optimization refers to 'when a learned model (such as a neural network) is itself an optimizer—a situation we refer to as mesa-optimization, a neologism' [ref]^[9]. Here are two examples:

Humans. From the mesa-optimization paper, 'To a first approximation, evolution selects organisms according to the objective function of their inclusive genetic fitness in some environment. Most of these biological organisms—plants, for example—are not “trying” to achieve anything, but instead merely implement heuristics that have been pre-selected by evolution. However, some organisms, such as humans, have behavior that does not merely consist of such heuristics but is instead also the result of goal-directed optimization algorithms implemented in the brains of these organisms' [ref]. In this case, evolution is the base optimizer and our conscious, goal-directed behavior is a mesa-optimizer.
A language model neural network could implement (beam) search for some maximally likely sentence completion. This network would then be doing mesa-optimization.
As a tentative example (full details of this model’s behavior is not public), DeepMind’s recent Open-Ended Learning paper [ref] describes the activity of their reinforcement learning agent as follows:

‘When it lacks the ability to 0-shot generalise through understanding, it plays with the objects, experiments, and visually verifies if it solved the task – all of this as an emergent behaviour, a potential consequence of an open-ended learning process. Note, that agent does not perceive the reward, it has to infer it purely based on the observations.’

Analogies between interacting optimizers and valence

In a mesa-optimizing system, heuristic evaluations of noxiousness can be found by either the base optimizer or the mesa-optimizer. In the general case, we should expect both to happen. Consider noxious stimuli which occur regularly at the base optimizer level (evolutionary time scale) but infrequently within an individual agent's lifetime e.g. smelling fatally poisonous food. If the agent is to avoid such sparse stimuli, the relevant heuristics must be learned by the base optimizer — not the mesa-optimizer. Now, consider what happens if we have a visceral fear of seeing an armed gangster. This reaction must have been learned by our mesa-optimizing processes, since armed gangsters haven’t been around for most of human evolution (the base optimizer’s time scale). For a visualization of these distinctions, see Figure 4.

Evaluations of noxiousness involve another axis of variance, planning complexity. The reaction to certain stimuli is very straight-forward. E.g. if you touch a hot surface, get away from that surface. In the extreme, there’s the knee-jerk reaction which does not even involve the brain. However, most of the noxiousness heuristics learned by the base optimizer need to be dealt with by the mesa-optimizer. For example, if you see a threatening looking bear, the base optimizer (evolution) cannot tell you which direction offers the best cover. Hence, the base optimizer must communicate its evaluation of noxiousness with the mesa-optimizer/ abstract planning faculties.

What form can this communication of noxiousness take? In the Phenumb thought experiment, we imagined that the word DANGER pops up in Phenumb’s mind informing Phenumb’s mesa-optimizing processes. Now we’re in a position to explain why this is impossible. Language is not encoded evolutionarily, so it isn’t possible for the base optimizer to communicate via words. But, perception-like processes are primarily developed evolutionarily. Hence, the way in which the base optimizer communicates evaluations of noxiousness could naturally be perception-like and thus match the phenomenological characteristics of valenced experience.

Figure 3, Examples of nociceptive heuristics learned by base optimizer and mesa-optimizer.

Figure 4, How evaluations of noxiousness are processed in terms of optimization levels. The base optimizer can learn simple plans like hide-behind-objects. Mesa-optimization is required for situation-specific abstract planning^[10].

Analogies between AI mesa-optimization and valence

Turning away from the human example, let’s briefly take a look again at the AI examples of mesa-optimization. In appendix A, I argue that VPG updates cannot constitute valenced experience, because the RL reward and its gradients are both external to the agent. Compare now to mesa-optimizers in which the reward signal is definitionally internal to the system. The DeepMind (DM) agent may have pain-like processing. The DM agent’s mesa-optimizer evaluates objects perceived externally. Of course mesa-optimization and valence are not the same thing; some mesa-optimizers do not satisfy all criteria set out by the above theories of valence. For example, a mesa-optimizer implementing a Q-learning algorithm using random exploration likely fails the imperativist criterion for valence, because such a mesa-optimizer’s learning signal does not result in aversion towards low-valuation states. It’s also worth exploring how mesa-optimization and valence can come apart in humans: the neurological condition auto-activation deficit (AAD) may offer such a case [ref].

The beam search example shows another way in which mesa-optimization and valence can come apart. Recall that the evaluativists’ definition that pain is ‘perception that essentially involves an evaluation of its object’ The object in question, the phrase being evaluated, is not usually perceived; rather, it is generated by the optimizer itself. Instead, perhaps beam-search mesa-optimization bears greater similarity to what happens when people feel a moment of satisfaction when looking for the right move in chess, or the right word for a rhyme.

Let me head off a potential objection. One should not imagine mesa-optimization as a silo-ed process with well-defined boundaries within the agent^[11]. For example, in humans, when we try to solve a task, like playing chess, valenced evaluations affect our mood and can activate our memory system [ref, ref]. In the previous section, I suggested that valence could act as a channel of information from base optimizer to mesa-optimizer. In the chess/rhyme examples, the channel flows in the opposite direction; conscious mesa-optimization unconsciously activates the valence system.

My view on valence

Putting together these ideas, here’s a sketch of my best-guess theory of valence^[12]: To experience valence, the agent must be conscious and mesa-optimize at this conscious level. Valence is then how the base optimizer presents its hard-coded evaluation of noxiousness to the agent’s mesa-optimizing processes. I’ll explore some approaches to distinguishing conscious mesa-optimization from non-conscious mesa-optimization in section 5.

As an example, take the case of pain-like valence. Here, the valence evaluation tags^[13] an external perception e.g. a needle prick, presenting this evaluative tag to the mesa-optimizer. At this loose level of formality, it’s unclear what sort of experiments could falsify my view. It is, however, possible to derive heuristic predictions. For example, consider an advanced (i.e. mesa-optimizing, abstract-workspace-using) RL agent. Valenced experience is likely to occur when encountering stimuli which meet four criteria:

Stimuli are common inter-episode, in other words many agents will encounter the stimulus.
1. So learnable by the base-optimizer
Stimuli’s effects cannot be learned within the episode. 2. So not learnable directly by the mesa-optimizer
Optimal behavior when encountering these stimuli requires complex/abstract planning. 3. Hence must be communicated to mesa-optimizer
Failure to correctly plan consistently results in lower reward. 4. So a valence tag is helpful

This view also suggests artificial/non-human agents which have the behavioral markers of valence most likely also have valenced phenomenology. The previously imagined valence alternatives (e.g. Phenumb) are, by default, unlikely to be found by optimization.

Valence mechanisms are unlikely to be replaced by a language-based, experientially-neutral signal of noxiousness (cf. Phenumb)
Valence mechanisms are unlikely to be replaced by low-level, reflexes-like mechanisms

Some more fine-grained details of the phenomenology of valence remain unspecified. Perhaps some sentient beings experience valence as involving throbbing, periodic sensations while others experience valence as more uniform and continuous. For ethical purposes, these details seem irrelevant. We’re concerned about whether artificial agents process noxiousness in a way more akin to our feelings of valence than to our reflexes and abstract reasoning.

5. How an easier version of the problem of consciousness could be relevant to understanding valence

In the previous sections, I eschewed discussion of the consciousness problem. However, valenced experience is probably morally relevant only when conscious^[14]. Hence, consciousness is an important criterion for determining AI moral patiency^[15]. In AI engineering, it may be more tractable to prevent valenced processing from reaching consciousness than to avoid all valenced processing. With that possibility in mind, let me distinguish two levels of research goals in the field of AI consciousness studies:

Formal Consciousness: Identify formal computational criterion for a given state to be conscious. Build interpretability tools which can verify this criterion in an artificial neural network.
Heuristic Consciousness: Identify necessary computational conditions for a computational state to be conscious. Identify what incentives cause an optimizer to find systems which fulfill these conditions.

Two examples of conditions which seem useful and tractable in the Heuristic Consciousness context are (1) Abstraction and (2) Integrated Information^[16]. There is likely more to consciousness than these two criteria — e.g. perhaps higher-order content, or recurrent processing — but if we can formalize one or two necessary conditions for consciousness, designers of ML systems can avoid conscious valence by falsifying these conditions. Here’s a caricatured example of how falsifying the integrated information criterion could work. Assume the RL agent has a noxiousness detection module, and almost all of the computation happens before the noxiousness module feeds into the computation. Then, the informational integration between the noxiousness evaluation and the AI’s other processes may be low enough to avoid conscious valenced experience. In graphical form:

Figure 5, An RL architecture in which noxiousness evaluations are non-integrated.

I discuss why recent progress in ML shows promise for clarifying the abstraction condition in Appendix B.

6. My timeline

To make my view more concrete and to encourage productive disagreement, here's a quantification of my beliefs. In the table, conditional probabilities condition on all events above that line in the same section. The table links to a corresponding Google sheet which can be duplicated for those interested in plugging in their own probabilities. Page 2 of the sheet includes my 2% estimate for valence in 2021 AI. These estimates are dependent on my background beliefs on AI timelines which are broadly in line with Metaculus’ estimate of median ~2050 strong AGI.

	Probability	Conditional Prob	Description

View theoretically correct	0.75	0.75	Current research on valence is broadly correct (i.e. imperativism and evaluativism capture aspects of a final CSA of valence)
View theoretically correct	0.6	0.8	Mesa-optimization is necessary for valence
View theoretically correct	0.8	0.8	A version of the abstract workspace hypothesis unifying multi-modal and knowledge neurons is found
View theoretically correct	0.33	0.33	A formalization of relevant aspects of nociception and imperativism (perhaps with some simple extensions) along with mesa-optimization are sufficient for valenced experience
View theoretically correct		0.16	All of the above

Empirically realized 2030	0.85	0.85	Large DNNs mesa-optimize by 2030
Empirically realized 2030	0.5	0.5	The formalization of nociception, imperativism etc. occur in DNNS by 2030
Empirically realized 2030	0.7	0.8	Mesa-optimization interacts with the abstract workspace in DNNs by 2030
Empirically realized 2030		0.34	All of the above

View correct and realized by 2030	0.16	0.16	view correct
View correct and realized by 2030	0.34	0.67	view empirically realized by 2030
		0.11	Final estimate of how likely my view is to materialize

Other	0.65		Mesa-optimization in an abstract workspace is not sufficient for valence
Other	0.15		Valenced experience in AI by 2030 conditional on my view being significantly off (i.e. allowing alternative functionalist theories and other theories of consciousness, but not panpsychist style theories/QRI's theory)
Other	0.236		Valenced experience of any kind in AI by 2030

If we are in the 12% world where my conjectures hold and are realized by 2030, this may be of great moral significance. Quantifying how great would be its own EAF post which involves estimating how intense the valence experienced by the AI would be. In terms of aggregate duration of valenced experience, I did a back-of-the-envelope calculation extrapolating from AlphaZero’s experience playing 44 million games in nine hours^[17]. The duration of phenomenal time experienced by sentient AI circa 2030 could be ~10% of that experienced by the human race in a year, and would then surpass human experience by 2035.

7. Future directions

What success looks like

Assuming a version of my view is correct, we will likely be able to distinguish which neural architectures support valence and which do not. For instance, the DeepMind agent discussed in section 4 pre-processes an evaluation of its goal achievement. This evaluative signal is factored/non-integrated in a sense, and so it may not be interacting in the right way with the downstream, abstract processes to reach conscious processing. If the interaction between base-optimizer and mesa-optimizer is important for valence, as I hypothesize^[18], factoring out these processes may allow us to avoid such valence-relevant interactions from happening at the conscious level of abstraction. Making precise these arguments requires progress on many of the questions in section 7, but seems feasible by the late 2020s. The broader implication of my view is that we must control the interactions between models’ sub-modules to avoid valenced experience in AI. Conversely, continuing to scale up monolithic language models poses greater risk.

The view sketched in this post suggests a number of research questions. I will enumerate them by discipline, and within discipline by importance. I have omitted research questions which are already well known in the EA/AI safety community e.g. many questions about mesa-optimization are relevant to this research proposal. If you are interested in working on any of them feel free to reach out!

Further elaboration on the general view

Assuming this view is empirically realized, calculate an expected value for averting negative valenced experience in that AI
1. Should we expect this system to have symmetric positive/negative valenced experiences? See e.g. [ref]
Under this characterization of valence, given an AI with valence-like behavior should we assume by default that it experiences valence?
Do a more careful version of the timeline calculation using distributions for timelines rather than point-estimates.
Solicit expert opinions on the likelihood of mesa-optimization by 2030 cf. [ref] 2. Similarly refine all of the other estimates in the table

Applied philosophy

Sketch potential parallels between negatively valenced nociception and processes in LMs and RL systems
Sketch potential parallels to imperativist-style aversion in LMs and RL AI
Come up with thought experiments imagining agents which are conscious but do not mesa-optimize
Come up with examples clarifying what integrated information looks like in mesa-optimization
1. In what sense is there a continuum between external optimization and integrated mesa-optimization?
Look into the research on AAD to check for empirical correlations between optimization and other valence correlates.
Look into how this view fits in with work on animal sentience
Compare the psycho-functionalist view with a hypothetical mesa-optimizer
Sketch how other views on consciousness may influence the timeline estimate above e.g. higher-order processing conditions

Machine learning

Experiment on the likelihood of mesa-optimization in near-term AI systems.
1. Can we train explicitly for mesa-optimization [ref]?
Study the phenomenon of abstractions in computation more generally [ref] 2. Also see Appendix B.

Appendix A. How my view relates to other EA Writing

Tomasik’s View on Moral Patiency

I take Tomasik’s view as characterized in his paper ‘Do Artificial Reinforcement-Learning Agents Matter Morally?’ [ref]. His third assumption claims “RL operations are relevant to an agent’s welfare”. What does this mean in functionalist terms? For simplicity’s sake, let’s focus on an RL agent trained via vanilla policy gradient (VPG). By the functionalist assumption, the conscious mind of this agent is equivalent to a certain CSA. In sections 4 and 5, we will see that this CSA must necessarily be rather complex, but for the purposes of this example, suppose the conscious mind of the agent were described by the below CSA:

Figure 6, The functional form of the VPG agent’s mind. The effects of gradient application are shown in red. Dashed box indicates the valence system. (Modified from [ref]).

The reward function is not calculated by the agent, and so is not encoded anywhere in the CSA. Instead, the gradient update calculated from the reward signal modifies the agent as shown in red. Under this functionalist understanding of the mind, the reward gradient is something that can modifythe valence system, it is not part ofthe valence system — as e.g. the physical states corresponding to CSA states 2511 and 1611 are. Hence, from the functionalist perspective, neither the RL agent’s gradient updates nor its reward function can constitute valenced experience. In psychological terms, RL updates are much like operant conditioning: Suppose you drink coffee only when morning stretching, in an effort to make stretching more pleasant for yourself. Eventually you find stretching pleasant even without the coffee. You have modified your valence system using coffee, but the coffee itself was not part of your valence system. Returning to Tomasik’s assumption, “RL operations are relevant to an agent’s welfare”, the functionalist must disagree. At best we can say that RL operations can be instrumentally valuable by modifying the valence system. RL operations are not intrinsically part of the valence system.

Schukraft’s Moral Weights Series

Schukraft [ref] focuses on evolutionary and biological approaches to understanding pain. The arguments made in this post broadly agree with his observations, and are complementary. I see this post as extending his arguments by looking at how the optimization/evolution perspective on valence interacts with the phenomenological perspective. I hope that this post also helped explain how Schukraft’s position sits in relation to the ML perspective on optimization, and to some degree our understanding of consciousness.

QRI’s Symmetry Theory of Valence

(Disclaimer: I have only a superficial understanding of QRI’s work). QRI seems to believe that it is necessary to take a physicalist rather than a functionalist perspective on valence. They probably disagree with the first premise of this post. However, their symmetry theory [ref] may be compatible with my view if we frame the differences in terms of Marr’s levels of analysis. Perhaps the symmetry theory offers an implementation level explanation while this post’s analysis offers a computational level explanation.

Appendix B. Is abstraction a necessary condition for consciousness?

To estimate AI valence timelines, we need an estimate of when AIs will satisfy various necessary conditions for consciousness. I’ll first briefly provide philosophical context for the abstraction condition. Then I’ll give an overview of recent results in machine learning and how they might lead to progress on this.

Abstraction plays a role in Global Workspace-style theories. For example, Whyte and Smith argue that conscious experiences occur at ‘the point in which lower-level representations update the content of higher-level states’ [ref]. Their theory focuses on a predictive processing picture of visual consciousness, but this emphasis on the importance of higher-level (abstract) processes for consciousness is not specific to their theory. C.f. Tye’s PANIC theory in which conscious experiences ‘are poised to impact beliefs and desires. E.g. conscious perception of an apple can change your belief about whether there are apples in your house’ [ref]. These theories point at an abstract workspace which must be affected if a perception is to be conscious. As Muelhauser notes, such theories do not offer definitions of abstraction. For example, it is unclear ‘What concept of “belief” [do the theories] refer to, such that the guesses of blindsight subjects do not count as “beliefs”’ [ref]. A definition of abstract workspace needs to formalize what it means to process concepts, objects and sense-aggregates. The abstract workspace processes at the level of people and objects rather than the low-level sensory inputs, e.g. colors or sounds, used to identify a given person or object.

Recent results from ML may hold promise for formalizing these notions of abstraction. Work in the interpretability literature has identified layers of neurons in which constituent neurons encode multi-modal concepts [ref] and propositional knowledge [ref, ref]^[19]. A similar previous result observed in humans [ref] has not to my knowledge informed the scientific understanding of neural abstraction. Nevertheless, the artificial deep neural network (DNN) context may provide a far more tractable ground for theoretical progress for a few reasons: First, DNNs are manipulable allowing experimenters to counterfactually edit the relevant neurons allowing causal study of their functional roles [ref]. Second when taken together with the aforementioned biological neuron result, these results in DNNs suggest that — suitably formalized — abstraction may be general in the functionalist sense. As an example of the sort of theoretical result which would bridge the human and AI settings, I imagine a theorem proving that a local optimization method for CSA which finds an accurate model for a sufficiently broad set of tasks must instantiate a form of abstraction. This level of generality sounds beyond the reach of short-term research, but the above cited papers provide evidence for a special case of this conjecture for neural networks. Aside from the empirical results mentioned above, there are also some theoretical works suggesting that neural networks can recover factors of variation for (i.e. abstractions of) corresponding raw image data [ref, ref, ref].

Acknowledgements

Thanks to Aaron Gertler for feedback and formatting help.

Footnotes

My own understanding of functionalism is closest to Sloman’s [ref], but Chalmer’s functionalism has the advantage of having a more rigorous and succinct description. ↩︎
This may be incorrect, perhaps most conscious states depend on individual neurons. ↩︎
Substates, to be precise, in the CSA formalism. ↩︎
Or at least conscious valence — I will abbreviate conscious valence as valence. Even if non-conscious valence is possible, I am interested in morally relevant forms of valence so conscious valence. ↩︎
By ‘made up of’ I mean ‘supervenient on’, for those comfortable with philosophical terminology. ↩︎
And vice-versa: the existence of a computationalist theory of valence should not be interpreted as invalidating a neurological theory. In Chalmers’ words, ‘In general, there is no canonical mapping from a physical object to "the" computation it is performing. We might say that within every physical system, there are numerous computational systems.’ [ref]. ↩︎
In the philosophical literature, which experiences count as pain is a point of contention. In particular, Grahek distinguishes between pain with and without painfulness [ref] (other authors sometimes prefer to talk about pain with and without suffering). As I am interested in valence and so painfulness, I will use ‘pain’ to refer exclusively to painful/suffering-laden pains. ↩︎
This Pain-Phenumb agent is rather similar to a pain asymbolic [ref], but there is ongoing debate regarding whether asybmolics are abnormal in their experience of painfulness or instead perhaps abnormal in their attitude towards their body. To avoid getting bogged down by this debate, I stick with Phenumb. ↩︎
How to formalize this definition is far from obvious, c.f. [ref], but for my purposes this heuristic description should suffice. ↩︎
The mesa-optimization paper goes into detail under what conditions we expect mesa-optimization to be necessary ↩︎
In more detail such a (mistaken) view might involve e.g. optimization carried out by 500 neurons across 5 layers of a neural network with that 500 neuron set having zero connectivity with other neurons in those 5 layers. Though unlikely, this is possible. Silo-ed mesa-optimization may even be desirable because it does not constitute valence as I will mention in section 7. ↩︎
I see this not as a final theory but rather as a set of necessary conditions for conscious valence. As mentioned above, clarifying how nociception/higher-order content/... fit into the picture will also be necessary. ↩︎
By ‘tags’, I mean that an evaluative signal is sent along with the perception to the mesa-optimizing/conscious processes of the agent. This tagging process may or may not identify that the tag was caused by one particular external perception, I’m uncertain here. ↩︎
I see it as an open question whether non-conscious valenced experience is morally relevant. Tentatively, I’d suggest at least some criterion for consciousness must be satisfied for valence to be morally relevant. ↩︎
I use the term consciousness, but I am agnostic as to whether or not illusionists are correct. For example, perhaps phenomenal reports are best explained by a Graziano-style attention schema. I expect that an elaboration of the attention schema will likely involve abstraction and information integration in an important way, meaning the arguments of this section still apply. ↩︎
Tononi’s IIT need not be true, I just mean any similar theory. Again, such a theory need not fully characterize consciousness, it need only provide a necessary condition. ↩︎
I assume phenomenal time in an AI would be similar to humans when processing a game of Go (this may be off by orders of magnitude). Then I used Cotra’s AI timeline forecasts to estimate that compute in 2030 will be 5 OOM greater than AlphaZero’s [ref]. ↩︎
Or if nociception is important to valence ↩︎
In these works, the concepts interact linearly and non-recurrently which seems unlikely to be the case in the human mind. The extent to which concepts and knowledge emerges in earlier layers — thereby supporting non-linear, and possibly recurrent processing — of neural networks remains open. ↩︎

Steven ByrnesSep 10 20213

Let's say a human writes code more-or-less equivalent to the evolved "code" in the human genome. Presumably the resulting human-brain-like algorithm would have valence, right? But it's not a mesa-optimizer, it's just an optimizer. Unless you want to say that the human programmers are the base optimizer? But if you say that, well, every optimization algorithm known to humanity would become a "mesa-optimizer", since they tend to be implemented by human programmers, right? So that would entail the term "mesa-optimizer" kinda losing all meaning, I think. Sorry if I'm misunderstanding.

jacobpfauSep 11 20212

Certainly valenced processing could emerge outside of this mesa-optimization context. I agree that for "hand-crafted" (i.e. no base-optimizer) systems this terminology isn't helpful. To try to make sure I understand your point, let me try to describe such a scenario in more detail: Imagine a human programmer who is working with a bunch of DL modules and interpretability tools and programming heuristics which feed into these modules in different ways -- in a sense the opposite end of the spectrum from monolithic language models. This person might program some noxiousness heuristics that input into a language module. Those might correspond to a Phenumb-like phenomenology. This person might program some other noxiousness heuristics that input into all modules as scalars. Those might end up being valenced or might not, hard to say. Without having thought about this in detail, my mesa-optimization framing doesn't seem very helpful for understanding this scenario.

Ideally we'd want a method for identifying valence which is more mechanistic that mine. In the sense that it lets you identify valence in a system just by looking inside the system without looking at how it was made. All that said, most contemporary progress on AI happens by running base-optimizers which could support mesa-optimization, so I think it's quite useful to develop criterion which apply to this context.

Hopefully this answers your question and the broader concern, but if I'm misunderstanding let me know.

Steven ByrnesSep 13 20211

most contemporary progress on AI happens by running base-optimizers which could support mesa-optimization

GPT-3 is of that form, but AlphaGo/MuZero isn't (I would argue).

I'm not sure how to settle whether your statement about "most contemporary progress" is right or wrong. I guess we could count how many papers use model-free RL vs model-based RL, or something? Well anyway, given that I haven't done anything like that, I wouldn't feel comfortable making any confident statement here. Of course you may know more than me! :-)

If we forget about "contemporary progress" and focus on "path to AGI", I have a post arguing against what (I think) you're implying at Against evolution as an analogy for how humans will create AGI, for what it's worth.

Ideally we'd want a method for identifying valence which is more mechanistic that mine. In the sense that it lets you identify valence in a system just by looking inside the system without looking at how it was made.

Yeah I dunno, I have some general thoughts about what valence looks like in the vertebrate brain (e.g. this is related, and this) but I'm still fuzzy in places and am not ready to offer any nice buttoned-up theory. "Valence in arbitrary algorithms" is obviously even harder by far. :-)

jacobpfauSep 14 20212

Thanks for the link. I’ll have to do a thorough read through your post in the future. From scanning it, I do disagree with much of it, many of those points of disagreement were laid out by previous commenters. One point I didn’t see brought up: IIRC the biological anchors paper suggests we will have enough compute to do evolution-type optimization before the end of the century. So even if we grant your claim that learning to learn is much harder to directly optimize for, I think it’s still a feasible path to AGI. Or perhaps you think evolution like optimization takes more compute than the biological anchors paper claims?

Steven ByrnesSep 14 20212

Nah, I'm pretty sure the difference there is "Steve thinks that Jacob is way overestimating the difficulty of humans building AGI-capable learning algorithms by writing source code", rather than "Steve thinks that Jacob is way underestimating the difficulty of computationally recapitulating the process of human brain evolution".

For example, for the situation that you're talking about (I called it "Case 2" in my post) I wrote "It seems highly implausible that the programmers would just sit around for months and years and decades on end, waiting patiently for the outer algorithm to edit the inner algorithm, one excruciatingly-slow step at a time. I think the programmers would inspect the results of each episode, generate hypotheses for how to improve the algorithm, run small tests, etc." If the programmers did just sit around for years not looking at the intermediate training results, yes I expect the project would still succeed sooner or later. I just very strongly expect that they wouldn't sit around doing nothing.

jacobpfauSep 15 20212

Ok, interesting. I suspect the programmers will not be able to easily inspect the inner algorithm, because the inner/outer distinction will not be as clear cut as in the human case. The programmers may avoid sitting around by fiddling with more observable inefficiencies e.g. coming up with batch-norm v10.

Steven ByrnesSep 15 20211

Oh, you said "evolution-type optimization", so I figured you were thinking of the case where the inner/outer distinction is clear cut. If you don't think the inner/outer distinction will be clear cut, then I'd question whether you actually disagree with the post :) See the section defining what I'm arguing against, in particular the "inner as AGI" discussion.

jacobpfauSep 16 20212

Ok, seems like this might have been more a terminological misunderstanding on my end. I think I agree with what you say here, 'What if the “Inner As AGI” criterion does not apply? Then the outer algorithm is an essential part of the AGI’s operating algorithm'.

OferSep 13 20212

I don't see why. The NNs in AlphaGo and MuZero were trained using some SGD variant (right?), and SGD variants can theoretically yield mesa-optimizers.

Steven ByrnesSep 14 20217

AlphaGo has a human-created optimizer, namely MCTS. Normally people don't use the term "mesa-optimizer" for human-created optimizers.

Then maybe you'll say "OK there's a human-created search-based consequentialist planner, but the inner loop of that planner is a trained ResNet, and how do you know that there isn't also a search-based consequentialist planner inside each single run through the ResNet?"

Admittedly, I can't prove that there isn't. I suspect that there isn't, because there seems to be no incentive for that (there's already a search-based consequentialist planner!), and also because I don't think ResNets are up to such a complicated task.

OferSep 16 20213

(I don't know/remember the details of AlphaGo, but if the setup involves a value network that is trained to predict the outcome of an MCTS-guided gameplay, that seems to make it more likely that the value network is doing some sort of search during inference.)

Steven ByrnesSep 16 20212

Hmm, yeah, I guess you're right about that.

OferSep 10 20212

This topic seems to me both extremely important and neglected. (Maybe it's neglected because it ~requires some combination of ML and philosophy backgrounds that people rarely have).

My interpretation of the core hypothesis in this post is something like the following: A mesa optimizer may receive evaluative signals that are computed by some subnetwork within the model (a subnetwork that was optimized by the base optimizer to give "useful" evaluative signals w.r.t. the base objective). Those evaluative signals can constitute morally relevant valenced experience. This hypothesis seems to me plausible.

Some further comments:

Re:

For instance, the DeepMind agent discussed in section 4 pre-processes an evaluation of its goal achievement. This evaluative signal is factored/non-integrated in a sense, and so it may not be interacting in the right way with the downstream, abstract processes to reach conscious processing.

I don't follow. I'm not closely familiar with the Open-Ended Learning paper, but from a quick look my impression is that it's basically standard RL in multi-agent environments, with more diversity in the training environments than most works. I don't understand what you mean when you say that the agent "pre-processes an evaluation of its goal achievement" (and why the analogy to humans & evolution is less salient here, if you think that).
Re:

Returning to Tomasik’s assumption, “RL operations are relevant to an agent’s welfare”, the functionalist must disagree. At best we can say that RL operations can be instrumentally valuable by (positively) modifying the valence system.

(I assume that an "RL operation" refers to things like an update of the weights of a policy network.) I'm not sure what you mean by "positively" here. An update to the weights of the policy network can negatively affect an evaluative signal.

[EDIT: Also, re: "Compare now to mesa-optimizers in which the reward signal is definitionally internal to the system". I don't think that the definition of mesa-optimizers involves a reward signal. (It's possible that a mesa-optimizer will never receive any evidence about "how well it's doing".)]

jacobpfauSep 11 20211

Your interpretation is a good summary!

Re comment 1: Yes, sorry this was just meant to point at a potential parallel not to work out the parallel in detail. I think it'd be valuable to work out the potential parallel between the DM agent's predicate predictor module (Fig12/pg14) with my factored-noxiousness-object-detector idea. I just took a brief look at the paper to refresh my memory, but if I'm understanding this correctly, it seems to me that this module predicts which parts of the state prevent goal realization.

Re comment 2: Yes, this should read "(positive/negatively)". Thanks for pointing this out.

Re EDIT: Mesa-optimizers may or may not represent a reward signal -- perhaps there's a connection here with Demski's distinction between search and control. But for the purposes of my point in the text, I don't think this much matters. All I'm trying to say is that VPG-type-optimizers have external reward signals, whereas mesa-optimizers can have internal reward signals.

OferSep 11 20211

I guess what I don't understand is how the "predicate predictor" thing can make it so that the setup is less likely to yield models that support morally relevant valence (if you indeed think that). Suppose the environment is modified such that the observation that the agent gets in each time step includes the value of every predicate in the reward specification. That would make the "predicate predictor" useless (I think; just from a quick look at the paper). Would that new setup be more likely than the original to yield models that have morally relevant valence?

jacobpfauSep 14 20211

Your new setup seems less likely to have morally relevant valence. Essentially the more the setup factors out valence-relevant computation (e.g. by separating out a module, or by accessing an oracle as in your example) the less likely it is for valenced processing to happen within the agent.

Just to be explicit here, I'm assuming estimates of goal achievement are valence-relevant. How generally this is true is not clear to me.

OferSep 14 20211

Essentially the more the setup factors out valence-relevant computation (e.g. by separating out a module, or by accessing an oracle as in your example) the less likely it is for valenced processing to happen within the agent.

I think the analogy to humans suggests otherwise. Suppose a human feels pain in their hand due to touching something hot. We can regard all the relevant mechanisms in their body outside the brain—those that cause the brain to receive the relevant signal—as mechanisms that have been "factored out from the brain". And yet those mechanisms are involved in morally relevant pain. In contrast, suppose a human touches a radioactive material until they realize it's dangerous. Here there are no relevant mechanisms that have been "factored out from the brain" (the brain needs to use ~general reasoning); and there is no morally relevant pain in this scenario.

Though generally if "factoring out stuff" means that smaller/less-capable neural networks are used, then maybe it can reduce morally relevant valence risks.

jacobpfauSep 15 20211

Good clarification. Determining which kinds of factoring are the ones which reduce valence is more subtle than I had thought. I agree with you that the DeepMind set-up seems more analogous to neural nociception (e.g. high heat detection). My proposed set-up (Figure 5) seems significantly different from the DM/nociception case, because it factors the step where nociceptive signals affect decision making and motivation. I'll edit my post to clarify.

Effective Altruism Forum
EA Forum