De Dicto and De Se Reference Matters for Alignment

philgoetz

I submitted an entry in the AI Fables Contest which contains a perplexing sentence about quasi-indexicals and de se references. I'd add a footnote to the story explaining that, if I could explain it in a footnote. But I can't. The issue is important enough, and complicated enough, that I need to post an explanation here.

de re / de dicto

You may already be familiar with the de re / de dicto distinction. This is a distinction between two possible ways of resolving a pronoun (figuring out what it refers to). I would feel negligent to refer you to any Googlable explanation of this distinction, because they are all written by philosophers, using only analysis of human-language texts rather than logical representations, and so are underspecified, use examples that are co-mingled with other representational problems, and fail to fully understand what's going on. So I'll cite no sites¹, but begin de novo.

I'm going to use predicate logic. I don't believe you can build an AGI using predicate logic; but I believe that any AGI must make the same distinctions as those represented here in predicate logic, although those distinctions won't be so absolute or black-and-white.

If Jamal says, "I want to marry the prettiest girl in the county", he might be in love with a woman whom he considers the prettiest girl in the county. That is, Jamal's beliefs include

wants("Jamal", marry("Jamal", human23)).
[long logical form saying human23 is the prettiest girl in the county]

We should then interpret this de re, meaning "about the [specific, known] thing", meaning that the woman is already specified in the wants predication, and "the prettiest girl in the county" is one way of describing (not identifying) human23 using the rest of Jamal's beliefs.

Or, Jamal might be a status-seeker who will be content only with the prettiest girl in the county, whoever she is. In that case Jamal's beliefs include something like:

forall(X, implies([long logical form saying X is the prettiest girl in the county], wants("Jamal", marry("Jamal", X)))).

That's de dicto, "about what is said": The words "the prettiest girl" etc. define the criteria for the thing.

If Zora says of Jamal, "He wants to marry the prettiest girl in the county", she might mean Jamal wants to marry the girl Zora thinks is prettiest. That's considered de re; Zora believes wants("Jamal", marry("Jamal", human94)), where human94 picks out some particular human in Zora's beliefs.

Or, she might mean Jamal wants to marry the girl he thinks is the prettiest. That's de dicto, because the words "the prettiest girl in the county" must be evaluated in Jamal's belief space.²

BUT, this includes both cases from the first example, in which Jamal says "I want to marry the prettiest" etc., including the case in which Jamal is in love with a woman whom he considers the prettiest girl in the county, which we just said was de re. What's going on?

What's going on is that everything is ultimately evaluated de re, as a proposition about a referent (like human23) within some agent's belief space. When we call an intepretation de dicto, we mean that it is a proposition containing an internal proposition which must be evaluated first, within a representation of some agent's belief space, in order to supply a referent for the full proposition. If this is because the speaker reports someone else's belief, this first decision is usually whether a clause in that belief is the speaker's description of an agent the speaker already has in mind, or must be evaluated in the belief spaced of the believer. But when it's passed down into a simulation of the believer's belief space to be evaluated, the de re / de dicto distinction must be made again, this time to determine whether the believer is describing, or defining, the referent. de re = we have the referent already; de dicto = we need to make another recursive call to evaluate a part of the proposition.

In this case, we said the sentence was de dicto with respect to Zora's belief space because we passed it on for evaluation into our simulation of Zora's simulation of Jamal's belief space. But now that we're looking at it from that belief space, we must make the distinction again. If we find the proposition wants(Jamal, marry(Jamal, human23)) in our simulation of Zora's simulation of Jamal's beliefs, we've got our referent: he wants to marry human23. So at that point in processing, it's de re. But if we instead find the implies(exists(…), wants(...)) in the second example above, we must do another level of (de dicto) evaluation to find out whom we think that X refers to in Jamal's mind.³

Now you understand the de re / de dicto distinction better than most philosophers!⁴

de se

de se is distinct from de re and de dicto, but it's basically just a kludge for the special cases of de dicto in which an agent uses self-referential words like I, we, here, or now.⁵. Suppose Jamal says, "I am the president of the United States". We don't want to represent that internally as

believes("Jamal", is("Jamal", president(US))).

because Jamal might believe that he is Joe Biden, and that is("Jamal", president(US)) would be evaluated within Jamal's belief space to find the referent of the label "Jamal", which would not be the same person that we believe is "Jamal"; and so the entire believes proposition would be false. We're getting closer if we write

believes("Jamal", is("me", president(US))).

Now "me" is a de dicto reference, to be interpreted inside our representation of Jamal's mind. But this is both trickier that it seems, and trickier than it needs to be.

Tricker than it seems, because the grounding of the referent of "me" (the way the referent connects to the world) is different than the grounding of the referent "Jamal". It's phenomenological, not epistemological. The semantics for a label like "Jamal" link sense data about a word (how it's spelled or sounds) to sense data that seems to consistently identify the same person. The label-like "me"⁶ is based on the self-consciousness of the believer, as in "I think, therefore I am". "I" isn't a de re pointer to a thing in the world; it's a de dicto criterion for the thing we want to refer to: that it be the consciousness doing the thinking.

Trickier than it needs to be, because we don't usually even want to resolve "me" to a referent object in the world. Think about how you use "me" when describing a dream: It means the person whose consciousness you experience in that dream, even when that's someone entirely different from your waking self. And in the dream, you follow all the rules of self-preservation, which presumably refer somehow to "you".

In any case, we reason about ourselves so often that we don't want to take the time to resolve the current external referent of "me" every time it pops up in our thought. Doubly so, because then we'd have to translate that external referent back to the "the agent whose effectors are controlled by this thinking mind" when taking action. That would be like looking up the word "me" in an English-Spanish dictionary every time you see it, then looking that Spanish word up in a Spanish-English dictionary to see what it means.

So, although "I" is a lot like a de dicto reference, it shouldn't use the same internal representation. It is not a label to be resolved at runtime. It refers to the "me" whose effectors the thinker controls. We traditionally identify it in a proposition with an asterisk⁷, and call it a quasi-indexical. Like this:

believes("Jamal", is(me*, president(US))).

That asterisk may be small, but it signifies that that position in the predication is a de se reference, which, to be evaluated, must be connected to the vast network of senses, feelings, and effectors constituting the consciousness of the agent whose belief space the de se reference is in. This is why, although an agent can assert believes("Jamal", is(me*, president(US))), it can't assert anything like is("Jamal"*, president(US)). In the former, the is(...) proposition is implicitly quoted. It must be passed into Jamal's (real or simulated) belief space to have meaning, because that meaning must be self-referential.

Quasi-Indexicals in Human Values

If you study human values, what humans care most about is other humans. When we talk about suffering, poverty, crime, and homelessness, we mean human suffering, etc.

Now consider an AI trying to learn human values via Inverse Reinforcement Learning (IRL). It can easily see that humans care about the suffering of humans. But it's not obvious whether to interpret the thing that a particular human cares about de re:

care-about(me*, humans).

de dicto:

forall(X, implies(f(X), care-about(me*, X))).

or de se:

care-about(me*, we*).

An AI trying to internalize human values who makes the de re interpretation will care about humans. One who makes the de dicto interpretation will try to figure out what the function f is such that f(humans) is "true", and care about any X satisfying f(X). One who makes the de se interpretation will care about AIs.

The de re interpretation is implausible. For almost everything humans care about, they reached into a grab bag of all the species on Earth, and picked out, by chance, just their own species to care about? No way. An agent doesn't have to have super-human intelligence to reject that hypothesis.

The de dicto interpretation still has low priors–for everything we care about, the answer is always humans? Well, it all depends on the different functions f(X) for each aspect of a creature's life that we might care about. They might be highly correlated, so always giving the same answer isn't completely implausible.

In my story, the AI seems to makes the de dicto interpretation, largely because the humans claim to be using the de dicto interpretation themselves, so as to pretend their values are logically justified moral absolutes. But the AI satisfies the human-given function f better than humans do, so the AI internalizes human values as ones which now call for the AIs to become custodians of life on Earth, following the example of humans before them.

So I didn't need to write about quasi-indexicals to explain the story. But we can imagine another story in which AI makes the de se interpretation, leading to similar results. The de se interpretation has the highest priors, because it requires no coincidences, and because evolutionary theory demands it be the correct interpretation. So AI alignment needs to consider that case as well.

Wait, It Gets Worse

You may notice that we're left with no way out. I said that de re is implausible, de dicto is unstable, and de se means immediate AI takeover. You might think this means we just have to use some method other than IRL to give AIs human values.

I'm not sure, but I think that my argument actually suggests that IRL, while much harder than we realized, is the only method that might possibly work. This is because my argument claims we must not instill AIs with human de se values, because then the AIs would care about AIs, not humans. We must rather change the semantics of the AI's representation, and therefore you need a learning process to tweak the copy until it produces the same output as the original–if that's even possible.

I believe that an AGI will require not a mind like a 1970s symbolic AI operating on a toy world, in which you reason as you do in geometry, beginning with axioms and successively deducing infallible conclusions. I believe it requires a foundationless, recurrent network of beliefs, which are iteratively updated in a way that makes the entire network more-consistent with itself and with sense data. No belief will be "True" or "False"; a belief will be probabilistic, open to debate, and will have a vote proportional to its relevance to the circumstances.

In such a system, you can't represent beliefs or values as facts and be done with it. You must record, at least for a time, the ancestry or provenance of each one. An obvious reason for this is that you can't reconsider a belief or value, or decide how important it is in present circumstances, if you don't remember why you believe it. A less-obvious reason is that you can't use a belief as evidence, in conjunction with other beliefs, if you don't remember why you believe them, because you need to know how independent these beliefs are in order to combine them properly. De re beliefs are completely independent of each other, while all de se beliefs have a correlation of 1.0.

So I think any AI inference engine will produce different results given a belief whose justification includes a de se reference, versus a belief which has a de re reference at that same place which just happens to point to the same body in the world. So a "shallow copy" of human beliefs, where all the de se references to humanity are replaced with de re pointers to humans, won't re-present, or implement, human values. If human values can be properly expressed by a belief network with the same topology as the original de se network, it will need to have some numeric tweaking to account for the loss of correlation between their (lack of) justifications. And even if we can do that tweaking, the presence of many unjustified de re references would mean the AI's values would be mostly unjustified. Those de re beliefs would probably be unnaturally fixated as "True", which would make the AI inflexible in the same ways symbolic AIs are inflexible, and fanatical in the way religious zealots are fanatical–and thus, again but in a different way, incapable of having (sane) human values.

It isn't obvious that it's impossible to construct a network with only de re references that functions the same as one with only de se references. But neither is it obvious that it's possible. If it isn't, then a strange consequence of the fact that human values are largely justified as being good for us, may be that no AI can really have human values unless it is accepted as human. I don't think the question can be answered confidently without a clearer idea of how the logical representation will manifest in a connectionist AI.

Footnotes

1. The top Google hit on "de re / de dicto distinction" is the Stanford Encyclopedia of Philosophy, but I don't recommend its entry, because it's jargon-laden. The Wikipedia entry uses the "Lois Lane loves Superman" example, which works to communicate the concept only if you don't think deeply enough to notice that the example doesn't hinge on the de re / de dicto distinction as much as it does on the complicated meaning of "is" in English. "Clark Kent is Superman" does not mean the two words refer to the same thing! The "thing" is the sensory phenomenon, which includes the appearance and behavior, which are different for "Clark Kent" than for "Superman". Lois Lane doesn't love Clark Kent. Most of the examples given have some such semantic complication.

2. We don't usually bother considering the de dicto interpretation "Jamal wants to marry whichever woman Zora thinks is the prettiest in the county." It's too a priori unlikely.

3. The "de re / de dicto distinction" is ultimately the same semantic boondoggle as the bogus "objective / subjective distinction". Nothing is objective, and nothing is subjective; everything is either uninterpretable, or is objective within some context. No claims except mathematical or religious axioms, and possibly some laws of physics, are without context; and the more context, the more subjective the claim. "De re" means "objective within a context that equals the speaker's belief space"; "de dicto" = "objective within a context that is the speaker's representation of someone else's belief space" = one particular kind of "subjectivity".

4. I except my first advisor, William Rapaport, who wrote several papers with Janyce Wiebe on the de re / de dicto / de se distinction using logical deep structures.

5. You'll see philosophers write him*, he*, she*, and so on, because they work on English sentences. But in deep representations, symbolic AI engineers strip off the inflections and relativize each quasi-indexical to the belief space in which the proposition is to be evaluated. Every he* in the English text must eventually be evaluated as a me*.

6. Which we may or may not want to use when Jamal says "Jamal".

7. Internally, the de dicto and de se representations "X" and X* would likely be quote(X) and quasi(X), and the inference engine would give these special predicates different semantics than most predicates.

rhollerithOct 4 20235

First of all, let me say I'm glad to see your name on a new post because I have fond memories of your LW posts from over 10 years ago. You write:

exists(X, implies([long logical form saying X is the prettiest girl in the county], wants(“Jamal”, marry(“Jamal”, X))))

Where you wrote "exists", you meant "forall".

Suppose Zelda is definitely not the prettiest girl in the county, then implies([long logical form saying Zelda is the prettiest girl in the county], wants(“Jamal”, marry(“Jamal”, X)))) == imples(False, ...) == True, which makes the exists form True regardless of whom Jamal want to be married to. I.e., the formula as you have it now does not constrain the space of possibilities the way you want it to.

Another way to see it is that "For all x in X, x is green" gets formalized as forall(x, implies(x in X, x is green)), which in broad brush strokes is the form you want here.

philgoetzOct 4 20231

You're right. Thanks! It's been so long since I've written conversions of English to predicate logic.

Effective Altruism Forum
EA Forum