Varieties of fake alignment (Section 1.1 of “Scheming AIs”)

This is a crosspost from the new Animal Welfare Alignment Newsletter by Anima International. You can subscribe on Substack if you are interested in following these efforts. Audio reading also available on Substack. The goals of this post are to: 1. Raise a question I see as crucially important to the goal of aligning AI to animal welfare...

131

Let's taboo the V-word

lincolnq·2d ago·8m read

“How long have you been v*g*n?” This is one of the most common icebreakers at animal protection events. It’s a baseline assumption, and it mostly holds true: if you’re out advocating for animals not to be tortured or abused, realistically these days you are v**n, or close. And it makes for good conversation. It seems fairly safe to assume when you meet strangers. But this assumption is hurting the movement in a way which we don’t always notice: someone new comes into the sp...

Spiro: an update 2.5 years on and a fundraising ask for expansion

Habiba Banu·4h ago·6m read

Summary Back in November 2023 I posted here to launch Spiro and raise our first $198k. Two and a half years later this is an update and a fundraiser for the next step. The short version: we've now reached over-5,900 people with TB preventive medicine, including over 3,000 children under five years old. Our early results have held up well an...

Recent opportunities to take action

See Park et al (2023) for a more in-depth look at AI deception. ↩︎
For readers unfamiliar with this story, see section 4.2 of Carlsmith (2021). ↩︎
Bostrom's original definition of the treacherous turn is: "While weak, an AI behaves cooperatively (increasingly so, as it gets smarter). When the AI gets sufficiently strong – without warning or provocation – it strikes, forms a singleton, and begins directly to optimize the world according to the criteria implied by its final values." Note that treacherous turns, as defined here, don't necessarily require that the early, nice-seeming behavior is part of an explicit strategy for getting power later (and Bostrom explicitly includes examples that involve such explicit pretense). Other definitions, though – for example, Artibal's here – define treacherous turning such that it implies strategic betraying. And my sense is that this is how the term is often used colloquially. ↩︎
Cotra's definition of "playing the training game" is: "Rather than being straightforwardly 'honest' or 'obedient,' baseline HFDT would push Alex to make its behavior look as desirable as possible to Magma researchers (including in safety properties), while intentionally and knowingly disregarding their intent whenever that conflicts with maximizing reward. I'll refer to this as 'playing the training game.' " Note that there is some ambiguity here about whether it counts as playing the training game if, in fact, maximizing reward does not end up conflicting with human intent. I'll assume that this still counts: what matters is that the model is intentionally trying to perform well according to the training process. ↩︎
Note, though, that sometimes the term "episode" is used differently. For example, you might talk about a game of chess as an "episode" for a chess-playing AI, even if it doesn't satisfy the definition I've given. I discuss this difference in much more depth in section 2.2.1.2. ↩︎
See e.g. Gao (2022) here for a breakdown. On my ontology, the reward process starts with what Gao calls the "sensors." ↩︎
Bostrom's description is: "If an agent retains its present goals into the future, then its present goals will be more likely to be achieved by its future self. This gives the agent a present instrumental reason to prevent alterations of its final goals. (The argument applies only to final goals. In order to attain its final goals, an intelligent agent will of course routinely want to change its subgoals in light of new information and insight.)" ↩︎
Though my use of this term might differ from other usages in the literature. ↩︎
The persistent applicability of analogies like prison and re-education camps to AIs is one of the reasons I think we should be alarmed about the AI moral patienthood issues here. ↩︎
"If the mesa-optimizer has an objective that extends across parameter updates, then it will be incentivized to avoid being modified, as it might not pursue the same objective after modification (with the result that its current objective will not be achieved in future iterations). This means that the mesa-optimizer will be instrumentally incentivized to act as if it is optimizing the base objective function, even if its actual mesa-objective is something else entirely. We will refer to this hypothetical phenomenon as deceptive alignment. Deceptive alignment is a form of instrumental proxy alignment, as fulfilling the base objective is an instrumental goal of the mesa-optimizer." ↩︎
And I think it encourages confusion with nearby concepts as well: e.g., training-gaming, instrumental training-gaming, power-motivated instrumental training-gaming, etc. ↩︎
My sense is that the "alignment" at stake in Hubinger et al's (2019) definition is "alignment with the 'outer' optimization objective," which needn't itself be aligned with human interests/values/intentions. ↩︎
To be clear, though: I think it's OK if people keep using "deceptive alignment," too. Indeed, I have some concern that the world has just started to learn what the term "deceptive alignment" is supposed to mean, and that now is not the time to push for different terminology. (And doing so risks a proliferation of active terms, analogous to the dynamic in this cartoon -- this is one of the reason I stuck with Cotra's "schemers.") ↩︎

Varieties of fake alignment (Section 1.1 of “Scheming AIs”)

Varieties of fake alignment (Section 1.1 of “Scheming AIs”)

Scheming and its significance

Varieties of fake alignment

Alignment fakers

Training-gamers

Power-motivated instrumental training-gamers, or "schemers"

Goal-guarding schemers