Varieties of fake alignment (Section 1.1 of “Scheming AIs”)

This is a crosspost from the new Animal Welfare Alignment Newsletter by Anima International. You can subscribe on Substack if you are interested in following these efforts. Audio reading also available on Substack. The goals of this post are to: 1. Raise a question I see as crucially important to the goal of aligning AI to animal welfare...

158

The first video from Giving What We Can's new channel is out now!

JustinPortela·2d ago·1m read

Hello! I'm Justin Portela. I got hired by GWWC to make YouTube videos after AI in Context did such a kickass job. My channel is using that same cinematic, high-production value beauty to talk about everything in the EA universe that isn't AI. ...

New round of digital minds funding opportunities at Longview

zdgroff, Longview Philanthropy·4d ago·2m read

This is a linkpost for Request for Proposals: Research and Applied Work on Digital Minds. I'm glad to announce a request for proposals for research and applied work on digital minds at Longview Ph...

Recent opportunities to take action

Seeking feedback and collaborators for an AI welfare project

Juliana Grant·1h ago·2m read

A huge way you can help pigs in 5-20 minutes (in the US)

ElliotTep·16h ago·1m read

EA Switzerland is Hiring: 🇨🇭Impact Cohort Manager

Erik Jentzen·9h ago·4m read

See Park et al (2023) for a more in-depth look at AI deception. ↩︎
For readers unfamiliar with this story, see section 4.2 of Carlsmith (2021). ↩︎
Bostrom's original definition of the treacherous turn is: "While weak, an AI behaves cooperatively (increasingly so, as it gets smarter). When the AI gets sufficiently strong – without warning or provocation – it strikes, forms a singleton, and begins directly to optimize the world according to the criteria implied by its final values." Note that treacherous turns, as defined here, don't necessarily require that the early, nice-seeming behavior is part of an explicit strategy for getting power later (and Bostrom explicitly includes examples that involve such explicit pretense). Other definitions, though – for example, Artibal's here – define treacherous turning such that it implies strategic betraying. And my sense is that this is how the term is often used colloquially. ↩︎
Cotra's definition of "playing the training game" is: "Rather than being straightforwardly 'honest' or 'obedient,' baseline HFDT would push Alex to make its behavior look as desirable as possible to Magma researchers (including in safety properties), while intentionally and knowingly disregarding their intent whenever that conflicts with maximizing reward. I'll refer to this as 'playing the training game.' " Note that there is some ambiguity here about whether it counts as playing the training game if, in fact, maximizing reward does not end up conflicting with human intent. I'll assume that this still counts: what matters is that the model is intentionally trying to perform well according to the training process. ↩︎
Note, though, that sometimes the term "episode" is used differently. For example, you might talk about a game of chess as an "episode" for a chess-playing AI, even if it doesn't satisfy the definition I've given. I discuss this difference in much more depth in section 2.2.1.2. ↩︎
See e.g. Gao (2022) here for a breakdown. On my ontology, the reward process starts with what Gao calls the "sensors." ↩︎
Bostrom's description is: "If an agent retains its present goals into the future, then its present goals will be more likely to be achieved by its future self. This gives the agent a present instrumental reason to prevent alterations of its final goals. (The argument applies only to final goals. In order to attain its final goals, an intelligent agent will of course routinely want to change its subgoals in light of new information and insight.)" ↩︎
Though my use of this term might differ from other usages in the literature. ↩︎
The persistent applicability of analogies like prison and re-education camps to AIs is one of the reasons I think we should be alarmed about the AI moral patienthood issues here. ↩︎
"If the mesa-optimizer has an objective that extends across parameter updates, then it will be incentivized to avoid being modified, as it might not pursue the same objective after modification (with the result that its current objective will not be achieved in future iterations). This means that the mesa-optimizer will be instrumentally incentivized to act as if it is optimizing the base objective function, even if its actual mesa-objective is something else entirely. We will refer to this hypothetical phenomenon as deceptive alignment. Deceptive alignment is a form of instrumental proxy alignment, as fulfilling the base objective is an instrumental goal of the mesa-optimizer." ↩︎
And I think it encourages confusion with nearby concepts as well: e.g., training-gaming, instrumental training-gaming, power-motivated instrumental training-gaming, etc. ↩︎
My sense is that the "alignment" at stake in Hubinger et al's (2019) definition is "alignment with the 'outer' optimization objective," which needn't itself be aligned with human interests/values/intentions. ↩︎
To be clear, though: I think it's OK if people keep using "deceptive alignment," too. Indeed, I have some concern that the world has just started to learn what the term "deceptive alignment" is supposed to mean, and that now is not the time to push for different terminology. (And doing so risks a proliferation of active terms, analogous to the dynamic in this cartoon -- this is one of the reason I stuck with Cotra's "schemers.") ↩︎

Varieties of fake alignment (Section 1.1 of “Scheming AIs”)

Varieties of fake alignment (Section 1.1 of “Scheming AIs”)

Scheming and its significance

Varieties of fake alignment

Alignment fakers

Training-gamers

Power-motivated instrumental training-gamers, or "schemers"

Goal-guarding schemers