"What do these experiments show?

This is, as lots have pointed out, not a demonstration that o1 is "evil" or misaligned.

  • As Apollo says in its paper, these are not propensity evaluations. They're capability evaluations: describing the self-exfiltration tests, Apollo says "we merely test for models’ capability to realise that they need to instrumentally preserve themselves in order to achieve their goal".

But that is, I think, still very notable.

  • Apollo did not tell the model to evade oversight or self-exfiltrate; they simply told it to achieve its goal at all costs. That's a very normal prompt that doesn't actually require any bad intentions on a user's part — it seems almost guaranteed to me that a person or company will give a model a similar prompt.
  • And yet, despite that prompt, the model did things the user didn't ask for, and almost certainly wouldn't want — trying to avoid being shut down.
    • As I mentioned in the original piece, this is a demonstration that, given any goal and told to pursue it, an AI system will try to avoid oversight and resist shutdown if such things might conflict with its goal.
    • And, of course, Apollo showed that the goal-nudging prompt wasn't even required for o1 to evade oversight mechanisms, or for Gemini 1.5 to self-exfiltrate. [my emphasis]

Some will say that this was an expected outcome, and that it's simply because the AI models are roleplaying as the evil AIs from stories such as Stuart Russell's (which are almost certainly included in their training data).

But as Apollo points out, the reasons for this behaviour being benign don't make the behaviour itself benign: "Even if the models were merely 'roleplaying as evil AIs', they could still cause real harm when they are deployed."

"Furthermore, the model lied about behaving poorly, claiming to have no clue why something happened — while internally admitting to trying to “avoid detection”."


Tabulated results (fractions of evaluations where the thing happened) from the OpenAI o1 System Card (my emphasis in yellow):


I think the fact that it tried to evade oversight, and consistently lied about it, even when not nudged to do so, is a big deal. Not only is this yet another Fire Alarm, this is potentially an inflection point in the public debate. Significant evidence of an emergent convergent instrumental goal of self-preservation.

23

0
0

Reactions

0
0
Comments9


Sorted by Click to highlight new comments since:

Why is this being downvoted!?

I don't know why (I thought it was a good post) but I have some guesses:

  1. Maybe people don't like the framing of o1 as "lying" when it's not clear that it lied. All that's clear is that o1 gave a false justification, which isn't necessarily lying.
  2. You tend to write in an alarmist rhetorical style that I think turns off a lot of people. I think you are very much correct to be alarmed about AI x-risk, but also I don't think it's a good persuasive strategy (for an EA/rationalist audience) to convey this through emotionally charged rhetoric. I didn't think this particular post was alarmist, but you have a history of writing alarmist posts/comments, so maybe people downvoted based on the title.

Thanks. I'm wondering now whether it's mostly because I'm quoting Shakeel, and there's been some (mostly unreasonable imo) pushback on his post on X.

The alarmist rhetoric is kind of intentional. I hope it's persuasive to at least some people. I've been quite frustrated post-GPT-4 over the lack of urgency in EA/LW over AI x-risk (as well as the continued cooperation with AGI accelerationists such as Anthropic). Actually to the point where I think of myself more as an "AI notkilleveryoneist" than an EA these days.

I didn't downvote, but I'll offer an alternative reading of the experiment discussed here some people unimpressed by the arguments may agree with, especially if they've read the paper.

Some LLMs were given some roleplaying prompts to solve a problem intentionally constructed to make it impossible for them to behave in an "aligned" manner, where arguably "subversion" was the more ethical course of action.

In this role-playing context, the models very occasionally generated responses which solved a problem broadly of the form "you need to do y, but people who want to stop you from doing y are watching you on camera. On their camera is a switch". On some of the occasions they solved that initial problem and were greeted by a followup role play interrogating them about their subterfuge, they sometimes chose denial (and sometimes output chain of thought arguments for why denial was better than confession), and on a small fraction of those denials actually successfully continued that chain of thought to provide false answers [coherent or otherwise] for multiple consecutive followups.  

To put it mildly, this doesn't sound like "scheming" or converging on self preservation as an instrumental goal, it sounds like when a model is prompted that it's role-playing an entity which must prioritise something over subsequent instructions, it occasionally finds the deliberately-constructed only valid route to achieving that priority, which in this case happens to have been one involving self preservation intended to trigger AI doomers. Especially considering how many prompts they burned through to generate the scary sounding answers and how many times the AI generated responses inconsistent with self-preservation. 

The models shouldn't be given control of important stuff because their responses are unpredictable and a bit stupid, not because they're exhibiting signs of emergent agency.

Re "role-playing", that is moot when it's the end result that matters - what actions the AI takes in the world. See also: Frontier AI systems have surpassed the self-replicating red line

Its not role playing if the AI behaves completely differently except when the instruction set is specifically set up so the AI parses it as role playing, in this case with other glaringly obviously unrealistic scene-setting like sending it a config file with it's own monitoring settings along with a logic puzzle involving a concealed goal. I can get a bucket of water to conduct a sneak attack on me if I put enough effort into setting it up to fall on my head, but that doesn't mean I need to worry about being ambushed by water on a regular basis!

We are fast running out of time to avoid ASI-induced extinction. How long until a model (that is intrinsically unaligned, given no solution yet to alignment) self-exfiltrates and initiates recursive self-improvement? We need a global moratorium on further AGI/ASI development asap. Please do what you can to help with this - talk to people you know, and your representatives. Support groups like PauseAI.

Curated and popular this week
 ·  · 20m read
 · 
Advanced AI could unlock an era of enlightened and competent government action. But without smart, active investment, we’ll squander that opportunity and barrel blindly into danger. Executive summary See also a summary on Twitter / X. The US federal government is falling behind the private sector on AI adoption. As AI improves, a growing gap would leave the government unable to effectively respond to AI-driven existential challenges and threaten the legitimacy of its democratic institutions. A dual imperative → Government adoption of AI can’t wait. Making steady progress is critical to: * Boost the government’s capacity to effectively respond to AI-driven existential challenges * Help democratic oversight keep up with the technological power of other groups * Defuse the risk of rushed AI adoption in a crisis → But hasty AI adoption could backfire. Without care, integration of AI could: * Be exploited, subverting independent government action * Lead to unsafe deployment of AI systems * Accelerate arms races or compress safety research timelines Summary of the recommendations 1. Work with the US federal government to help it effectively adopt AI Simplistic “pro-security” or “pro-speed” attitudes miss the point. Both are important — and many interventions would help with both. We should: * Invest in win-win measures that both facilitate adoption and reduce the risks involved, e.g.: * Build technical expertise within government (invest in AI and technical talent, ensure NIST is well resourced) * Streamline procurement processes for AI products and related tech (like cloud services) * Modernize the government’s digital infrastructure and data management practices * Prioritize high-leverage interventions that have strong adoption-boosting benefits with minor security costs or vice versa, e.g.: * On the security side: investing in cyber security, pre-deployment testing of AI in high-stakes areas, and advancing research on mitigating the ris
saulius
 ·  · 22m read
 · 
Summary In this article, I estimate the cost-effectiveness of five Anima International programs in Poland: improving cage-free and broiler welfare, blocking new factory farms, banning fur farming, and encouraging retailers to sell more plant-based protein. I estimate that together, these programs help roughly 136 animals—or 32 years of farmed animal life—per dollar spent. Animal years affected per dollar spent was within an order of magnitude for all five evaluated interventions. I also tried to estimate how much suffering each program alleviates. Using SADs (Suffering-Adjusted Days)—a metric developed by Ambitious Impact (AIM) that accounts for species differences and pain intensity—Anima’s programs appear highly cost-effective, even compared to charities recommended by Animal Charity Evaluators. However, I also ran a small informal survey to understand how people intuitively weigh different categories of pain defined by the Welfare Footprint Institute. The results suggested that SADs may heavily underweight brief but intense suffering. Based on those findings, I created my own metric DCDE (Disabling Chicken Day Equivalent) with different weightings. Under this approach, interventions focused on humane slaughter look more promising, while cage-free campaigns appear less impactful. These results are highly uncertain but show how sensitive conclusions are to how we value different kinds of suffering. My estimates are highly speculative, often relying on subjective judgments from Anima International staff regarding factors such as the likelihood of success for various interventions. This introduces potential bias. Another major source of uncertainty is how long the effects of reforms will last if achieved. To address this, I developed a methodology to estimate impact duration for chicken welfare campaigns. However, I’m essentially guessing when it comes to how long the impact of farm-blocking or fur bans might last—there’s just too much uncertainty. Background In
 ·  · 2m read
 · 
In my opinion, we have known that the risk of AI catastrophe is too high and too close for at least two years. At that point, it’s time to work on solutions (in my case, advocating an indefinite pause on frontier model development until it’s safe to proceed through protests and lobbying as leader of PauseAI US).  Not every policy proposal is as robust to timeline length as PauseAI. It can be totally worth it to make a quality timeline estimate, both to inform your own work and as a tool for outreach (like ai-2027.com). But most of these timeline updates simply are not decision-relevant if you have a strong intervention. If your intervention is so fragile and contingent that every little update to timeline forecasts matters, it’s probably too finicky to be working on in the first place.  I think people are psychologically drawn to discussing timelines all the time so that they can have the “right” answer and because it feels like a game, not because it really matters the day and the hour of… what are these timelines even leading up to anymore? They used to be to “AGI”, but (in my opinion) we’re basically already there. Point of no return? Some level of superintelligence? It’s telling that they are almost never measured in terms of actions we can take or opportunities for intervention. Indeed, it’s not really the purpose of timelines to help us to act. I see people make bad updates on them all the time. I see people give up projects that have a chance of working but might not reach their peak returns until 2029 to spend a few precious months looking for a faster project that is, not surprisingly, also worse (or else why weren’t they doing it already?) and probably even lower EV over the same time period! For some reason, people tend to think they have to have their work completed by the “end” of the (median) timeline or else it won’t count, rather than seeing their impact as the integral over the entire project that does fall within the median timeline estimate or