Thoughts on the OpenAI alignment plan: will AI research assistants be net-positive for AI existential risk?

This is a crosspost from the new Animal Welfare Alignment Newsletter by Anima International. You can subscribe on Substack if you are interested in following these efforts. Audio reading also available on Substack. The goals of this post are to: 1. Raise a question I see as crucially important to the goal of aligning AI to animal welfare...

146

Maybe do the thing you wish CEA would do

alejoacelas 🔸·6d ago·2m read

I used AI to fix transcription errors, rerrarange the ideas, and suggest tweaks to the title and some sentences. Three of the most exciting projects to come out of EA in recent years are, in a vague sense, CEA spinouts: * Kairos is directly a spinout of CEA and now handles most support for university AI safety groups. Basically everyone I've found who knows them is really excited about what they do * NEST is an opinionated ideas-fi...

137

The first video from Giving What We Can's new channel is out now!

JustinPortela·1d ago·1m read

Hello! I'm Justin Portela. I got hired by GWWC to make YouTube videos after AI in Context did such a kickass job. My channel is using that same cinematic, high-production value beauty to talk about everything in the EA universe that isn't AI. ...

Recent opportunities to take action

Find funding, fast

Austin·16h ago·3m read

New round of digital minds funding opportunities at Longview

zdgroff, Longview Philanthropy·3d ago·2m read

173

Possible mistake EAs are making and shout out to Pause AI UK

Michelle_Hutchinson·2w ago·4m read

What I like about OpenAI’s alignment plan

This post started with my high level take, which included a lot of criticisms. There are also a bunch of things about the alignment plan post that I quite liked. So in no particular order:

I liked that the plan describes the high level approach to aligning AGI (not just current models). Namely, “building and aligning a system that can make faster and better alignment research progress than humans can.” Even though I don’t think this is a good plan, I like that they spelled out what their high-level approach is!

In general, I think the plan did a good job pointing at a bunch of true things. I like that they explicitly named existential risks from AI systems as a concern. Any alignment plan needs to start here, and they started here. In the same vein, the plan did a good job recognizing that current AI systems like InstructGPT are far from aligned. They also recognized that OpenAI is currently lacking in interpretability researchers, and that this might be important for their plan.

As part of their core alignment approach, they mention: “evaluating alignment research is substantially easier than producing it” which seems true, even if evaluation is still quite difficult (as I argue above). However, it still seems true that if you could produce an AI system that could output many alignment plans and do nothing else, some of which were truly good, that would be a huge step forward in alignment. And assuming it wasn’t already vastly smarter than humans (and if it was it could probably just convince you to release it right away), then this would probably be largely safety positive.

They also acknowledged limitations of their plan & anticipated ways that people would disagree with them, e.g. they acknowledged that there are fundamental limitations in how well humans can provide training signals for trying to train aligned AI systems. They also acknowledged that major discontinuities between aligning current systems and AGI systems would make the approach substantially less likely to work. They also acknowledged that “the least capable models that can help with alignment research might already be too dangerous if not properly aligned”.

I also want to acknowledge that I didn’t address every counterpoint to the common arguments against the OpenAI approach that Jan includes here. I appreciated that Jan wrote these out and I would like to respond to them, but I ran out of time this week. I might add comments on these as follow-ups in the comments later.

The OpenAI alignment plan is outlined in this post and Jan Leike adds more context to his thinking about this plan and related ideas on his blog. If you want to dive deep into the plan, I recommend reading not just the original post but also the rest of these. Jan addresses a bunch of points not present in alignment plan post and also responds to common criticisms:

Thanks AW for helping me review this post! Mistakes are all his.

^{^}

An additional and likely even worse problem, is that there are likely many necessary alignment insights / discoveries you would need to find before building aligned AGI that would also speed up / make easier your general AGI engineering research. Thus, some kinds of alignment research can make us less safe if these insights spread. Thus both capabilities-insights-that-help-with-alignment and alignment-insights-that-help-us-with-capabilities would differentially be helpful to create unsafe AGI compared to safe AGI.

^{^}

Jan acknowledges that this is true, see this quote from this post: “One of the problems that conceptual alignment work has is that it’s unclear when progress is being made and by how much. The best proxy is “do other researchers think progress is being made” and that’s pretty flawed: the alignment research community largely disagrees about whether any conceptual piece constitutes real progress.”

But his answer, “iterate on real systems” fails to address the many of the core concerns that conceptual alignment is trying to address! You can’t just say “well conceptual research is hard to evaluate so we’ll just skip that part and do the easy thing” if the easy thing won’t actually solve the problems you need to solve!

^{^}

Relevant piece from Yudkowky’s List of Lethalities: “#3 We can gather all sorts of information beforehand from less powerful systems that will not kill us if we screw up operating them; but once we are running more powerful systems, we can no longer update on sufficiently catastrophic errors. This is where practically all of the real lethality comes from, that we have to get things right on the first sufficiently-critical try.”

^{^}

In particular, List of Lethalities points 10, 17, and 20 seem particularly lethal for Open AI’s current approach:

“10 You can't train alignment by running lethally dangerous cognitions, observing whether the outputs kill or deceive or corrupt the operators, assigning a loss, and doing supervised learning. On anything like the standard ML paradigm, you would need to somehow generalize optimization-for-alignment you did in safe conditions, across a big distributional shift to dangerous conditions… Powerful AGIs doing dangerous things that will kill you if misaligned, must have an alignment property that generalized far out-of-distribution from safer building/training operations that didn't kill you. This is where a huge amount of lethality comes from on anything remotely resembling the present paradigm. Unaligned operation at a dangerous level of intelligence*capability will kill you; so, if you're starting with an unaligned system and labeling outputs in order to get it to learn alignment, the training regime or building regime must be operating at some lower level of intelligence*capability that is passively safe, where its currently-unaligned operation does not pose any threat. (Note that anything substantially smarter than you poses a threat given any realistic level of capability. Eg, "being able to produce outputs that humans look at" is probably sufficient for a generally much-smarter-than-human AGI to navigate its way out of the causal systems that are humans, especially in the real world where somebody trained the system on terabytes of Internet text, rather than somehow keeping it ignorant of the latent causes of its source code and training environments.)”

“17 More generally, a superproblem of 'outer optimization doesn't produce inner alignment' is that on the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they're there, rather than just observable outer ones you can run a loss function over. This is a problem when you're trying to generalize out of the original training distribution, because, eg, the outer behaviors you see could have been produced by an inner-misaligned system that is deliberately producing outer behaviors that will fool you. We don't know how to get any bits of information into the inner system rather than the outer behaviors, in any systematic or general way, on the current optimization paradigm.”

“20. Human operators are fallible, breakable, and manipulable. Human raters make systematic errors - regular, compactly describable, predictable errors. To faithfully learn a function from 'human feedback' is to learn (from our external standpoint) an unfaithful description of human preferences, with errors that are not random (from the outside standpoint of what we'd hoped to transfer). If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them. It's a fact about the territory, not the map - about the environment, not the optimizer - that the best predictive explanation for human answers is one that predicts the systematic errors in our responses, and therefore is a psychological concept that correctly predicts the higher scores that would be assigned to human-error-producing cases.”

Thoughts on the OpenAI alignment plan: will AI research assistants be net-positive for AI existential risk?

Thoughts on the OpenAI alignment plan: will AI research assistants be net-positive for AI existential risk?

The dual-use nature of AI research assistants and whether these systems will differentially improve capabilities and net-increase existential risk

The challenges involved in both generating and evaluating AI alignment research using AI research assistants

The nature and difficulty of the alignment problem

What I like about OpenAI’s alignment plan