On how various plans miss the hard bits of the alignment challenge

So8res

On how various plans miss the hard bits of the alignment challenge

So8res

35 min readJul 12, 2022

126

Comments 13

Sorted by

New & upvoted

Owen Cotton-Barratt

Thanks for this, and especially for your last post (I'm viewing this as kind of an appendix-of-examples to the last post, which was one of my favourite pieces from the MIRI-sphere or indeed on AI alignment from anywhere). General themes I want to pick out:

My impression is that there is a surprising dearth of discussion of what the hard parts of alignment actually are, and that this is one of the most important discussions to have given that we don't have clean agreed articulations of the issues
- I thought your last post was one of the most direct attempts to discuss this that I've seen, and I'm super into that
I am interested in further understanding "what exactly would constitute a sharp left turn, and will there be one?"
I'm in strong agreement that the field would be healthier if more people were aiming at the central problems, and I think it's super healthy for you to complain about how it seems to you like they're missing them.
- I don't think everyone should be aiming directly at the central problems because I think it may be that we don't yet know enough to articulate and make progress there, and it can be helpful as a complement to build up knowledge that could later help with central problems; I would at least like it though if lots of people spent a little bit of time trying to understand the central problems, even if they then give up and say "seems like we can't articulate them yet" or "I don't know how to make progress on that" and go back to more limited things that they know how to get traction on, while keeping half an eye on the eventual goal and how it's not being directly attacked.

I also wanted to clarify that Truthful AI was not trying to solve the hard bit of alignment (I think my coauthors would all agree with this). I basically think it could be good for two reasons:

As a social institution it could put society in a better place to tackle hard challenges (like alignment; if we get long enough between building this institution and having to tackle alignment proper).
It could get talented people who wouldn't otherwise be thinking about alignment to work on truthfulness. And I think that some of the hard bits of truthfulness will overlap with the hard bits of alignment, so it might produce knowledge which is helpful for alignment.

(There was also an exploration of "OK but if we had fully truthful AI maybe that would help with alignment", but I think that's more a hypothetical sideshow than a real plan.)

So I think you could berate me for choosing not to work on the hard part of the problem, but I don't want to accept the charge of missing the point. So why don't I work on the hard part of the problem? I think:

I don't actually perceive the hard part of the problem clearly
- It feels slippery, and that trying to tackle it head-on prematurely is too liable to result in doing work that I will later think completely misses the point
- But I can perceive the shape of something there (I may or may not end up agreeing with you about its rough contours), so I prefer to think about a variety of things with some bearing on alignment, and periodically check back in to see how much enlightenment I now have about the central things
  - You could think of me as betting on something like Grothendieck's rising sea approach to alignment (although of course it's quite likely I'll never actually get the shell open)
- This is part of what made my taste sensors fire very happily on your posts!
I think there are a web of things which can put us in a position of "more likely well-equipped to make it through", and when I see I have traction on some of those it feels like there's a real substantive opportunity cost to just ignoring them

(Laying this out so that you know the basic shape of my thinking, such that if you want to make a case that I should devote time to tackling things more head-on, you'll know what I need to be moved on.)

Michael_Cohen

I constructed an agent where you can literally prove that if you set a parameter high enough, it won't try to kill everyone, while still eventually at least matching human-level intelligence. Sure it uses a realizability assumption, sure it's intractable in its current form, sure it might require an enormously long training period, but these are computer science problems, not philosophy problems, and they clearly suggest paths forward. The underlying concept is sound. It struck me as undignified to say this in the past, but maybe dignity rightly construed should compel me to: it absolutely boggles me that ~no one in the EA community talks about this. It's not in this blog post; it's not in Richard's curriculum; it wasn't in Evan's list of promising AGI safety ideas.

I agree with your perspective on all of these approaches, except my initial reaction is to be more pessimistic about natural abstractions. It seems to me that a good understanding of natural abstractions is not good enough for putting a handle on a part of an agent's mind. We'd also need to understand "natural types", the type signatures that agents' brains use to represent those abstractions. And I think that there is a long, long list of types, in which each is as natural as the rest.

There's an interpretability benchmark that occurred to me recently, which I may as well mention here, because I agree approximately none of the interpretability research I see strikes me as progress toward strategically relevant interpretation of AGI. Try to understand what corvids are saying to each other.

Devin Kalish

This is pretty unrelated to the substantial content here to the point where I'm unsure about writing about it, especially as it's looking like it's going to be the first comment on this post. Still, I wanted to offer some feedback on messaging in case it helps. Whenever I see you use the word "dignity" in this piece, I sort of recoil and feel more alienated from this post overall. In particular, there are two semi-related reasons for this:

It references a post that almost single-handedly gave me and I think lots of others, pretty bad mental health issues for a few months, and to an extent still today. On its own I don't know that this disqualifies the piece as worthwhile, but it does make me recoil from references to it like a hand on a hot stove somewhat automatically. On a more substantial level, I think the post itself was awkward and probably a misstep. The weird April Fools but not April Fools framing, and the fact that it didn't contribute much to the substantial discussion, though it had some valuable things to say about being a consequentialist rather than a cartoon supervillain (which I think he has said elsewhere less prominently if I remember).
Dignity is the wrong thing to aim for. This is maybe the more substantial problem I have. First and foremost, I don't want to aim for "dying with dignity". I would rather just, exclusively, aim for not dying. It's true there's an awfully funny coincidence that Yudkowsky's version of "dying with dignity" lines up so perfectly with aiming for not dying, but that still doesn't justify aiming for it instead. Dignity, in this situation, is just not that motivating to me. If aiming for "dying with dignity" diverges one nano-degree from aiming for not dying, then it is the wrong thing to aim for. This is especially striking to find in writing from Yudkowsky, for those of us who have read his other stuff, because it is not how he ever advocates people think elsewhere. Whenever you parry, hit, spring, strike or touch the cutting sword of AGI safety, you must cut the actual solution in the same movement. This almost is enough to convince me that this peculiar dignity framing was the only part of the piece that was at least sort of facetious in the whole, not-very-April-Foolsy post. At the very least I can't relate to ever getting to a point of being so hopeless that what I am aiming for is "dying with dignity" rather than just "not dying".

I guess my steelman of this is that, if you aim for not dying, you will probably be disappointed, so to keep motivated, you should aim for something achievable, like dignity. I am not nearly as pessimistic as you or Yudkowsky about this matter, and maybe if I was this framing would seem better to me, but even then I find it unlikely. What differs between a world where a few people put in a good deal of effort, this effort proves mostly counterproductive or just misguided, and they all die, and a world where a few people put in a good deal of effort in the right direction, aren't accidentally counterproductive, and they almost don't die. I am just as sympathetic to both, I find that they had similar dignity in my eyes, but the latter world, most crucially, almost survived. More dignified than both is a world in which everyone coordinates a great deal and society really buckles down and gets serious about the issue and they still die. Less dignified is a world where no one but one lonely weirdo cares at all, everyone else laughs about it, and the one weirdo gives up, and they all die. Both of those worlds are unrealistic at this point, I judge that if our option set for solving this problem is narrow enough that you can expect us to probably fail with any reliability, then it is narrow enough that we can't change how much dignity we die with very much either.

All of this is a rambling way of saying that, this "dignity" stuff has come up in a number of serious writings from MIRI now, and I'm worried that it is going to become a standard fixture of your messaging. I just want to register some concerns I have about this happening, I would rather you just say things increase our odds of succeeding than say that things increase our dignity.

JakubK

Maybe I'm missing something, but it seems that "dignity" only appears once in the OP? Namely, here:

On my model, solutions to how capabilities generalize further than alignment are necessary but not sufficient. There is dignity in attacking a variety of other real problems, and I endorse that practice.

This usage appears to have nothing to do with the April Fool's Day post.

Perhaps Soares made a subsequent edit to the OP?

RobBensinger

"Dignity" indeed only occurs once, and I assume it's calling back to the same "death with dignity" concept from the April Fool's post (which I agree shouldn't have been framed as an April Fool's thing).

I assume EY didn't expect the post to have such a large impact, in part because he'd already said more or less the same thing, with the same terminology, in a widely-read post back in November 2021:

Anonymous
At a high level one thing I want to ask about is research directions and prioritization. For example, if you were dictator for what researchers here (or within our influence) were working on, how would you reallocate them?
Eliezer Yudkowsky
The first reply that came to mind is "I don't know." I consider the present gameboard to look incredibly grim, and I don't actually see a way out through hard work alone. We can hope there's a miracle that violates some aspect of my background model, and we can try to prepare for that unknown miracle; preparing for an unknown miracle probably looks like "Trying to die with more dignity on the mainline" (because if you can die with more dignity on the mainline, you are better positioned to take advantage of a miracle if it occurs).

The term also shows up a ton in the Late 2021 MIRI Conversations, e.g., here and here.

I appreciate the data point about the term being one you find upsetting to run into; thanks for sharing about that, Devin. And, for whatever it's worth, I'm sorry. I don't like sharing info (or framings) that cause people distress like that.

I don't know whether data points like this will update Nate and/or Eliezer all the way to thinking the term is net-negative to use. If not, and this is a competing access needs issue ('one group finds it much more motivating to use the phrase X; another group finds that exact same phrase extremely demotivating'), then I think somebody should make a post walking folks through a browser text-replacement method that can swap out words like 'dignity' and 'dignified' (on LW, the EA Forum, the MIRI website, etc.) for something more innocuous/silly.

Devin Kalish

The word dignity only appears once, but variations appear as well:

"And it sure would be undignified for our world to die of antitrust law at the final extremity."

"It's as dignified as any of the other attempts to walk around this hard problem"

Some version of this reference appears mostly when Soares is endorsing efforts to solve a problem in a way that won't work if the standard MIRI model of doom is correct, but which is still worthwhile in case it isn't. To be clear, I respect you, Soares, and Yudkowsky a great deal, my impression is that MIRI is a great bunch of folks whose approach is worthwhile, even if I lean somewhat more Christiano/Critch on some of these issues. It is also possible that dignity is a good framing overall and I'm just weird, in which case I fully endorse using it. I just personally don't like it for the reasons I mentioned, and I think there are many others with similar reactions.

RobBensinger

Oops, thanks! I checked for those variants elsewhere but forgot to do so here. :)

It is also possible that dignity is a good framing overall and I'm just weird, in which case I fully endorse using it.

I think it's a good framing for some people and not for others. I'm confident that many people shouldn't use this framing regularly in their own thinking. I'm less sure about whether the people who do find it valuable should steer clear of mentioning it, that's a bit more extreme.

Devin Kalish

That's fair, I think it depends how it's intended. If the point is to talk about how you think about or relate to the issue, talking about the framing that works best for you makes sense. If the purpose is outreach, there are framings that make more or less sense to use.

Ben_West🔸

You maintain this pretty well as it walks up through to primate, and then suddenly it takes a sharp left turn and invents its own internal language and a bunch of abstract concepts, and suddenly you find your visualization tools to be quite lacking for interpreting its abstract mathematical reasoning about topology or whatever.

Empirically speaking, scientists who are trying to understand human brains do spend a lot (most?) of their time looking at nonhuman brains, no?

Is Nate's objection here something like "human neuroscience is not at the level where we deal with 'sharp left turn' stuff, and I expect that once neuroscientists can understand chimpanzee brains very well they will discover that there is in fact a whole other set of problems they need to solve to understand human brains, and that this other set of problems is actually the harder one?"

RobBensinger

scientists who are trying to understand human brains do spend a lot (most?) of their time looking at nonhuman brains, no?

My sense is that this is mostly for ethics reasons, rather than representing a strong stance that animal models are the fastest way to make progress on understanding human cognition.

Ben_West🔸

Thanks! That sounds right to me, but I had thought that Nate was making a stronger objection, something like "looking at nonhuman brains is useless because you could have a perfect understanding of a chimpanzee brain but still completely fail to predict human behavior (after a 'sharp left turn')."

Is that wrong? Or is he just saying something like "looking at nonhuman brains is 90% less effective and given long enough timelines these research projects will pan out - I just don't expect us to have long enough timelines?"

RobBensinger

"looking at nonhuman brains is useless because you could have a perfect understanding of a chimpanzee brain but still completely fail to predict human behavior (after a 'sharp left turn')."

Sounds too strong to me. If Nate or Eliezer thought that it would be totally useless to have a perfect understanding of how GPT-3, AlphaZero, and Minerva do their reasoning, then I expect that they'd just say that.

My Nate-model instead says things like:

Current transparency work mostly isn't trying to gain deep mastery of how GPT-3 etc. do their reasoning; and to the extent it's trying, it isn't making meaningful progress.

('Deep mastery of how this system does its reasoning' is the sort of thing that would let us roughly understand what thoughts a chimpanzee is internally thinking at a given time, verify that it's pursuing the right kinds of goals and thinking about all (and only) the right kinds of topics, etc.)
A lot of other alignment research isn't even trying to understand chimpanzee brains, or future human brains, or generalizations that might hold for both chimps and humans; it's just assuming there's no important future chimp-to-human transition it has to worry about.
Once we build the equivalent of 'humans', we won't have much time to align them before the tech proliferates and someone accidentally destroys the world. So even if the 'understand human cognition' problem turns out to be easier than the 'understand chimpanzee cognition' problem in a vacuum, the fact that it's a new problem and we have a lot less time to solve it makes it a lot harder in practice.

Jon P

This is a nice writeup and summary.

I personally think that this is yet more evidence that formal control is a path which is more promising than others. If you can formally prove that your code, when properly executed, has certain properties then that gives you some hope that those properties will be durable during and after a hard left turn.

Things like, if you had a magic wand, formally proving that any AI designed by a formally controlled AI will also be formally controlled. That way even if it whooshes and completely redesigns itself there is still some hope.

I would love to see the amount of resources going into formal methods be multiplied by 10x or 100x, I think if we built a really solid field, where all of modern mathematics and computer science is formalised and people write formally verified code by default because it's safer and there are good libraries to do that, then in that environment the control problem becomes easier, if still extremely hard.

Comments

More from the author

323

A personal reflection on SBF

So8res·3y ago·23m read

359

On Caring

So8res·11y ago·12m read

115

Comments on OpenAI's "Planning for AGI and beyond"

So8res·3y ago·15m read

Curated and popular this week

Counting animals: Stable population size is not equivalent to priority level

abrahamrowe, mal_graham🔸·1w ago·Curated 6d ago·16m read

AI Use Note: Main body text entirely human written. Claude (Opus 4.8) helped develop models of animal life histories in the appendix. Cross-posted from Good Structures. Executive Summary * Animal advocates sometimes make claims like “there are X of this animal...

114

Spiro: an update 2.5 years on and a fundraising ask for expansion

Habiba Banu·1w ago·6m read

Summary Back in November 2023 I posted here to launch Spiro and raise our first $198k. Two and a half years later this is an update and a fundraiser for the next step. The short version: we've now reached over-5,900 people with TB preventive medicine, including over 3,000 children under five years old. Our early results have held up well an...

How (not) to fundraise from Anthropic staff

Jack Lewars·6d ago·7m read

Adapted from my Substack, Funding Anthropalypse. Short version: if you want a share of the coming Anthropic and OpenAI windfall - the $37bn+ that could be in play next year - the way in is to become 'legibly excellent', so the evaluators and donors that frontier lab staff already trust point them to yo...

Recent opportunities to take action

Starting an EA group @ SUNY Binghamton

micahzarin·20h ago·1m read

Marginal Victories: career advising and opportunities for U.S. democracy preservation & political work

Annika Burman 🔸·1d ago·2m read

I'm stepping down as Hive's Executive Director, and we're hiring my successor

SofiaBalderson, Hive·2d ago·3m read

Devin Kalish

It references a post that almost single-handedly gave me and I think lots of others, pretty bad mental health issues for a few months, and to an extent still today. On its own I don't know that this disqualifies the piece as worthwhile, but it does make me recoil from references to it like a hand on a hot stove somewhat automatically. On a more substantial level, I think the post itself was awkward and probably a misstep. The weird April Fools but not April Fools framing, and the fact that it didn't contribute much to the substantial discussion, though it had some valuable things to say about being a consequentialist rather than a cartoon supervillain (which I think he has said elsewhere less prominently if I remember).
Dignity is the wrong thing to aim for. This is maybe the more substantial problem I have. First and foremost, I don't want to aim for "dying with dignity". I would rather just, exclusively, aim for not dying. It's true there's an awfully funny coincidence that Yudkowsky's version of "dying with dignity" lines up so perfectly with aiming for not dying, but that still doesn't justify aiming for it instead. Dignity, in this situation, is just not that motivating to me. If aiming for "dying with dignity" diverges one nano-degree from aiming for not dying, then it is the wrong thing to aim for. This is especially striking to find in writing from Yudkowsky, for those of us who have read his other stuff, because it is not how he ever advocates people think elsewhere. Whenever you parry, hit, spring, strike or touch the cutting sword of AGI safety, you must cut the actual solution in the same movement. This almost is enough to convince me that this peculiar dignity framing was the only part of the piece that was at least sort of facetious in the whole, not-very-April-Foolsy post. At the very least I can't relate to ever getting to a point of being so hopeless that what I am aiming for is "dying with dignity" rather than just "not dying".

^{^}

I ran a few of the dialogs past the relevant people, but that has empirically dragged out the amount of time it takes this post to publish, and I have a handful of other posts to publish afterwards, so I neglected to get feedback from most of the people mentioned. Sorry.

^{^}

Much of Vanessa, Scott, etc.'s work does look to me like it is grappling with confusions related to the problem of aiming minds in theory, and if their research succeeds according to their own lights then I would expect to have a better understanding of how to aim minds in general, even ones that had undergone some sort of "sharp left turn".

Which is not to say that I’m optimistic about whether any of these plans will succeed by their own lights. Regardless, they get points for taking a swing, and the thing I’m mostly advocating for is that more people take swings at this problem at all, not that we filter strongly on my optimism about specific angles of attack.

I tried to solve the problem myself for a few years, and failed. Turns out I wasn't all that good at it.

Maybe I'll be able to do better next time, and I poke at it every so often. (Even though in my mainline prediction, we won’t have the time to complete the sort of research paths that I can see and that I think have any chance of working.)

MIRI funds or offers-to-fund most every researcher who I see as having this "their work would help with the generalization problem if they succeeded" property and as doing novel, nontrivial work, so it's no coincidence that I feel more positive about Vanessa, etc.'s work. But I'd like to see far more attempts to solve this problem than the field is currently marshaling.

^{^}

Again, to be clear, it's nice to have some people trying to route around the hard problems wholesale. But I don't count such attempts as attacks on the problem itself. (I'm also not optimistic about any attempts I have yet seen to dodge the problem, but that's a digression from today's topic.)

^{^}

I couldn't understand Stuart's views from what he's written publicly, so I ran this section by Stuart and Rebecca, who requested that I use actual quotes instead of my attempted paraphrasings. If I'd had more time, I'd like to have run all the dialogs by the researchers I mentioned in this post, and iterated until I could pass everyone's ideological Turing Test, as opposed to the current awkward set-up where the people that I thought I understood didn't get as much chance for feedback. But the time delay from editing this one section is evidence that this wouldn't be worth the time burnt. Instead, I hope the comments can correct any mischaracterizations on my part.

^{^}

Note also that while having the AI ask for clarification in the face of ambiguity is nice and helpful, it is of course far from autonomous-AGI-grade.

^{^}

I specifically see:

~3 MIRI-supported research approaches that are trying to attack a chunk of the hard problem (with a caveat that I think the relevant chunks are too small and progress is too slow for this to increase humanity's odds of success by much).
~1 other research approach that could maybe help address the core difficulty if it succeeds wildly more than I currently expect it to succeed (albeit no one is currently spending much time on this research approach): Natural Abstractions. Maybe 2, if you count sufficiently ambitious interpretability work.
~2 research approaches that mostly don't help address the core difficulty (unless perhaps more ambitious versions of those proposals are developed, and the ambitious versions wildly succeed), but might provide small safety boosts on the mainline if other research addresses the core difficulty: Concept Extrapolation, and current interpretability work (with a caveat that sufficiently ambitious interpretability work would seem more promising to me than this).
9+ approaches that appear to me to be either assuming away what look to me like the key problems, or hoping that we can do other things that allow us to avoid facing the problem: Truthful AI, ELK, AI Services, Evan's approach, the Richard/Rohin meta-approach, Vivek's approach, Critch's approach, superbabies, and the "maybe there is a pretty wide attractor basin around my own values" idea.

^{^}

I rate "interpretability succeeds so wildly that we can understand and aim one of the first AGIs" as probably a bit more plausible than "natural abstractions are so natural that, by understanding them, we can practically find concepts-worth-optimizing-for in an AGI". Both seem very unlikely to me, though they meet my bar for “deserving of a serious effort by humanity” in case they work out.

On how various plans miss the hard bits of the alignment challenge

On how various plans miss the hard bits of the alignment challenge

Reactions to specific plans

Owen Cotton-Barratt & Truthful AI

Ryan Greenblatt & Eliciting Latent Knowledge

Eric Drexler & AI Services

Evan Hubinger, in a recent personal conversation

A fairly straw version of someone with technical intuitions like Richard Ngo’s or Rohin Shah’s

Another recent proposal

Vivek Hebbar, summarized (perhaps poorly) from last time we spoke of this in person

John Wentworth & Natural Abstractions

Neel Nanda & Theories of Impact for Interpretability

Stuart Armstrong & Concept Extrapolation

Andrew Critch & political solutions

What about superbabies?

What about other MIRI people?

High-level view