(As usual, this post was written by Nate Soares with some help and editing from Rob Bensinger.)

 

In my last post, I described a “hard bit” of the challenge of aligning AGI—the sharp left turn that comes when your system slides into the “AGI” capabilities well, the fact that alignment doesn’t generalize similarly well at this turn, and the fact that this turn seems likely to break a bunch of your existing alignment properties.

Here, I want to briefly discuss a variety of current research proposals in the field, to explain why I think this problem is currently neglected.

I also want to mention research proposals that do strike me as having some promise, or that strike me as adjacent to promising approaches.

Before getting into that, let me be very explicit about three points:

  1. On my model, solutions to how capabilities generalize further than alignment are necessary but not sufficient. There is dignity in attacking a variety of other real problems, and I endorse that practice.
  2. The imaginary versions of people in the dialogs below are not the same as the people themselves. I'm probably misunderstanding the various proposals in important ways, and/or rounding them to stupider versions of themselves along some important dimensions.[1] If I've misrepresented your view, I apologize.
  3. I do not subscribe to the Copenhagen interpretation of ethics wherein someone who takes a bad swing at the problem (or takes a swing at a different problem) is more culpable for civilization's failure than someone who never takes a swing at all. Everyone whose plans I discuss below is highly commendable, laudable, and virtuous by my accounting.

Also, many of the plans I touch upon below are not being given the depth of response that I'd ideally be able to give them, and I apologize for not engaging with their authors in significantly more depth first. I’ll be especially cursory in my discussion of some MIRI researchers and research associates like Vanessa Kosoy and Scott Garrabrant.[2]

In this document I'm attempting to summarize my high-level view of the approaches I know about; I'm not attempting to provide full arguments for why I think particular approaches are more or less promising.

Think of the below as a window into my thought process, rather than an attempt to state or justify my entire background view. And obviously, if you disagree with my thoughts, I welcome objections.

So, without further ado, I’ll explain why I think that the larger field is basically not working on this particular hard problem:

 

Reactions to specific plans

 

Owen Cotton-Barratt & Truthful AI

Imaginary, possibly-mischaracterized-by-Nate version of Owen: What if we train our AGIs to be truthful? If our AGIs were generally truthful, we could just ask them if they're plotting to be deceptive, and if so how to fix it, and we could do these things early in ways that help us nip the problems in the bud before they fester, and so on and so forth. 

Even if that particular idea doesn't work, it seems like our lives are a lot easier insofar as the AGI is truthful.

Nate: "Truthfulness" sure does sound like a nice property for our AGIs to have. But how do you get it in there? And how do you keep it in there, after that sharp left turn? If this idea is to make any progress on the hard problem we're discussing, it would have to come from some property of "truthfulness" that makes it more likely than other desirable properties to survive the great generalization of capabilities.

Like, even simpler than the problem of an AGI that puts two identical strawberries on a plate and does nothing else, is the problem of an AGI that turns as much of the universe as possible into diamonds. This is easier because, while it still requires that we have some way to direct the system towards a concept of our choosing, we no longer require corrigibility. (Also, "diamond" is a significantly simpler concept than "strawberry" and "cellularly identical".)

It seems to me that we have basically no idea how to do this. We can train the AGI to be pretty good at building diamond-like things across a lot of training environments, but once it takes that sharp left turn, by default, it will wander off and do some other thing, like how humans wandered off and invented birth control.

In my book, solving this hard problem so well that we could feasibly get an AGI that predictably maximizes diamond (after its capabilities start generalizing hard), would constitute an enormous advance.

Solving the hard problem so well that we could feasibly get an AGI that predictably answers operator questions truthfully, would constitute a similarly enormous advance. Because we would have figured out how to keep a highly capable system directed at any one thing of our choosing.

Now, in real life, building a truthful AGI is much harder than building a diamond optimizer, because 'truth' is a concept that's much more fraught than 'diamond'. (To see this, observe that the definition of "truth" routes through tricky concepts like "ways the AI communicated with the operators" and "the mental state of the operators", and involves grappling with tricky questions like "what ways of translating the AI's foreign concepts into human concepts count as manipulative?" and "what can be honestly elided?", and so on, whereas diamond is just carbon atoms bound covalently in tetrahedral lattices.)

So as far as I can tell, from the perspective of this hard problem, Owen's proposal boils down to "Wouldn't it be nice if the tricky problems were solved, and we managed to successfully direct our AGIs to be truthful?" Well, sure, that would be nice, but it's not helping solve our problem. In fact, this problem subsumes the whole diamond maximizer problem, but replaces the concept of "diamond" (that we obviously can't yet direct an AGI to optimize, diamond more clearly being a physical phenomenon far removed from the AGI's raw sensory inputs) with the concept of "truth" (which is abstract enough that we can easily forget that it's a much more difficult-to-describe physical phenomenon far removed from the AGI's raw sensory inputs).

(And this hasn't even touched on how if you did manage to get an AGI that keeps optimizing for truth-telling after it falls into the capabilities well, then you still have to make it corrigible, on pain of extinction. But I digress.)

Maybe Owen does think that the goal of "tell the truth" generalizes more readily than "optimize diamond", for some reason? But if so, I haven't seen him mention it, except when I explicitly prompted him when having this argument in person. Which doesn't seem like a very promising sign to me. My read is that he's just failing to engage with this hard problem.

 

Ryan Greenblatt & Eliciting Latent Knowledge

Imaginary Ryan, because Paul was on vacation last time I tried to understand ELK: As we train the nascent AI up to an AGI, we're going to simultaneously train a second system (the "ELK head") that can elicit its latent knowledge. The latter system will be able to tell us the moment that the former system is starting to do things like contemplate deception. We're then going to combine this with some third adversarial system, which can use that latent knowledge to expose the bad behavior to gradients, so that we can nip it in the bud.

Basically all this requires is that, whatever intelligence gradient descent (or some other optimizing procedure) can create, it can also decode, which doesn't seem all that big an ask: whatever optimization method you were using that found this intelligent reasoner, there should be a dual optimization process that finds its decoder, if you're a little clever about it.

Nate: So, set aside questions of whether that sort of thing (with the separate knowledge-elicitor and adversary) will turn out to be practical, because we're talking about whether the plan attacks the key problem.

Suppose we have an ELK head, and an adversary, that we're using to train a nascent AGI. The AGI takes a sharp left turn. Now it's dangerously capable, and suddenly a bunch of misalignments which it previously didn't notice within itself are relevant. And you've got to get them out.

Your first problem is that the recent capabilities gains made by the AGI might not have come from gradient descent (much like how humans’ sudden explosion of technological knowledge accumulated in our culture rather than our genes, once we turned the corner). You might not be able to just "expose the bad behavior" to gradients that you can hit to correct the thing, at least not easily and quickly.

Your second problem is that the AGI's concepts might rapidly get totally uninterpretable to your ELK head. Like, you could imagine doing neuroimaging on your mammals all the way through the evolution process. They've got some hunger instincts in there, but it's not like they’re smart enough yet to represent the concept of "inclusive genetic fitness" correctly, so you figure you'll just fix it when they get capable enough to understand the alternative (of eating because it's instrumentally useful for procreation). And so far you're doing great: you've basically decoded the visual cortex, and have a pretty decent understanding of what it's visualizing. 

You maintain this pretty well as it walks up through to primate, and then suddenly it takes a sharp left turn and invents its own internal language and a bunch of abstract concepts, and suddenly you find your visualization tools to be quite lacking for interpreting its abstract mathematical reasoning about topology or whatever.

Analogously, your ELK head's abilities are liable to fall off a cliff right as the AGI's capabilities start generalizing way outside of its training distribution.

And if they don't, then this ELK head is (in this hypothetical) able to decode and understand the workings of an alien mind. Likely a kludgey behemoth of an alien mind. This itself is liable to require quite a lot of capability, quite plausibly of the sort that humanity gets first from the systems that took sharp left-turns, rather than systems that ground along today's scaling curves until they scaled that far. 

Or in other words, if your ELK head does keep pace with your AGI, and takes a sharp left turn at the same time as it... then, well, now you're basically back to the "Truthful AI" proposal. How do you keep your ELK head reporting accurately (and doing so corrigibly), as it undergoes that sharp left turn?

This proposal seems to me like it's implicitly assuming that most of the capabilities gains come from the slow grind of gradient descent, in a world where the systems don't take sharp left turns and rapidly become highly capable in a wide variety of new (out-of-distribution) domains.

Which seems to me that it's mostly just assuming its way out from under the hard problem—and thus, on my models, assuming its way clean out of reality.

And if I imagine attempting to apply this plan inside of the reality I think I live in, I don't see how it plans to address the hard part of the problem, beyond saying "try training it against places where it knows it's diverging from the goal before the sharp turn, and then hope that it generalizes well or won't fight back", which doesn't instill a bunch of confidence in me (and which I don't expect to work).

 

Eric Drexler & AI Services

Imaginary Eric: Well, sure, AGI could get real dangerous if you let one system do everything under one umbrella. But that's not how good engineers engineer things. You can and should split your AI systems into siloed services, each of which can usefully help humanity with some fragment of whichever difficult sociopolitical or physical challenge you're hoping to tackle, but none of which constitutes an adversarial optimizer (with goals over the future) in its own right.

Nate: So mostly I expect that, if you try to split these systems into services, then you either fail to capture the heart of intelligence and your siloed AIs are irrelevant, or you wind up with enough AGI in one of your siloes that you have a whole alignment problem (hard parts and all) in there.

Like, I see this plan as basically saying "yep, that hard problem is in fact too hard, let's try to dodge it, by having humans + narrow AI services perform the pivotal act". Setting aside how I don't particularly expect this to work, we can at least hopefully agree that it's attempting to route around the problems that seem to me to be central, rather than attempting to solve them.

 

Evan Hubinger, in a recent personal conversation

Imaginary Evan: It's hard, in the modern paradigm, to separate the system's values from its capabilities and from the way it was trained. All we need to do is find a training regimen that leads to AIs that are both capable and aligned. At which point we can just make it publicly available, because it's not like people will be trying to disalign their AIs.

Nate: So, first of all, you haven't exactly made the problem easier.

As best I can tell, this plan amounts to "find a training method that not only can keep a system aligned through the sharp left turn, but must, and then popularize it". Which has, like, bolted two additional steps atop an assumed solution to some hard problems. So this proposal does not seem, to me, to make any progress towards solving those hard problems.

(Also, the observation "capabilities and alignment are fairly tightly coupled in the modern paradigm" doesn't seem to me like much of an argument that they're going to stay coupled after the ol' left turn. Indeed, I expect they won't stay coupled in the ways you want them to. Assuming that this modern desirable property will hold indefinitely seems dangerously close to just assuming this hard problem away, and thus assuming your way clean out of what-I-believe-to-be-reality.)

But maybe I just don't understand this proposal yet (and I have had some trouble distilling things I recognize as plans out of Evan's writing, so far).

 

A fairly straw version of someone with technical intuitions like Richard Ngo’s or Rohin Shah’s

Imaginary Richard/Rohin: You seem awfully confident in this sharp left turn thing. And that the goals it was trained for won't just generalize. This seems characteristically overconfident. For instance, observe that natural selection didn't try to get the inner optimizer to be aligned with inclusive genetic fitness at all. For all we know, a small amount of cleverness in exposing inner-misaligned behavior to the gradients will just be enough to fix the problem. And even if not that-exact-thing, then there are all sorts of ways that some other thing could come out of left field and just render the problem easy. So I don't see why you're worried.

Nate: My model says that the hard problem rears its ugly head by default, in a pretty robust way. Clever ideas might suffice to subvert the hard problem (though my guess is that we need something more like understanding and mastery, rather than just a few clever ideas). I have considered an array of clever ideas that look to me like they would predictably-to-me fail to solve the problems, and I admit that my guess is that you're putting most of your hope on small clever ideas that I can already see would fail. But perhaps you have ideas that I do not. Do you yourself have any specific ideas for tackling the hard problem?

Imaginary Richard/Rohin: Train it, while being aware of inner alignment issues, and hope for the best.

Nate: That doesn't seem to me to even start to engage with the issue where the capabilities fall into an attractor and the alignment doesn't.

Perhaps sometime we can both make a list of ways to train with inner alignment issues in mind, and then share them with each other, so that you can see whether you think I'm lacking awareness of some important tool you expect to be at our disposal, and so that I can go down your list and rattle off the reasons why the proposed training tools don't look to me like they result in alignment that is robust to sharp left turns. (Or find one that surprises me, and update.) But I don't want to delay this post any longer, so, some other time, maybe.

 

Another recent proposal

Imaginary Anonymous-Person-Whose-Name-I’ve-Misplaced: Okay, but maybe there is a pretty wide attractor basin around my own values, though. Like, maybe not my true values, but around a bunch of stuff like being low-impact and deferring to the operators about what to do and so on. You don't need to be all that smart, nor have a particularly detailed understanding of the subtleties of ethics, to figure out that it's bad (according to me) to kill all humans.

Nate: Yeah, that's basically the idea behind corrigibility, and is one reason why corrigibility is plausibly a lot easier to get than a full-fledged CEV sovereign. But this observation doesn't really engage with the question of how to point the AGI towards that concept, and how to cause its behavior to be governed by that concept in a fashion that's robust to the sharp left turn where capabilities start to really generalize.

Like, yes, some directions are easier to point an AI in, on account of the direction itself being simpler to conceptualize, but that observation alone doesn't say anything about how to determine which direction an AI is pointing after it falls into the capabilities well.

More generally, saying "maybe it's easy" is not the same as solving the problem. Maybe it is easy! But it's not going to get solved unless we have people trying to solve it.

 

Vivek Hebbar, summarized (perhaps poorly) from last time we spoke of this in person

Imaginary Vivek: Hold on, the AGI is being taught about what I value every time it tries something and gets a gradient about how well that promotes the thing I value. At least, assuming for the moment that we have a good ability to evaluate the goodness of the consequences of a given action (which seems fair, because it sounds like you're arguing for a way that we'd be screwed even if we had the One True Objective Function).

Like, you said that all aspects of reality are whispering to the nascent AGI of what it means to optimize, but few parts of reality are whispering of what to optimize for—whereas it looks to me like every gradient the AGI gets is whispering a little bit of both. So in particular, it seems to me like if you did have the one true objective function, you could just train good and hard until the system was both capable and aligned.

Nate: This seems to me like it's implicitly assuming that all of the system's cognitive gains come from the training. Like, with every gradient step, we are dragging the system one iota closer to being capable, and also one iota closer to being good, or something like that.

To which I say: I expect many of the cognitive gains to come from elsewhere, much as a huge number of the modern capabilities of humans are encoded in their culture and their textbooks rather than in their genomes. Because there are slopes in capabilities-space that an intelligence can snowball down, picking up lots of cognitive gains, but not alignment, along the way.

Assuming that this is not so, seems to me like simply assuming this hard problem away.

And maybe you simply don't believe that it's a real problem; that's fine, and I’d be interested to hear why you think that. But I have not yet heard a proposed solution, as opposed to an objection to the existence of the problem in the first place.

 

John Wentworth & Natural Abstractions

Imaginary John: I suspect there's a common format to concepts, that is a fairly objective fact about the math of the territory, and that—if mastered—could be used to understand an AGI's concepts. And perhaps select the ones we wish it would optimize for. Which isn't the whole problem, but sure is a big chunk of the problem. (And other chunks might well be easier to address given mastery of the fairly-objective concepts of "agent" and "optimizer" and so on.)

Nate: This does seem to me like it's trying to attack the actual problem! I have my doubts about this particular line of research (and those doubts are on my list of things to write up), but hooray for a proposal that, if it succeeded by its own lights, would address this hard problem!

Imaginary John: Well, uh, these days I'm mostly focusing on using my flimsy non-mastered grasp of the common-concept format to try to give a descriptive account of human values, because for some reason that's where I think the hope is. So I'm not actually working too much on this thing that you think takes a swing at the real problem (although I do flirt with it occasionally).

Nate: :'(

Imaginary John: Look, I didn't want to break the streak, OK.

Rob Bensinger, reading this draft: Wait, why do you see John’s proposal as attacking the central problem but not, for example, Eric Drexler’s Language for Intelligent Machines (summarized here)?

Nate: I understand Eric to be saying "maybe humans deploying narrow AIs will be capable enough to end the acute risk period before an AGI can (in which case we can avoid ever using AIs that have taken sharp left turns)", whereas John is saying "maybe a lot of objective facts about the territory determine which concepts are useful, and by understanding the objectivity of concepts we can become able to understand even an alien mind's concepts".

I think John’s guess is wrong (at least in the second clause), but it seems aimed at taking an AI system that has snowballed down a capabilities slope in the way that humans snowballed, and identifying its concepts in a way that’s stable to changes in the AI’s ontology—which is step one in the larger challenge of figuring out how to robustly direct an AGI’s motivations at the content of a particular concept it has.

My understanding of Eric’s idea, in contrast, is "I think there's a language these siloed components could use that's not so expressive as to allow them to be dangerous, but is expressive enough to allow them to help humans." To which my basic reply is roughly “The problem is that the non-siloed systems are going to start snowballing and end the world before the human+silo systems can save the world." As far as I can tell, Eric's attempting to route around the problem, whereas John's attempting to solve it.[3]

 

Neel Nanda & Theories of Impact for Interpretability

Imaginary Neel: What if we get a lot of interpretability?

Nate: That would be great, and I endorse developing such tools.

I think this will only solve the hard problems if the field succeeds at interpretability so wildly that (a) our interpretability tools continue to work on fairly difficult concepts in a post-left-turn AGI; (b) that AGI has an architecture that turns out to be especially amenable to being aimed at some concept of our choosing; and (c) the interpretability tools grant us such a deep understanding of this alien mind that we can aim it using that understanding.

I admit I'm skeptical of all three. Where, to be clear, better interpretability tools help put us in a better position even if they don't clear these lofty bars. In real life, I expect interpretability to play a smaller role as a force-multiplier that awaits some other plan for addressing the hard problems.

Which are great to have and worth building, to be clear. I full-throatedly endorse humanity putting more effort into interpretability.

It simultaneously doesn't look to me like people are seriously aiming for "develop such a good ability to understand minds that we can reshape/rebuild them to be aimable in whatever time we have after we get one". It looks to me like the sights are currently set at much lower and more achievable targets, and that current progress is consistent with never hitting the more ambitious targets, the ones that would let us understand and reshape the first artificial minds into something aligned (fast enough to be relevant).

But if some ambitious interpretability researchers do set their sights on the sharp left turn and the generalization problem, then I would indeed count this as a real effort by humanity to solve its central technical challenge. I don't need a lot of hope in a specific research program in order to be satisfied with the field's allocation of resources; I just want to grow the space of attempts to solve the generalization problem at all.

 

Stuart Armstrong & Concept Extrapolation

Nate: (Note: This section consists of actual quotes and dialog, unlike the others.)[4]

Stuart, in a blog post:

[...] It is easy to point at current examples of agents with low (or high) impact, at safe (or dangerous) suggestions, at low (or high) powered behaviours. So we have in a sense the 'training sets' for defining low-impact/Oracles/low-powered AIs.

It's extending these examples to the general situation that fails: definitions which cleanly divide the training set (whether produced by algorithms or humans) fail to extend to the general situation. Call this the 'value extrapolation problem, with 'value' interpreted broadly as a categorisation of situations into desirable and undesirable.

[...] Value extrapolation is thus necessary for AI alignment.

[...] We think that once humanity builds its first AGI, superintelligence is likely near, leaving little time to develop AI safety at that point. Indeed, it may be necessary that the first AGI start off aligned: we may not have the time or resources to convince its developers to retrofit alignment to it. So we need a way to have alignment deployed throughout the algorithmic world before anyone develops AGI.

To do this, we'll start by offering alignment as a service for more limited AIs. Value extrapolation scales down as well as up: companies value algorithms that won't immediately misbehave in new situations, algorithms that will become conservative and ask for guidance when facing ambiguity.

We will get this service into widespread use (a process that may take some time), and gradually upgrade it to a full alignment process. [...]

Rob Bensinger, replying on TwitterThe basic idea in that post seems to be: let's make it an industry standard for AI systems to "become conservative and ask for guidance when facing ambiguity", and gradually improve the standard from there as we figure out more alignment stuff.

The reasoning being something like: once we have AGI, we need to have deployment-ready aligned AGI extremely soon; and this will be more possible if the non-AGI preceding it is largely aligned.

(I at least agree with the "once we have AGI, we’ll need deployment-ready aligned AGI extremely soon" part of this.)

The other aspect of your plan seems to be 'focus on improving value extrapolation methods'. Both aspects of this plan seem very bad to me, speaking from my inside view:

  • 1a.  I don't expect that much overlap between what's needed to make, e.g., a present-day image classifier more conservative, and what's needed to make an AGI reliable and safe. So redirecting resources from the latter problem to the former seems wasteful to me.
  • 1b.  Relatedly, I don't think it's helpful for the field to absorb the message "oh, yeah, our image classifiers and Go players and so on are aligned, we're knocking that problem out of the park". If 1a is right, then making your image classifier conservative doesn't represent much progress toward being able to align AGI. They're different problems, like building a safe bridge vs. building a safe elevator.

'Alignment' is currently a word that's about the AGI problem in particular, which overlaps with a lot of narrow-AI robustness problems, but isn't just a scaled-up version of those; the difficulty of AGI alignment mostly comes from qualitatively new risks. So 'aligning' the field as a whole doesn't necessarily help much, and (less importantly) using the term 'alignment' for the broader, fuzzier goal is liable to distract from the core difficulties, and liable to engender a false sense of progress on the original problem.

  • 2.  We need to do value extrapolation eventually, but I don't think this is the field's current big bottleneck, and I don't think it helps address the bottleneck. Rather, I think the big bottleneck is understandability / interpretability.

Nate: I like Rob’s response. I’ll add that I’m not sure I understand your proposal. Your previous name for the value extrapolation problem was the “model splintering” problem, and iirc you endorsed Rohin’s summary of model splintering:

[Model splintering] is one way of more formally looking at the out-of-distribution problem in machine learning: instead of simply saying that we are out of distribution, we look at the model that the AI previously had, and see what model it transitions to in the new distribution, and analyze this transition.

Model splintering in particular refers to the phenomenon where a coarse-grained model is “splintered” into a more fine-grained model, with a one-to-many mapping between the environments that the coarse-grained model can distinguish between and the environments that the fine-grained model can distinguish between (this is what it means to be more fine-grained).

On the surface, work aimed at understanding and addressing "model splintering" sounds potentially promising to me—like, I might want to classify some version of "concept extrapolation" alongside Natural Abstractions, certain approaches to interpretability, Vanessa’s work, Scott’s work, etc. as "an angle of attack that might genuinely help with the core problem, if it succeeded wildly more than I expect it to succeed". Which is about as positive a situation as I’m expecting right now, and would be high praise in my books.

But in the past, I’ve often heard you use words and phrases in ways that I find promising at a glance, to mean things that I end up finding much less promising when I dig in on the specifics of what you’re talking about. So I’m initially skeptical, especially insofar as I don’t understand your proposal well.

I’d be interested in hearing how you think your proposal addresses the sharp left turn, if you think it does; or maybe you can give me pointers toward particular paragraphs/sections you’ve written up that you think already speak to this problem.

Regarding work on image-classifier conservatism: at a first glance, I don't have much confidence that the types of generalization you’re shooting for are tracking the possibility of sharp left turns. "We want our solutions to generalize" is cheap to say; things that engage with the sharp left turn are more expensive. What’s an example of a kind of present-day research on image classifier conservatism that you’d expect to help with the sharp left turn (if you do think any would help)?

Rebecca Gorman, in an email thread: We're working towards something that achieves interpretability objectives, and does so better than current approaches.

Agreed that AGI alignment isn't just a scaled-up version of narrow-AI robustness problems. But if we need to establish the foundations of alignment before we reach AGI and build it into every AI being built today (since we don't know when and where superintelligence will arise), then we need to try to scale down the alignment problem to something we can start to research today.

As for the article [A central AI alignment problem: capabilities generalization, and the sharp left turn]: I think it's an excellent article, but I'll give an insufficient response. I agree that capabilities form an attractor well. And that we don't get a strong understanding of human values as easily. That's why we think it's important to invest energy and resources into giving AI a strong understanding of human values; it's probably a harder problem. But - at a high level, some of the methods for getting there may generalize. That, at least, is a hopeful statement.

Nate: That sounds like a laudable goal. I have not yet managed to understand what sort of foundations of alignment you're trying to scale down and build into modern systems. What are you hoping to build into modern systems, and how do you expect it to relate to the problem of aligning systems with capabilities that generalize far outside of training?

So far, from parts of the aforementioned email thread that have been elided in this dialog, I have not yet managed to extract a plan beyond "generate training data that helps things like modern image classifiers distinguish intended features (such as ‘pre-treatment collapsed lung’ from ‘post-treatment collapsed lung with chest drains installed’, despite the chest-drains being easier to detect than the collapse itself)", and I don't yet see how generating this sort of training data and training modern image-classifiers thereon addresses the tricky alignment challenges I worry about.

Stuart, in an email thread: In simple or typical environments, simple proxies can achieve desired goals. Thus AIs tend to learn simple proxies, either directly (programmers write down what they currently think the goal is, leaving important pieces out) or indirectly (a simple proxy fits the training data they receive - eg image classifiers focusing on spurious correlations).

Then the AI develops a more complicated world model, either because the AI is becoming smarter or because the environment changes by itself. At this point, by the usual Goodhart arguments, the simple proxy no longer encodes desired goals, and can be actively pernicious.

What we're trying to do is to ensure that, when the AI transitions to a different world model, this updates its reward function at the same time. Capability increases should lead immediately to alignment increases (or at least alignment changes); this is the whole model splintering/value extrapolation approach.

The benchmark we published is a much-simplified example of this: the "typical environment" is the labeled datasets where facial expression and text are fully correlated. The "simple proxy/simple reward function" is the labeling of these images. The "more complicated world model" is the unlabeled data that the algorithm encounters, which includes images where the expression feature and the text feature are uncorrelated. The "alignment increase" (or, at least, the first step of this) is the algorithm realising that there are multiple distinct features in its "world model" (the unlabeled images) that could explain the labels, and thus generating multiple candidates for its "reward function".

One valid question worth asking is why we focused on image classification in a rather narrow toy example. The answer is that, after many years of work in this area, we've concluded that the key insights in extending reward functions do not lie in high-level philosophy, mathematics, or modelling. These have been useful, but have (temporarily?) run their course. Instead, practical experiments in value extrapolation seem necessary - and these will ultimately generate theoretical insights. Indeed, this has already happened; we now have, I believe, a much better understanding of model splintering than before we started working on this.

As a minor example, this approach seems to generate a new form of interpretability. When the algorithm asks the human to label a "smiling face with SAD written on it", it doesn't have a deep understanding of either expression or text; nor do humans have an understanding of what features it is really using. Nevertheless, seeing the ambiguous image gives us direct insight into the "reward functions" it is comparing, a potential new form of interpretability. There are other novel theoretical insights which we've been discussing in the company, but they're not yet written up for public presentation.

We're planning to generalise the approach and insights from image classifiers to other agent designs (RL agents, recommender systems, language models...); this will generate more insights and understanding on how value extrapolation works in general.

Nate: In Nate-speak, the main thing I took away from what you've said is "I want alignment to generalize when capabilities generalize. Also, we're hoping to get modern image classifiers to ask for labels on ambiguous data."

"Get the AI to ask for labels on ambiguous data" is one of many ideas I'd put on a list of shallow alignment ideas that are worth implementing. To my eye, it doesn't seem particularly related to the problem of pointing an AGI at something in a way that's robust to capabilities-start-generalizing.

It's a fine simple tool to use to help point at the concept you were hoping to point at, if you can get an AGI to do the thing you're pointing toward at all, and it would be embarrassing if we didn't try it. And I'm happy to have people trying early versions of such things as soon as possible. But I don't see these sorts of things as shedding much light on how you get a post-left-turn AGI to optimize for some concept of your choosing in the first place. If you could do that, then sure, getting it to ask for clarification when the training data is ambiguous is a nice extra saving throw (if it wasn't already doing that automatically because of some deeper corrigibility success), but I don't currently see this sort of thing as attacking one of the core issues.[5]

 

Andrew Critch & political solutions

Imaginary Andrew Critch: Just politick between the AGI teams and get them all to agree to take the problem seriously, not race, not cut corners on safety, etc.

Nate: Uh, that ship sailed in, like, late 2015. My fairly-strong impression, from my proximity to the current politics between the current orgs, is "nope".

Also, even if this wasn't a straight-up "nope", you have the question of what you do with your cooperation. Somehow you've still got to leverage this cooperation into the end of the acute risk period, before the people outside your alliance end the world. And this involves having a leadership structure that can distinguish bad plans from good ones.

The alliance helps, for sure. It takes a bunch of the time pressure off (assuming your management is legibly capable of distinguishing good deployment ideas from bad ones). I endorse attempts to form such an alliance. (And it sure would be undignified for our world to die of antitrust law at the final extremity.) But it's not an attempt to solve this hard technical problem, and it doesn't alleviate enough pressure to cause me to think that the problem would eventually be solved, in this field where ~nobody manages to strike for the heart of the problem before them.

Imaginary Andrew Critch: So get global coordination going! Or have some major nation-state regulate global use of AI, in some legitimate way!

Nate: Here I basically have the same response: First, can't be done (though I endorse attempts to prove me wrong, and recommend practicing by trying to effect important political change on smaller-stakes challenges ASAP (The time is ripe for sweeping global coordination in pandemic preparedness! We just had our warning shot! If we'll be able to do something about AGI later, presumably we can do something analogous about pandemics now!)). 

Second, it doesn't alleviate enough pressure; the bureaucrats can't tell real solutions from bad ones; the cost to build an unaligned AGI drops each year; etc., etc. Sufficiently good global coordination is a win condition, but we're not anywhere close to on track for that, and in real life we're still going to need technical solutions.

Which, apparently, only a handful of people in the world are trying to provide.

 

What about superbabies?

Nate: I doubt we have the time, but sure, go for superbabies. It's as dignified as any of the other attempts to walk around this hard problem.

 

What about other MIRI people?

There are a few people supported at least in part by MIRI (such as Scott and Vanessa) who seem to me to have identified confusing and poorly-understood aspects of cognition. And their targets strike me as the sort of things where if we got less confused about what the heck was going on, then we might thereby achieve a somewhat better understanding of minds/optimization/etc., in a way that sheds some light on the hard problems. So yeah, I'd chalk a few other MIRI-supported folk up in the "trying to tackle the hard problems" column.

We still wouldn’t have anything close to a full understanding, and at the progress rate of the last decade, I’d expect it to take a century for research directions like these to actually get us to an understanding of minds sufficient to align them.

Maybe early breakthroughs chain into follow-up breakthroughs that shorten that time? Or maybe if you have fifty people trying that sort of thing, instead of 3–6, one of them ends up tugging on a thread that unravels the whole knot if they manage to succeed in time. It seems good to me that researchers are trying approaches like these, but the existence of a handful of people making such an attempt doesn’t seem to me to represent much of an update about humanity’s odds of survival.

 

High-level view

I again stress that all the people whose plans I am pessimistic about are people that I consider virtuous, and whose efforts I applaud. (And that my characterizations of people above are probably not endorsed by those people, and that I'm putting less effort into passing their ideological Turing Tests than would be virtuous of me, etc. etc.)

Nevertheless, my overall impression is that most of the new people coming into alignment research end up pursuing research that seems doomed to me, not just because they're unlikely to succeed at their stated research goals, but because their stated research goals have little overlap with what seem to me to be the tricky bits. Or, well, that's what happens at best; what happens at worst is they wind up doing capabilities work with a thin veneer of alignment research.

Perhaps unfairly, my subjective experience of people entering the alignment research field is that there are:

  • a bunch of plans like Owen's (that seem to me to just completely miss the problem),
  • and a bunch of people who study some local phenomenon of modern systems that seems to me to have little relationship to the difficult problems that I expect to arise once things start getting serious, while calling that "alignment" (thus watering down the term, and allowing them to convince themselves that alignment is actually easy because it's just as easy to train a language model to answer "morality" questions as it is to train it to explain jokes or whatever),
  • and a few people who do capabilities work so that they can "stay near the action",
  • and very few who are taking stabs at the hard problems.

An exception is interpretability work, which I endorse, and which I think is getting rightful efforts (though I will caveat that some grim part of me expects that somehow interpretability work will be used to boost capabilities long before it gets to the high level required to face down the tricky problems I expect in the late game). And there are definitely a handful of folk plugging away at research proposals that seem to me to have non-trivial inner product with the tricky problems.

In fact, when writing this list, I was slightly pleasantly surprised by how many of the research directions seem to me to have non-trivial inner product with the tricky problems.[6]

This isn't as much of a positive update as it might first seem, on account of how it looks to me like the total effort in the field is not distributed evenly across all the above proposals, and I still have a general sense that most researchers aren't really asking questions whose answers would really help us out. But it is something of a positive update nevertheless.

Returning to one of the better-by-my-lights proposals from above, Natural Abstractions: If this agenda succeeded and was correct in a key hypothesis, this would directly solve a big chunk of the problem.

I don't buy the key hypothesis (in the relevant way), and I don't expect that agenda to succeed.[7] But if I was saying that about a hundred pretty-uncorrelated agendas being pursued by two hundred people, I'd start to think that maybe the odds are in our favor.

My overall impression is still that when I actually look at the particular community we have, weighted by person-hours, the large majority of the field isn't trying to solve the problem(s) I expect to kill us. They're just wandering off in some other direction.

It could turn out that I’m wrong about one of these other directions. But "turns out the hard/deep problem I thought I could see, did not in fact exist" feels a lot less likely, on my own models, than "one of these 100 people, whose research would clearly solve the problem if it achieved its self-professed goals, might in fact be able to achieve their goals (despite me not sharing their research intuitions)".

So the status quo looks grims to me.

I in fact think it's nice to have some people saying "we can totally route around that problem", and then pursuing research paths that they think route around the problem!

But currently, we have only a few fractions of plans that look to me to be trying to solve the problem that I expect to actually kill us. Like a field of contingency plans with no work going into a Plan A; or like a field of pandemic preparedness that immediately turned its gaze away from the true disaster scenarios and focused the vast majority of its effort on ideas like “get people to eat healthier so that their immune systems will be better-prepared”. (Not a perfect analogy; sorry.)

Hence: I'm not highly-pessimistic about our prospects because I think this problem is extraordinarily hard. I think this problem is normally hard, and very little effort is being deployed toward solving it.

Like, you know how some people out there (who I'm reluctant to name for fear that reminding them of their old stances will contribute to fixing them in their old ways) are like, "Your mistake was attempting to put a goal into the AGI; what you actually need to do is keep your hands off it and raise it compassionately!"? And from our perspective, they're just walking blindly into the razor blades?

And then other people are like, "The problem is giving the AGI a bad goal, or letting bad people control it", and... well, that's probably still where some of you get off the train, but to the rest of us, these people also look like they're walking willfully into the razor blades?

Well, from my perspective, the people who are like, "Just keep training it on your objective while being somewhat clever about the training, maybe that empirically works", are also walking directly into the razor blades.

(And it doesn't help that a bunch of folks are like "Well, if you're right, then we'll be able to update later, when we observe that getting language models to answer ethical questions is mysteriously trickier than getting it to answer other sorts of questions", apparently impervious to my cries of "No, my model does not predict that, my model does not predict that we get all that much more advance evidence than we've got already". If the evidence we have isn't enough to get people focused on the central problems, then we seem to me to be in rather a lot of trouble.)

My current prophecy is not so much "death by problem too hard" as "death by problem not assailed".

Which is absolutely a challenge. I'd love to see more people attacking the things that seem to me like they're at the core.

 

  1. ^

    I ran a few of the dialogs past the relevant people, but that has empirically dragged out the amount of time it takes this post to publish, and I have a handful of other posts to publish afterwards, so I neglected to get feedback from most of the people mentioned. Sorry.

  2. ^

    Much of Vanessa, Scott, etc.'s work does look to me like it is grappling with confusions related to the problem of aiming minds in theory, and if their research succeeds according to their own lights then I would expect to have a better understanding of how to aim minds in general, even ones that had undergone some sort of "sharp left turn".

    Which is not to say that I’m optimistic about whether any of these plans will succeed by their own lights. Regardless, they get points for taking a swing, and the thing I’m mostly advocating for is that more people take swings at this problem at all, not that we filter strongly on my optimism about specific angles of attack.

    I tried to solve the problem myself for a few years, and failed. Turns out I wasn't all that good at it.

    Maybe I'll be able to do better next time, and I poke at it every so often. (Even though in my mainline prediction, we won’t have the time to complete the sort of research paths that I can see and that I think have any chance of working.)

    MIRI funds or offers-to-fund most every researcher who I see as having this "their work would help with the generalization problem if they succeeded" property and as doing novel, nontrivial work, so it's no coincidence that I feel more positive about Vanessa, etc.'s work. But I'd like to see far more attempts to solve this problem than the field is currently marshaling.

  3. ^

    Again, to be clear, it's nice to have some people trying to route around the hard problems wholesale. But I don't count such attempts as attacks on the problem itself. (I'm also not optimistic about any attempts I have yet seen to dodge the problem, but that's a digression from today's topic.)

  4. ^

    I couldn't understand Stuart's views from what he's written publicly, so I ran this section by Stuart and Rebecca, who requested that I use actual quotes instead of my attempted paraphrasings. If I'd had more time, I'd like to have run all the dialogs by the researchers I mentioned in this post, and iterated until I could pass everyone's ideological Turing Test, as opposed to the current awkward set-up where the people that I thought I understood didn't get as much chance for feedback. But the time delay from editing this one section is evidence that this wouldn't be worth the time burnt. Instead, I hope the comments can correct any mischaracterizations on my part.

  5. ^

    Note also that while having the AI ask for clarification in the face of ambiguity is nice and helpful, it is of course far from autonomous-AGI-grade.

  6. ^

    I specifically see:

    • ~3 MIRI-supported research approaches that are trying to attack a chunk of the hard problem (with a caveat that I think the relevant chunks are too small and progress is too slow for this to increase humanity's odds of success by much).
    • ~1 other research approach that could maybe help address the core difficulty if it succeeds wildly more than I currently expect it to succeed (albeit no one is currently spending much time on this research approach): Natural Abstractions. Maybe 2, if you count sufficiently ambitious interpretability work.
    • ~2 research approaches that mostly don't help address the core difficulty (unless perhaps more ambitious versions of those proposals are developed, and the ambitious versions wildly succeed), but might provide small safety boosts on the mainline if other research addresses the core difficulty: Concept Extrapolation, and current interpretability work (with a caveat that sufficiently ambitious interpretability work would seem more promising to me than this).
    • 9+ approaches that appear to me to be either assuming away what look to me like the key problems, or hoping that we can do other things that allow us to avoid facing the problem: Truthful AI, ELK, AI Services, Evan's approach, the Richard/Rohin meta-approach, Vivek's approach, Critch's approach, superbabies, and the "maybe there is a pretty wide attractor basin around my own values" idea.
  7. ^

    I rate "interpretability succeeds so wildly that we can understand and aim one of the first AGIs" as probably a bit more plausible than "natural abstractions are so natural that, by understanding them, we can practically find concepts-worth-optimizing-for in an AGI". Both seem very unlikely to me, though they meet my bar for “deserving of a serious effort by humanity” in case they work out.

125

New Comment
13 comments, sorted by Click to highlight new comments since: Today at 8:57 AM

Thanks for this, and especially for your last post (I'm viewing this as kind of an appendix-of-examples to the last post, which was one of my favourite pieces from the MIRI-sphere or indeed on AI alignment from anywhere). General themes I want to pick out:

  • My impression is that there is a surprising dearth of discussion of what the hard parts of alignment actually are, and that this is one of the most important discussions to have given that we don't have clean agreed articulations of the issues
    • I thought your last post was one of the most direct attempts to discuss this that I've seen, and I'm super into that
  • I am interested in further understanding "what exactly would constitute a sharp left turn, and will there be one?"
  • I'm in strong agreement that the field would be healthier if more people were aiming at the central problems, and I think it's super healthy for you to complain about how it seems to you like they're missing them. 
    • I don't think everyone should be aiming directly at the central problems because I think it may be that we don't yet know enough to articulate and make progress there, and it can be helpful as a complement to build up knowledge that could later help with central problems; I would at least like it though if lots of people spent a little bit of time trying to understand the central problems, even if they then give up and say "seems like we can't articulate them yet" or "I don't know how to make progress on that" and go back to more limited things that they know how to get traction on, while keeping half an eye on the eventual goal and how it's not being directly attacked.

I also wanted to clarify that Truthful AI was not trying to solve the hard bit of alignment (I think my coauthors would all agree with this). I basically think it could be good for two reasons:

  1. As a social institution it could put society in a better place to tackle hard challenges (like alignment; if we get long enough between building this institution and having to tackle alignment proper).
  2. It could get talented people who wouldn't otherwise be thinking about alignment to work on truthfulness. And I think that some of the hard bits of truthfulness will overlap with the hard bits of alignment, so it might produce knowledge which is helpful for alignment.

(There was also an exploration of "OK but if we had fully truthful AI maybe that would help with alignment", but I think that's more a hypothetical sideshow than a real plan.)

So I think you could berate me for choosing not to work on the hard part of the problem, but I don't want to accept the charge of missing the point. So why don't I work on the hard part of the problem? I think:

  • I don't actually perceive the hard part of the problem clearly
    • It feels slippery, and that trying to tackle it head-on prematurely is too liable to result in doing work that I will later think completely misses the point
    • But I can perceive the shape of something there (I may or may not end up agreeing with you about its rough contours), so I prefer to think about a variety of things with some bearing on alignment, and periodically check back in to see how much enlightenment I now have about the central things
      • You could think of me as betting on something like Grothendieck's rising sea approach to alignment (although of course it's quite likely I'll never actually get the shell open)
    • This is part of what made my taste sensors fire very happily on your posts!
  • I think there are a web of things which can put us in a position of "more likely well-equipped to make it through", and when I see I have traction on some of those it feels like there's a real substantive opportunity cost to just ignoring them

(Laying this out so that you know the basic shape of my thinking, such that if you want to make a case that I should devote time to tackling things more head-on, you'll know what I need to be moved on.)

This is pretty unrelated to the substantial content here to the point where I'm unsure about writing about it, especially as it's looking like it's going to be the first comment on this post. Still, I wanted to offer some feedback on messaging in case it helps. Whenever I see you use the word "dignity" in this piece, I sort of recoil and feel more alienated from this post overall. In particular, there are two semi-related reasons for this:

  1. It references a post that almost single-handedly gave me and I think lots of others, pretty bad mental health issues for a few months, and to an extent still today. On its own I don't know that this disqualifies the piece as worthwhile, but it does make me recoil from references to it like a hand on a hot stove somewhat automatically. On a more substantial level, I think the post itself was awkward and probably a misstep. The weird April Fools but not April Fools framing, and the fact that it didn't contribute much to the substantial discussion, though it had some valuable things to say about being a consequentialist rather than a cartoon supervillain (which I think he has said elsewhere less prominently if I remember).

  2. Dignity is the wrong thing to aim for. This is maybe the more substantial problem I have. First and foremost, I don't want to aim for "dying with dignity". I would rather just, exclusively, aim for not dying. It's true there's an awfully funny coincidence that Yudkowsky's version of "dying with dignity" lines up so perfectly with aiming for not dying, but that still doesn't justify aiming for it instead. Dignity, in this situation, is just not that motivating to me. If aiming for "dying with dignity" diverges one nano-degree from aiming for not dying, then it is the wrong thing to aim for. This is especially striking to find in writing from Yudkowsky, for those of us who have read his other stuff, because it is not how he ever advocates people think elsewhere. Whenever you parry, hit, spring, strike or touch the cutting sword of AGI safety, you must cut the actual solution in the same movement. This almost is enough to convince me that this peculiar dignity framing was the only part of the piece that was at least sort of facetious in the whole, not-very-April-Foolsy post. At the very least I can't relate to ever getting to a point of being so hopeless that what I am aiming for is "dying with dignity" rather than just "not dying".

I guess my steelman of this is that, if you aim for not dying, you will probably be disappointed, so to keep motivated, you should aim for something achievable, like dignity. I am not nearly as pessimistic as you or Yudkowsky about this matter, and maybe if I was this framing would seem better to me, but even then I find it unlikely. What differs between a world where a few people put in a good deal of effort, this effort proves mostly counterproductive or just misguided, and they all die, and a world where a few people put in a good deal of effort in the right direction, aren't accidentally counterproductive, and they almost don't die. I am just as sympathetic to both, I find that they had similar dignity in my eyes, but the latter world, most crucially, almost survived. More dignified than both is a world in which everyone coordinates a great deal and society really buckles down and gets serious about the issue and they still die. Less dignified is a world where no one but one lonely weirdo cares at all, everyone else laughs about it, and the one weirdo gives up, and they all die. Both of those worlds are unrealistic at this point, I judge that if our option set for solving this problem is narrow enough that you can expect us to probably fail with any reliability, then it is narrow enough that we can't change how much dignity we die with very much either.

All of this is a rambling way of saying that, this "dignity" stuff has come up in a number of serious writings from MIRI now, and I'm worried that it is going to become a standard fixture of your messaging. I just want to register some concerns I have about this happening, I would rather you just say things increase our odds of succeeding than say that things increase our dignity.

Maybe I'm missing something, but it seems that "dignity" only appears once in the OP? Namely, here:

On my model, solutions to how capabilities generalize further than alignment are necessary but not sufficient. There is dignity in attacking a variety of other real problems, and I endorse that practice.

This usage appears to have nothing to do with the April Fool's Day post.

Perhaps Soares made a subsequent edit to the OP?

"Dignity" indeed only occurs once, and I assume it's calling back to the same "death with dignity" concept from the April Fool's post (which I agree shouldn't have been framed as an April Fool's thing).

I assume EY didn't expect the post to have such a large impact, in part because he'd already said more or less the same thing, with the same terminology, in a widely-read post back in November 2021:

Anonymous 

At a high level one thing I want to ask about is research directions and prioritization. For example, if you were dictator for what researchers here (or within our influence) were working on, how would you reallocate them?

Eliezer Yudkowsky 

The first reply that came to mind is "I don't know." I consider the present gameboard to look incredibly grim, and I don't actually see a way out through hard work alone. We can hope there's a miracle that violates some aspect of my background model, and we can try to prepare for that unknown miracle; preparing for an unknown miracle probably looks like "Trying to die with more dignity on the mainline" (because if you can die with more dignity on the mainline, you are better positioned to take advantage of a miracle if it occurs).

The term also shows up a ton in the Late 2021 MIRI Conversations, e.g., here and here

I appreciate the data point about the term being one you find upsetting to run into; thanks for sharing about that, Devin. And, for whatever it's worth, I'm sorry. I don't like sharing info (or framings) that cause people distress like that.

I don't know whether data points like this will update Nate and/or Eliezer all the way to thinking the term is net-negative to use. If not, and this is a competing access needs issue ('one group finds it much more motivating to use the phrase X; another group finds that exact same phrase extremely demotivating'), then I think somebody should make a post walking folks through a browser text-replacement method that can swap out words like 'dignity' and 'dignified' (on LW, the EA Forum, the MIRI website, etc.) for something more innocuous/silly.

The word dignity only appears once, but variations appear as well:

"And it sure would be undignified for our world to die of antitrust law at the final extremity."

"It's as dignified as any of the other attempts to walk around this hard problem"

Some version of this reference appears mostly when Soares is endorsing efforts to solve a problem in a way that won't work if the standard MIRI model of doom is correct, but which is still worthwhile in case it isn't. To be clear, I respect you, Soares, and Yudkowsky a great deal, my impression is that MIRI is a great bunch of folks whose approach is worthwhile, even if I lean somewhat more Christiano/Critch on some of these issues. It is also possible that dignity is a good framing overall and I'm just weird, in which case I fully endorse using it. I just personally don't like it for the reasons I mentioned, and I think there are many others with similar reactions.

Oops, thanks! I checked for those variants elsewhere but forgot to do so here. :)

It is also possible that dignity is a good framing overall and I'm just weird, in which case I fully endorse using it.

I think it's a good framing for some people and not for others. I'm confident that many people shouldn't use this framing regularly in their own thinking. I'm less sure about whether the people who do find it valuable should steer clear of mentioning it, that's a bit more extreme.

That's fair, I think it depends how it's intended. If the point is to talk about how you think about or relate to the issue, talking about the framing that works best for you makes sense. If the purpose is outreach, there are framings that make more or less sense to use.

I constructed an agent where you can literally prove that if you set a parameter high enough, it won't try to kill everyone, while still eventually at least matching human-level intelligence. Sure it uses a realizability assumption, sure it's intractable in its current form, sure it might require an enormously long training period, but these are computer science problems, not philosophy problems, and they clearly suggest paths forward. The underlying concept is sound. It struck me as undignified to say this in the past, but maybe dignity rightly construed should compel me to: it absolutely boggles me that ~no one in the EA community talks about this. It's not in this blog post; it's not in Richard's curriculum; it wasn't in Evan's list of promising AGI safety ideas.

I agree with your perspective on all of these approaches, except my initial reaction is to be more pessimistic about natural abstractions. It seems to me that a good understanding of natural abstractions is not good enough for putting a handle on a part of an agent's mind. We'd also need to understand "natural types", the type signatures that agents' brains use to represent those abstractions. And I think that there is a long, long list of types, in which each is as natural as the rest.

There's an interpretability benchmark that occurred to me recently, which I may as well mention here, because I agree approximately none of the interpretability research I see strikes me as progress toward strategically relevant interpretation of AGI. Try to understand what corvids are saying to each other.

You maintain this pretty well as it walks up through to primate, and then suddenly it takes a sharp left turn and invents its own internal language and a bunch of abstract concepts, and suddenly you find your visualization tools to be quite lacking for interpreting its abstract mathematical reasoning about topology or whatever.

Empirically speaking, scientists who are trying to understand human brains do spend a lot (most?) of their time looking at nonhuman brains, no?

Is Nate's objection here something like "human neuroscience is not at the level where we deal with 'sharp left turn' stuff, and I expect that once neuroscientists can understand chimpanzee brains very well they will discover that there is in fact a whole other set of problems they need to solve to understand human brains, and that this other set of problems is actually the harder one?"

scientists who are trying to understand human brains do spend a lot (most?) of their time looking at nonhuman brains, no?

My sense is that this is mostly for ethics reasons, rather than representing a strong stance that animal models are the fastest way to make progress on understanding human cognition.

Thanks! That sounds right to me, but I had thought that Nate was making a stronger objection, something like "looking at nonhuman brains is useless because you could have a perfect understanding of a chimpanzee brain but still completely fail to predict human behavior (after a 'sharp left turn')."

Is that wrong? Or is he just saying something like "looking at nonhuman brains is 90% less effective and given long enough timelines these research projects will pan out - I just don't expect us to have long enough timelines?"

"looking at nonhuman brains is useless because you could have a perfect understanding of a chimpanzee brain but still completely fail to predict human behavior (after a 'sharp left turn')."

Sounds too strong to me. If Nate or Eliezer thought that it would be totally useless to have a perfect understanding of how GPT-3, AlphaZero, and Minerva do their reasoning, then I expect that they'd just say that.

My Nate-model instead says things like:

  • Current transparency work mostly isn't trying to gain deep mastery of how GPT-3 etc. do their reasoning; and to the extent it's trying, it isn't making meaningful progress.

    ('Deep mastery of how this system does its reasoning' is the sort of thing that would let us roughly understand what thoughts a chimpanzee is internally thinking at a given time, verify that it's pursuing the right kinds of goals and thinking about all (and only) the right kinds of topics, etc.)
     
  • A lot of other alignment research isn't even trying to understand chimpanzee brains, or future human brains, or generalizations that might hold for both chimps and humans; it's just assuming there's no important future chimp-to-human transition it has to worry about.
     
  • Once we build the equivalent of 'humans', we won't have much time to align them before the tech proliferates and someone accidentally destroys the world. So even if the 'understand human cognition' problem turns out to be easier than the 'understand chimpanzee cognition' problem in a vacuum, the fact that it's a new problem and we have a lot less time to solve it makes it a lot harder in practice.

This is a nice writeup and summary.

I personally think that this is yet more evidence that formal control is a path which is more promising than others. If you can formally prove that your code, when properly executed, has certain properties then that gives you some hope that those properties will be durable during and after a hard left turn.

Things like, if you had a magic wand, formally proving that any AI designed by a formally controlled AI will also be formally controlled. That way even if it whooshes and completely redesigns itself there is still some hope.

I would love to see the amount of resources going into formal methods be multiplied by 10x or 100x, I think if we built a really solid field, where all of modern mathematics and computer science is formalised and people write formally verified code by default because it's safer and there are good libraries to do that, then in that environment the control problem becomes easier, if still extremely hard.