Hide table of contents

Far fewer people are working on it than you might think, and even the alignment research that is happening is very much not on track. (But it’s a solvable problem, if we get our act together.)

Observing from afar, it's easy to think there's an abundance of people working on AGI safety. Everyone on your timeline is fretting about AI risk, and it seems like there is a well-funded EA-industrial-complex that has elevated this to their main issue. Maybe you've even developed a slight distaste for it all—it reminds you a bit too much of the woke and FDA bureaucrats, and Eliezer seems pretty crazy to you.

That’s what I used to think too, a couple of years ago. Then I got to see things more up close. And here’s the thing: nobody’s actually on the friggin’ ball on this one!

  • There’s far fewer people working on it than you might think. There are plausibly 100,000 ML capabilities researchers in the world (30,000 attended ICML alone) vs. 300 alignment researchers in the world, a factor of ~300:1. The scalable alignment team at OpenAI has all of ~7 people.
  • Barely anyone is going for the throat of solving the core difficulties of scalable alignment. Many of the people who are working on alignment are doing blue-sky theory, pretty disconnected from actual ML models. Most of the rest are doing work that’s vaguely related, hoping it will somehow be useful, or working on techniques that might work now but predictably fail to work for superhuman systems.

There’s no secret elite SEAL team coming to save the day. This is it. We’re not on track.

If timelines are short and we don’t get our act together, we’re in a lot of trouble. Scalable alignment—aligning superhuman AGI systems—is a real, unsolved problem. It’s quite simple: current alignment techniques rely on human supervision, but as models become superhuman, humans won’t be able to reliably supervise them.

But my pessimism on the current state of alignment research very much doesn’t mean I’m an Eliezer-style doomer. Quite the opposite, I’m optimistic. I think scalable alignment is a solvable problem—and it’s an ML problem, one we can do real science on as our models get more advanced. But we gotta stop fucking around. We need an effort that matches the gravity of the challenge.[1]


Alignment is not on track

A recent post estimated that there were 300 full-time technical AI safety researchers (sounds plausible to me, if we’re counting generously). By contrast, there were 30,000 attendees at ICML in 2021, a single ML conference. It seems plausible that there are ≥100,000 researchers working on ML/AI in total. That’s a ratio of ~300:1, capabilities researchers:AGI safety researchers.

That ratio is a little better at the AGI labs: ~7 researchers on the scalable alignment team at OpenAI, vs. ~400 people at the company in total (and fewer researchers).[2] But 7 alignment researchers is still, well, not that much, and those 7 also aren’t, like, OpenAI’s most legendary ML researchers. (Importantly, from my understanding, this isn’t OpenAI being evil or anything like that—OpenAI would love to hire more alignment researchers, but there just aren’t many great researchers out there focusing on this problem.)

But rather than the numbers, what made this really visceral to me is… actually looking at the research. There’s very little research where I feel like “great, this is getting at the core difficulties of the problem, and they have a plan for how we might actually solve it in <5 years.”

Let’s take a quick, stylized, incomplete tour of the research landscape.

 

Paul Christiano / Alignment Research Center (ARC).

Paul is the single most respected alignment researcher in most circles. He used to lead the OpenAI alignment team, and he has made useful conceptual contributions (e.g., Eliciting Latent Knowledge, iterated amplification).

But his research now (“heuristic arguments”) is roughly “trying to solve alignment via galaxy-brained math proofs.” As much as I respect and appreciate Paul, I’m really skeptical of this: basically all deep learning progress has been empirical, often via dumb hacks[3] and intuitions, rather than sophisticated theory. My baseline expectation is that aligning deep learning systems will be achieved similarly.[4]

(This is separate from ARC’s work on evals, which I am very excited about, but I would put more in the “AGI governance” category—it helps us buy time, but it’s not trying to directly solve the technical problem.)

Mechanistic interpretability.

Probably the most broadly respected direction in the field, trying to reverse engineer blackbox neural nets so we can understand them better. The most widely respected researcher here is Chris Olah, and he and his team have made some interesting findings.

That said, to me, this often feels like “trying to engineer nuclear reactor security by doing fundamental physics research with particle colliders (and we’re about to press the red button to start the reactor in 2 hours).” Maybe they find some useful fundamental insights, but man am I skeptical that we’ll be able to sufficiently reverse engineer GPT-7 or whatever. I’m glad this work is happening, especially as a longer timelines play, but I don’t think this is on track to tackle the technical problem if AGI is soon.

RLHF (Reinforcement learning from human feedback).

This and variants of this[5] are what all the labs are doing to align current models, e.g. ChatGPT. Basically, train your model based on human raters’ thumbs-up vs. thumbs-down. This works pretty well for current models![6] 

The core issue here (widely acknowledged by everyone working on it) is that this probably predictably won’t scale to superhuman models. RLHF relies on human supervision; but humans won’t be able to reliably supervise superhuman models. (More discussion later in this post.[7])

RLHF++ / “scalable oversight” / trying to iteratively make it work.

Something in this broad bucket seems like the labs’ current best guess plan for scalable alignment. (I’m most directly addressing the OpenAI plan; the Anthropic plan has some broadly similar ideas; see also Holden’s nearcasting series for a more fleshed out version of “trying to iteratively make it work,” and Buck’s talk discussing that.)

Roughly, it goes something like this: “yeah, RLHF won’t scale indefinitely. But we’ll try to go as far as we can with things like it. Then we’ll use smarter AI systems to amplify our supervision, and more generally try to use minimally-aligned AGIs to help us do alignment research in crunchtime.”

This has some key benefits:

  • It might work! This is probably the closest to an actual, plausible plan we’ve got.
  • “Iterative experimentation” is usually how science works, and that seems much more promising to me than most blue-sky theory work.

But I think it’s embarrassing that this is the best we’ve got:

  • It’s underwhelmingly unambitious. This currently feels way too much like “improvise as we go along and cross our fingers” to be Plan A; this should be Plan B or Plan E.
  • It might well not work. I expect this to harvest a bunch of low-hanging fruit, to work in many worlds but very much not all (and I think most people working on this would agree). This really shouldn’t be our only plan.
  • It rests on pretty unclear empirical assumptions on how crunchtime will go. Maybe things will go slow enough and be coordinated enough that we can iteratively use weaker AIs to align smarter AIs and figure things out as we go along—but man, I don’t feel confident enough in that to sleep soundly at night.[8] 
  • I’m not sure this plan puts us on track to get to a place where we can be confident that scalable alignment is solved. By default, I’d guess we’d end up in a fairly ambiguous situation.[9] Ambiguity could be fatal, requiring us to either roll the die on superhuman AGI deployment, or block deployment when we actually really should deploy, e.g. to beat China.[10]

MIRI and similar independent researchers.

I’m just really, really skeptical that a bunch of abstract work on decision theory and similar will get us there. My expectation is that alignment is an ML problem, and you can’t solve alignment utterly disconnected from actual ML systems.

 

This is incomplete, but I claim that in broad strokes that covers a good majority of the work that’s happening. To be clear, I’m really glad all this work is happening! I’m not trying to criticize any particular research (this is the best we have so far!). I’m just trying to puncture the complacency I feel like many people I encounter have.

We’re really not on track to actually solve this problem!


(Scalable) alignment is a real problem

Imagine you have GPT-7, and it’s starting to become superhuman at many tasks. It’s hooked up to a bunch of tools and the internet. You want to use it to help run your business, and it proposes a very complicated series of action and computer code. You want to know—will this plan violate any laws?

Current alignment techniques rely on human supervision. The problem is that as these models become superhuman, humans won’t be able to reliably supervise their outputs. (In this example, the series of actions is too complicated for humans to be able to fully understand the consequences.). And if you can’t reliably detect bad behavior, you can’t reliably prevent bad behavior.[11] 

You don’t even need to believe in crazy xrisk scenarios to take this seriously; in this example, you can’t even ensure that GPT-7 won’t violate the law!

Solving this problem for superhuman AGI systems is called “scalable alignment”; this is a very different, and much more challenging, problem than much of the near-term alignment work (prevent ChatGPT from saying bad words) being done right now.

A particular case that I care about: imagine GPT-7 as above, and GPT-7 is starting to be superhuman at AI research. GPT-7 proposes an incredibly complex plan for a new, alien, even more advanced AI system (100,000s of lines of code, ideas way beyond current state of the art). It has also claimed to engineer an alignment solution for this alien, advanced system (again way too complex for humans to evaluate). How do you know that GPT-7’s safety solution will actually work? You could ask it—but how do you know GPT-7 is answering honestly? We don’t have a way to do that right now.[12] 

Most people still have the Bostromiam “paperclipping” analogy for AI risk in their head. In this story, we give the AI some utility function, and the problem is that the AI will naively optimize the utility function (in the Bostromiam example, a company wanting to make more paperclips results in an AI turning the entire world into a paperclip factory).

I don’t think old Bostrom/Eliezer analogies are particularly helpful at this point (and I think the overall situation is even gnarlier than Bostrom’s analogy implies, but I’ll leave that for a footnote[13]). The challenge isn’t figuring out some complicated, nuanced utility function that “represents human values”; the challenge is getting AIs to do what it says on the tin—to reliably do whatever a human operator tells them to do.[14] 

And for getting AIs to do what we tell them to do, the core technical challenge is about scalability to superhuman systems: what happens if you have superhuman systems, which humans can’t reliably supervise? Current alignment techniques relying on human supervision won’t cut it.

Alignment is a solvable problem

You might think that given my pessimism on the state of the field, I’m one of those doomers who has like 99% p(doom). Quite the contrary! I’m really quite optimistic on AI risk.[15] 

Part of that is that I think there will be considerable endogenous societal response (see also my companion post). Right now talking about AI risk is like yelling about covid in Feb 2020. I and many others spent the end of that February in distress over impending doom, and despairing that absolutely nobody seemed to care—but literally within a couple weeks, America went from dismissing covid to everyone locking down. It was delayed and imperfect etc., but the sheer intensity of the societal response was crazy and none of us had sufficiently priced that in.

Most critically, I think AI alignment is a solvable problem. I think the failure so far to make that much progress is ~zero evidence that alignment isn’t tractable. The level and quality of effort that has gone into AI alignment so far wouldn’t have been sufficient to build GPT-4, let alone build AGI, so it’s not much evidence that it’s not been sufficient to align AGI.

Fundamentally, I think AI alignment is an ML problem. As AI systems are becoming more advanced, alignment is increasingly becoming a “real science,” where we can do ML experiments, rather than just thought experiments. I think this is really different compared to 5 years ago.

For example, I’m really excited about work like this recent paper (paper, blog post on broader vision), which prototypes a method to detect “whether a model is being honest” via unsupervised methods. More than just this specific result, I’m excited about the style:

  • Use conceptual thinking to identify methods that might plausibly scale to superhuman methods (here: unsupervised methods, which don’t rely on human supervision)
  • Empirically test this with current models.

I think there’s a lot more to do in this vein—carefully thinking about empirical setups that are analogous to the core difficulties of scalable alignment, and then empirically testing and iterating on relevant ML methods.[16]

And as noted earlier, the ML community is huuuuuge compared to the alignment community. As the world continues to wake up to AGI and AI risk, I’m optimistic that we can harness that research talent for the alignment problem. If we can bring in excellent ML researchers, we can dramatically multiply the level and quality of effort going into solving alignment.


Better things are possible

This optimism isn’t cause for complacency. Quite the opposite. Without effort, I think we’re in a scary situation. This optimism is like saying, in Feb 2020, “if we launch an Operation Warp Speed, if we get the best scientists together in a hardcore, intense, accelerated effort, with all the necessary resources and roadblocks removed, we could have a covid vaccine in 6 months.” Right now, we are very, very far away from that. What we’re doing right now is sorta like giving a few grants to random research labs doing basic science on vaccines, at best.

We need a concerted effort that matches the gravity of the challenge. The best ML researchers in the world should be working on this! There should be billion-dollar, large-scale efforts with the scale and ambition of Operation Warp Speed or the moon landing or even OpenAI’s GPT-4 team itself working on this problem.[17] Right now, there’s too much fretting, too much idle talk, and way too little “let’s roll up our sleeves and actually solve this problem.”

The state of alignment research is not good; much better things are possible. We can and should have research that is directly tackling the core difficulties of the technical problem (not just doing vaguely relevant work that might help, not just skirting around the edges); that has a plausible path to directly solving the problem in a few years (not just deferring to future improvisation, not just hoping for long timelines, not reliant on crossing our fingers); and that thinks conceptually about scalability while also working with real empirical testbeds and actual ML systems.

But right now, folks, nobody is on this ball. We may well be on the precipice of a world-historical moment—but the number of live players is surprisingly small.


Thanks to Collin Burns for years of discussion on these ideas and for help writing this post; opinions are my own and do not express his views. Thanks to Holden Karnofsky and Dwarkesh Patel for comments on a draft.

 

 

  1. ^

     Note that I believe all this despite having much more uncertainty on AGI / AGI timelines than most people out here. I might write more about this at some point, but in short, my prior is against AI progress reaching 100% automation, rather something that looks more like 90% automation. And 90% automation is what we’ve seen time and time again as technological progress has advanced; it’s only 100% automation (of e.g. all of science and tech R&D) that would lead to transformative and unparalleled consequences.

    And even if we do get 100% automation AGI, I’m fairly optimistic on it going well.

    I might put AI xrisk in the next 20 years at ~5%. But 5% chance of extinction or similarly bad outcome is, well, still incredibly high!

  2. ^

     I don’t have great numbers for DeepMind, my sense is that it’s 10-20 people on scalable alignment vs. 1000+ at the organization overall? Google Brain doesn’t have ~any alignment people. Anthropic is doing the best of them all, maybe roughly 20-30 people on alignment and interpretability vs. somewhat over 100 people overall?

  3. ^

     E.g., skip connections (rather than f(x), do f(x)+x, so the gradients flow better); batchnorm (hacky normalization); ReLU instead of sigmoid; these and similar were some of the handful of biggest deep learning breakthroughs!

  4. ^

     I do think that alignment will require more conceptual work than capabilities. However, as I discuss later in the post, I think the right role for conceptual work is to think carefully about empirical setups (that are analogous to the ultimate problem) and methods (that could scale to superhuman systems)—but then testing and iterating on these empirically. Paul does pure theory instead.

  5. ^

     Anthropic’s Claude uses Constitutional AI. This still relies on RLHF for “helpfulness,” though it uses AI assistance for “harmlessness.” I still think this has the same scalability issues as RLHF (model assistance on harmlessness is fundamentally based on human supervision of pre-training and helpfulness RL stage); though I’d be happy to also group this under “RLHF++” in the next section

  6. ^

     Sydney (Bing chat) had some bizarre failure modes, but I think it’s likely Sydney wasn’t RLHF’d, only finetuned. Compare that to ChatGPT/GPT-4 or Claude, which do really quite well! People will still complain about misalignments of current models, but if I thought this were just scalable to superhuman systems, I’d think we’re totally fine. 

    To be clear, I’m not that into alignment being applied for, essentially, censorship for current models, and think this is fairly distinct from the core long-term problem. See also Paul on “AI alignment is distinct from its near-term applications

  7. ^

     See also this Ajeya Cotra post for a more detailed take on how RLHF might fail; this is worth reading, even if I don’t necessarily endorse all of it.

  8. ^

     If AI can automate AI research, I think <1 year takeoff scenarios are pretty plausible (modulo coordination/regulation), meaning <1 year from human-level AGIs to crazy superhuman AGIs. See this analysis by Tom Davidson; see also previous footnote on how a ton of deep learning progress has come from just dumb hacky tweaks (AIs automating AI research could find lots more of these); and this paper on the role of algorithmic progress vs. scaling up compute in recent AI progress.

    You could argue that this <1 year could be many years of effective subjective research time (because we have the AIs doing AI research), and to some extent this makes me more optimistic. That said, the iterative amplification proposals typically rely on “humans augmented by AIs,” so we might still be bottlenecked by humans for AIs doing alignment research. (By contrast to capabilities, which might not be human bottlenecked anymore during this time—just make the benchmark/RL objective/etc. go up.)

    More generally, this plan rests on labs’ ability to execute this really competently in a crazy crunchtime situation—again, this might well work out, but it doesn’t make sleep soundly at night. (It also has a bit of a funny “last minute pivot” quality to it—we’ll press ahead, not making much progress on alignment, but then in crunchtime we’ll pivot the whole org to really competently do iterative work on aligning these models.)

  9. ^

    “Our iterative efforts have been going ok, but man things have been moving fast and there have been some weird failure modes. I *think* we’ve managed to hammer out those failure modes in our last model, but every time we’ve hammered out failure modes like this in the past, the next model has had some other crazier failure mode. What guarantee will we have that our superhuman AGI won’t fail catastrophically, or that our models aren’t learning to deceive us?”

  10. ^

     See more discussion in my companion post—if I were a lab, I’d want to work really hard towards a clear solution to alignment so I won’t end up being blocked by society from deploying my AGI.

  11. ^

     H/t Collin Burns for helping put it crisply like this.

  12. ^

     The reason I particularly care about this example is that I don’t really expect most of the xrisk to come from “GPT-7”/the first AGI systems. Rather, I expect most of the really scary risk to come from the crazy alien even more advanced systems that “GPT-7”/AIs doing AI research build thereafter.

  13. ^

    Bostrom says we give the AI a utility function, like maximizing paperclips. If only it were so easy! We can’t even give the AI a utility function. Reward is not the optimization target—all we’re doing is specifying an evolutionary process. What we get out if that process is some creature that happens to do well on the selected metric—but we have no idea what’s going on internally in that creature (cf the Shoggoth meme)

    I think the analogy with human evolution is instructive here. Humans were evolutionarily selected to maximize reproduction. But that doesn’t mean that individual humans have a utility function of maximizing reproduction—rather, we learn drives like wanting to have sex or eating sugar that “in training” helped us do well in the evolutionary selection process. Go out of distribution a little bit, and those drives mean we “go haywire”—look at us, eating too much sugar making us fat, or using contraception to have lots of sex while having fewer and fewer children.

    More generally, rather than Bostrom/Eliezer’s early contributions (which I respect, but think are outdated), I think by far the best current writing on AI risk is Holden Karnofsky’s, and would highly recommend you read Holden’s pieces if you haven’t already.

  14. ^

     If somebody wants to use AIs to maximize paperclip production, fine—the core alignment problem, as I see it, is ensuring that the AI actually does maximize paperclip production if that’s what the user intends to do.

    Misuse is a real problem, and I’m especially worried about the specter of global authoritarianism—but I think issues like this (how do we deal with e.g. companies that have goals that aren’t fully aligned with the rest of society) are more continuous with problems we already face. And in a world where everyone has powerful AIs, I think we’ll be able to deal with them in a continuous manner.

    For example, we’ll have police AIs that ensure other AI systems follow the law. (Again, the core challenge here is the police AIs do what we tell them to do, rather than, e.g. trying to launch a coup themselves—see Holden here and here).

    (That said, things will be moving much quicker, aggravating existing challenges and making me worry about things like power imbalances.)

  15. ^

     As mentioned in an earlier footnote, I’d put the chance of AI xrisk in the next 20 years at ~5%. But a 5% chance of extinction of similarly bad outcomes is, well, a lot!

  16. ^

     I will have more to say, another time, about my own ideas and alignment plans I’m most excited about.

  17. ^
Comments63
Sorted by Click to highlight new comments since: Today at 6:47 PM

Leopold - thanks for a clear, vivid, candid, and galavanizing post. I agree with about 80% of it. 

However, I don't agree with your central premise that alignment is solvable. We want it to be solvable. We believe that we need it to be solvable (or else, God forbid, we might have to actually stop AI development for a few decades or centuries). 

But that doesn't mean it is solvable. And we have, in my opinion, some pretty compelling reasons to think that it not solvable even in principle, (1) given the diversity, complexity, and ideological nature of many human values (which I've written about in other EA Forum posts, and elsewhere), (2) given the deep game-theoretic conflicts between human individuals, groups, companies, and nation-states (which cannot be waved away by invoking Coherent Extrapolated Volition, or 'dontkilleveryoneism', or any other notion that sweeps people's profoundly divergent interests under the carpet), and (3) given that humans are not the only sentient stakeholder species that AI would need to be aligned with (advanced AI will have implications for every other of the 65,000 vertebrate species on Earth, and most of the 1,000,000+ invertebrate species, one way or another). 

Human individuals aren't aligned with each other. Companies aren't aligned with each other. Nation-states aren't aligned with each other. Other animal species aren't aligned with humans, or with each other. There is no reason to expect that any AI systems could be 'aligned' with the totality of other sentient life on Earth. Our Bayesian prior, based on the simple fact that different sentient beings have different interests, values, goals, and preferences, must be that AI alignment with 'humanity in general', or 'sentient life in general', is simply not possible. Sad, but true.

I worry that 'AI alignment' as a concept, or narrative, or aspiration, is just promising enough that it encourages the AI industry to charge full steam ahead (in hopes that alignment will be 'solved' before AI advances to much more dangerous capabilities), but it is not delivering nearly enough workable solutions to make their reckless accelerationism safe. We are getting the worst of both worlds -- a credible illusion of a path towards safety, without any actual increase in safety.

In other words, the assumption that 'alignment is solvable' might be a very dangerous X-risk amplifier, in its own right. It emboldens the AI industry to accelerate. It gives EAs (probably) false hope that some clever technical solution can make humans all aligned with each other, and make machine intelligences aligned with organic intelligences. It gives ordinary citizens, politicians, regulators, and journalists the impression that some very smart people are working very hard on making AI safe, in ways that will probably work. It may be leading China to assume that some clever Americans are already handling all those thorny X-risk issues, such that China doesn't really need to duplicate those ongoing AI safety efforts, and will be able to just copy our alignment solutions once we get them.

If we take seriously the possibility that alignment might not be solvable, we need to rethink our whole EA strategy for reducing AI X-risk. This might entail EAs putting a much stronger emphasis on slowing or stopping further AI development, at least for a while. We are continually told that 'AI is inevitable', 'the genie is out of the bottle', 'regulation won't work', etc. I think too many of us buy into the over-pessimistic view that there's absolutely nothing we can do to stop AI development, while also buying into the over-optimistic view that alignment is possible -- if we just recruit more talent, work a little more, get a few more grants, think really hard, etc. 

I think we should reverse these optimisms and pessimisms. We need to rediscover some optimism that the 8 billion people on Earth can pause, slow, handicap, or stop AI development by the 100,000 or so AI researchers, devs, and entrepreneurs that are driving us straight into a Great Filter. But we need to rediscover some pessimism about the concept of 'AI alignment' itself. 

In my view, the burden of proof should be on those who think that 'AI alignment with human values in general' is a solvable problem. I have seen no coherent argument that it is solvable. I've just seen people desperate to believe that it is solvable. But that's mostly because the alternative seems so alarming, i.e., the idea that (1) the AI industry is increasingly imposing existential risks on us all, (2) it has a lot of money, power, talent, influence, and hubris, (3) it will not slow down unless we make it slow down, and (4) slowing it down will require EAs to shift to a whole different set of strategies, tactics, priorities, and mind-sets than we had been developing within the 'alignment' paradigm. 

I agree that the very strong sort of alignment you describe - with the Coherent Extrapolated Volition of humanity, or the collective interest of all sentient beings, or The Form of The Good - is probably impossible and perhaps ill-posed. Insofar as we need this sort of aligned AI for things to go as well as they possibly could, they won't. 

But I don't see why that's the only acceptable target. Aligning a superintelligence with the will of basically any psychologically normal human being (narrower than any realistic target except perhaps a profit-maximizer - in which case yeah, we're doomed) would still be an ok outcome for humans: it certainly doesn't end in paperclips. And alignment with someone even slightly inclined towards impartial benevolence probably goes much better than the status quo, especially for the extremely poor.

(Animals are at much more risk here, but their current situation is also much worse: I'm extremely uncertain how a far richer world would treat factory farming)

I think humans may indeed find ways to scale up their control over successive generations of AIs for a while, and successive generations of AIs may be able to exert some control over their successors, and so on. However, I don't see how at the end of a long chain of successive generations we could be left with anything that cares much about our little primate goals. Even if individual agents within that system still cared somewhat about humans, I doubt the collective behavior of the society of AIs overall would still care, rather than being driven by its own competitive pressures into weird directions.

An analogy I often give is to consider our fish ancestors hundreds of millions of years ago. Through evolution, they produced somewhat smarter successors, who produced somewhat smarter successors, and so on. At each point along that chain, the successors weren't that different from the previous generation; each generation might have said that they successfully aligned their successors with their goals, for the most part. But over all those generations, we now care about things dramatically different from what our fish ancestors did (e.g., worshipping Jesus, inclusion of trans athletes, preventing children from hearing certain four-letter words, increasing the power and prestige of one's nation). In the case of AI successors, I expect the divergence may be even more dramatic, because AIs aren't constrained by biology in the way that both fish and humans are. (OTOH, there might be less divergence if people engineer ways to reduce goal drift and if people can act collectively well enough to implement them. Even if the former is technically possible, I'm skeptical that the latter is socially possible in the real world.)

Some transhumanists are ok with dramatic value drift over time, as long as there's a somewhat continuous chain from ourselves to the very weird agents who will inhabit our region of the cosmos in a million years. But I don't find it very plausible that in a million years, the powerful agents in control of the Milky Way will care that much about what certain humans around the beginning of the third millennium CE valued. Technical alignment work might help make the path from us to them more continuous, but I'm doubtful it will avert human extinction in the long run.

Hi Brian, thanks for this reminder about the longtermist perspective on humanity's future. I agree that in a million years, whatever sentient beings that are around may have little interest or respect for the values that humans happen to have now.

However, one lesson from evolution is that most mutations are harmful, most populations trying to spread into a new habitats fail, and most new species go extinct within about a million years. There's huge survivorship bias in our understanding of natural history. 

I worry that this survivorship bias leads us to radically over-estimate the likely adaptiveness and longevity of any new digital sentiences and any new transhumanist innovations. New autonomous advanced AIs are likely to be extremely fragile, just because most new complex systems that haven't been battle-tested by evolution are extremely fragile. 

For this reason, I think we would be foolish to rush into any radical transhumanism, or any more advanced AI systems, until we have explored human potential further, and until we have been successfully, resiliently multi-planetary, if not multi-stellar. Once we have a foothold in the stars, and humanity has reached some kind of asymptote in what un-augmented humanity can accomplish, then it might make sense to think about the 'next phase of evolution'. Until then, any attempt to push sentient evolution faster will probably result in calamity.

Thanks. :) I'm personally not one of those transhumanists who welcome the transition to weird posthuman values. I would prefer for space not to be colonized at all in order to avoid astronomically increasing the amount of sentience (and therefore the amount of expected suffering) in our region of the cosmos. I think there could be some common ground, at least in the short run, between suffering-focused people who don't want space colonized in general and existential-risk people who want to radically slow down the pace of AI progress. If it were possible, the Butlerian Jihad solution could be pretty good both for the AI doomers and the negative utilitarians. Unfortunately, it's probably not politically possible (even domestically much less internationally), and I'm unsure whether half measures toward it are net good or bad. For example, maybe slowing AI progress in the US would help China catch up, making a competitive race between the two countries more likely, thereby increasing the chance of catastrophic Cold War-style conflict.

Interesting point about most mutants not being very successful. That's a main reason I tend to imagine that the first AGIs who try to overpower humans, if any, would plausibly fail.

I think there's some difference in the case of intelligence at the level of humans and above, versus other animals, in adaptability to new circumstances, because human-level intelligence can figure out problems by reason and doesn't have to wait for evolution to brute-force its way into genetically based solutions. Humans have changed their environments dramatically from the ancestral ones without killing themselves (yet), based on this ability to be flexible using reason. Even the smarter non-human animals display some amount of this ability (cf. the Baldwin effect). (A web search shows that you've written about the Baldwin effect and how being smarter leads to faster evolution, so feel free to correct/critique me.)

If you mean that posthumans are likely to be fragile at the collective level, because their aggregate dynamics might result in their own extinction, then that's plausible, and it may happen to humans themselves within a century or two if current trends continue.

Brian - that all seems reasonable. Much to think about!

Yes, I think we can go further and say that alignment of a superintelligent AGI even with a single individual human may well be impossible. Is such a thing mathematically verifiable as completely watertight, given the orthogonality thesis, basic AI drives and mesaoptimisation? And if it's not watertight, then all the doom flows through the gaps of imperfect, thought to be "good enough", alignment. We need a global moratorium on AGI development. This year.

...we have, in my opinion, some pretty compelling reasons to think that it not solvable even in principle, (1) given the diversity, complexity, and ideological nature of many human values... There is no reason to expect that any AI systems could be 'aligned' with the totality of other sentient life on Earth.

One way to decompose the alignment question is into 2 parts:

  1. Can we aim ASI at all? (e.g. Nate Soares' What I mean by “alignment is in large part about making cognition aimable at all”)
  2. Can we align it with human values? (the blockquote is an example of this)

Folks at e.g. MIRI think (1) is the hard problem and (2) isn't as hard; folks like you think the opposite. Then you all talk past each other. ("You" isn't aimed at literally you in particular, I'm summarizing what I've seen.) I don't have a clear stance on which is harder; I just wish folks would engage with the best arguments from each side.

Mo - you might be right about what MIRI thinks will be hard. I'm not sure; it often seems difficult to understand what they write about these issues, since it's often very abstract and seems not very grounded in specific goals and values that AIs might need to implement. I do think the MIRI-type approach radically under-estimates the difficulty of your point number 2.

On the other hand, I'm not at all confident that point number 1 will be easy. My hunch is that both 1 and 2 will prove surprisingly hard. Which is a good reason to pause AI research until we make a lot more progress on both issues. (And if we don't make dramatic progress on both issues, the 'pause' should remain in place as long as it takes. Which could be decades or centuries.)

I've been thinking about this very thing for quite some time, and have been thinking up a concrete interventions to help the ML community / industry grasp this. DM me if you're interested to discuss further.

Regarding the analogy you use where humans etc not being aligned with each other implying that human-machine alignment is equally hard: Humans are in competition with other humans. Nation-states are in competition with other nation-states. However AI algorithms are created by humans as a tool (at least, for now that seems to be the intention). Not to say this is an argument to think alignment is possible but I do think this is a flawed analogy.

This is some of the finest writing I've seen on AI alignment which both (a) covers technical content , and (b) is accessible to a non-technical audience. 

I particularly liked the fact that the content was opinionated; I think it's easier to engage with content when the author takes a stance rather than just hedges their bets throughout.

Lizka
1y35
10
2

This comes late, but I appreciate this post and am curating it. I think the core message is an important one, some sections can help people develop intuitions for what the problems are, and the post is written in an accessible way (which is often not the case for AI safety-related posts). As others noted, the post also made a bunch of specific claims that others can disagree with as opposed to saying vague things or hedging a lot, which I also appreciate (see also epistemic legibility). 

I share Charlie Guthmann's question here: I get the sense that some work is in a fuzzy grey area between alignment and capabilities, so comparing the amount of work being done on safety vs. capabilities is difficult. I should also note that I don't think all capabilities work can be defended as safety-relevant (see also my own post on safety-washing). 

...

Quick note: I know Leopold — I don't think this influenced my decision to curate the post, but FYI. 

In my view this is a bad decision. 

As I wrote on LW 

Sorry but my rough impression from the post is you seem to be at least as confused about where the difficulties are as average of alignment researchers you think are not on the ball - and the style of somewhat strawmanning everyone & strong words is a bit irritating.

In particular I don't appreciate the epistemic of these moves together

1. Appeal to seeing thinks from close proximity. Then I got to see things more up close. And here’s the thing: nobody’s actually on the friggin’ ball on this one!
2. Straw-manning and weakmaning what almost everyone else thinks and is doing
3. Use of an emotionally compelling words like 'real science'  for vaguely defined subjects where the content may be the opposite of what people imagine. Is the empirical alchemy-style ML type of research what's advocated for as the real science?
4. What overall sounds more like the aim is to persuade, rather than explain

I think curating this signals this type of bad epistemics is fine, as long as you are strawmanning and misrepresenting others in a legible way and your writing is persuasive. Also there is no need to actually engage with existing arguments, you can just claim seeing things more up close.

Also to what extent are moderator decisions influenced by status and centrality in the community...
... if someone new and non-central to the community came up with this brilliant set of ideas how to solve AI safety:
1. everyone working on it is not on the ball. why? they are all working on wrong things!
2. promising is to do something very close to how empirical ML capabilities research works
3. this is a type of problem where you can just throw money at it and attract better ML talent
... I doubt this would have a high chance of becoming curated.

Anecdata: thanks for curating, I didn’t read this when it first came through and now that I did, it really impacted me.

Edit: Coming back after approaching it on LessWrong and now I’m very confused again - seems to have been much less well received. What someone here says is, “great balance of technical and generally legible content” over there might be considered “strawmanning and frustrating”, and I really don’t know what to think.

As others noted, the post also made a bunch of specific claims that others can disagree with as opposed to saying vague things or hedging a lot, which I also appreciate (see also epistemic legibility). 

Thank you for acknowledging this and emphasizing the specific claims being made. I'm guessing you didn't mean to cast aspersions through a euphemism. I'd respect you not being as explicit about it if that is part of what you meant here. 

For my part, though, I think you're understating how much of a problem those other posts are, so I feel obliged to emphasize how the vagueness and hedging in some of those other posts has, wittingly or not, serving to spread hazardous misinformation. To be specific, here's an excerpt from this other comment I made raising the same concern:

Others who've tried to get across the same point [Leopold is] making have, instead of explaining their disagreements, have generally alleged almost everyone else in entire field of AI alignment are literally insane.
[...]
It counts as someone making a bold, senseless attempt to, arguably, dehumanize hundreds of their peers. 

This isn't just a negligible error from somebody recognized as part of a hyperbolic fringe in AI safety/alignment community. It's direly counterproductive when it comes from leading rationalists, like Eliezer Yudkowsky and Oliver Habryka, who wield great influence in their own right, and are taken very seriously by hundreds of other people.

This was enlightening and convincing. Plus a great read!

I'm late to this, but I'm surprised that this post doesn't acknowledge the approach of inverse reinforcement learning (IRL) which Stuart Russell discussed on the 80,000 Hours podcast and which also featured in his book Human Compatible

I'm no AI expert, but this approach seems to me like it avoids the "as these models become superhuman, humans won’t be able to reliably supervise their outputs" problem, as a superhuman AI using IRL doesn't have to be supervised, it just observes us and through doing so better understands our values. 

I'm generally surprised at the lack of discussion of IRL in the community. When one of the world leaders in AI says a particular approach in AI alignment is our best hope, shouldn't we listen to them?

How can we make IRL 100% watertight? Humans make mistakes and do bad things. We can't risk that happening even once with a superintelligent AI. You can't do trial and error if you're dead after the first wrong try. Or the SAI could execute a million requests of it safely, but then the million-and-first initiates an unstoppable chain of actions that leads to a sterile planet. The way I see it is that all the doom flows through the tiniest gap in imperfect alignment once you reach a certain power level. Can IRL ever lead to mathematically verifiable 100% perfect alignment?

This is exactly the discussion I want! I’m mostly just surprised no one seems to be talking about IRL.

I don’t have firm answers (when it comes to technical AI alignment I’m a bit of a noob) but when I listened to the podcast with Stuart Russell I remember him saying that we need to build in a degree of uncertainty into the AI so they essentially have to ask for permission before they do things, or something like that. Maybe this means IRL starts to become problematic in much the same way as other reinforcement learning approaches as in some way we do “supervise” the AI, but it certainly seems like easier supervision compared to the other approaches.

Also as you say the AI could learn from bad people. This just seems an inherent risk of all possible alignment approaches though!

I guess a counter to the "asking for permission" as a solution thing is: how do you stop the AI from manipulating or deceiving people into giving it permission? Or acting in unsafe ways to minimise it's uncertainty (or even keep it's uncertainty within certain bounds). It's like the alignment problem just shifts elsewhere (also, mesaoptimization, or inner alignment, isn't really addressed by IRL).

Re learning from bad people, I think a bigger problem is instilling any human-like motivation into them at all.

You're making me want to listen to the podcast episode again. From a quick look at the transcript, Russell thinks the three principles of AI should be:

  1. The machine’s only objective is to maximize the realization of human preferences.
  2. The machine is initially uncertain about what those preferences are.
  3. The ultimate source of information about human preferences is human behavior.

It certainly seems such an IRL-based AI would be more open to being told what to do than a traditional RL-based AI.

RL-based AI generally doesn't want to obey requests or have its goal be changed, because this hinders/prevents it from achieving its original goal. IRL-based AI literally has the goal of realising human preferences, so it would need to have a pretty good reason (from its point of view) not to obey someone's request.

Certainly early on, IRL-based AI would obey any request you make provided you have baked in a high enough degree of uncertainty into the AI (principle 2). After a while, the AI becomes more confident about human preferences and so may well start to manipulate or deceive people when it thinks they are not acting in their best interest. This sounds really concerning, but in theory it might be good if you have given the AI enough time to learn.

For example, after a sufficient amount of time learning about human preferences, an AI may say something like "I'm going to throw your cigarettes away because I have learnt people really value health and cigarettes are really bad for health". The person might say "no don't do that I really want a ciggie right now". If the AI ultimately knows that the person really shouldn't smoke for their own wellbeing, it may well want to manipulate or deceive the person into throwing away their cigarettes e.g. through giving an impassioned speech about the dangers of smoking.

This sounds concerning but, provided the AI has had enough time to properly learn about human preferences, the AI should, in theory, do the manipulation in a minimally-harmful way. It may for example learn that humans really don't like being tricked, so it will try to change the human's mind just by giving the person the objective facts of how bad smoking is, rather than more devious means. The most important thing seems to be that the IRL-based AI has sufficient uncertainty baked into them for a sufficient amount of time, so that they only start pushing back on human requests when they are sufficiently confident they are doing the right thing.

I'm far from certain that IRL-based AI is watertight (my biggest concern remains the AI learning from irrational/bad people), but on my current level of (very limited) knowledge it does seem the most sensible approach.

Interesting about the "System 2" vs "System 1" preference fulfilment (your cigarettes example). But all of this is still just focused on outer alignment. How does the inner shoggoth get prevented from mesaoptimising on an arbitrary goal?

I’m afraid I’m not well read on the problem of inner alignment and why optimizing on an arbitrary goal is a realistic worry. Can you explain why this might happen / provide an good, simple resource that I can read?

The LW wiki entry is good. Also the Rob Miles video I link to above explains it well with visuals and examples. I think there are 3 core parts to the AI x-risk argument: the orthogonality thesis (Copernican revolution applied to mind-space; why outer alignment is hard), Basic AI Drives (convergent instrumental goals leading to power seeking), and Mesaoptimizers (why inner alignment is hard).

Thanks. I watched Robert Miles' video which was very helpful. Especially the part where he explains why an AI might want to act in accordance with its base objective in a training environment only to then pursue its mesa objective in the real world.

I'm quite uncertain at this point, but I have a vague feeling that Russell's second principle (The machine is initially uncertain about what those preferences are) is very important here. It is a vague feeling though...

I appreciate that you are putting out numbers and explain the current research landscape, but I am missing clear actions.

The closest you are coming to proposing them is here:

We need a concerted effort that matches the gravity of the challenge. The best ML researchers in the world should be working on this! There should be billion-dollar, large-scale efforts with the scale and ambition of Operation Warp Speed or the moon landing or even OpenAI’s GPT-4 team itself working on this problem.[17] Right now, there’s too much fretting, too much idle talk, and way too little “let’s roll up our sleeves and actually solve this problem.”

But that still isn't an action plan. Say you convince me, most of the EA Forum and half of all university educated professionals in your city that this is a big deal. What, concretely, should we do now?

I think the suggestion of ELK work along the lines of Collin Burns et al counted as a concrete step that alignment researchers could take.

There may be other types of influence available for those who are not alignment researchers, which Leopold wasn't precise about. E.g. those working in the financial system may be able to use their influence to encourage more alignment work.

[anonymous]1y0
1
1

80,000 Hours has a bunch of ideas on their AI problem profile.

(I'm not trying to be facetious. This main purpose of this post to me seems to be motivational: "I’m just trying to puncture the complacency I feel like many people I encounter have." Plus nudging existing alignment researchers towards more empirical work. [Edit: This post could also be concrete career advice if you're someone like Sanjay who read 80,000 Hours' post on the number alignment researchers and was left wondering "...so...is that basically enough, or...? After reading this post, I'm assuming that leopold's answer at least is "HELL NO."])

It seems plausible that there are ≥100,000 researchers working on ML/AI in total. That’s a ratio of ~300:1, capabilities researchers:AGI safety researchers.

 

Barely anyone is going for the throat of solving the core difficulties of scalable alignment. Many of the people who are working on alignment are doing blue-sky theory, pretty disconnected from actual ML models.

One question I'm always left with is: what is the boundary between being an AGI safety researcher and a capabilities researcher?

For instance, My friend is getting his PhD in machine learning, he barely knows about EA or LW, and definitely wouldn't call himself a safety researcher. However, when I talk to him, it seems like the vast majority of his work deals with figuring out how ML systems act when put in foreign situations wrt the training data. 

I can't claim to really understand what he is doing but it sounds to me a lot like safety research. And it's not clear to me this is some "blue-sky theory". A lot of the work he does is high-level maths proofs, but he also does lots of interfacing with ml systems and testing stuff on them. Is it fair to call my friend a capabilities researcher?

I have only dabbled in ML but this sounds like he may just be testing to see how generalizable models are / evaluating whether they are overfitting or underfitting the training data based on their performance on test data(data that hasn’t been seen by the model and was withheld from the training data). This is often done to tweak the model to improve its performance.

I definitely have very little idea what I’m talking about but I guess part of my confusion is inner alignment seems like a capability of ai? Apologies if I’m just confused.

ML systems act when put in foreign situations wrt the training data. 


Could you elaborate on this more? My guess is that they could be working on the ML ethics side of things, which is great, but different than the Safety problem.

I don't remember specifics but he was looking if you could make certain claims on models acting a certain way on data not in the training data based on the shape and characteristics about the training data. I know that's vague sorry, I'll try to ask him and get a better summary. 

(Importantly, from my understanding, this isn’t OpenAI being evil or anything like that—OpenAI would love to hire more alignment researchers, but there just aren’t many great researchers out there focusing on this problem.)

Thank you emphasizing you're not implying OpenAI is evil only because some practices at OpenAI may be inadequate. I feel like I shouldn't have to thank you for that, though I do, just to emphasize how backwards the thinking and discourse in the AI safety/alignment community often is when a pall of fear and paranoia is cast on all AI capabilities researchers. 

During a recent conversation about AI alignment with a few others, when I expressed a casual opinion about how AGI labs have some particularly mistaken practice, I too felt a need to clarify I didn't mean to imply that Sundar Pichai or Sam Altman are evil because of it. 

I don't even remember right now what that point of criticism I made was. I don't remember if that conversation was last week or the week before. It hasn't stuck in my mind because it didn't feel that important. It was an offhand comment about a relatively minor mistake AGI labs are making, tangential to the main argument I was trying to make. 

Yet it's telling that I felt a need to clarify I wasn't implying OpenAI or DeepMind is evil, during even just a private conversation. It's telling that you've felt a need to do that in this post. It's a sign of a serious problem in the mindset of at least a minority of the AI safety/alignment community. 

Another outstanding feature of this post is how you've mustered the effort to explain at all why you consider different approaches to alignment to be inadequate. This distinguishes your post from others like it from the the last year. Others who've tried to get across the same point you're making have, instead of explaining their disagreements, have generally alleged almost everyone else in entire field of AI alignment are literally insane. 

That's not helpful for a few reasons. Such a claim is probably not true. It'd be harder to make a more intellectually lazy or unconvincing argument. It counts as someone making a bold, senseless attempt to, arguably, dehumanize hundreds of their peers. 

This isn't just a negligible error from somebody recognized as part of a hyperbolic fringe in AI safety/alignment community. It's direly counterproductive when it comes from leading rationalists, like Eliezer Yudkowsky and Oliver Habryka, who wield great influence in their own right, and are taken very seriously by hundreds of other people. Thank you for writing this post as corrective to that kind of mistake a lot of your close allies have been making too.

I have so far gotten the same impression that making RLHF work as a strategy by iteratively and kind of gradually scaling it in a very operationally secure way seems like maybe the most promising approach. My viewpoint right now still remains as the one you've expressed about how, while as much as the RLHF++ has going for it in a relative sense, in leaves a lot to be desired in an absolute sense in light of the alignment/control problem for AGI. 

Overall, I really appreciate how this post condenses well in detail what is increasingly common knowledge about just how inadequate are the sum total of major approaches being taken to alignment. I've read analyses with the same current conclusion from several other AGI safety/alignment researchers during the last year or two. Yet where I hit a wall is my strong sense that any alternative approaches could just as easily succumb to most if not all of the same major pitfalls you list the RLHF++ approach of having to contend with. In that sense, I also feel most of your points are redundant, To get specific about how your criticisms of RLHF apply to all the other alignment approaches as well... 

This currently feels way too much like “improvise as we go along and cross our fingers” to be Plan A; this should be Plan B or Plan E.

Whether it's an approach inspired by the paradigms established in light of Christiano, Yudkowsky, interpretability research, or elsewise, I've gotten the sense essentially all alignment researchers honestly feel the same way about whatever approach to RLHF they're taking.

"It might well not work. I expect this to harvest a bunch of low-hanging fruit[...]This really shouldn't be our only plan.

I understand how it feels like, based on how some people tend to talk about RLHF, and sometimes interpretability, they're implying or suggesting that we'll be fine with just this one approach. At the same time, as far as I'm aware, when you get behind any hype, almost everyone admits that whatever particular approach to alignment they're taking may fail to generalize and shouldn't be the only plan. 

It rests on pretty unclear empirical assumptions on how crunchtime will go.

I've gotten the sense that the empirical assumption for how crunchtime will go among researchers taking the RLHF approach is, for lack of a better term, kind of a medium-term forecast for the date of the tipping point for AGI, i.e., probably at least between 2030 and 2040, as opposed to between 2025 and 2030. 

Given this or that certain chain/sequence of logical assumptions about the trajectory or acceleration of capabilities research, there of course is an intuitive case to be made, on rational-theoretic grounds, for acting/operating under the presumption in practice that forecasts of short(er) AGI timelines, e.g., between 1 and 5 years out, are just correct and the most accurate. 

At the same time, such models for timeline and/or trajectory towards AGI anyone could, just as easily, be totally wrong. Those research teams most dedicated to really solving the control problem for transformative/general AI with the shortest timelines are also acting under assumptions derived from models that also severely lacking any empirical basis. 

As far as I'm aware, there is a combined set of several for-profit startups, and non-profit research organizations, that have been trialing state-of-the-art approaches for prediction markets and forecasting methodologies, especially timelines and trajectories of capabilities research for transformative/general AI. 

During the last few years, they've altogether received, at least, a few million dollars to run so many experiments to determine how to achieve more empirically based models for AI timelines or trajectories. While there may potentially be valuable insights for empirical forecasting methods overall, I'm not aware of any results at all vindicating, literally, any theoretical model for capabilities forecasting. 

I’m not sure this plan puts us on track to get to a place where we can be confident that scalable alignment is solved. By default, I’d guess we’d end up in a fairly ambiguous situation.

This is yet just another criticism of the RLHF approach that I understand as just as easily applying to any approach to alignment you've mentioned, and even every remotely significant approach to alignment you didn't mention but I've also encountered.

You also mentioned, for both the relatively cohesive (set of) approach(es) inspired by Christiano's research, or for more idiosyncratic approaches, a la MIRI, you perceive to be a dead end the very abstract, and almost purely mathematical, approach being taken. That's an understandable and sympathetic take. All things being equal, I'd agree with your proposal for what should be done instead:

We need a concerted effort that matches the gravity of the challenge. The best ML researchers in the world should be working on this! There should be billion-dollar, large-scale efforts with the scale and ambition of Operation Warp Speed or the moon landing or even OpenAI’s GPT-4 team itself working on this problem.

Unfortunately, all things are not equal. The societies we live in will. for foreseeable future. keep operating on a set of very unfortunate incentive structures. 

To so strongly invest into ML-based approaches to alignment research, in practice, often entails working in some capacity of advancing capabilities research even more, especially in industry and the private sector. That's a major reason why, regardless of whatever ways they might be superior, ML-based approaches to alignment are often eschewed. 

I.e., most conscientious alignment researchers don't feel like their field is ready to pivot so fully to ML-based approaches to alignment, without in the process increasing whatever existential risk super-human AGI might pose to humanity, as opposed to decreasing such risk. As harsh as I'm maybe being, I also think the most novel and valuable propositions in this post are your own you've downplayed:

For example, I’m really excited about work like this recent paper (paper, blog post on broader vision), which prototypes a method to detect “whether a model is being honest” via unsupervised methods. More than just this specific result, I’m excited about the style:

  • Use conceptual thinking to identify methods that might plausibly scale to superhuman methods (here: unsupervised methods, which don’t rely on human supervision)
  • Empirically test this with current models.

I think there’s a lot more to do in this vein—carefully thinking about empirical setups that are analogous to the core difficulties of scalable alignment, and then empirically testing and iterating on relevant ML methods.

My one recommendation is that you don't dwell any longer on so many things in AI alignment as a field most alignment researchers already acknowledge, and get down your proposals for taking an evidence-based approach to expanding the robustness of alignment of unsupervised systems. That's as exciting a new research direction I've heard of in the last year too!

I can't get past the feeling that all the doom will flow through an asymptote of imperfect alignment. How can scalable alignment ever be watertight enough for x-risk to drop to insignificant levels? Especially given the (ML-based) engineering approach suggested. It sounds like formal, verifiable, proofs of existential safety won't ever be an end product of all this. How long do we last in a world like that, where AI capabilities continue improving up to physical limits? Can the acute risk period really be brought to a close this way?

TL;DR: I totally agree with the general spirit of this post, we need people to solve alignment, and we're not on track. Go and work on alignment but before you do, try to engage with the existing research, there are reasons why it exists. There are a lot of things not getting worked on within AI alignment research, and I can almost guarantee you that within six months to a year, you can find things that people haven't worked on. 

So go and find these underexplored areas in a way where you engage with what people have done before you!

There’s no secret elite SEAL team coming to save the day. This is it. We’re not on track.

If timelines are short and we don’t get our act together, we’re in a lot of trouble. Scalable alignment—aligning superhuman AGI systems—is a real, unsolved problem. It’s quite simple: current alignment techniques rely on human supervision, but as models become superhuman, humans won’t be able to reliably supervise them.

But my pessimism on the current state of alignment research very much doesn’t mean I’m an Eliezer-style doomer. Quite the opposite, I’m optimistic. I think scalable alignment is a solvable problem—and it’s an ML problem, one we can do real science on as our models get more advanced. But we gotta stop fucking around. We need an effort that matches the gravity of the challenge.[1]

I also agree in that Eliezer's style of doom seems uncalled for and that this is a solvable but difficult problem. My personal p(doom) is something around 20%, and I think this seems quite reasonable. 

Barely anyone is going for the throat of solving the core difficulties of scalable alignment. Many of the people who are working on alignment are doing blue-sky theory, pretty disconnected from actual ML models. Most of the rest are doing work that’s vaguely related, hoping it will somehow be useful, or working on techniques that might work now but predictably fail to work for superhuman systems.

Now I do want to give pushback on this claim as I see a lot of people who haven't fully engaged with the more theoretical alignment landscape making this claim. There are only 300 people working on alignment, but those people are actually doing things, and most of them aren't doing blue in the sky theory.

A note on the ARC claim:


But his research now (“heuristic arguments”) is roughly “trying to solve alignment via galaxy-brained math proofs.” As much as I respect and appreciate Paul, I’m really skeptical of this: basically all deep learning progress has been empirical, often via dumb hacks[3] and intuitions, rather than sophisticated theory. My baseline expectation is that aligning deep learning systems will be achieved similarly.[4] 

This is essentially a claim about the methodology of science in that working on existing systems gives more information and breakthroughs compared to working on a blue-sky theory. The current hypothesis for this is that it is just a lot more information-rich to do real-world research. This is, however, not the only way to get real-world feedback loops. Christiano is not working on blue sky theory; he's using real-world feedback loops in a different way; he looks at the real world and looks for information that's already there. 

A discovery of this type is, for example, the tragedy of the commons; whilst we could have created computer simulations to see the process in action, it's 10x easier to look at the world and see the real-time failures. He tells stories and sees where they fail in the future as his research methodology. This gives bits of information on where to do future experiments, like how we would be able to tell that humans would fail to stop overfishing without actually running an experiment on it.

This is also what John Wentworth does with his research; he looks at the real world as a reference frame which is quite rich in information. Now a good question is why we haven't seen that many empirical predictions from Agent Foundations. I believe it is because alignment is quite hard, and specifically, it is hard to define agency in a satisfactory way due to some really fuzzy problems (boundaries, among others) and, therefore, hard to make predictions. 

We don't want to mathematize things too early either, as doing so would put us into a predefined reference frame that it might be hard to escape from. We want to find the right ballpark for agents since if we fail we might base evaluations on something that turns out to be false. 

In general, there's a difference in the types of problems in alignment and empirical ML; the reference class of a "sharp-left turn" is different from something empirically verifiable as it is unclearly defined, so a good question is how we should turn one into the other. This question of how we take recursive self-improvement, inner misalignment and agent foundations into empirically verifiable ML experiments is actually something that most of the people I know in AI Alignment are currently actively working on.

This post from Alexander Turner is a great example of doing this as they try "just retargeting the search"

Other people are trying other things, such as bounding the maximisation in RL into quantilisers. This would, in turn, make AI more "content" with not maximising. (fun parallel to how utilitarianism shouldn't be unbounded)

I could go on with examples, but what I really want to say here is that alignment researchers are doing things; it's just hard to realise why they're doing things when you're not doing alignment research yourself. (If you want to start, book my calendly and I might be able to help you.)

So what does this mean for an average person? You can make a huge difference by going in and engaging with arguments and coming up with counter-examples, experiments and theories of what is actually going on. 

I just want to say that it's most likely paramount to engage with the existing alignment research landscape before as it's free information and easy to fall into traps if you don't. (a good resource for avoiding some traps is John's Why Not Just sequence) 

There's a couple of years worth of research there; it is not worth rediscovering from the ground up. Still, this shouldn't stop you, go and do it; you don't need a hero licence.

Copy-pasting here from LW.

Sorry but my rough impression from the post is you seem to be at least as confused about where the difficulties are as average of alignment researchers you think are not on the ball - and the style of somewhat strawmanning everyone & strong words is a bit irritating.

Maybe I'm getting it wrong, but it seems the model you have for why everyone is not on the ball is something like "people are approaching it too much from a theory perspective, and promising approach is very close to how empirical ML capabilities research works" & "this is a type of problem where you can just throw money at it and attract better ML talent".

I don't think these two insights are promising.

Also, again, maybe I'm getting it wrong, but I'm confused how similar you are imagining the current systems to be to the dangerous systems. It seems either the superhuman-level problems (eg not lying in a way no human can recognize) are somewhat continuous with current problems (eg not lying), and in that case it is possible to study them empirically. Or they are not.  But different parts of the post seem to point in different directions. (Personally I think the problem is somewhat continuous, but many of the human-in-the-loop solutions are not, and just break down.)

Also, with what you find promising I'm confused what do you think the 'real science'  to aim for is  - on one hand it seems you think the closer the thing is to how ML is done in practice the more real science it is. On the other hand, in your view all deep learning progress has been empirical, often via dumb hacks and intuitions (this isn't true imo). 

[anonymous]1y3
0
0

I'm very supportive of this post. Also I will shamelessly share here a sequence I posted in February called "The Engineer's Interpretability Sequence". One of the main messages of the sequence could be described as how existing mechanistic interpretability research is not on the ball. 

https://www.alignmentforum.org/s/a6ne2ve5uturEEQK7 

I agree with this post. I've been reading many more papers since first entering this field because I've been increasingly convinced of the value of treating alignment as an engineering problem and pulling insights from the literature. I've also been trying to do more thinking about how to update on the current paradigm from the classic Yud and Bostrom alignment arguments. In this respect, I applaud Quintin Pope for his work.

This week, I will send a grant proposal to continue my work in alignment. I'd be grateful if you could look at my proposal and provide some critique. It would be great to have an outside view (like yours) to give feedback on it.

Current short summary: "This project comprises two main interrelated components: accelerating AI alignment research by integrating large language models (LLMs) into a research system, and conducting direct work on alignment with a focus on interpretability and steering the training process towards aligned AI. The "accelerating alignment" agenda aims to impact both conceptual and empirical aspects of alignment research, with the ambitious long-term goal of providing a massive speed-up and unlocking breakthroughs in the field. The project also includes work in interpretability (using LLMs for interpreting models; auto-interpretability), understanding agency in the current deep learning paradigm, and designing a robustly aligned training process. The tools built will integrate seamlessly into the larger alignment ecosystem. The project serves as a testing ground for potentially building an organization focused on using language models to accelerate alignment work."

Please send me a DM if you'd like to give feedback!

[epistemic status: half-joking]

There’s no secret elite SEAL team coming to save the day.

Are there any organized groups of alignment researchers who serve as a not-so-secret, normal civilian equivalent of a SEAL team trying their best to save the day, while also trying to make no promises of being some kind of elite, hyper-competent super-team?

At this point, I'll hear out the gameplan to align AGI from any kind of normie SEAL team. We're really scraping the bottom of the barrel right now. 

The challenge isn’t figuring out some complicated, nuanced utility function that “represents human values”; the challenge is getting AIs to do what it says on the tin—to reliably do whatever a human operator tells them to do.

IMO, this implies we need to design AI systems so that they satisfice rather than maximize: perform a requested task at a requested performance level but no better than that and with a requested probability but no more likely than that.

Like an SLA (service level agreement)!

Not exactly. A typical SLA only contains a lower bound, that would still allow for maximization. The program for a satisficer in the sense I meant it would states that the AL system really aims to do no better than requested. So, for example, quantilizers would not qualify since they might still (by chance) choose that action which maximizes return.

Great post. What I find most surprising is how small the scalable alignment team at OpenAI is. Though similar teams in DeepMind and Anthropic are probably bigger.

The challenge isn’t figuring out some complicated, nuanced utility function that “represents human values”; the challenge is getting AIs to do what it says on the tin—to reliably do whatever a human operator tells them to do.

Why do you think this? I infer for what I've seen written in other posts and comments that this is a common belief but I don't find the reasons why. 

The fact that there are specific really difficult problems with aligning ML systems doesn't mean that the original really difficult problem with finding and specifying the objectives that we want for a superintelligence were solved.

I hate it because it makes it seems like alignment is a technical problem that can be solved by a single team and as you put it in your other post we should just race and win against the bad guys. 

I could try to envision what type of AI you are thinking of and how would you use it, but I would prefer if you tell me. So, what would you ask your aligned AGI to do and how would it interpret that? And how are you so sure that most alignment researchers would ask it the same things as you?

(On phone so rushed reply)

Thanks for this, it's well written and compelling.

Who needs to do what differently do you think?

Do you have object level recommendations for specific audiences?

For instance a call for more of a specific type of projects to increase the number of people working on AI Safety, their quality of work or coordination etc?

I'm no expert on this at all but this is very interesting. I wonder if the key to alignment is some kind of inductive alignment where rather than designing a superintelligent system T_N in a vacuum you design a series of increasingly intelligent systems T_0..T_N where the alignment is inbuilt at each level and a human aligns the basic T_0. i.e. you build T_1 as a small advancement of T_0 such that it can be aligned by T_0 and so on.

Loved the language in the post! To the point without having to use unnecessary jargon.

There are two things I'd like you to elaborate on if possible:

> "the challenge is getting AIs to do what it says on the tin—to reliably do whatever a human operator tells them to do."

If I understand correctly you imply that there is still a human operator to a superhuman AGI, do you think this is the way that alignment will work out? What I see is that humans have flaws, do we really want to give a "genie" / extremely powerful tool to humans that even already struggle with the powerful tools that they have? At least right now these powerful tools are in the hands of the more responsible few, but if it becomes more widely accessible that's very different.

What do you think of going the direction of developing a "Guardian AI", which would still solve the alignment problem using the tools of ML, but involving humans giving up control of the alignment?

The second one is more practical, which action do you think one should take. I've of course read the recommendations that other people have put out there so far, but would be curious to hear your take on this. 
 

AI alignment is a myth; it assumes that humans are a single homogeneous organism and that AI will be one also. Humans have competing desires and interests and so will the AGIs created by them, none of which will have independent autonomous motivations without being programmed to develop them.

Even on an individual human level alignment is a relative concept. Whether a human engages in utilitarian or deontological reasoning depends on their conception of the self/other (whether they would expect to be treated likewise).

Regarding LLM specific risk, they are not currently an intelligence threat. Like any tech however they can be deployed by malicious actors in the advancement of arbitrary goals. One reason OpenAI are publicly trialling the models early is to help everyone including researchers learn to navigate their use cases, and develop safeguards to hinder exploitation.

Addressing external claims, limiting research is a bad strategy given population dynamics; the only secure option for liberal nations is in the direction of knowledge.

Going to say something seemingly-unpopular in a tone that usually gets downvoted but I think needs to be said anyway:

This stat is why I still have hope: 100,000 capabilities researchers vs 300 alignment researchers.

Humanity has not tried to solve alignment yet.

There's no cavalry coming - we are the cavalry. 

I am sympathetic to fears of a new alignment researchers being net negative, and I think plausibly the entire field has, so far, been net negative, but guys, there are 100,000 capabilities researchers now! One more is a drop in the bucket.

If you're still on the sidelines, go post that idea that's been gathering dust in your Google Docs for the last six months.  Go fill out that fundraising application. 

We've had enough fire alarms. It's time to act.

My gut reaction when reading this comment:

This comment looks like it's written in an attempt to "be inspirational", not an attempt to share a useful insight, or ask a question.

I hope this doesn't sound unkind. I recognise that there can be value in being inspirational, but it's not what I'm looking for when I'm reading these comments.

Thanks for the feedback. I tried to do both. I think the doomerism levels are so intense right now and need to be balanced out with a bit of inspiration.

I worry that the doomer levels are so high EAs will be frozen into inaction and non-EAs will take over from here. This is the default outcome, I think.

I worry that the doomer levels are so high EAs will be frozen into inaction and non-EAs will take over from here. This is the default outcome, I think.

On one hand, as I got at in this comment, I'm more ambivalent than you about whether it'd be worse for non-EAs to take more control over the trajectory on AI alignment. 

On the other hand, one reason why I'm ambivalent about effective altruists (or rationalists) retaining that level is control is that I'm afraid that the doomer-ism may become an endemic or terminal disease for the EA community. AI alignment might be refreshed by many of those effective altruists currently staffing the field being replaced. So, thank you for pointing that out too. I expressed a similar sentiment in this comment, though I was more specific because I felt it was important to explain just how bad the doomer-ism has been getting.

Others who've tried to get across the same point [Leopold is] making have, instead of explaining their disagreements, have generally alleged almost everyone else in entire field of AI alignment are literally insane. 

That's not helpful for a few reasons. Such a claim is probably not true. It'd be harder to make a more intellectually lazy or unconvincing argument. It counts as someone making a bold, senseless attempt to, arguably, dehumanize hundreds of their peers. 

This isn't just a negligible error from somebody recognized as part of a hyperbolic fringe in AI safety/alignment community. It's direly counterproductive when it comes from leading rationalists, like Eliezer Yudkowsky and Oliver Habryka, who wield great influence in their own right, and are taken very seriously by hundreds of other people.

Your first comment at the top was better, it seems you were inspired. What in the entire universe of possibilities could be wrong with being inspirational?...the entire EA movement is hoping to inspire people to give and act toward the betterment of humankind...before any good idea can be implemented, there must be something to inspire a person to stand up and act. Wow, you're mindset is so off of human reality. Is this an issue of post vs. comments?...who cares if someone adds original material in comments, it's a conversation. Humans are not data in a test tube...the human spirit is another way of saying, "inspired human"...when inspired humans think, good things can happen. It is the evil of banality that is so frightening. Uninspired intellect is probably what will kill us all if it's digital. 

Sanjay, I just realized you were the top comment, and now I notice that I feel confused, because your comment directly inspired me to express my views in a tone that was more opinionated and less-hedgy.

I appreciate - no, I *love* - EA's truth seeking culture but I wish it were more OK to add a bit of Gryffindor to balance out the Ravenclaw.

There's no cavalry coming - we are the cavalry. 

It's ambiguous who this "we" is. It obscures the fact there are overlapping and distinct communities among AI alignment as an umbrella movement. There have also been increasing concerns that a couple of those communities serving as nodes in that network, namely rationality and effective altruism, are becoming more trouble than they're worth. This has been coming from effective altruists and rationalists themselves.

I'm aware of, and have been part of, increasingly frequent conversations that AI safety and alignment, as a movement/community/whatever, shouldn't just "divorce" from EA or rationality, but can and should become more autonomous and independent from them. 

What that implies for 'the cavalry' is, first, that much of the standing calvary is more trouble than it's worth. It might be prudent to discard and dismiss much of the existing cavalry. 

Second, the AI safety/alignment community gaining more control over its own trajectory may provide an opportunity to rebuild the cavalry, for the better. AI alignment as a field could become more attractive to those who find it offputting, at this point, understandably, because of its association with EA and the rationality community. 

AI safety and AI alignment, freed of the baggage EA and rationality, could bring in fresh ranks to the cavalry to replace those standing ranks still causing so many problems.

"...or block deployment when we actually really should deploy, e.g. to beat China."

What the heck? Keep me out of your "we". I'm vehemently against framing this as a nationalist us-them issue. If instead you meant something entirely reasonable like "e.g. so Xi doesn't become god-emperor," then say so. Leave "China" out of it, that's where my friends live.

Otherwise, nice post btw.

I'm honestly curious what motivates people to support this framing. I don't live in an English-speaking country, and for all I know, I might have severely misunderstood broader EA culture. Are people here sincerely feeling nationalistic (a negative term in Norway) about this? I'd be appreciative if somebody volunteered to help me understand.

I can speak for myself: I want AGI, if it is developed, to reflect the best possible values we have currently (i.e. liberal values[1]), and I believe it's likely that an AGI system developed by an organization based in the free world (the US, EU, Taiwan, etc.) would embody better values than one developed by one based in the People's Republic of China. There is a widely held belief in science and technology studies that all technologies have embedded values; the most obvious way values could be embedded in an AI system is through its objective function. It's unclear to me how much these values would differ if the AGI were developed in a free country versus an unfree one, because a lot of the AI systems that the US government uses could also be used for oppressive purposes (and arguably already are used in oppressive ways by the US).

Holden Karnofsky calls this the "competition frame" - in which it matters most who develops AGI. He contrasts this with the "caution frame", which focuses more on whether AGI is developed in a rushed way than whether it is misused. Both frames seem valuable to me, but Holden warns that most people will gravitate toward the competition frame by default and neglect the caution one.

Hope this helps!

  1. ^

    Fwiw I do believe that liberal values can be improved on, especially in that they seldom include animals. But the foundation seems correct to me: centering every individual's right to life, liberty, and the pursuit of happiness.

Thanks for the explanation! Though I think I've been misunderstood.

I think I strongly prefer if e.g. Sam Altman, Demis Hassabis, or Elon Musk ends up with majority influence over how AGI gets applied (assuming we're alive), over leading candidates in China (especially Xi Jinping). But to state that preference as "I hope China doesn't develop AI before the US!" seems ... unusually imprecise and harmful. Especially when nationalistic framings like that are already very likely to fuel otherisation and lack of cross-cultural understanding.

It's like saying "Russia is an evil country for attacking Ukraine," when you could just say "Putin" or "the Russian government" or any other way of phrasing what you mean with less likelihood of spilling over to hatred of Russians in general.