Hide table of contents

My p(doom) is pretty high and I found myself repeating the same words to explain some parts of the intuitions behind it. I think there are hard parts of the alignment problem that we’re not on track to solve in time.[1] Alignment plans that I've heard[2] fail for reasons connected to these hard parts of the problem, so I decided to attempt to write my thoughts in a short post.

(Thanks to Theresa, Owen, Jonathan, and David for comments on a draft.) 

Modern machine learning uses a powerful search process to look for neural network parameters such that a neural network performs well on some function.

There exist algorithms for general and powerful agents. At some point in the near future, there will be a training procedure with the gradient of the loss function(s) w.r.t. the parameters pointing towards neural networks implementing these algorithms.

Increasingly context-aware and capable agents achieve a better score on a wide range of scoring functions than their neighbors and will, by default, attract gradient descent.  

Unfortunately, we haven’t solved agent foundations: we have these powerful search processes, and if you imagine the space of all possible AGIs (or possible neural networks, or possible minds), there are some areas that are aligned AGIs, but we have no idea how to define them, no idea how to look for them. We understand how all designs for a search process people came up with so far end up somewhere that’s not in an area of aligned AGI[3], and we also understand that some areas with aligned AGIs actively dispel many sorts of search processes. We can compare an area of aligned AGIs to the Moon. Imagine we’re trying to launch a rocket there, and if after the first take-off, it ends up somewhere that’s not the Moon (maybe after a rapid unplanned disassembly), we die. We have a bunch of explosives, but we don’t have equations for gravity, only maybe some initial understanding of acceleration. Also, actually, we don’t know where the Moon is in space; we don’t know how to specify it, we don’t know what kind of light we can look for that many other things wouldn’t emit, etc.; we imagine that the Moon must be nice, but we don’t have a notion of its niceness that we can use to design our rocket; we know that some specific designs definitely fail and end up somewhere that’s not the Moon, but that wouldn’t really help us to get to the Moon.

If you launch anything capable and you don’t have good reasons to think it’s an aligned mind, it will not be an aligned mind. If you try to prevent specific failure modes- if you identify optimizations towards something different from what you want, or how exactly gradient descent diverges somewhere that’s certainly not aligned- you’re probably iteratively looking for training setups where you don’t understand failure modes instead of setups that actually produce something aligned. If you don’t know where you’re going, it’s not helpful enough not to go somewhere that’s definitely not where you want to end up; you have to differentiate paths towards the destination from all other paths, or you fail.

When you get to a system capable enough to meaningfully help you[4], you need to have already solved this problem. I think not enough people understand what this problem is, and I think that if it is not solved in time, we die.

I’ve heard many attempts to hide the hard problem in something outside of where our attention is directed: e.g., design a system out of many models overseeing each other, and get useful work out of the whole system while preventing specific models from staging a coup. 

I have intuitions for why these kinds of approaches fail, mostly along the lines of reasons for why, unless you already have something sufficiently smart and aligned, you can't build an aligned system out of it, without figuring out how to make smart aligned minds. 


(A sidenote: In a conversation with an advocate of this approach, they've mentioned Google as an example of a system where the incentive structures inside of it are such that the parts are pretty aligned with the goals of the whole system and don't stage a coup. A weak argument, but the reason why Google wouldn’t kill competitors’ employees even if it would increase their profit is not only the outside legal structures but also smart people whose values include human lives, who work at Google, don't want the entire structure to kill people, and can prevent the entire structure from killing people, and so the entire structure optimizes for something slightly different than profit.

I've had some experience with getting useful work out of somewhat unaligned systems made of somewhat unaligned parts by placing additional constraints on them: back when I lived in Russia, I was an independent election observer, a member of an election committee (and basically had to tell everyone what to do because they didn't really know the law and didn't want to follow the procedures), and later coordinated opposition election observers in a district of Moscow. You had to carefully place incentives such that it was much easier for the members of election commissions to follow the lawful procedures (which were surprisingly democratic and would've worked pretty well if only all the election committees consisted of representatives of competing political parties interested in fair results, which was never the case in the history of the country). There are many details there, but two things I want to say are:

  • the primary reason why the system hasn’t physically hurt many more independent election observers than it did is that it consisted of many at least somewhat aligned humans who preferred other people not to be hurt- without that, no law and no incentive structure independent observers could provide would've saved their health or helped to achieve their goals;
  • the system looked good to the people on top, but without the feedback loops that democracies have and countries without real elections usually don’t, at every level of the hierarchy people were slightly misrepresenting the situation they were responsible for because that meant a higher reward, and that led to some specific people having a map so distorted that they decided to start a war, thinking they'll win it in three days. They designed a system that achieved some goals, but what the system actually was directed at wasn’t exactly what they wanted. No one controls what exactly the whole system optimizes for, even when they can shape it however they want.)

To me, approaches like "let's scale oversight and prevent parts of the system from staging a coup and solve the alignment problem with it" are like trying to prevent the whole system from undergoing the Sharp Left Turn in some specific ways, while not paying attention to where you direct the whole system. If a capable mind is made out of many parts being kept in balance, it doesn’t help you with the problem that the mind itself needs to be something from the area of aligned minds. If you don't have reasons to expect the system to be outer-aligned, it won't be. And unless you solve agent foundations (or otherwise understand minds sufficiently well), I don't think it's possible to have good reasons for expecting outer alignment in systems intelligent and goal-oriented enough to solve alignment or turn all GPUs on Earth into Rubik's cubes.

There are ongoing attempts at solving this: for example, Infra-Bayesianism tries to attack this problem by coming up with a realistic model of an agent, understanding what it can mean for it to be aligned with us, and producing some desiderata for an AGI training setup such that it points at coherent AGIs similar to the model of an aligned agent.[5] Some people try to understand minds and agents from various angles, but everyone seems pretty confused, and I'm not aware of a research direction that seems to be on track to solve the problem.

If we don't solve this problem, we'll get a goal-oriented powerful intelligence powerful enough to produce 100k hours of useful alignment research in a couple of months or turn all the GPUs into Rubik's cubes, and there will be no way for it to be aligned enough to do these nice things instead of killing us.

Are you sure that if your research goes as planned and all the pieces are there and you get to powerful agents, you understand why exactly the system you’re pointing at is aligned? Do you know what exactly it coherently optimizes for and why the optimization target is good enough? Have you figured out and prevented all possible internal optimisation loops? Have you actually solved outer alignment?

I would like to see people thinking more about that problem; or at least being aware of it even if they work on something else.

And I urge you to think about the hardest problem that you think has to be solved, and attack that problem.

  1. ^

    I don't expect that we have much time until AGI happens; I think progress in capabilities happens much faster than in this problem; this problem needs to be solved before we have AI systems that are capable enough to meaningfully help with this problem; alignment researchers don't seem to stack.

  2. ^

    E.g., scalable oversight in the lines of what people from Redwood, Anthropic, and DeepMind are thinking about.

  3. ^

    To be clear, I haven't seen many designs that people I respect believed to have a chance of actually working. If you work on the alignment problem or at an AI lab and haven't read Nate Soares' On how various plans miss the hard bits of the alignment challenge, I'd suggest reading it.

    For people who are relatively new to the problem, an example of finding a failure in a design (I don't expect people from AI labs to think it can possibly work): think what happens if you have two systems: one is trained to predict how a human would evaluate a behavior, another is trained to produce behavior the first system predicts would be evaluated highly. Imagine that you successfully train them to a superintelligent level: the AI is really good at predicting what buttons humans click and producing behaviors that lead to humans clicking on buttons that mean that the behavior is great. If it's not obvious at the first glance why an AI trained this way won't be aligned, it might be really helpful to stop and think for 1-2 minutes about what happens if a system understands humans really well and does everything in its power to make a predicted human click on a button. Is the answer "a nice and fully aligned helpful AGI assistant"?
    See Eliezer Yudkowsky's AGI Ruin: A List of Lethalities for more (the problem you have just discovered might be described under reason #20).

  1. ^

    There are some problems you might solve to prevent someone else from launching an unaligned AI. Solving them is not something that's easy for humans to do or oversee and probably requires using a system whose power is comparable to being able to produce alignment research like what Paul Cristiano produces but 1,000 times faster or turn all the GPUs on the planet into Rubik's cubes. Building systems past a certain threshold of power is something systems below that threshold can't meaningfully help with; this a chicken and egg situation, and systems below the threshold can speed up our work, but some hard work and thinking, the hard bits of solving the problem of building the first system past this threshold safely is on us; systems powerful enough to meaningfully help you are already dangerous enough.

  2. ^

    With these goals, the research starts by solving some problems with traditional RL theory: for example, traditional RL agents, being a part of the universe, can't even consider the actual universe in the set of their hypotheses, since they're smaller than the universe; a traditional bayesian agent would have a hypothesis as a probability distribution over all possible worlds; but it's impossible for an agent made out of blocks in a part of a Minecraft world to assign probabilities to every possible state of the whole Minecraft world.

    IB solves this problem of non-realizability by considering hypotheses in the form of convex sets of probability distributions; in practice, this means, for example, a hypothesis can be “every odd bit in the string of bits is 1”. (This is a set of probability distributions over all possible bit strings that only assign positive probabilities to strings that have 1s in odd positions; a mean of any two such probability distribution also doesn’t assign any probability to strings that have a 0 in an odd position, so it’s also from the set, so the set is convex.)

Show all footnotes
No comments on this post yet.
Be the first to respond.
Curated and popular this week
 ·  · 5m read
 · 
This work has come out of my Undergraduate dissertation. I haven't shared or discussed these results much before putting this up.  Message me if you'd like the code :) Edit: 16th April. After helpful comments, especially from Geoffrey, I now believe this method only identifies shifts in the happiness scale (not stretches). Have edited to make this clearer. TLDR * Life satisfaction (LS) appears flat over time, despite massive economic growth — the “Easterlin Paradox.” * Some argue that happiness is rising, but we’re reporting it more conservatively — a phenomenon called rescaling. * I test rescaling using long-run German panel data, looking at whether the association between reported happiness and three “get-me-out-of-here” actions (divorce, job resignation, and hospitalisation) changes over time. * If people are getting happier (and rescaling is occuring) the probability of these actions should become less linked to reported LS — but they don’t. * I find little evidence of rescaling. We should probably take self-reported happiness scores at face value. 1. Background: The Happiness Paradox Humans today live longer, richer, and healthier lives in history — yet we seem no seem for it. Self-reported life satisfaction (LS), usually measured on a 0–10 scale, has remained remarkably flatover the last few decades, even in countries like Germany, the UK, China, and India that have experienced huge GDP growth. As Michael Plant has written, the empirical evidence for this is fairly strong. This is the Easterlin Paradox. It is a paradox, because at a point in time, income is strongly linked to happiness, as I've written on the forum before. This should feel uncomfortable for anyone who believes that economic progress should make lives better — including (me) and others in the EA/Progress Studies worlds. Assuming agree on the empirical facts (i.e., self-reported happiness isn't increasing), there are a few potential explanations: * Hedonic adaptation: as life gets
 ·  · 38m read
 · 
In recent months, the CEOs of leading AI companies have grown increasingly confident about rapid progress: * OpenAI's Sam Altman: Shifted from saying in November "the rate of progress continues" to declaring in January "we are now confident we know how to build AGI" * Anthropic's Dario Amodei: Stated in January "I'm more confident than I've ever been that we're close to powerful capabilities... in the next 2-3 years" * Google DeepMind's Demis Hassabis: Changed from "as soon as 10 years" in autumn to "probably three to five years away" by January. What explains the shift? Is it just hype? Or could we really have Artificial General Intelligence (AGI)[1] by 2028? In this article, I look at what's driven recent progress, estimate how far those drivers can continue, and explain why they're likely to continue for at least four more years. In particular, while in 2024 progress in LLM chatbots seemed to slow, a new approach started to work: teaching the models to reason using reinforcement learning. In just a year, this let them surpass human PhDs at answering difficult scientific reasoning questions, and achieve expert-level performance on one-hour coding tasks. We don't know how capable AGI will become, but extrapolating the recent rate of progress suggests that, by 2028, we could reach AI models with beyond-human reasoning abilities, expert-level knowledge in every domain, and that can autonomously complete multi-week projects, and progress would likely continue from there.  On this set of software engineering & computer use tasks, in 2020 AI was only able to do tasks that would typically take a human expert a couple of seconds. By 2024, that had risen to almost an hour. If the trend continues, by 2028 it'll reach several weeks.  No longer mere chatbots, these 'agent' models might soon satisfy many people's definitions of AGI — roughly, AI systems that match human performance at most knowledge work (see definition in footnote). This means that, while the compa
 ·  · 4m read
 · 
SUMMARY:  ALLFED is launching an emergency appeal on the EA Forum due to a serious funding shortfall. Without new support, ALLFED will be forced to cut half our budget in the coming months, drastically reducing our capacity to help build global food system resilience for catastrophic scenarios like nuclear winter, a severe pandemic, or infrastructure breakdown. ALLFED is seeking $800,000 over the course of 2025 to sustain its team, continue policy-relevant research, and move forward with pilot projects that could save lives in a catastrophe. As funding priorities shift toward AI safety, we believe resilient food solutions remain a highly cost-effective way to protect the future. If you’re able to support or share this appeal, please visit allfed.info/donate. Donate to ALLFED FULL ARTICLE: I (David Denkenberger) am writing alongside two of my team-mates, as ALLFED’s co-founder, to ask for your support. This is the first time in Alliance to Feed the Earth in Disaster’s (ALLFED’s) 8 year existence that we have reached out on the EA Forum with a direct funding appeal outside of Marginal Funding Week/our annual updates. I am doing so because ALLFED’s funding situation is serious, and because so much of ALLFED’s progress to date has been made possible through the support, feedback, and collaboration of the EA community.  Read our funding appeal At ALLFED, we are deeply grateful to all our supporters, including the Survival and Flourishing Fund, which has provided the majority of our funding for years. At the end of 2024, we learned we would be receiving far less support than expected due to a shift in SFF’s strategic priorities toward AI safety. Without additional funding, ALLFED will need to shrink. I believe the marginal cost effectiveness for improving the future and saving lives of resilience is competitive with AI Safety, even if timelines are short, because of potential AI-induced catastrophes. That is why we are asking people to donate to this emergency appeal