Interesting post!
I like your points against the value of public advocacy. I'm not convinced overall, but that's probably mostly because I'm in an overall more pessimistic state where I don't mind trying more desperate measures.
One small comment on incentives for alignment researchers:
Goodhart’s Law (labs): Safety teams face large pressure to do things that would look good to the public, which is different from doing work that is actually good[9].
I feel like there are already a bunch of misaligned incentives for alignment researchers (and all groups of people in general), so if the people on the safety team aren't selected to be great at caring about the best object-level work, then we're in trouble either way. My model might be too "black and white," but I basically think that people fall into categories where they are either lost causes or they have trained themselves cognitive habits about how to steer clear of distorting influences. Sure, the incentive environment still makes a difference, but I find it unlikely that it's among the most important considerations for evaluating the value of masses-facing AI pause advocacy.
Iterative deployment. We treat AGI like we would treat many other new technologies: something that could pose risks, which we should think about and mitigate, but ultimately something we should learn about through iterative deployment. The default is to deploy new AI systems, see what happens with a particular eye towards noticing harms, and then design appropriate mitigations. In addition, rollback mechanisms ensure that we can AI systems are deployed with a rollback mechanism, so that if a deployment causes significant harms
[...]
Conditional pause. We institute regulations that say that capability improvement must pause once the AI system hits a particular threshold of riskiness, as determined by some relatively standardized evaluations, with some room for error built in. AI development can only continue once the developer has exhibited sufficient evidence that the risk will not arise.
Compared to you, I'm more pessimistic about these two measures. On iterative deployment, I'm skeptical about the safety of rollback mechanisms. On conditional pause, I agree it makes total sense to pause at the latest point possible as long as things are still pretty likely to be safe. However, I don't see why we aren't already at that point.
I suspect that our main crux might be a disagreement over takeoff speeds, and perhaps AI timelines being another (more minor) crux?
On takeoff speeds, I place 70% on 0.5-2.5 orders of magnitude for what Tom Davidson calls the FLOP gap. (Relatively high robustness because I thought about this a lot.) I also worry less about metrics of economic growth/economic change because I believe it won't take long from "AI makes a noticeable dent on human macroeconomic productivity" to "it's now possible to run millions of AIs that are ~better/faster at all tasks human experts can do on a computer." The latter scenario is one from which it is easy to imagine how AIs might disempower humans. I basically don't see how one can safely roll things back at the point where generally smarter-than-human AIs exist that can copy themselves millionfold on the internet.
On timelines, I have maybe 34% on 3 years and less, and 11% on 1 year.
Why do I have these views?
A large part of the story is that if you had described to me back in 2018 all the AI capabilities we have today, without mentioning the specific year by which we'd have those capabilities, I'd have said "once we're there, we're probably very close to transformative AI." And now that we are at this stage, even though it's much sooner than I'd have expected, I feel like the right direction of update is "AI timelines are sooner than expected" rather than "(cognitive) takeoff speeds must be slower than I'd have expected."
Maybe this again comes down to my specific view on takeoff speeds. I felt more confident that takeoff won't be super slow than I felt confident about anything timelines-related.
So, why the confident view on takeoff speeds? Just looking at humans vs chimpanzees, I'm amazed by the comparatively small difference in brain size. We can also speculate, based on the way evolution operates, that there's probably not much room for secret-sauce machinery in the human brain (that chimpanzees don't already have).
The main counterargument from the slow(er) takeoff crowd on the chimpanzees vs. humans comparison is that humans faced much stronger selection pressure for intelligence, which must have tweaked a lot of other things besides brain size, and since chimpanzees didn't face that same selection pressure, the evolutionary comparison underestimates how smart an animal with a chimpanzee-sized brain would be if it had also undergone strong selection pressure for the sort of niche that humans inhabit ("intelligence niche"). I find that counterargument slightly convincing, but not convincing enough to narrow my FLOP gap estimates too much. Compared to ML progress where we often 10x compute between models, evolution was operating way more slowly.
As I've written in an (unpublished) review of (a draft of) the FLOP gap report:
I’ll now reply to sections in the report where Tom discusses evidence for the FLOP gap and point out where I’d draw different conclusions. I’ll start with the argument that I consider the biggest crux between Tom and me.
Tom seems to think chimpanzees – or even rats, with much smaller probability – could plausibly automate a significant percent of cognitive tasks “if they had been thoroughly evolutionarily prepared for this” (I’m paraphrasing). A related argument in this spirit is Paul Christiano’s argument that humans far outclass chimpanzees not because of a discontinuity in intelligence, but because chimpanzees hadn’t been naturally selected to be good at a sufficiently general (or generalizable) range of skills.
I think this argument is in tension with the observation that people at the lower range of human intelligence (as measured by IQ, for instance) tend to struggle to find and hold jobs. If natural selection had straightforwardly turned all humans into specialists for “something closely correlated with economically useful tasks,” then how come the range for human intelligence is still so wide? I suspect that the reason humans were selected for (general) intelligence more than chimpanzees is because of some “discontinuity” in the first place. This comment by Carl Shulman explains it as follows (my emphasis in bold):“Hominid culture took off enabled by human capabilities [so we are not incredibly far from the minimum need for strongly accumulating culture, the selection effect you reference in the post], and kept rising over hundreds of thousands and millions of years, at accelerating pace as the population grew with new tech, expediting further technical advance.”
Admittedly, Carl also writes the following:
“Different regions advanced at different rates (generally larger connected regions grew faster, with more innovators to accumulate innovations), but all but the smallest advanced. So if humans overall had lower cognitive abilities there would be slack for technological advance to have happened anyway, just at slower rates (perhaps manyfold), accumulating more by trial and error.”
So, perhaps chimpanzees (or bonobos), if they had been evolutionarily prepared for social learning or for performing well in the economy, could indeed perform about 20% of today’s tasks. But that’s also my cutoff point: I think we can somewhat confidently rule out that smaller-brained primates, let alone rodents such as rats, could do the same thing in this hypothetical. (Or, in case I’m wrong, it would have to be because it’s possible to turn all of the rodent’s cognition into a highly specialized “tool” that exploits various advantages it can reach over humans – enabling partial automation of specific workflows, but not full automation of sectors.)
b) Same for "the capability program is an easier technical problem than the alignment program". You don't know that; nobody knows that; Lord Kelvin/Einstein/Ehrlich/etc would all have said "X is an easier technical problem than flight/nuclear energy/feeding the world/etc" for a wide range of X, a few years before each of those actually happened.
Even if we should be undecided here, there's an asymmetry where, if you get alignment too early, that's okay, but getting capabilities before alignment is bad. Unless we know that alignment is going to be easier, pushing forward on capabilities without an outsized alignment benefit seems needlessly risky.
On the object level, if we think the scaling hypothesis is roughly correct (or "close enough") or if we consider it telling that evolution probably didn't have the sophistication to install much specialized brain circuitry between humans and other great apes, then it seems like getting capabilities past some universality and self-improvement/self-rearrangement ("learning how to become better at learning/learning how to become better at thinking") threshold cannot be that difficult? Especially considering that we arguably already have "weak AGI." (But maybe you have an inside view that says we still have huge capability obstacles to overcome?)
At the same time, alignment research seems to be in a fairly underdeveloped state (at least my impression as a curious outsider), so I'd say "alignment is harder than capabilities" seems almost certainly true. Factoring in lots of caveats about how they aren't always cleanly separable, and so on, doesn't seem to change that.
Interesting and insightful framing! I think the main concern I have is that your scenario 1 doesn't engage much with the idea of capability info hazards and the point that some of the people who nerd out about technical research lack moral seriousness or big-picture awareness to not always push ahead.
That makes sense. But like Greg_Colbourn says, it seems like a non-trivial assumption that alignment research will become significantly more productive with newer systems.
Also, different researchers may expect very different degrees of "more productive." It seems plausible to me that we could learn more about the motivations of AI models once we move to a paradigm that isn't just "training next-token prediction on everything on the internet." At the same time, it seems outlandish to me that there'd ever come a point where new systems could help us with the harder parts of alignment (due the expert delegation problem where delegating well in an environment where the assistants may not all be competent and well-intentioned becomes impossible if you don't already have the expertise yourself).
AI takeover is harder in the current world than many future worlds
This seems true. By extension, AI takeover before the internet became widely integrated into the economy would also have seemed harder.
Presumably in 2023 if you start doing research on biological weapons, buying uranium, or launching cyberattacks, federal government agencies learn about this relatively soon and try to shut you down. Conditional on somewhat slow AI takeoff (there is widespread deployment of AI systems that we see happening and we’re still alive), I expect that whatever institutions currently do global security like this just get totally overwhelmed. The problem is not just that there may be many AIs doing malicious stuff, it’s that there are tons of AIs doing things at all, and it’s just quite hard to keep track of what’s happening.
This line of argument suggests that slow takeoff is inherently harder to steer. Because pretty much any version of slow takeoff means that the world will change a ton before we get strongly superhuman AI.
This also relates to the benefits of solving alignment – how much that helps with coordination problems. Even if your team solves alignment, there's still the concern that anyone could build and release a massive fleet of misaligned AIs. As I wrote elsewhere:
At least, a slow-takeoff world seems more likely to contain aligned AIs. However, coordination problems could (arguably) be more difficult to solve under slow takeoff because aligned AI would have less of a technological lead over the rest of the world. By contrast, in a hard takeoff world, with aligned, superintelligent AI, we could solve coordination challenges with a "pivotal act." Even if that strategy seems unrealistic/implausible, we have to note that, in a slow takeoff world, this option isn’t available and we’ll definitely have to solve coordination “the hard way” – heavily relying on people and institutions (while still getting some help from early-stage TAI).
Argument summary: Existential risk may hinge on whether AGI development is centralized to a single major project, where centralization is good because it gives this project more time for safety work and securing the world. I look at some arguments about whether AI development sooner or later is better for centralization, and overall I think the answer is unclear.
Yeah, I agree that it seems unclear. But I feel like the current state of things is clearly suboptimal and if we need something extraordinary to happen to get the AI transition right, that only has a chance of happening with more time. I'm envisioning something like "globally-coordinated ban of large training runs + CERN-like alignment project with well-integrated safety evals to ensure the focus remains on alignment research before accidentally creating AI agents with dangerous capabilities." (Maybe we don't need this degree of coordination, but it's the sort of thing we definitely can't achieve under very short timelines.)
Josh’s post covers some arguments for why acceleration may be good: Avoid/delay a race with China, Smooth out takeoff (reduce overhangs), Keep the good guys in the lead.
I'm not convinced by those.
Regarding smoothing out takeoff, I think we're still in the ramping up period where companies are allocating increasingly large portions of their budgets into compute (incl. new data centers). In this sense, there's a lot of (compute) "overhang" available – compute that the world could use if companies increase their willingness to spend, but aren't currently using. In a few years, if AI takeoff hasn't happened yet, resource allocation will likely be closer to the competitive frontier, reducing the hardware overhang so takeoff then will likely be smoother. So, at least if "AI soon" means "AI soon enough that we're still in the ramp-up period" (3 years or less?), then takeoff looks unlikely to be smooth.
(Not to mention that there's a plausible worldview on which "smooth takeoff" was never particularly likely to begin with – the Christiano-style takeoff model many EAs are operating under isn't obviously correct. On the alternative view, AI research is perhaps less economically efficient even now, and algorithmic improvements could unearth enough of an "overhang" to blow us past the human range, which is arguably an extremely narrow target when you allow compute to 5x as you improve algorithms (e.g., seems like a bigger deal than human vs chimpanzee, and that's plausibly what happens whenever someone develops a new foundational model).)
Regarding China, I think it's at least worth being explicitly quantitative about the likelihood of China catching up in the next x years. The compute export controls make a big dent if they hold up, plus I've seen news that suggest that China's AI research policies (particularly around LLMs) hinder innovation. It doesn't seem like China releasing misaligned AI is a huge concern over the next couple of years.
On keeping the good guys in the lead: I don't have a strong opinion here, but I'm not entirely convinced that current lab leadership is sufficiently high on integrity/not narcissistic. More time might give us more time to improve governance at leading labs or new projects. Admittedly, Meta's stance on AI seems insane, so there's maybe a point there. Still, who cares about "catching up" if alignment remains unsolved by the time it needs to be solved. I feel like some EAs are double-counting the reasons for optimism of some vocal optimists (like Christiano) in the context of discussions like the one here, not factoring in that part of their optimism comes explicitly from not having very short timelines. It's important to emphasize that no one is particularly optimistic on very short AI timelines.
I don't concede because people having incorrect maps is expected and tells me little about the territory.
I'm clearly talking about expert convergence under ideal reasoning conditions, as discussed earlier. Weird that this wasn't apparent. In physics or any other scientific domain, there's no question whether experts would eventually converge if they had ideal reasoning conditions. That's what makes these domains scientifically valid (i.e., they study "real things"). Why is morality different? (No need to reply; it feels like we're talking in circles.)
FWIW, I think it's probably consistent to have a position that includes (1) a wager for moral realism ("if it's not true, then nothing matters" – your wager is about the importance of qualia, but I've also seen similar reasoning around normativity as the bedrock, or free will), and (2), a simplicity/"lack of plausible alternatives" argument for hedonism. This sort of argument for hedonism only works if you take realism for granted, but that's where the wager comes in handy. (Still, one could argue that tranquilism is 'simpler' than hedonism and therefore more likely to be the one true morality, but okay.) Note that this combination of views isn't quite "being confident in moral realism," though. It's only "confidence in acting as though moral realism is true."
I talk about wagering on moral realism in this dialogue and the preceding post. In short, it seems fanatical to me if taken to its conclusions, and I don't believe that many people really believe this stuff deep down without any doubt whatsoever. Like, if push comes to shove, do you really have more confidence in your understanding of illusionism vs other views in philosophy of mind, or do you have more confidence in wanting to reduce the thing that Brian Tomasik calls suffering, when you see it in front of you (regardless of whether illusionism turns out to be true)? (Of course, far be it from me to discourage people from taking weird ideas seriously; I'm an EA, after all. I'm just saying that it's worth reflection if you really buy into that wager wholeheartedly, or if you have some meta uncertainty.)
I also talk a bit about consciousness realism in endnote 18 of my post "Why Realists and Anti-Realists Disagree." I want to flag that I personally don't understand why consciousness realism would necessarily imply moral realism. I guess I can see that it gets you closer to it, but I think there's more to argue for even with consciousness realism. In any case, I think illusionism is being strawmanned in that debate. Illusionists aren't denying anything worth wanting. Illusionists are only denying something that never made sense in the first place. It's the same as compatibilists in the free will debate: you never wanted "true free will," whatever that is. Just like one can be mistaken about one's visual field having lots of details even at the edges, or how some people with a brain condition can be mistaken about seeing stuff when they have blindsight, illusionists claim that people can be mistaken about some of the properties they ascribe to consciousness. They're not mistaken about a non-technical interpretation of "it feels like something to be me," because that's just how we describe the fact that there's something that both illusionists and qualia realists are debating. However, illusionists claim that qualia realists are mistaken about a philosophically-loaded interpretation of "it feels like something to be me," where the hidden assumption is something like "feeling like something is a property that is either on or off for something, and there's always a fact of the matter." See the dialogue in endnote 18 of that post on why this isn't correct (or at least why we cannot infer this from our experience of consciousness.) (This debate is btw very similar to the moral realism vs anti-realism debate. There's a sense in which anti-realists aren't denying that "torture is wrong" in a loose and not-too-philosophically loaded sense. They're just denying that based on "torture is wrong," we can infer that there's a fact of the matter about all courses of action – whether they're right or wrong.) Basically, the point I'm trying to make here is that illusionists aren't disagreeing with you if you say your conscious. They're only disagreeing with you when, based on introspecting about your consciousness, you now claim that you know that an omniscient being could tell about every animal/thing/system/process whether it's conscious or not, that there must be a fact of the matter. But just because it feels to you like there's a fact of the matter doesn't mean that there may not be myriads of edge cases where we (or experts under ideal reasoning conditions) can't draw crisp boundaries about what may or may not be 'conscious.' That's why illusionists like Brian Tomasik end up saying that consciousness is about what kind of algorithms you care about.
I'm not sure if I'm claiming quite that, but maybe I am. It depends on operationalizations.
Most importantly, I want to flag that even the people who are optimistic about "alignment might turn out to be easy" probably lose their optimism if we assume that timelines are sufficiently short. Like, would you/they still be optimistic if we for sure had <2years? It seems to me that more people are confident that AI timelines are very short than people are confident that we'll solve alignment really soon. In fact, no one seems confident that we'll solve alignment really soon. So, the situation already feels asymmetric.
On assessing alignment difficulty, I sympathize most with Eliezer's claims that it's important to get things right on the first try and that engineering progress among humans almost never happened to be smoother than initially expected (and so is a reason for pessimism in combination with the "we need to get it right on the first try" argument). I'm less sure how much I buy Eliezer's confidence that "niceness/helpfulness" isn't easy to train/isn't a basin of attraction. He has some story about how prosocial instincts evolved in humans for super-contingent reasons so that it's highly unlikely to re-play in ML training. And there I'm more like "Hm, hard to know." So, I'm not pessimistic for inherent technical reasons. It's more that I'm pessimistic because I think we'll fumble the ball even if we're in the lucky world where the technical stuff is surprisingly easy.
That said, I still think "alignment difficulty?" isn't the sort of question where the ignorance prior is 50-50. It feels like there are more possibilities for it to be hard than easy.