LessWrong dev & admin as of July 5th, 2022.
I do agree with this, in principle:
A system being ‘cognitively efficient wrt humanity’ doesn’t automatically entail ‘whatever goals the system has – and whatever constraints the system might otherwise face – the cognitively efficient system gets what it wants’.
...though I don't think it buys us more than a couple points; I think people dramatically underestimate how high the ceiling is for humans and think that a reasonably smart human familiar with the right ideas would stand a decent chance at executing a takeover if placed into the position of an AI (assuming speedup of cognition, + whatever actuators current systems typically possess).
However, I think this is wrong:
LLMs distill human cognition
LLMs have whatever capabilities they have because those are the capabilities discovered by gradient descent which, given their architecture, improved their performance on the test task (next token prediction). This task is extremely unlike the tasks represented in the environment where human evolution occurred, and the kind of cognitive machinery which would make a system effective at next token prediction seems very different from whatever it is that humans do. (Humans are capable of next token prediction, but notably we are much worse at it than even GPT-3.)
Separately, the cognitive machinery that represents human intelligence seems to be substantially decoupled from the cognitive machinery that represents human values (and/or the cognitive machinery that causes humans to develop values after birth), so if it turned out that LLMs did, somehow, share the bulk of their cognitive algorithms with humans, that would be a slight positive update for me, but not an overwhelming one, since I wouldn't expect an LLM to want anything remotely relevant to humans. (Most of the things that humans want are lossy proxies for things that improved IGF in the ancestral environment, many of which generalized extremely poorly out of distribution. What are the lossy proxies for minimizing prediction loss that a sufficiently-intelligent LLM would end up with? I don't know, but I don't see why they'd have anything to do with the very specific things that humans value.)
None of those obviously mean the same thing ("runaway AI" might sort of gesture at it, but it's still pretty ambiguous). Intelligence explosion is the thing it's pointing at, though I think there are still a bunch of conflated connotations that don't necessarily make sense as a single package.
I think "hard takeoff" is better if you're talking about the high-level "thing that might happen", and "recursive self improvement" is much clearer if you're talking about the usually-implied mechanism by which you expect hard takeoff.
I think people should take a step back and take a bird's-eye view of the situation:
I don't doubt that the author cares about preventing sexual assault, and mitigating the harms that come from it. They do also seem to care about something that requires dropping dark hints of potential legal remedies they might pursue, with scary-sounding numbers and mentions of venue-shopping attached to them.
Relevant, I think, is Gwern's later writing on Tool AIs:
There are similar general issues with Tool AIs as with Oracle AIs:
- a human checking each result is no guarantee of safety; even Homer nods. A extremely dangerous or subtly dangerous answer might slip through; Stuart Armstrong notes that the summary may simply not mention the important (to humans) downside to a suggestion, or frame it in the most attractive light possible. The more a Tool AI is used, or trusted by users, the less checking will be done of its answers before the user mindlessly implements it.
- an intelligent, never mind superintelligent Tool AI, will have built-in search processes and planners which may be quite intelligent themselves, and in ‘planning how to plan’, discover dangerous instrumental drives and the sub-planning process execute them.2 (This struck me as mostly theoretical until I saw how well GPT-3 could roleplay & imitate agents purely by offline self-supervised prediction on large text databases—imitation learning is (batch) reinforcement learning too! See Decision Transformer for an explicit use of this.)
- developing a Tool AI in the first place might require another AI, which itself is dangerous
Personally, I think the distinction is basically irrelevant in terms of safety concerns, mostly for reasons outlined by the second bullet-point above. The danger is in the fact that "useful answers" you might get out of a Tool AI are those answers which let you steer the future to hit narrow targets (approximately described as "apply optimization power" by Eliezer & such).
If you manage to construct a training regime for something that we'd call a Tool AI, which nevertheless gives us something smart enough that it does better than humans in terms of creating plans which affect reality in specific ways[1], then it approximately doesn't matter whether or not we give it actuators to act in the world[2]. It has to be aiming at something; whether or not that something is friendly to human interests won't depend on what we name we give the AI.
I'm not sure how to evaluate the predictions themselves. I continue to think that the distinction is basically confused and doesn't carve reality at the relevant joints, and I think progress to date supports this view.
I think you are somewhat missing the point. The point of a treaty with an enforcement mechanism which includes bombing data centers is not to engage in implicit nuclear blackmail, which would indeed be dumb (from a game theory perspective). It is to actually stop AI training runs. You are not issuing a "threat" which you will escalate into greater and greater forms of blackmail if the first one is acceded to; the point is not to extract resources in non-cooperative ways. It is to ensure that the state of the world is one where there is no data center capable of performing AI training runs of a certain size.
The question of whether this would be correctly understood by the relevant actors is important but separate. I agree that in the world we currently live in, it doesn't seem likely. But if you in fact lived in a world which had successfully passed a multilateral treaty like this, it seems much more possible that people in the relevant positions had updated far enough to understand that whatever was happening was at least not the typical realpolitik.
2. If the world takes AI risk seriously, do we need threats?
Obviously if you live in a world where you've passed such a treaty, the first step in response to a potential violation is not going to be "bombs away!", and nothing Eliezer wrote suggests otherwise. But the fact that you have these options available ultimately bottoms out in the fact that your BATNA is still to bomb the data center.
3. Don't do morally wrong things
I think conducting cutting edge AI capabilities research is pretty immoral, and in this counterfactual world that is a much more normalized position, even if consensus is that chances of x-risk absent a very strong plan for alignment is something like 10%. You can construct the least convenient possible world such that some poor country has decided, for perfectly innocent reasons, to build data centers that will predictably get bombed, but unless you think the probability mass on something like that happening is noticeable, I don't think it should be a meaningful factor in your reasoning. Like, we do not let people involuntarily subject others to russian roulette, which is similar to the epistemic state of the world where 10% x-risk is a consensus position, and our response to someone actively preparing to go play roulette while declaring their intentions to do so in order to get some unrelated real benefit out of it would be to stop them.
4. Nuclear exchanges could be part of a rogue AI plan
I mean, no, in this world you're already dead, and also nuclear exchange would in fact cost AI quite a lot so I expect many fewer nuclear wars in worlds where we've accidentally created an unaligned ASI.
This advocates for risking nuclear war for the sake of preventing mere "AI training runs". I find it highly unlikely that this risk-reward payoff is logical at a 10% x-risk estimate.
All else equal, this depends on what increase in risk of nuclear war you're trading off against what decrease in x-risk from AI. We may have "increased" risk of nuclear war by providing aid to Ukraine in its war against Russia, but if it was indeed an increase it was probably small and worth the trade-off[1] against our other goals (such as disincentivizing the beginning of wars which might lead to nuclear escalation in the first place). I think approximately the only unusual part of Eliezer's argument is the fact that he doesn't beat around the bush in spelling out the implications.
Asserted for the sake of argument; I haven't actually demonstrated that this is true but my point is more that there are many situations where we behave as if it is obviously a worthwhile trade-off to marginally increase the risk of nuclear war.
Many things about this comment seem wrong to me.
Yudkowsky's suggestions seem entirely appropriate if you truly believe, like him, that AI x-risk is probability ~100%.
These proposals would plausibly be correct (to within an order of magnitude) in terms of the appropriate degree of response with much lower probabilities of doom (i.e. 10-20%). I think you need to actually run the math to say that this doesn't make sense.
unproven and unlikely assumptions, like that an AI could build nanofactories by ordering proteins to be mixed over email
This is a deeply distorted understanding of Eliezer's threat model, which is not any specific story that he can tell, but the brute fact that something smarter than you (and him, and everyone else) will come up with something better than that.
In the actual world, where the probability of extinction is signficantly less than 100%, are these proposals valuable?
I do not think it is ever particularly useful to ask "is someone else's conclusion valid given my premises, which are importantly different from theirs", if you are attempting to argue against someone's premises. Obviously "A => B" & "C" does not imply "B", and it especially does not imply "~A".
It seems like they will just get everyone else labelled luddites and fearmongerers, especially if years and decades go by with no apocalypse in sight.
This is an empirical claim about PR, which:
Separate from object-level disagreements, my crux that people can have inside-view models which "rule out" other people's models (as well as outside-view considerations) in a way that leads to assigning very high likelihoods (i.e. 99%+) to certain outcomes.
The fact that they haven't successfully communicated their models to you is certainly a reason for you to not update strongly in their direction, but it doesn't mean much for their internal epistemic stance.
I think that's strongly contra Eliezer's model, which is shaped something like "succeeding at solving the alignment problem eliminates most sources of existential risk, because aligned AGI will in fact be competent to solve for them in a robust way". This does obviously imply something about the ability of random humans to
spin up unmonitored nanofactoriespush a bad yaml file. Maybe there'll be some much more clever solution(s) for various possible problems? /shrug