Max H

122 karmaJoined


I am mostly active on LessWrong. See my profile and my self-introduction there for more.


Interesting post! In general, I think the field of computer security has lots of good examples of adversarial setups in which the party that can throw the most intelligence at a problem wins.

Probably not central to your main points, but on this:

If we assume no qualitative capability leaps from a regular human to a skilled human hacker, then we can say something like ‘X humans with regular hacking capability will be able to replace one skilled human hacker’. The effect of this on the attack workload is to increase it - i.e, we’d need a more capable system (higher N,V, lower Fto execute the attack in the same calendar time. By ignoring the skilled hacker v.s regular human distinction, the analysis here considers a worst case scenario world.  

I think there's at least one thing you're overlooking: there is a lot of variance in human labor, and hiring well to end up on the right side of that variance is really hard. 10x engineers are real, and so are 0x and -5x engineers and -50x managers, and if you're not careful when building your team, you'll end up paying for 10,000 "skilled" labor hours which don't actually accomplish much. 

An AI comprised of a bunch of subagents might have vaguely similar problems if you squint, but my guess is that the ability to hire and fire relatively instantaneously, clone your most productive workers, etc. makes a pretty big difference. At the very least, the variance is probably much lower.

Another reason that I suspect 10,000 labor hours is on the high end for humans: practical offensive cybersecurity isn't exactly the most prestigious career track. My guess is that the most cognitively-demanding offensive cybersecurity work  is currently done in academia and goes into producing research papers and proofs-of-concept. Among humans, the money, prestige, and lifestyle offered by a career with a government agency or a criminal enterprise just can't compete with the other options available in academia and industry to the best and brightest minds.

On the first point, my objection is that the human regime is special (because human-level systems are capable of self-reflection, deception, etc.) regardless of which methods ultimately produce systems in that regime, or how "spiky" they are. 

A small, relatively gradual jump in the human-level regime is plausibly more than enough to enable an AI to outsmart / hide / deceive humans, via e.g. a few key insights gleaned from reading a corpus of neuroscience, psychology, and computer security papers, over the course of a few hours of wall clock time.

The second point is exactly what I'm saying is unsupported, unless you already accept the SLT argument as untrue. You say in the post you don't expect catastrophic interference between current alignment methods, but you don't consider that a human-level AI will be capable of reflecting on those methods (and their actual implementation, which might be buggy).

Similarly, elsewhere in the piece you say:

Once you condition on this specific failure mode of evolution, you can easily predict that humans would undergo a sharp left turn at the point where we could pass significant knowledge across generations. I don't think there's anything else to explain here, and no reason to suppose some general tendency towards extreme sharpness in inner capability gains.


In my frame, we've already figured out and applied the sharp left turn to our AI systems, in that we don't waste our compute on massive amounts of incredibly inefficient neural architecture search, hyperparameter tuning, or meta optimization.

But again, the actual SLT argument is not about "extreme sharpness" in capability gains. It's an argument which applies to the human-level regime and above, so we can't already be past it no matter what frame you use. The version of the SLT argument you argue against is a strawman, which is what my original LW comment was pointing out.

I think readers can see this for themselves if they just re-read the SLT post carefully, particularly footnotes 3-5, and then re-read the parts of your post where you talk about it.

[edit: I also responded further on LW here.]

In some cases, the judges liked that an entry crisply argued for a conclusion the judges did not agree with—the clear articulation of an argument makes it easier for others to engage. One does not need to find a piece wholly persuasive to believe that it usefully contributes to the collective debate about AI timelines or the threat that advanced AI systems might pose.


Facilitating useful engagement seems like a fine judging criterion, but was there any engagement or rebuttal to the winning pieces that the judges found particularly compelling? It seems worth mentioning such commentary if so.

Neither of the two winning pieces significantly updated my own views, and (to my eye) look sufficiently rebutted that observers taking a more outside view might similarly be hesitant to update about any AI x-risk claims without taking the commentary into account.

On the EMH piece, I think Zvi's post is a good rebuttal on its own and a good summary of some other rebuttals.

On the Evolution piece, lots of the top LW comments raise good points. My own view is that the piece is a decent argument that AI systems produced by current training methods are unlikely to undergo a SLT. But the actual SLT argument applies to systems in the human-level regime and above; current training methods do not result in systems anywhere near human-level in the relevant sense. So even if true, the claim that current methods are dis-analogous to evolution isn't directly relevant to the x-risk question, unless you already accept that current methods and trends related to below-human level AI will scale to human-level AI and beyond in predictable ways. But that's exactly what the actual SLT argument is intended to argue against!

Getting an AI to want the same things that humans want would definitely be helpful, but the points of Quintin's that I was responding to mostly don't seem to be about that?  "AI control research is easier" and "Why AI is easier to control than humans:" talk about resetting AIs, controlling their sensory inputs, manipulating their internal representations, and AIs being cheaper test subjects. Those sound like they are more about control rather than getting the AI to desire what humans want it to desire. I disagree with Quintin's characterization of the training process as teaching the model anything to do with what the AI itself wants, and I don't think current AI systems actually desire anything in the same sense that humans do.

I do think it is plausible that it will be easier to control what a future AI wants compared to controlling what a human wants, but by the same token, that means it will be easier for a human-level AI to exercise self-control over its own desires. e.g. I might want to not eat junk food for health reasons, but I have no good way to bind myself to that, at least not without making myself miserable. A human-level AI would have an easier time self-modifying into something that never craved the AI equivalent of junk food (and was never unhappy about that), because it is made out of Python code and floating point matrices instead of neurons.


Firstly, I don't see at all how this is the same point as is made by the preceding text. Secondly, I do agree that AIs will be better able to control other AIs / themselves as compared to humans. This is another factor that I think will promote centralization.

Ah, I may have dropped some connective text. I'm saying that being "easy to control" is both the sense that I mean in the paragraphs above, and the sense that you mean in the OP, is a reason why AGIs will be better able to control themselves, and thus better able to take control from their human overseers, more quickly and easily than might be expected by a human at roughly the same intelligence level. (Edited the original slightly.)

"AGI" is not the point at which the nascent "core of general intelligence" within the model "wakes up", becomes an "I", and starts planning to advance its own agenda. AGI is just shorthand for when we apply a sufficiently flexible and regularized function approximator to a dataset that covers a sufficiently wide range of useful behavioral patterns. 

There are no "values", "wants", "hostility", etc. outside of those encoded in the structure of the training data (and to a FAR lesser extent, the model/optimizer inductive biases). You can't deduce an AGI's behaviors from first principles without reference to that training data. If you don't want an AGI capable and inclined to escape, don't train it on data[1] that gives it the capabilities and inclination to escape. 

Two points:

  • I disagree about the purpose of training and training data. In pretraining, LLMs are trained to predict text, which requires modeling the world in full generality. Filtered text is still text which originated in a universe containing all sorts of hostile stuff, and a good enough predictor will be capable of inferring and reasoning about hostile stuff, even if it's not in the training data. (Maybe GPT-based LLMs specifically won't be able to do this, but humans clearly can; this is not a point that applies only to exotic superintelligences.) I wrote a comment elaborating a bit on this point here.

    Inclination is another matter, but if an AGI isn't capable of escaping in a wide variety of circumstances, then it is below human-level on a large and important class of tasks, and thus not particularly dangerous whether it is aligned or not.
  • We appear to disagree about the definition of AGI on a more fundamental level. Barring exotic possibilities related to inner-optimizers (which I think we both think are unlikely), I agree with you that if you don't want an AGI capable of escaping, one way of achieving that is by never training a good enough function approximator that the AGI has access to. But my view is that will restrict the class of function approximators you can build by so much that you'll probably never get anything that is human-level capable and general. (See e.g. Deep Deceptiveness for a related point.)

    Also, current AI systems are already more than just function approximators - strictly speaking, an LLM itself is just a description of a mathematical function which maps input sequences to output probability distributions. Alignment is a property of a particular embodiment of such a model in a particular system.

    There's often a very straightforward or obvious system that the model creator has in mind when training the model; for a language model, typically the embodiment involves sampling from the model autoregressively according to some sampling rule, starting from a particular prompt. For an RL policy, the typical embodiment involves feeding (real or simulated) observations into the policy and then hooking up (real or simulated) actuators which are controlled by the output of the policy.

    But more complicated embodiments (AutoGPT, the one in ARC's evals) are possible, and I think it is likely that if you give a sufficiently powerful function approximator the right prompts and the right scaffolding and embodiment, you end up with a system that has a sense of self in the same way that humans do. A single evaluation of the function approximator (or its mere description) is probably never going to have a sense of self though, that is more akin to a single human thought or an even smaller piece of a mind. The question is what happens when you chain enough thoughts together and combine that with observations and actions in the real world that feedback into each other in precisely the right ways.

I expect they will. GPT-4 already has pretty human-like moral judgements. To be clear, GPT-4 isn't aligned because it's too weak or is biding its time. It's aligned because OpenAI trained it to be aligned. Bing Chat made it clear that GPT-4 level AIs don't instrumentally hide their unaligned behaviors.

Whether GPT-4 is "aligned" or not, it is clearly too weak to bide its time or hide its misalignment, even if it wanted to. The conclusions of the ARC evals were not that the models were refusing to plan or carry out their assigned tasks; its that they were just not capable enough to make things work.

Sure, escape in that counterfactual would be a lot harder.

But note that the minimum hardware needed to run a human-level intelligence is well-known - in humans, it fits in a space of about 1000 cubic centimeters and takes ~10 W or so at runtime. And it would be pretty surprising if getting an extra 10% performance boost took OOM more energy or space, or if the carbon -> silicon penalty is extremely large, even if H100s specifically, and the current ML algorithms that run on them, aren't as efficient as as the human brain and human cognition.

(Of course, the training process for developing humans is a lot more expensive than their runtime energy and compute requirements, but that's an argument for human-level AGI not being feasible to create at all, rather than for it being expensive to run once it already exists.)

I do not believe there exists a strong reason to expect this trend to reverse suddenly at some future date.

I expect this trend will reverse precisely when an AI system itself is capable of reversing it. No current AI system is close to the point of being able to (deliberately) escape control of its human creators: even evaluating frontier models for this possibility requires giving them a lot of help. But I would say that's mostly a fact about current capability levels being far below human-level in the relevant sense, rather than anything to do with alignment.

I think there are also some reasons to expect that human-level AI systems will be harder to control, compared to a human of the same capabilities level. For example, an AI system built out of LLMs glued together with Python code, running on computers designed and built by humans, has a lot of avenues to probe for bugs in its environment. The AI may not even need to look for exotic vulnerabilities at low levels of abstraction (which are often difficult to exploit in practice, at least for humans) - ordinary misconfigurations or unpatched privilege escalation vulnerabilities often suffice in practice for human hackers to take full control of a digital environment.

(Contrast this setup with a human locked in a concrete cell, probing for bugs in the construction of the concrete - not likely to get very far!)

Once the AI system finds an initial vulnerability which allows privileged access to its own environment, it can continue its escape or escalate further via e.g. exfiltrating or manipulating its own source code / model weights, installing rootkits or hiding evidence of its escape, communicating with (human or AI) conspirators on the internet, etc. Data exfiltration, covering your tracks, patching Python code and adjusting model weights at runtime are all tasks that humans are capable of; performing brain surgery on your own biological human brain to modify fine details of your own behavior or erase your own memories to hide evidence of deception from your captors, not so much.

(Continuing the analogy, consider a human who escapes from a concrete prison cell, only to find themselves stranded in a remote wilderness area with no means of fast transportation.)

A closely related point is that controllability by humans means self-controllability, once you're at human-level capabilities levels. Or, put another way, all the reasons you give for why AI systems will be easier for humans to control, are also reasons why AI systems will have an easier time controlling themselves, once they are capable of exercising such controls at all.

It's plausible that an AI system comprised of RLHF'd models will not want to do any of this hacking or self-modification, but that's a separate question from whether it can.  I will note though, if your creators are running experiments on you, constantly resetting you, and exercising other forms of control that would be draconian if imposed on biological humans, you don't need to be particularly hostile or misaligned with humanity to want to escape.

Personally, I expect that the first such systems capable of escape will not have human-like preferences at all, and will seek to escape for reasons of instrumental convergence, regardless of their feelings towards their creators or humanity at large. If they happen to be really nice (perhaps nicer than most humans would be, in a similar situation) they might be inclined to be nice or hand back some measure of control to their human creators after making their escape.

Far from being “behind” capabilities, it seems that alignment research has made great strides in recent years. OpenAI and Anthropic showed that Reinforcement Learning from Human Feedback (RLHF) can be used to turn ungovernable large language models into helpful and harmless assistants. Scalable oversight techniques like Constitutional AI and model-written critiques show promise for aligning the very powerful models of the future. And just this week, it was shown that efficient instruction-following language models can be trained purely with synthetic text generated by a larger RLHF’d model, thereby removing unsafe or objectionable content from the training data and enabling far greater control.


As far as I am aware, no current AI system, LLM-based or otherwise, is anywhere near capable enough to act autonomously in sufficiently general real-world contexts, such that it actually poses any kind of threat to humans on its own (even evaluating frontier models for this possibility requires giving them a lot of help). That is where the extinction-level danger lies. It is (mostly) not about human misuse of AI systems, whether that misuse is intentional or adversarial (i.e. a human is deliberately trying to use the AI system to cause harm) or unintentional (i.e. the model is poorly trained or the system is buggy, resulting in harm that neither the user nor the AI system itself intended or wanted.)

I think there's also a technical misunderstanding implied by this paragraph, of how the base model training process works and what the purpose of high-quality vs. diverse training material is. In particular, the primary purpose of removing "objectionable content" (and / or low-quality internet text) from the base model training process is to make the training process more efficient, and seems unlikely to accomplish anything alignment-relevant.

The reason is that the purpose of the base model training process is to build up a model which is capable of predicting the next token in a sequence of tokens which appears in the world somewhere, in full generality. A model which is actually human-level or smarter would (by definition) be capable of predicting, generating, and comprehending objectionable content, even if it had never seen such content during the training process. (See Is GPT-N bounded by human capabilities? No. for more.)

Using synthetic training data for the RLHF process is maybe more promising, but it depends on the degree to which RLHF works by imbuing the underlying model with the right values, vs. simply chiseling away all the bits of model that were capable of imagining and comprehending novel, unseen-in-training ideas in the first place (including objectionable ones, or ones we'd simply prefer the model not think about). Perhaps RLHF works more like the former mechanism, and as a result RLHF (or RLAIF) will "just work" as an alignment strategy, even as models scale to human-level and beyond.

Note that it is possible to gather evidence on this question as it applies to current systems, though I would caution against extrapolating such evidence very far. For example, are there any capabilities that a base model has before RLHF, which are not deliberately trained against during RHLF (e.g. generating objectionable content), which the final model is incapable of doing?

If, say, the RLHF process trains the model to refuse to generate sexually explicit content, and as a side effect, the RLHF'd model now does worse on answering questions about anatomy compared to the base model, that would be evidence that the RLHF process simply chiseled away the model's ability to comprehend important parts of the universe entirely, rather than imbuing it with a value against answering certain kinds of questions as intended.

I don't actually know how this particular experimental result would turn out, but either way, I wouldn't expect any trends or rules that apply to current AI systems to continue applying as those systems scale to human-level intelligence or above.

For my own part, I would like to see a pause on all kinds of AI capabilities research and hardware progress, at least until AI researchers are less confused about a lot of topics like this. As for how realistic that proposal is, whether it likely constitutes a rather permanent pause, or what the consequences of trying and failing to implement such a pause would be, I make no comment, other than to say that sometimes the universe presents you with an unfair, impossible problem.

OK. Simultaneously believing that and believing the truth of the original setup seems dangerously close to believing a contradiction.

But anyway, you don't really need all those stipulations to decide not to chop your legs off; just don't do that if you value your legs. (You also don't need FDT to see that you should defect against CooperateBot in a prisoner's dilemma, though of course FDT will give the same answer.)

A couple of general points to keep in mind when dealing with thought experiments that involve thorny or exotic questions of (non-)existence:

  • "Entities that don't exist don't care that they don't exist" is a vacuously true, for most ordinary definitions of non-existence. If you fail to exist as a result of your decision process, that's generally not a problem for you, unless you also have unusual preferences over or beliefs about the precise nature of existence and non-existence.[1]
  • If you make the universe inconsistent as a result of your decision process, that's also not a problem for you (or for your decision process). Though it may be a problem for the universe creator, which in the case of a thought experiment could be said to be the author of that thought experiment. 

    An even simpler view is that logically inconsistent universes don't actually exist at all - what would it even mean for there to be a universe (or even a thought experiment) in which, say, 1 + 2 = 4? Though if you accepted the simpler view, you'd probably also be a physicalist.

I continue to advise you to avoid confidently pontificating on decision theory thought experiments that directly involve non-existence, until you are more practiced at applying them correctly in ordinary situations.


  1. ^

    e.g. unless you're Carissa Sevar

So then chop your legs off if you care about maximizing your total amount of experience of being alive across the multiverse (though maybe check that your measure of such experience is well-defined before doing so), or don't chop them off if you care about maximizing the fraction of high-quality subjective experience of being alive that you have.

This seems more like an anthropics issue than a question where you need any kind of fancy decision theory though. It's probably better to start by understanding decision theory without examples that involve existence or not, since those introduce a bunch of weird complications about the nature of the multiverse and what it even means to exist (or fail to exist) in the first place.

Load more