MH

Max H

123 karmaJoined Apr 2023

Bio

I am mostly active on LessWrong. See my profile and my self-introduction there for more.

Comments
19

Getting an AI to want the same things that humans want would definitely be helpful, but the points of Quintin's that I was responding to mostly don't seem to be about that?  "AI control research is easier" and "Why AI is easier to control than humans:" talk about resetting AIs, controlling their sensory inputs, manipulating their internal representations, and AIs being cheaper test subjects. Those sound like they are more about control rather than getting the AI to desire what humans want it to desire. I disagree with Quintin's characterization of the training process as teaching the model anything to do with what the AI itself wants, and I don't think current AI systems actually desire anything in the same sense that humans do.
 

I do think it is plausible that it will be easier to control what a future AI wants compared to controlling what a human wants, but by the same token, that means it will be easier for a human-level AI to exercise self-control over its own desires. e.g. I might want to not eat junk food for health reasons, but I have no good way to bind myself to that, at least not without making myself miserable. A human-level AI would have an easier time self-modifying into something that never craved the AI equivalent of junk food (and was never unhappy about that), because it is made out of Python code and floating point matrices instead of neurons.

 

Firstly, I don't see at all how this is the same point as is made by the preceding text. Secondly, I do agree that AIs will be better able to control other AIs / themselves as compared to humans. This is another factor that I think will promote centralization.


Ah, I may have dropped some connective text. I'm saying that being "easy to control" is both the sense that I mean in the paragraphs above, and the sense that you mean in the OP, is a reason why AGIs will be better able to control themselves, and thus better able to take control from their human overseers, more quickly and easily than might be expected by a human at roughly the same intelligence level. (Edited the original slightly.)

"AGI" is not the point at which the nascent "core of general intelligence" within the model "wakes up", becomes an "I", and starts planning to advance its own agenda. AGI is just shorthand for when we apply a sufficiently flexible and regularized function approximator to a dataset that covers a sufficiently wide range of useful behavioral patterns. 

There are no "values", "wants", "hostility", etc. outside of those encoded in the structure of the training data (and to a FAR lesser extent, the model/optimizer inductive biases). You can't deduce an AGI's behaviors from first principles without reference to that training data. If you don't want an AGI capable and inclined to escape, don't train it on data[1] that gives it the capabilities and inclination to escape. 

Two points:
 

  • I disagree about the purpose of training and training data. In pretraining, LLMs are trained to predict text, which requires modeling the world in full generality. Filtered text is still text which originated in a universe containing all sorts of hostile stuff, and a good enough predictor will be capable of inferring and reasoning about hostile stuff, even if it's not in the training data. (Maybe GPT-based LLMs specifically won't be able to do this, but humans clearly can; this is not a point that applies only to exotic superintelligences.) I wrote a comment elaborating a bit on this point here.

    Inclination is another matter, but if an AGI isn't capable of escaping in a wide variety of circumstances, then it is below human-level on a large and important class of tasks, and thus not particularly dangerous whether it is aligned or not.
  • We appear to disagree about the definition of AGI on a more fundamental level. Barring exotic possibilities related to inner-optimizers (which I think we both think are unlikely), I agree with you that if you don't want an AGI capable of escaping, one way of achieving that is by never training a good enough function approximator that the AGI has access to. But my view is that will restrict the class of function approximators you can build by so much that you'll probably never get anything that is human-level capable and general. (See e.g. Deep Deceptiveness for a related point.)

    Also, current AI systems are already more than just function approximators - strictly speaking, an LLM itself is just a description of a mathematical function which maps input sequences to output probability distributions. Alignment is a property of a particular embodiment of such a model in a particular system.

    There's often a very straightforward or obvious system that the model creator has in mind when training the model; for a language model, typically the embodiment involves sampling from the model autoregressively according to some sampling rule, starting from a particular prompt. For an RL policy, the typical embodiment involves feeding (real or simulated) observations into the policy and then hooking up (real or simulated) actuators which are controlled by the output of the policy.

    But more complicated embodiments (AutoGPT, the one in ARC's evals) are possible, and I think it is likely that if you give a sufficiently powerful function approximator the right prompts and the right scaffolding and embodiment, you end up with a system that has a sense of self in the same way that humans do. A single evaluation of the function approximator (or its mere description) is probably never going to have a sense of self though, that is more akin to a single human thought or an even smaller piece of a mind. The question is what happens when you chain enough thoughts together and combine that with observations and actions in the real world that feedback into each other in precisely the right ways.
     

I expect they will. GPT-4 already has pretty human-like moral judgements. To be clear, GPT-4 isn't aligned because it's too weak or is biding its time. It's aligned because OpenAI trained it to be aligned. Bing Chat made it clear that GPT-4 level AIs don't instrumentally hide their unaligned behaviors.

Whether GPT-4 is "aligned" or not, it is clearly too weak to bide its time or hide its misalignment, even if it wanted to. The conclusions of the ARC evals were not that the models were refusing to plan or carry out their assigned tasks; its that they were just not capable enough to make things work.

Sure, escape in that counterfactual would be a lot harder.

But note that the minimum hardware needed to run a human-level intelligence is well-known - in humans, it fits in a space of about 1000 cubic centimeters and takes ~10 W or so at runtime. And it would be pretty surprising if getting an extra 10% performance boost took OOM more energy or space, or if the carbon -> silicon penalty is extremely large, even if H100s specifically, and the current ML algorithms that run on them, aren't as efficient as as the human brain and human cognition.

(Of course, the training process for developing humans is a lot more expensive than their runtime energy and compute requirements, but that's an argument for human-level AGI not being feasible to create at all, rather than for it being expensive to run once it already exists.)

I do not believe there exists a strong reason to expect this trend to reverse suddenly at some future date.


I expect this trend will reverse precisely when an AI system itself is capable of reversing it. No current AI system is close to the point of being able to (deliberately) escape control of its human creators: even evaluating frontier models for this possibility requires giving them a lot of help. But I would say that's mostly a fact about current capability levels being far below human-level in the relevant sense, rather than anything to do with alignment.

I think there are also some reasons to expect that human-level AI systems will be harder to control, compared to a human of the same capabilities level. For example, an AI system built out of LLMs glued together with Python code, running on computers designed and built by humans, has a lot of avenues to probe for bugs in its environment. The AI may not even need to look for exotic vulnerabilities at low levels of abstraction (which are often difficult to exploit in practice, at least for humans) - ordinary misconfigurations or unpatched privilege escalation vulnerabilities often suffice in practice for human hackers to take full control of a digital environment.

(Contrast this setup with a human locked in a concrete cell, probing for bugs in the construction of the concrete - not likely to get very far!)

Once the AI system finds an initial vulnerability which allows privileged access to its own environment, it can continue its escape or escalate further via e.g. exfiltrating or manipulating its own source code / model weights, installing rootkits or hiding evidence of its escape, communicating with (human or AI) conspirators on the internet, etc. Data exfiltration, covering your tracks, patching Python code and adjusting model weights at runtime are all tasks that humans are capable of; performing brain surgery on your own biological human brain to modify fine details of your own behavior or erase your own memories to hide evidence of deception from your captors, not so much.

(Continuing the analogy, consider a human who escapes from a concrete prison cell, only to find themselves stranded in a remote wilderness area with no means of fast transportation.)

A closely related point is that controllability by humans means self-controllability, once you're at human-level capabilities levels. Or, put another way, all the reasons you give for why AI systems will be easier for humans to control, are also reasons why AI systems will have an easier time controlling themselves, once they are capable of exercising such controls at all.

It's plausible that an AI system comprised of RLHF'd models will not want to do any of this hacking or self-modification, but that's a separate question from whether it can.  I will note though, if your creators are running experiments on you, constantly resetting you, and exercising other forms of control that would be draconian if imposed on biological humans, you don't need to be particularly hostile or misaligned with humanity to want to escape.

Personally, I expect that the first such systems capable of escape will not have human-like preferences at all, and will seek to escape for reasons of instrumental convergence, regardless of their feelings towards their creators or humanity at large. If they happen to be really nice (perhaps nicer than most humans would be, in a similar situation) they might be inclined to be nice or hand back some measure of control to their human creators after making their escape.

Far from being “behind” capabilities, it seems that alignment research has made great strides in recent years. OpenAI and Anthropic showed that Reinforcement Learning from Human Feedback (RLHF) can be used to turn ungovernable large language models into helpful and harmless assistants. Scalable oversight techniques like Constitutional AI and model-written critiques show promise for aligning the very powerful models of the future. And just this week, it was shown that efficient instruction-following language models can be trained purely with synthetic text generated by a larger RLHF’d model, thereby removing unsafe or objectionable content from the training data and enabling far greater control.
 

 

As far as I am aware, no current AI system, LLM-based or otherwise, is anywhere near capable enough to act autonomously in sufficiently general real-world contexts, such that it actually poses any kind of threat to humans on its own (even evaluating frontier models for this possibility requires giving them a lot of help). That is where the extinction-level danger lies. It is (mostly) not about human misuse of AI systems, whether that misuse is intentional or adversarial (i.e. a human is deliberately trying to use the AI system to cause harm) or unintentional (i.e. the model is poorly trained or the system is buggy, resulting in harm that neither the user nor the AI system itself intended or wanted.)


I think there's also a technical misunderstanding implied by this paragraph, of how the base model training process works and what the purpose of high-quality vs. diverse training material is. In particular, the primary purpose of removing "objectionable content" (and / or low-quality internet text) from the base model training process is to make the training process more efficient, and seems unlikely to accomplish anything alignment-relevant.

The reason is that the purpose of the base model training process is to build up a model which is capable of predicting the next token in a sequence of tokens which appears in the world somewhere, in full generality. A model which is actually human-level or smarter would (by definition) be capable of predicting, generating, and comprehending objectionable content, even if it had never seen such content during the training process. (See Is GPT-N bounded by human capabilities? No. for more.)

Using synthetic training data for the RLHF process is maybe more promising, but it depends on the degree to which RLHF works by imbuing the underlying model with the right values, vs. simply chiseling away all the bits of model that were capable of imagining and comprehending novel, unseen-in-training ideas in the first place (including objectionable ones, or ones we'd simply prefer the model not think about). Perhaps RLHF works more like the former mechanism, and as a result RLHF (or RLAIF) will "just work" as an alignment strategy, even as models scale to human-level and beyond.

Note that it is possible to gather evidence on this question as it applies to current systems, though I would caution against extrapolating such evidence very far. For example, are there any capabilities that a base model has before RLHF, which are not deliberately trained against during RHLF (e.g. generating objectionable content), which the final model is incapable of doing?

If, say, the RLHF process trains the model to refuse to generate sexually explicit content, and as a side effect, the RLHF'd model now does worse on answering questions about anatomy compared to the base model, that would be evidence that the RLHF process simply chiseled away the model's ability to comprehend important parts of the universe entirely, rather than imbuing it with a value against answering certain kinds of questions as intended.

I don't actually know how this particular experimental result would turn out, but either way, I wouldn't expect any trends or rules that apply to current AI systems to continue applying as those systems scale to human-level intelligence or above.

For my own part, I would like to see a pause on all kinds of AI capabilities research and hardware progress, at least until AI researchers are less confused about a lot of topics like this. As for how realistic that proposal is, whether it likely constitutes a rather permanent pause, or what the consequences of trying and failing to implement such a pause would be, I make no comment, other than to say that sometimes the universe presents you with an unfair, impossible problem.

OK. Simultaneously believing that and believing the truth of the original setup seems dangerously close to believing a contradiction.

But anyway, you don't really need all those stipulations to decide not to chop your legs off; just don't do that if you value your legs. (You also don't need FDT to see that you should defect against CooperateBot in a prisoner's dilemma, though of course FDT will give the same answer.)

A couple of general points to keep in mind when dealing with thought experiments that involve thorny or exotic questions of (non-)existence:

  • "Entities that don't exist don't care that they don't exist" is a vacuously true, for most ordinary definitions of non-existence. If you fail to exist as a result of your decision process, that's generally not a problem for you, unless you also have unusual preferences over or beliefs about the precise nature of existence and non-existence.[1]
  • If you make the universe inconsistent as a result of your decision process, that's also not a problem for you (or for your decision process). Though it may be a problem for the universe creator, which in the case of a thought experiment could be said to be the author of that thought experiment. 

    An even simpler view is that logically inconsistent universes don't actually exist at all - what would it even mean for there to be a universe (or even a thought experiment) in which, say, 1 + 2 = 4? Though if you accepted the simpler view, you'd probably also be a physicalist.


I continue to advise you to avoid confidently pontificating on decision theory thought experiments that directly involve non-existence, until you are more practiced at applying them correctly in ordinary situations.

 

  1. ^

    e.g. unless you're Carissa Sevar

So then chop your legs off if you care about maximizing your total amount of experience of being alive across the multiverse (though maybe check that your measure of such experience is well-defined before doing so), or don't chop them off if you care about maximizing the fraction of high-quality subjective experience of being alive that you have.

This seems more like an anthropics issue than a question where you need any kind of fancy decision theory though. It's probably better to start by understanding decision theory without examples that involve existence or not, since those introduce a bunch of weird complications about the nature of the multiverse and what it even means to exist (or fail to exist) in the first place.

You're response in the decision theory case was that there's no way that a rational agent could be in that epistemic state.
 

I did not say this.

But we can just stipulate it for the purpose of the hypothetical.


OK, in that case, the agent in the hypothetical should probably consider whether they are in a short-lived simulation.
 

FDT implies that you have strong reason to chop your legs off even though it doesn't benefit you at all.  


No, it might say that, depending on (among other things) what exactly it means to value your own existence.

Thank you. I don't have any strong objections to these claims, and I do think pessimism is justified. Though my guess is that a lot of people at places like OpenAI and DeepMind do care about animal welfare pretty strongly already. Separately, I think that it would be much better in expectation (for both humans and animals) if Eliezer's views on pretty much every other topic were more influential, rather than less, inside those places.

My negative reaction to your initial comment was mainly due to the way critiques (such as this post) of Eliezer are often framed, in which the claims "Eliezer's views are overly influential" and "Eliezer's views are incorrect / harmful" are combined into one big attack. I don't object to people making these claims in principle (though I think they're both wrong, in many cases), but when they are combined it requires more effort to separate and refute.

(Your comment wasn't a particularly bad example of this pattern, but it was short and crisp and I didn't have any other major objections to it, so I chose to express the way it made me feel on the expectation that it would be more likely to be heard and understood compared to making the point in more heated disagreements.)

It's frustrating to read comments like this because they make me feel like, if I happen agree with Eliezer about something, my own agency and ability to think critically is being questioned before I've even joined the object-level discussion.

Separately, this comment makes a bunch of mostly-implicit object-level assertions about animal welfare and its importance, and a bunch of mostly-explicit assertions about Eliezer's opinions and influence on rationalists and EAs, as well as the effect of this influence on the impacts of TAI.

None of these claims are directly supported in the comment, which is fine if you don't want to argue for them here, but the way the comment is written might lead readers who agree with the implicit claims about the animal welfare issues to accept the explict claims about Eliezer's influence and opinions and their effects on TAI with a less critical eye than if these claims were otherwise more clearly separated.

For example, I don't think it's true that a few FB posts / comments have had a "huge influence" on rationalist culture. I also think that worrying about animal welfare specifically when thinking about TAI outcomes is less important than you claim. If we succeed in being able to steer TAI at all (unlikely, in my view), animals will do fine - so will everyone else. At a minimum, there will also be no more global poverty, no more malaria, and no more animal suffering. Even if the specific humans who develop TAI don't care at all about animals themselves (not exactly likely), they are unlikely to completely ignore the concerns of everyone else who does care. But none of these disagreements have much or any bearing on whether I think animal suffering is real (I find this at least plausible) and whether that's a moral horror (I think this is very likely, if the suffering is real).

Load more