2650Joined Aug 2014


The better reference class is adversarially mined examples for text models. Meta and other researchers were working on a similar projects before Redwood started doing that line of research. is an example

I agree that's a good reference class. I don't think Redwood's project had identical goals, and would strongly disagree with someone saying it's duplicative. But other work is certainly also relevant, and ex post I would agree that other work in the reference class is comparably helpful for alignment

Reader: evaluate your model's consistency for what counts as alignment research--does this mean non-x-risk-pilled Meta researchers do some alignment research, if we believe RR project constituted exciting alignment research too?

Of course! I'm a bit unusual amongst the EA crowd in how enthusiastic I am about "normal" robustness research, but I'm similarly unusual amongst the EA crowd in how enthusiastic I am this proposed research direction for Redwood, and I suspect those things will typically go together.

Separately, I haven't seen empirical demonstrations that pursuing this line of research can have limited capabilities externalities or result in differential technological progress.

I'm still not convinced by this perspective. I would frame the situation as:

  • There's a task we really want future people to be good at---finding places where models behave in obviously-undesirable ways, and understanding the limitations of such evaluations and the consequences of training on adversarial inputs.
  • That task isn't obviously improving automatically with model capabilities, it seems like something that requires knowledge and individual+institutional expertise.
  • So maybe we should practice a lot to get better at that task, sharing what we learn and building a larger community of researchers and engineers with relevant experience.

Your objection sounds like: "That may be true but there's not a lot of evidence that this doesn't also make models more capable, which would be bad." And I don't find that very persuasive---I don't think there is such a strong default presumption that generic research accelerates capabilities enough to be a meaningful cost.

On the question of what generates differential technological progress, I think I'm comparably skeptical of all of the evidence on offer for claims of the form "doing research on X leads to differential progress on Y," and the best guide we have (both in alignment and in normal academic research!) is basically common-sense arguments along the lines of "investigating and practicing doing X tends to make you better at doing X."

I've argued about this point with Evan a few times but still don't quite understand his take. I'd be interested in more back and and forth. My most basic objection is that the fine-tuning objective is also extremely simple---produce actions that will be rated highly, or even just produce outputs that get a low loss. If you have a picture of the training process, then all of these are just very simple things to specify, trivial compared to other differences in complexity between deceptive alignment and proxy alignment. (And if you don't yet have such a picture, then deceptive alignment also won't yield good performance.)

Yes, I think that's how people have used the terms historically. I think it's also generally good usage---the specific thing you talk about in the post is important and needs its own name.

 Unfortunately I think it is extremely often misinterpreted and there is some chance we should switch to a term like "instrumental alignment" instead to avoid the general confusion with deception more broadly.

Because they lead to good performance on the pre-training objective (via deceptive alignment). I think a similarly big leap is needed to develop deceptive alignment during fine-tuning (rather than optimization directly for the loss). In both cases the deceptively aligned behavior is not cognitively similar to the intended behavior, but is plausibly simpler (with similar simplicity gaps in each case).

I don't know how common each view is. My guess would be that in the old days this was the more common view, but there's been a lot more discussion of deceptive alignment recently on LW.

I don't find the argument about "take actions with effects in the real world" --> "deceptive alignment," and my current guess is that most people would also back off from that style of argument if they thought about the issues more thoroughly. Mostly though it seems like this will just get settled by the empirics.

Models that are only pre-trained almost certainly don’t have consequentialist goals beyond the trivial next token prediction.

If a model is deceptively aligned after fine-tuning, it seems most likely to me that it's because it was deceptively aligned during pre-training.

"Predict tokens well" and "Predict fine-tuning tokens well" seem like very similar inner objectives, so if you get the first one it seems like it will move quickly to the second one. Moving to the instrumental reasoning to do well at fine-tuning time seems radically harder. And generally it's quite hard for me to see real stories about why deceptive alignment would be significantly more likely at the second step than the first. 

(I haven't read your whole post yet, but I may share many of your objections to deceptive alignment first emerging during fine-tuning.)

I've gotten the vague vibe that people expect deceptive alignment to emerge during fine-tuning (and perhaps especially RL fine-tuning?) but I don't fully understand the alternative view. I think that "deceptively aligned during pre-training" is closer to e.g. Eliezer's historical views.

There is another form of deceptive alignment in which agents become more manipulative over time due to problems with training data and eventually optimize for reward, or something similar, directly.

I think "deceptive alignment" refers only  to situations where the model gets a high reward at training for instrumental reasons. This is a source of a lot of confusion (and should perhaps be called "instrumental alignment") but worth trying to be clear about.

I might be misunderstanding what you are saying here. I think the post you link doesn't use the term "deceptive alignment" at all so am a bit confused about the cite. (It uses the term "playing the training game" for all models that understand what is happening in training and are deliberately trying to get a low loss, which does include both deceptively aligned models and models that intrinsically value reward or something sufficiently robustly correlated.)

Rather, I don't think that GPUs performing parallel searches through a probabilistic word space by themselves are likely to support consciousness.

This seems like the crux. It feels like a big neural network run on a GPU, trained to predict the next word, could definitely be conscious. So to me this is just a question about the particular weights of large language models, not something that can be established a priori based on architecture.

It seems reasonable to guess that modern language models aren't conscious in any morally relevant sense. But it seems odd to use that as the basis for a reductio of arguments about consciousness, given that we know nothing about the consciousness of language models.

Put differently: if a line of reasoning would suggest that language models are conscious, then I feel like the main update should be about consciousness of language models rather than about the validity of the line of reasoning. If you think that e.g. fish are conscious based on analysis of their behavior rather than evolutionary analogies with humans, then I think you should apply the same reasoning to ML systems.

I don't think that biological brains are plausibly necessary for consciousness. It seems extremely likely to me that a big neural network can in principle be conscious without adding any of these bells or whistles, and it seems clear that SGD could find conscious models. 

I don't think the fact that language models say untrue things show they have no representation of the world (in fact for a pre-trained model that would be a clearly absurd  inference---they are trained to predict what someone else would say and then sample from that distribution, which will of course lead to confidently saying false things when the predicted-speaker can know things the model does not!)

That all said, I think it's worth noting and emphasizing that existing language models' statements about their own consciousness are not evidence  that they are conscious, and that more generally the relationship between a language model's inner life and its utterances is completely unlike the relationship between a human's inner life and their utterances (because they are trained to produce these utterances by mimicking humans, and they would make similar utterances regardless of whether they are conscious). A careful analysis of how models generalize out of distribution, or about surprisingly high accuracy on some kinds of prediction tasks could provide evidence of consciousness, but we don't have that kind of evidence right now.

I don't think those objections to offsetting really apply to demand offsetting. If I paid someone for a high-welfare egg, I shouldn't think about my action as bringing an unhappy hen into existence and then "offsetting" that by making it better off. And that would be true even if I paid someone for a high-welfare egg, but then swapped my egg with someone else's normal egg. And by the same token if I pay someone to sell a high-welfare egg on the market labeled as a normal egg, and then buy a normal egg from the market, I haven't increased the net demand for normal eggs at all and so am not causally responsible for any additional factory-farmed hens.

Load more