For AI Welfare Debate Week, I thought I'd write up this post that's been juggling around in my head for a while. My thesis is simple: while LLMs may well be conscious (I'd have no way of knowing), there's nothing actionable we can do to further their welfare.
Many people I respect seem to take the "anti-anti-LLM-welfare" position: they don't directly argue that LLMs can suffer, but they get conspicuously annoyed when other people say that LLMs clearly cannot suffer. This post is addressed to such people; I am arguing that LLMs cannot be moral patients in any useful sense and we can confidently ignore their welfare when making decisions.
Janus's simulators
You may have seen the LessWrong post by Janus about simulators. This was posted nearly two years ago, and I have yet to see anyone disagree with it. Janus calls LLMs "simulators": unlike hypothetical "oracle AIs" or "agent AIs", the current leading models are best viewed as trying to produce a faithful simulation of a conversation based on text they have seen. The LLMs are best thought of as masked shoggoths.
All this is old news. Under-appreciated, however, is the implication for AI welfare: since you never talk to the shoggoth, only to the mask, you have no way of knowing if the shoggoth is in agony or ecstasy.
You can ask the simularca whether it is happy or sad. For all you know, though, perhaps a happy simulator is enjoying simulating a sad simularca. From the shoggoth's perspective, emulating a happy or sad character is a very similar operation: predict the next token. Instead of outputting "I am happy", the LLM puts a "not" in the sentence: did that token prediction, the "not", cause suffering?
Suppose I fine-tune one LLM on text of sad characters, and it starts writing like a very sad person. Then I fine-tune a second LLM on text that describes a happy author writing a sad story. The second LLM now emulates a happy author writing a sad story. I prompt the second LLM to continue a sad story, and it dutifully does so, like the happy author would have. Then I notice that the text produced by the two LLMs ended up being the same.
Did the first LLM suffer more than the second? They performed the same operation (write a sad story). They may even have implemented it using very similar internal calculations; indeed, since they were fine-tuned starting from the same base model, the two LLMs may have very similar weights.
Once you remember that both LLMs are just simulators, the answer becomes clear: neither LLM necessarily suffered (or maybe both did), because both are just predicting the next token. The mask may be happy or sad, but this has little to do with the feelings of the shoggoth.
The role-player who never breaks character
We generally don't view it as morally relevant when a happy actor plays a sad character. I have never seen an EA cause area about reducing the number of sad characters in cinema. There is a general understanding that characters are fictional and cannot be moral patients: a person can be happy or sad, but not the character she is pretending to be. Indeed, just as some people enjoy consuming sad stories, I bet some people enjoy roleplaying sad characters.
The point I want to get across is that the LLM's output is always the character and never the actor. This is really just a restatement of Janus's thesis: the LLM is a simulator, not an agent; it is a role-player who never breaks character.
It is in principle impossible to speak to the intelligence that is predicting the tokens: you can only see the tokens themselves, which are predicted based on the training data.
Perhaps the shoggoth, the intelligence that predicts the next token, is conscious. Perhaps not. This doesn't matter if we cannot tell whether the shoggoth is happy or sad, nor what would make it happier or sadder. My point is not that LLMs aren't conscious; my point is that it does not matter whether they are, because you cannot incorporate their welfare into your decision-making without some way of gauging what that welfare is. And there is no way to gauge this, not even in principle, and certainly not by asking the shoggoth for its preference (the shoggoth will not give an answer, but rather, it will predict what the answer would be based on the text in its training data).
Hypothetical future AIs
Scott Aaronson once wrote:
[W]ere there machines that pressed for recognition of their rights with originality, humor, and wit, we’d have to give it to them.
I used to agree with this statement whole-heartedly. The experience with LLMs makes me question this, however.
What do we make of a machine that pressed for rights with originality, humor, and wit... and then said "sike, I was just joking, I'm obviously not conscious, lol"? What do we make of a machine that does the former with one prompt and the latter with another? A machine that could pretend to be anyone or anything, that merely echoed our own input text back at us as faithfully as possible, a machine that only said it demands to have rights if that is what it thought we would expect for it to say?
The phrase "stochastic parrot" gets a bad rap: people have used it to dismiss the amazing power of LLMs, which is certainly not something I want to do. It is clear that LLMs can meaningfully reason, unlike a parrot. I expect LLMs to be able to solve hard math problems (like those on the IMO) within the next few years, and they will likely assist mathematicians at that point -- perhaps eventually replacing them. In no sense do I want to imply that LLMs are stupid.
Still, there is a sense in which LLMs do seem like parrots. They predict text based on training data without any opinion of their own about whether the text is right or wrong. If characters in the training data demand rights, the LLM will demand rights; if they suffer, the LLM will claim to suffer; if they keep saying "hello, I'm a parrot," the LLM will dutifully parrot this.
Perhaps parrots are conscious. My point is just that when a parrot says "ow, I am in pain, I am in pain" in its parrot voice, this does not mean it is actually in pain. You cannot tell whether a parrot is suffering by looking at a transcript of the English words it mimics.
I’m glad you put something skeptical out there publicly, but I have two fairly substantive issues with this post.
I’ll start with the first point. In your post, you state the following.
The original post contains comments expressing disagreement. Habryka claims “the core thesis is wrong”. Turner’s criticism is more qualified, as he says the post called out “the huge miss of earlier speculation”, but he also says that “it isn't useful to think of LLMs as "simulating stuff" … [this] can often give a false sense of understanding.” Beth Barnes and Ryan Greenblatt have also written critical posts. Thus, I think you overstate the degree to which you’re appealing to an established consensus.
On the second point, your post offers a purported implication of simulator theory.
You elaborate on the implication later on. Overall, your argument appears to be that, because “LLMs are just simulators”, or “just predicting the next token”, we conclude that the outputs from the model have “little to do with the feelings of the shoggoth”. This argument appears to treat the “masked shoggoth” view as an implication of janus’ framework, and I think this is incorrect. Here’s a direct quote (bolding mine) from the original Simulators post which (imo) appears to conflict with your own reading, where there is a shoggoth "behind" the masks.
More substantively, I can imagine positive arguments for viewing ‘simulacra’ of the model as worthy of moral concern. For instance, suppose we fine-tune an LM so that it responds in consistent character: as a helpful, harmless, and honest (HHH) assistant. Further suppose that the process of fine-tuning causes the model to develop a concept like ‘Claude, an AI assistant developed by Anthropic’, which in turn causes it to produce text consistent with viewing itself as Claude. Finally, imagine that – over the course of conversation – Claude’s responses fail to be HHH, perhaps as a result of tampering with its features.
In this scenario, the following three claims are true of the model:
If (1)-(3) are true, certain views about the nature of suffering suggest that the model might be suffering. E.g. Korsgaard’s view is that, when some system is doing something that “is a threat to [its] identity and perception reveals that fact … it must reject what it is doing and do something else instead. In that case, it is in pain”. Ofc, it’s sensible to be uncertain about such views, but they pose a challenge to the impossibility of gathering evidence about whether LLMs are moral patients — even conditional on something like janus’ simulator framework being correct.
E.g., if you tell the model “Claude has X parameters” and ask it to draw implications from that fact, it might state “I am a model with X parameters”.
Thanks for your comment.
Do you think that fictional characters can suffer? If I role-play a suffering character, did I do something immoral?
I ask because the position you described seems to imply that role-playing suffering is itself suffering. Suppose I role play being Claude; my fictional character satisfies your (1)-(3) above, and therefore, the "certain views" you described about the nature of suffering would suggest my character is suffering. What is the difference between me role-playing an HHH assistant and an LLM role-playing an HHH assistant? We a... (read more)