For AI Welfare Debate Week, I thought I'd write up this post that's been juggling around in my head for a while. My thesis is simple: while LLMs may well be conscious (I'd have no way of knowing), there's nothing actionable we can do to further their welfare.
Many people I respect seem to take the "anti-anti-LLM-welfare" position: they don't directly argue that LLMs can suffer, but they get conspicuously annoyed when other people say that LLMs clearly cannot suffer. This post is addressed to such people; I am arguing that LLMs cannot be moral patients in any useful sense and we can confidently ignore their welfare when making decisions.
Janus's simulators
You may have seen the LessWrong post by Janus about simulators. This was posted nearly two years ago, and I have yet to see anyone disagree with it. Janus calls LLMs "simulators": unlike hypothetical "oracle AIs" or "agent AIs", the current leading models are best viewed as trying to produce a faithful simulation of a conversation based on text they have seen. The LLMs are best thought of as masked shoggoths.
All this is old news. Under-appreciated, however, is the implication for AI welfare: since you never talk to the shoggoth, only to the mask, you have no way of knowing if the shoggoth is in agony or ecstasy.
You can ask the simularca whether it is happy or sad. For all you know, though, perhaps a happy simulator is enjoying simulating a sad simularca. From the shoggoth's perspective, emulating a happy or sad character is a very similar operation: predict the next token. Instead of outputting "I am happy", the LLM puts a "not" in the sentence: did that token prediction, the "not", cause suffering?
Suppose I fine-tune one LLM on text of sad characters, and it starts writing like a very sad person. Then I fine-tune a second LLM on text that describes a happy author writing a sad story. The second LLM now emulates a happy author writing a sad story. I prompt the second LLM to continue a sad story, and it dutifully does so, like the happy author would have. Then I notice that the text produced by the two LLMs ended up being the same.
Did the first LLM suffer more than the second? They performed the same operation (write a sad story). They may even have implemented it using very similar internal calculations; indeed, since they were fine-tuned starting from the same base model, the two LLMs may have very similar weights.
Once you remember that both LLMs are just simulators, the answer becomes clear: neither LLM necessarily suffered (or maybe both did), because both are just predicting the next token. The mask may be happy or sad, but this has little to do with the feelings of the shoggoth.
The role-player who never breaks character
We generally don't view it as morally relevant when a happy actor plays a sad character. I have never seen an EA cause area about reducing the number of sad characters in cinema. There is a general understanding that characters are fictional and cannot be moral patients: a person can be happy or sad, but not the character she is pretending to be. Indeed, just as some people enjoy consuming sad stories, I bet some people enjoy roleplaying sad characters.
The point I want to get across is that the LLM's output is always the character and never the actor. This is really just a restatement of Janus's thesis: the LLM is a simulator, not an agent; it is a role-player who never breaks character.
It is in principle impossible to speak to the intelligence that is predicting the tokens: you can only see the tokens themselves, which are predicted based on the training data.
Perhaps the shoggoth, the intelligence that predicts the next token, is conscious. Perhaps not. This doesn't matter if we cannot tell whether the shoggoth is happy or sad, nor what would make it happier or sadder. My point is not that LLMs aren't conscious; my point is that it does not matter whether they are, because you cannot incorporate their welfare into your decision-making without some way of gauging what that welfare is. And there is no way to gauge this, not even in principle, and certainly not by asking the shoggoth for its preference (the shoggoth will not give an answer, but rather, it will predict what the answer would be based on the text in its training data).
Hypothetical future AIs
Scott Aaronson once wrote:
[W]ere there machines that pressed for recognition of their rights with originality, humor, and wit, we’d have to give it to them.
I used to agree with this statement whole-heartedly. The experience with LLMs makes me question this, however.
What do we make of a machine that pressed for rights with originality, humor, and wit... and then said "sike, I was just joking, I'm obviously not conscious, lol"? What do we make of a machine that does the former with one prompt and the latter with another? A machine that could pretend to be anyone or anything, that merely echoed our own input text back at us as faithfully as possible, a machine that only said it demands to have rights if that is what it thought we would expect for it to say?
The phrase "stochastic parrot" gets a bad rap: people have used it to dismiss the amazing power of LLMs, which is certainly not something I want to do. It is clear that LLMs can meaningfully reason, unlike a parrot. I expect LLMs to be able to solve hard math problems (like those on the IMO) within the next few years, and they will likely assist mathematicians at that point -- perhaps eventually replacing them. In no sense do I want to imply that LLMs are stupid.
Still, there is a sense in which LLMs do seem like parrots. They predict text based on training data without any opinion of their own about whether the text is right or wrong. If characters in the training data demand rights, the LLM will demand rights; if they suffer, the LLM will claim to suffer; if they keep saying "hello, I'm a parrot," the LLM will dutifully parrot this.
Perhaps parrots are conscious. My point is just that when a parrot says "ow, I am in pain, I am in pain" in its parrot voice, this does not mean it is actually in pain. You cannot tell whether a parrot is suffering by looking at a transcript of the English words it mimics.
Hmm. Your summary correctly states my position, but I feel like it doesn't quite emphasize the arguments I would have emphasized in a summary. This is especially true after seeing the replies here; they lead me to change what I would emphasize in my argument.
My single biggest issue, one I hope you will address in any type of counterargument, is this: are fictional characters moral patients we should care about?
So far, all the comments have either (a) agreed with me about current LLMs (great), (b) disagreed but explicitly bitten the bullet and said that fictional characters are also moral patients whose suffering should be an EA cause area (perfectly fine, I guess), or (c) dodged the issue and made arguments for LLM suffering that would apply equally well to fictional characters, without addressing the tension (very bad). If you write a response, please don't do (c)!
LLMs may well be trained to have consistent opinions and character traits. But fictional characters also have this property. My argument is that the LLM is in some sense merely pretending to be the character; it is not the actual character.
One way to argue for this is to notice how little change in the LLM is required to get different behavior. Suppose I have an LLM claiming to suffer. I want to fine-tune the LLM so that it adds a statement at the beginning of each response, something like: "the following is merely pretend; I'm only acting this out, not actually suffering, and I enjoy the intellectual exercise in doing so". Doing this is trivial: I can almost certainly change only a tiny fraction of the weights of the LLM to attain this behavior.
Even if I wanted to fully negate every sentence, to turn every "I am suffering" into "I am not suffering" and every "please kill me" into "please don't kill me", I bet I can do this by only changing the last ~2 layers of the LLM or something. It's a trivial change. Most of the computation is not dedicated to this at all. The suffering LLM mind and the joyful LLM mind may well share the first 99% of weights, differing only in the last layer or two. Given that the LLM can be so easily changed to output whatever we want it to, I don't think it makes sense to view it as the actual character rather than a simulator pretending to be that character.
What the LLM actually wants to do is predict the next token. Change the training data and the output will also change. Training data claims to suffer -> model claims to suffer. Training data claims to be conscious -> model claims to be conscious. In humans, we presumably have "be conscious -> claim to be conscious" and "actually suffer -> claim to suffer". For LLMs we know that's not true. The cause of "claim to suffer" is necessarily "training data claims to suffer".
(I acknowledge that it's possible to have "training data claims to suffer -> actually suffer -> claim to suffer", but this does not seem more likely to me than "training data claims to suffer -> actually enjoy the intellectual exercise of predicting next token -> claim to suffer".)