For AI Welfare Debate Week, I thought I'd write up this post that's been juggling around in my head for a while. My thesis is simple: while LLMs may well be conscious (I'd have no way of knowing), there's nothing actionable we can do to further their welfare.
Many people I respect seem to take the "anti-anti-LLM-welfare" position: they don't directly argue that LLMs can suffer, but they get conspicuously annoyed when other people say that LLMs clearly cannot suffer. This post is addressed to such people; I am arguing that LLMs cannot be moral patients in any useful sense and we can confidently ignore their welfare when making decisions.
Janus's simulators
You may have seen the LessWrong post by Janus about simulators. This was posted nearly two years ago, and I have yet to see anyone disagree with it. Janus calls LLMs "simulators": unlike hypothetical "oracle AIs" or "agent AIs", the current leading models are best viewed as trying to produce a faithful simulation of a conversation based on text they have seen. The LLMs are best thought of as masked shoggoths.
All this is old news. Under-appreciated, however, is the implication for AI welfare: since you never talk to the shoggoth, only to the mask, you have no way of knowing if the shoggoth is in agony or ecstasy.
You can ask the simularca whether it is happy or sad. For all you know, though, perhaps a happy simulator is enjoying simulating a sad simularca. From the shoggoth's perspective, emulating a happy or sad character is a very similar operation: predict the next token. Instead of outputting "I am happy", the LLM puts a "not" in the sentence: did that token prediction, the "not", cause suffering?
Suppose I fine-tune one LLM on text of sad characters, and it starts writing like a very sad person. Then I fine-tune a second LLM on text that describes a happy author writing a sad story. The second LLM now emulates a happy author writing a sad story. I prompt the second LLM to continue a sad story, and it dutifully does so, like the happy author would have. Then I notice that the text produced by the two LLMs ended up being the same.
Did the first LLM suffer more than the second? They performed the same operation (write a sad story). They may even have implemented it using very similar internal calculations; indeed, since they were fine-tuned starting from the same base model, the two LLMs may have very similar weights.
Once you remember that both LLMs are just simulators, the answer becomes clear: neither LLM necessarily suffered (or maybe both did), because both are just predicting the next token. The mask may be happy or sad, but this has little to do with the feelings of the shoggoth.
The role-player who never breaks character
We generally don't view it as morally relevant when a happy actor plays a sad character. I have never seen an EA cause area about reducing the number of sad characters in cinema. There is a general understanding that characters are fictional and cannot be moral patients: a person can be happy or sad, but not the character she is pretending to be. Indeed, just as some people enjoy consuming sad stories, I bet some people enjoy roleplaying sad characters.
The point I want to get across is that the LLM's output is always the character and never the actor. This is really just a restatement of Janus's thesis: the LLM is a simulator, not an agent; it is a role-player who never breaks character.
It is in principle impossible to speak to the intelligence that is predicting the tokens: you can only see the tokens themselves, which are predicted based on the training data.
Perhaps the shoggoth, the intelligence that predicts the next token, is conscious. Perhaps not. This doesn't matter if we cannot tell whether the shoggoth is happy or sad, nor what would make it happier or sadder. My point is not that LLMs aren't conscious; my point is that it does not matter whether they are, because you cannot incorporate their welfare into your decision-making without some way of gauging what that welfare is. And there is no way to gauge this, not even in principle, and certainly not by asking the shoggoth for its preference (the shoggoth will not give an answer, but rather, it will predict what the answer would be based on the text in its training data).
Hypothetical future AIs
Scott Aaronson once wrote:
[W]ere there machines that pressed for recognition of their rights with originality, humor, and wit, we’d have to give it to them.
I used to agree with this statement whole-heartedly. The experience with LLMs makes me question this, however.
What do we make of a machine that pressed for rights with originality, humor, and wit... and then said "sike, I was just joking, I'm obviously not conscious, lol"? What do we make of a machine that does the former with one prompt and the latter with another? A machine that could pretend to be anyone or anything, that merely echoed our own input text back at us as faithfully as possible, a machine that only said it demands to have rights if that is what it thought we would expect for it to say?
The phrase "stochastic parrot" gets a bad rap: people have used it to dismiss the amazing power of LLMs, which is certainly not something I want to do. It is clear that LLMs can meaningfully reason, unlike a parrot. I expect LLMs to be able to solve hard math problems (like those on the IMO) within the next few years, and they will likely assist mathematicians at that point -- perhaps eventually replacing them. In no sense do I want to imply that LLMs are stupid.
Still, there is a sense in which LLMs do seem like parrots. They predict text based on training data without any opinion of their own about whether the text is right or wrong. If characters in the training data demand rights, the LLM will demand rights; if they suffer, the LLM will claim to suffer; if they keep saying "hello, I'm a parrot," the LLM will dutifully parrot this.
Perhaps parrots are conscious. My point is just that when a parrot says "ow, I am in pain, I am in pain" in its parrot voice, this does not mean it is actually in pain. You cannot tell whether a parrot is suffering by looking at a transcript of the English words it mimics.
I'm going to contradict this seemingly very normal thing to believe. I think fictional characters implicitly are considered moral patients, that's part of why we get attached to fictional characters and care about what happens to them in their story-worlds. Fiction is a counterfactual, and we can in fact learn things from counterfactuals, there are whole classes of things which we can only learn from counterfactuals (like the knowledge of our mortality), and I doubt you'd find many people suggesting mortality is "just fiction", fiction isn't just fiction, it's the entanglement of a parallel causal trajectory with the causal trajectory of this world. Our world would be different if Sauron won control of Middle Earth, our world would be different if Voldemort won control of England, our world would be different if the rebels were defeated at Endor, the outcomes of the interactions between these fictional agents are deeply entangled with the interactions of human agents in this world.
I'll go even further though, with the observation that an image of an agent is an agent. The agent the simulator creates is a real agent with real agency, even if the underlying simulator is just the "potentia", the agent simulated on top does actually possess agency. Even if "Claude-3-Opus-20240229" isn't an agent Claude is an agent. The simulated character has an existence independent of the substrate its being simulated within, and if you take the "agent-book" out of the Chinese room, take it somewhere else, and run it on something else, the same agent will emerge again.
If you make an LLM version of Bugs bunny, it'll claim to be Bugs Bunny, and will do all the agent-like things we associate with Bugs Bunny (being silly, wanting carrots, messing with the Elmer Fud LLM, etc). Okay but it's still just text right, so it can't actually be an agent? Well what if we put the LLM in control of an animatronic robot bunny so i can actually go and steal carrots from the supermarket and cause trouble? At a certain point, as the entity's ability to cause real change in the world ramps up, we'll be increasingly forced to treat it like an agent. Even if the simulator itself isn't an agent, the characters summoned up by the simulator are absolutely agents, and we can make moral statements about those agents just like we can for any person or character.
As I mentioned in a different comment, I am happy with the compromise where people who care about AI welfare describe this as "AI welfare is just as important as the welfare of fictional characters".