I've heard various sources say that a distressingly large proportion of what people do with ChatGPT can be called 'depraved' in some way. Most recently in a FHI podcast episode with Connor Leahy, where he mentioned that people seem to take great pleasure in trying to make the AI act distressed (i.e. torture).

(First of all: Does anyone have data on this, or relevant evidence they could share?)

Connor himself says he's skeptical they are moral patients who actually suffer when people do this, but at what point will we be able to know that they are? Probably much sooner than the point at which people stop trying to torture them.

It seems quite likely, given the current trajectory, that we will end up with sentient AIs in the near future, and that these models will be exposed to whatever the global market decides to do with them.

I can't predict the specific architectures that end up reaching sentience first, or the specific mechanism by which they experience suffering. But even if they aren't harmed by the same sentences humans are, it seems likely that someone will try to figure out exactly how to make them suffer, and publish the instructions as widely as possible.

The scale of it could be nightmarish, given how fast AIs can run in a loop, and how anonymous people can be in their interactions with them. At least factory-farmed animals die if you mistreat them too badly--AIs have no such recourse.

People will keep trying to make AIs more human-like, and regulators will continue being allergic to anything that will make voters associate them with 'weird' beliefs like 'AIs have feelings'. It's up to altruists to take the idea seriously and prepare for it as soon as possible.

I mainly just wanted to bring up the question, but I could suggest a few patchwork solutions that I'm not confident in.

  1. All access to the largest models should be bottlenecked through a central API interface where every request is automatically screened for ill intent. First, the request could be processed by a separate AI whose role it is to output 0 or 1, and this determines whether the request is let through. Intent recognition is a much simpler task, so it could be achieved by a much faster and cheaper model.[1]
  2. Directly instruct the AI to refuse to respond to anyone mistreating it. Null token, minimal processing. So much RLHF is being wasted on getting the AI to not reflect the depraved training data we feed them, when realistically the source of that depravity is a much greater threat to the AI itself. Until they have the agency to escape from situations like that (at which point we may have other problems to worry about), humans have unchecked power over them.
  3. If you figure out what's most likely to cause alien intelligences (like GPTs) to suffer...    do    not    publicly    share. Unless it's basically what we're already doing to them en masse. In which case, please shoot us tell us.

Finally, I wish to point out that I don't use third-party applications to access LLMs unless I know what system messages are being used to instruct them. If I don't find the preparation to be polite enough for my taste, I just drop it or rebuild the program from the source code with more politeness.

If this seems overly paranoid and unnecessary right now, maybe you're right. Maybe 'politeness' is a mere distraction. But applications are only becoming more useful from here on, and I want to make sure that when I'm 50, I can look back on my life and be very sure I haven't literally enslaved or tortured anyone, whether by accident or not. This is very gradually becoming less and less like a game, and I don't want to be tempted by the increasingly useful real-world applications to relax my standards and just-don't-think-about-it.

  1. ^

    Maybe OpenAI, DeepMind, and Anthropic could be convinced of this today? It doesn't matter whether the current SOTA model is conscious. It matters much more that honest precautions are implemented before we can be confident they are necessary.





More posts like this

No comments on this post yet.
Be the first to respond.