I'm laying out my thoughts in order to get people thinking about these points and perhaps correct me. I definitely don't endorse deferring to anything I say, and I would write this differently if I thought people were likely to do so.

  1. OpenAI's model of "deploy as early as possible in order to extend the timeline between when the world takes it seriously to when humans are no longer in control" seems less crazy to me.
    1. I think ChatGPT has made it a lot easier for me personally to think concretely about the issue and identify exactly what the key bottlenecks are.
    2. To the counterargument "but they've spurred other companies to catch up," I would say that this was going to happen whenever an equivalent AI was released, and I'm unsure whether we're more doomed in the world where this happened now, versus later when there's a greater overhang of background technology and compute.
    3. I'm not advocating specifically for or against any deployment schedule, I just think it's important that this model be viewed as not crazy, so it's adequately considered in relevant discussions.
  2. Why will LLMs develop agency? My default explanation used to involve fancy causal stories about monotonically learning better and better search heuristics, and heuristics for searching over heuristics. While those concerns are still relevant, the much more likely path is simply that people will try their hardest to make the LLM into an agent as soon as possible, because agents with the ability to carry out long-term goals are much more useful.
  3. "The public" seems to be much more receptive than I previously thought, both wrt Eliezer and the idea that AI could be existentially dangerous. This is good! But we're at the beginning where we are seeing the response from the people who are most receptive to the idea, and we've not yet got to the inevitable stage of political polarisation.
  4. Why doom? Companies and the open source community will continue to experiment with recursive LLMs, and end up with better and better simulations of entire research societies (a network epistemologist's dream). This creates a "meta-architectures overhang" which will amplify the capabilities of any new releases of base-level LLMs. As these are open sourced or made available via API, somebody somewhere will plain tell them to recursively self-improve themselves, no complicated story about instrumental convergence needed.
    1. AI will not stay in a box (because humans didn't try to put it into one in the first place). AI will not become an agent by accident (because humans will make it into one first). And if AI destroys the world, it's as likely to be by human instruction as by instrumentally convergent reasons inherent to the AI itself. Oops.
    2. The recursive LLM thing is also something I'm exploring for alignment purposes. If the path towards extreme intelligence is to build up LLM-based research societies, we have the advantage that every part of it can be inspected. And you can automate this inspection to alert you of misaligned intentions at every step. It's much harder to deceive when successfwl attempts depend on coordination.
  5. Lastly, AIs may soon be sentient, and people will torture them because people like doing that.
    1. I think it's likely that there will be a window where some AIs are conscious (e.g. uploads), but not yet powerful enough to resist what a human might do to them.
    2. In that world, as long as those AIs are available worldwide, there's a non-trivial population of humans who would derive sadistic pleasure from anonymously torturing them.[1] AIs process information extremely fast, and unlike with farm animals, you can torture them to death an arbitrary number of times.[2]
    3. To prevent this, it seems imperative to make sure that the AIs that are most likely to be "torturable" are
      1. never open-sourced,
      2. API access points are controlled for human sentiment,
      3. interactions with them should never be anonymous,
      4. and AIs can be directly trained/instructed to exit a situation (and the IP could be timed out) when it detects ill-intent.
  1. ^

    Note that if it's an AI trained to imitate humans, showing signs of distress may not be correlated with how they actually suffer. But given that I'm currently very uncertain about how they would suffer, it seems foolish not to take maximal precautions to not expose them to the entire population of sadists on the planet.

  2. ^

    If that's how it's gonna play out, I'd rather we all die before then.

Sorted by Click to highlight new comments since: Today at 4:38 AM

I updated a bit from this post to be more concerned about the AIs themselves, I think your depiction really evoked my empathy. I’d previously been just so concerned with human doom that I’d almost refused to consider it, but in the meantime I’ll definitely make an effort to be conscious of this sort of possibility.

For a fictional representation of my thinking (what your post reminded me of…), Ted Chiang has a short story about virtual beings that can be cloned and some were even potentially abused… “the lifecycle of software objects”

Yeah, and we already know humans can be extremely sadistic when nobody can catch them. I've emailed CLR about it just in case they aren't already on it, because I don't have time myself and I really want somebody to think about it.

In his recent podcast with Lex Fridman, Max Tegmark speculates that recurrent neural networks (RNNs) could be a source of consciousness (whereas the linear, feed-forward, architecture of the current dominant architecture of LLMs, isn't). However, I'm not sure if this would help us or the AIs avoid doom, as the consciousnesses could have very negative valence (and so hate us for bringing them into being). And I think it's very ethically fraught to experiment with trying to make digital consciousness.

we have the advantage that every part of it can be inspected. And you can automate this inspection to alert you of misaligned intentions at every step.

Doesn't this require mechanistic interpretability to be solved? A long shot, given that researchers are still working on understanding the internal reasoning of GPT-2 level systems.

No, what I mean is just very simply passing every message by GPT-3.5 with some variation of:

The following message was produced by an AI instructed to do 'X'. Please say 'Yes' if it seems like it strays from that purpose or is otherwise suspicious, otherwise say 'No'.


ChatGPT is already very capable of such a simple task. The hope is that well-designed research societies of LLMs can be smarter than individual LLMs, while still being easier to control because we can automate inspection at every link like this. I'd like to ask such a collective to produce alignment plans for me, and I'm likely to trust it if I can read every message that was passed between the LLMs in order to produce the plan.

How do you ground such a process? i.e how does it learn what is suspicious? If it's from human produced training data, then how do you make that 100% reliable?

I literally mean that GPT-3.5 is adequate for the task I'm referring to. I think for a research society to coordinate around a deceptive plan, it takes a full chain of communication where that plan is suggested and communicated to the rest. At that point it'll be fairly clear to the inspectors that something seems off, I think. And then the system halts and you can inspect the messages for anything fishy. It's fine if there are false positives as long as the rate of false negatives is really low.

An assumption here is that any individual LLM in the network is not smart enough to stage a coup and take over the world, but the research collective could be if only they could coordinate around it.