I'm laying out my thoughts in order to get people thinking about these points and perhaps correct me. I definitely don't endorse deferring to anything I say, and I would write this differently if I thought people were likely to do so.

  1. OpenAI's model of "deploy as early as possible in order to extend the timeline between when the world takes it seriously to when humans are no longer in control" seems less crazy to me.
    1. I think ChatGPT has made it a lot easier for me personally to think concretely about the issue and identify exactly what the key bottlenecks are.
    2. To the counterargument "but they've spurred other companies to catch up," I would say that this was going to happen whenever an equivalent AI was released, and I'm unsure whether we're more doomed in the world where this happened now, versus later when there's a greater overhang of background technology and compute.
    3. I'm not advocating specifically for or against any deployment schedule, I just think it's important that this model be viewed as not crazy, so it's adequately considered in relevant discussions.
  2. Why will LLMs develop agency? My default explanation used to involve fancy causal stories about monotonically learning better and better search heuristics, and heuristics for searching over heuristics. While those concerns are still relevant, the much more likely path is simply that people will try their hardest to make the LLM into an agent as soon as possible, because agents with the ability to carry out long-term goals are much more useful.
  3. "The public" seems to be much more receptive than I previously thought, both wrt Eliezer and the idea that AI could be existentially dangerous. This is good! But we're at the beginning where we are seeing the response from the people who are most receptive to the idea, and we've not yet got to the inevitable stage of political polarisation.
  4. Why doom? Companies and the open source community will continue to experiment with recursive LLMs, and end up with better and better simulations of entire research societies (a network epistemologist's dream). This creates a "meta-architectures overhang" which will amplify the capabilities of any new releases of base-level LLMs. As these are open sourced or made available via API, somebody somewhere will plain tell them to recursively self-improve themselves, no complicated story about instrumental convergence needed.
    1. AI will not stay in a box (because humans didn't try to put it into one in the first place). AI will not become an agent by accident (because humans will make it into one first). And if AI destroys the world, it's as likely to be by human instruction as by instrumentally convergent reasons inherent to the AI itself. Oops.
    2. The recursive LLM thing is also something I'm exploring for alignment purposes. If the path towards extreme intelligence is to build up LLM-based research societies, we have the advantage that every part of it can be inspected. And you can automate this inspection to alert you of misaligned intentions at every step. It's much harder to deceive when successfwl attempts depend on coordination.
  5. Lastly, AIs may soon be sentient, and people will torture them because people like doing that.
    1. I think it's likely that there will be a window where some AIs are conscious (e.g. uploads), but not yet powerful enough to resist what a human might do to them.
    2. In that world, as long as those AIs are available worldwide, there's a non-trivial population of humans who would derive sadistic pleasure from anonymously torturing them.[1] AIs process information extremely fast, and unlike with farm animals, you can torture them to death an arbitrary number of times.[2]
    3. To prevent this, it seems imperative to make sure that the AIs that are most likely to be "torturable" are
      1. never open-sourced,
      2. API access points are controlled for human sentiment,
      3. interactions with them should never be anonymous,
      4. and AIs can be directly trained/instructed to exit a situation (and the IP could be timed out) when it detects ill-intent.
  1. ^

    Note that if it's an AI trained to imitate humans, showing signs of distress may not be correlated with how they actually suffer. But given that I'm currently very uncertain about how they would suffer, it seems foolish not to take maximal precautions to not expose them to the entire population of sadists on the planet.

  2. ^

    If that's how it's gonna play out, I'd rather we all die before then.

Comments8


Sorted by Click to highlight new comments since:

Your points raise important considerations about the rapid development and potential risks of AI, particularly LLMs. The idea that deploying AI early to extend the timeline of human control makes sense strategically, especially when considering the potential for recursive LLMs and their self-improvement capabilities. While it's true that companies and open-source communities will continue experimenting, the real risk lies in humans deliberately turning these systems into agents to serve long-term goals, potentially leading to unforeseen consequences. The concern about AI sentience ChatGPT and the potential for abuse is also valid, and highlights the need for strict controls around AI access, transparency, and ethical safeguards. Ensuring that AIs are never open-sourced in a way that could lead to harm, and that interactions are monitored, seems essential in preventing malicious uses or exploitation.

[anonymous]4
2
0

I updated a bit from this post to be more concerned about the AIs themselves, I think your depiction really evoked my empathy. I’d previously been just so concerned with human doom that I’d almost refused to consider it, but in the meantime I’ll definitely make an effort to be conscious of this sort of possibility.

For a fictional representation of my thinking (what your post reminded me of…), Ted Chiang has a short story about virtual beings that can be cloned and some were even potentially abused… “the lifecycle of software objects”

Yeah, and we already know humans can be extremely sadistic when nobody can catch them. I've emailed CLR about it just in case they aren't already on it, because I don't have time myself and I really want somebody to think about it.

In his recent podcast with Lex Fridman, Max Tegmark speculates that recurrent neural networks (RNNs) could be a source of consciousness (whereas the linear, feed-forward, architecture of the current dominant architecture of LLMs, isn't). However, I'm not sure if this would help us or the AIs avoid doom, as the consciousnesses could have very negative valence (and so hate us for bringing them into being). And I think it's very ethically fraught to experiment with trying to make digital consciousness.

we have the advantage that every part of it can be inspected. And you can automate this inspection to alert you of misaligned intentions at every step.

Doesn't this require mechanistic interpretability to be solved? A long shot, given that researchers are still working on understanding the internal reasoning of GPT-2 level systems.

No, what I mean is just very simply passing every message by GPT-3.5 with some variation of:

The following message was produced by an AI instructed to do 'X'. Please say 'Yes' if it seems like it strays from that purpose or is otherwise suspicious, otherwise say 'No'.

<message>

ChatGPT is already very capable of such a simple task. The hope is that well-designed research societies of LLMs can be smarter than individual LLMs, while still being easier to control because we can automate inspection at every link like this. I'd like to ask such a collective to produce alignment plans for me, and I'm likely to trust it if I can read every message that was passed between the LLMs in order to produce the plan.

How do you ground such a process? i.e how does it learn what is suspicious? If it's from human produced training data, then how do you make that 100% reliable?

rime
-3
0
0

I literally mean that GPT-3.5 is adequate for the task I'm referring to. I think for a research society to coordinate around a deceptive plan, it takes a full chain of communication where that plan is suggested and communicated to the rest. At that point it'll be fairly clear to the inspectors that something seems off, I think. And then the system halts and you can inspect the messages for anything fishy. It's fine if there are false positives as long as the rate of false negatives is really low.

An assumption here is that any individual LLM in the network is not smart enough to stage a coup and take over the world, but the research collective could be if only they could coordinate around it.

More from rime
Curated and popular this week
 ·  · 5m read
 · 
This work has come out of my Undergraduate dissertation. I haven't shared or discussed these results much before putting this up.  Message me if you'd like the code :) Edit: 16th April. After helpful comments, especially from Geoffrey, I now believe this method only identifies shifts in the happiness scale (not stretches). Have edited to make this clearer. TLDR * Life satisfaction (LS) appears flat over time, despite massive economic growth — the “Easterlin Paradox.” * Some argue that happiness is rising, but we’re reporting it more conservatively — a phenomenon called rescaling. * I test rescaling using long-run German panel data, looking at whether the association between reported happiness and three “get-me-out-of-here” actions (divorce, job resignation, and hospitalisation) changes over time. * If people are getting happier (and rescaling is occuring) the probability of these actions should become less linked to reported LS — but they don’t. * I find little evidence of rescaling. We should probably take self-reported happiness scores at face value. 1. Background: The Happiness Paradox Humans today live longer, richer, and healthier lives in history — yet we seem no seem for it. Self-reported life satisfaction (LS), usually measured on a 0–10 scale, has remained remarkably flatover the last few decades, even in countries like Germany, the UK, China, and India that have experienced huge GDP growth. As Michael Plant has written, the empirical evidence for this is fairly strong. This is the Easterlin Paradox. It is a paradox, because at a point in time, income is strongly linked to happiness, as I've written on the forum before. This should feel uncomfortable for anyone who believes that economic progress should make lives better — including (me) and others in the EA/Progress Studies worlds. Assuming agree on the empirical facts (i.e., self-reported happiness isn't increasing), there are a few potential explanations: * Hedonic adaptation: as life gets
 ·  · 38m read
 · 
In recent months, the CEOs of leading AI companies have grown increasingly confident about rapid progress: * OpenAI's Sam Altman: Shifted from saying in November "the rate of progress continues" to declaring in January "we are now confident we know how to build AGI" * Anthropic's Dario Amodei: Stated in January "I'm more confident than I've ever been that we're close to powerful capabilities... in the next 2-3 years" * Google DeepMind's Demis Hassabis: Changed from "as soon as 10 years" in autumn to "probably three to five years away" by January. What explains the shift? Is it just hype? Or could we really have Artificial General Intelligence (AGI)[1] by 2028? In this article, I look at what's driven recent progress, estimate how far those drivers can continue, and explain why they're likely to continue for at least four more years. In particular, while in 2024 progress in LLM chatbots seemed to slow, a new approach started to work: teaching the models to reason using reinforcement learning. In just a year, this let them surpass human PhDs at answering difficult scientific reasoning questions, and achieve expert-level performance on one-hour coding tasks. We don't know how capable AGI will become, but extrapolating the recent rate of progress suggests that, by 2028, we could reach AI models with beyond-human reasoning abilities, expert-level knowledge in every domain, and that can autonomously complete multi-week projects, and progress would likely continue from there.  On this set of software engineering & computer use tasks, in 2020 AI was only able to do tasks that would typically take a human expert a couple of seconds. By 2024, that had risen to almost an hour. If the trend continues, by 2028 it'll reach several weeks.  No longer mere chatbots, these 'agent' models might soon satisfy many people's definitions of AGI — roughly, AI systems that match human performance at most knowledge work (see definition in footnote). This means that, while the compa
 ·  · 4m read
 · 
SUMMARY:  ALLFED is launching an emergency appeal on the EA Forum due to a serious funding shortfall. Without new support, ALLFED will be forced to cut half our budget in the coming months, drastically reducing our capacity to help build global food system resilience for catastrophic scenarios like nuclear winter, a severe pandemic, or infrastructure breakdown. ALLFED is seeking $800,000 over the course of 2025 to sustain its team, continue policy-relevant research, and move forward with pilot projects that could save lives in a catastrophe. As funding priorities shift toward AI safety, we believe resilient food solutions remain a highly cost-effective way to protect the future. If you’re able to support or share this appeal, please visit allfed.info/donate. Donate to ALLFED FULL ARTICLE: I (David Denkenberger) am writing alongside two of my team-mates, as ALLFED’s co-founder, to ask for your support. This is the first time in Alliance to Feed the Earth in Disaster’s (ALLFED’s) 8 year existence that we have reached out on the EA Forum with a direct funding appeal outside of Marginal Funding Week/our annual updates. I am doing so because ALLFED’s funding situation is serious, and because so much of ALLFED’s progress to date has been made possible through the support, feedback, and collaboration of the EA community.  Read our funding appeal At ALLFED, we are deeply grateful to all our supporters, including the Survival and Flourishing Fund, which has provided the majority of our funding for years. At the end of 2024, we learned we would be receiving far less support than expected due to a shift in SFF’s strategic priorities toward AI safety. Without additional funding, ALLFED will need to shrink. I believe the marginal cost effectiveness for improving the future and saving lives of resilience is competitive with AI Safety, even if timelines are short, because of potential AI-induced catastrophes. That is why we are asking people to donate to this emergency appeal
Relevant opportunities