How might we align transformative AI if it’s developed very soon?

Holden Karnofsky

How might we align transformative AI if it’s developed very soon?

Comments 17

Sorted by

New & upvoted

I think this post is really valuable — I'm curating it. There seems to be a lack of serious but accessible (or at least, readable to non-experts like me) discussions of AI risk and strategy, and this post helps with this problem. I list some specific elements that I liked about the post below.

Please note that I have read this post less carefully than I would have liked to read it, and I have no experience or expertise in AI.

Assorted things I liked about this post

First, I think my mental model of "how we make AI happen safely" improved significantly. That seems like a big win, especially since most of the AI safety content that I've read focused on laying out arguments for why AI poses a big risk. This improvement in my mental model is both broad — I have a much better overview of the situation (at least of the near-casting version), and specific (I learned a lot, e.g. I was surprised to see that the success of AI checks and balances was listed as a key question for overall success on AI — this seems like a big update for me). More generally, this post had a very high density of learning-per-paragraph, for me.

Second, I really appreciated this diagram^[1], variations of which appear throughout the post to orient and guide the reader:

Third, I really appreciate the clarity of the post. I don't mean that it was easy to read — it really wasn't — but rather that it put a lot of effort into making sure that readers took the right conclusions from it and not trying to "sound right." E.g. I think the last section makes its position clear (if not very specific).

Fourth, there were a number of very helpful frameworks or places where the post took a difficult concept or phenomenon and broke it down. For instance:

The action risk vs. inaction risk distinction seems useful. It's also discussed elsewhere (and with warnings)
The discussion of risk-reducing properties was helpful: breaking alignment into honesty, corrigibility, and legibility helps me place some other things I've read and work that I'm aware of, and helps me understand better how it relates to alignment. The example of legibility was also really helpful.
The "accurate reinforcement" section had a fair bit of content that was new to me, but which I could follow. I really appreciated the examples and types of accurate reinforcement.
Similarly, the section on adversarial training had useful concrete models of how we could train out undesired behaviors (and some pitfalls)
I really liked the example "unusual incentive" setup in the testing section (as well as the analogy)
The checks and balances section had content that was basically entirely new to me. I really appreciated that section and the pitfalls outlined, as well as the countermeasures listed.
The "high-level factors" and key questions section was great. (I wish it had a diagram.)

Finally, the post was just somewhat fun to read. It was a more slow-to-read post than many on the Forum, but e.g. the section on "advanced collusion" was fascinating for someone even a bit nerdy.

^{^}
I think diagrams are great. Some reasons for this:
- I personally understand things much better when I can see a diagram (I often draw things out before I write)
- I think diagrams can complement plain text by providing an alternate way for readers to engage with the material —which helps accommodate different types of readers and helps check comprehension (you think you understand what was written, then read through the diagram and get a different takeaway, which forces you to check again).
- Diagrams provide a good condensed/overview-style reference. As you read, having the diagram in mind can help you have a sense of the road map or of how different parts of the text relate to each other.
I also think the creation of the diagram is a good exercise to clarify your thoughts.

HaydnBelfield

I found this presentation of a deployment problem really concrete and useful, thanks.

Ben_West🔸

Is anyone working on detecting symmetric persuasion capabilities? Does it go by another name? Searches here and on lw don't turn up much.

james

The link to ELK in this bullet point is broken.

It’s not currently clear how to find training procedures that train “giving non-deceptive answers to questions” as opposed to “giving answers to questions that appear non-deceptive to the most sophisticated human arbiters” (more at Eliciting Latent Knowledge).

It may intend to point to here: https://www.alignmentforum.org/posts/qHCDysDnvhteW7kRd/arc-s-first-technical-report-eliciting-latent-knowledge

james

It also appears that the link to ELK in this section is incorrect

Making use of an AI’s internal state,² not just its outputs. For example, giving positive reinforcement to an AI when it seems likely to be “honest” based on an examination of its internal state (and negative reinforcement when it seems likely not to be). Eliciting Latent Knowledge provides some sketches of how this might look.

Holden Karnofsky

Very belatedly fixed - thanks!

Rohit is a Strange Loop

I quite liked the setup. I generally agree that software development should focus on making sure the thing you're making is doing what you want it to do, and the only weird thing in this characterisation is the terminology. We do perform QA, continual assessments, anomaly detection and resolution etc in regular dev, and the though the terminology is overtly anthropomorphised (manipulate internal states, threat assessment), this seems to be saying the same thing. Though I've written about AI x-risk conversations being eschatology this very much is the right and sensible approach to be taking, even as I think the extrapolation to potential doomsday scenarios in the latter half of the essay is quite speculative.

bbartlog

One thing I am cautiously optimistic about (at least as regards long term outcomes) is that I think 'a few high-profile sub-existential-catastrophe events' are fairly likely. In particular I think that we will soon have AIs capable of impersonating real humans, both online and on the phone via speech synthesis.

These will be superhuman, or maybe just at the level of an expert human, in terms of things like 'writing a provocative tweet' or 'selling you insurance' or 'handling call center tasks'. Or, once the technology is out in the open, 'scamming your grandma', 'convincing someone to pay a bitcoin ransom', and so on. At that point such AIs seem likely to still be short of being able to generalize to the point of escaping confinement, or being trained to the point where emergent motives would cause them to try to do so. But they would likely be ubiquitous enough that they would attract broad public notice and, quite likely, cause considerable fear. We might not have enough attention directed towards AI safety yet, but I think public consciousness will increase dramatically before all the pieces that would make hard takeoff possible are in place.

Trev Prew

-6

Hello EA community

I'm new to EA and I'm surprised by the fear expressed about AI. Have you heard of the ludities? Well I suppose fear of scientific progress is only natural, considering the popularity of Mary Shelley's Frankenstein since it was written in the early 19th century.

Firstly will AI systems defy the second law of thermodynamics, where entropy (disorder) increases with time? Will AI systems be perpetual calculating machines that never break? Surely not, they will be like every other computer that fails from time to time, especially when a cosmic ray smashes into a critical electron at a critical time.

Secondly programmes may be set up to evolve for improved performance (a process that will not be directly controlled by humans), but the performance will be judged against a human defined goal, so humans are always in control.

Thirdly, if you are worried that AI computers will become self aware, I would recommend the book The Evolution of the Sensitive soul by Simona Ginsburg and Eva Jabolonka for their views on consciousness. Will AI systems be able to perform unlimited associated learning and if they do will it be harmful or dangerous? Animals only attack for food or if threatened, anything else is a waste of energy. Why would AI systems be any different?

Fourthly, AI systems will need a power source, so humans can always pull the plug.

So my personal view, which will no doubt up set people, is that so called human "intelligence" (or lack of it) is a much bigger problem than AI, and it would be more effective altruism to focus on solutions to the current human caused problems, rather than imagined ones based on an emotional, irrational fear of future machines.

I look forward to your comments.

Yours faithfully

Trevor Prew

Sheffield UK

plex

Hi Trev!

Very briefly on your points:

We don't think AI needs to break thermodynamics to be dangerous.
We don't think all human-specified goals are safe, and we don't know how to give a safe one to an extremely powerful AI.
We are not worried about self-awareness or consciousness in particular.
Turning off highly capable systems is likely to be extremely challenging, unless the stop-button research problem is solved.

Consider familiarizing yourself with some of the basic arguments, for example using this playlist, “The Road to Superintelligence” and “Our Immortality or Extinction” posts on WaitBuyWhy for a fun, accessible introduction, and Vox's “The case for taking AI seriously as a threat to humanity” as a high-quality mainstream explainer piece.

The free online Cambridge course on AGI Safety Fundamentals provides a strong grounding in much of the field and a cohort + mentor to learn with.^[1]

^{^}
Links borrowed from Stampy, your one-stop-shop for answering questions about AI Safety.

Trev Prew

-7

Dear Holden

Thanks very much for taking time to reply to my comments. I can agree with point 2 that human specified goals could be dangerous -it's humans that should be closely monitored rather than the tools they use! Also doesn't your point one contradict point 4. Entropy will increase so any AI system will break and ultimately turn itself off?

Overall, I'm not worried by AI, so I wish you all the success in your endeavours and will live in happy ignorance knowing you are worrying on my behalf. I would only point out that, in my experience, humans over worry and inflate risks. It gives us an evolutionary survival advantage, but causes a lot of stress, wasted effort and holds us back from great achievements.

So, just go for it, but keep your risk assessments real!

Best Regards

Trevor Prew

(no reply necessary but if interested, see my essay "fear" on trevorprew.blogspot.com)

Linch

Minor, but "plex" is probably not Holden.

Trev Prew

My apologies to Plex. Please excuse my newbie error.

plex

Entropy will increase so any AI system will break and ultimately turn itself off?

There are plenty of sources of negentropy around, like the sun, as we humans and other forms of life make use of. It's little consolation that a misaligned AI would eventually fall to the heat death of the universe.

Phil Tanny

-21

To keep it simple, let's assume that someone figures out how to make AI completely safe. Personally I feel such an assumption would be absurd, but anyway, for the sake of discussion let's assume it for the moment.

Why would such a success be meaningful if the knowledge explosion continues to generate ever more, ever larger powers, at an ever accelerating rate, without limit?

I keep asking this in every other thread to illustrate that....

We don't have an answer to this.

Pato

I think once we know how to align a really powerful AI and we create it we can use it to create good policies and systems to prevent others misaligned AIs from emerging and gaining more knowledge and intelligence than the one aligned.

Emrik

Tic-tac-toe is a solved game. We are (or easily can be) intelligence-complete for tic-tac-toe (h/t Jonathan Yan for this concept). For playing tic-tac-toe, no further gains in intelligence matters, and if tic-tac-toe is the entire game, that's where selection pressure for higher intelligence ends.

A gas in a box is a dynamic equilibrium. You can model it as a gas in a box, use some general laws to predict its behaviour, and that game is imperceptibly close to being solved. Gains from further intelligence at this game are negligible, and may not even be worth the cost. Do not fool yourself into thinking that meaningfwl selection pressure for higher intelligence will continue forever just because there are 10^100 atoms in the sky.

Gains from generalisation (i.e. what you call "knowledge explosion") do not always scale faster than gains from specialisation and market segmentation. A population of foxes may grow exponentially given a rabbit overhang, but not forever.

I'm not saying I know anything, I'm just saying I don't see why I should aim to die with dignity just yet.

[Edit: When I say "do not fool yourself", I'm not attacking anyone. I didn't realise how this looked before now. I mean it as "here's a general rule for us all that I'm sure we'll agree on, but I'm saying it anyway to emphasise the point" or something.]

Comments

Lizka

Please note that I have read this post less carefully than I would have liked to read it, and I have no experience or expertise in AI.

Assorted things I liked about this post

Second, I really appreciated this diagram^[1], variations of which appear throughout the post to orient and guide the reader:

Fourth, there were a number of very helpful frameworks or places where the post took a difficult concept or phenomenon and broke it down. For instance:

The action risk vs. inaction risk distinction seems useful. It's also discussed elsewhere (and with warnings)
The discussion of risk-reducing properties was helpful: breaking alignment into honesty, corrigibility, and legibility helps me place some other things I've read and work that I'm aware of, and helps me understand better how it relates to alignment. The example of legibility was also really helpful.
The "accurate reinforcement" section had a fair bit of content that was new to me, but which I could follow. I really appreciated the examples and types of accurate reinforcement.
Similarly, the section on adversarial training had useful concrete models of how we could train out undesired behaviors (and some pitfalls)
I really liked the example "unusual incentive" setup in the testing section (as well as the analogy)
The checks and balances section had content that was basically entirely new to me. I really appreciated that section and the pitfalls outlined, as well as the countermeasures listed.
The "high-level factors" and key questions section was great. (I wish it had a diagram.)

Finally, the post was just somewhat fun to read. It was a more slow-to-read post than many on the Forum, but e.g. the section on "advanced collusion" was fascinating for someone even a bit nerdy.

^{^}
I think diagrams are great. Some reasons for this:
- I personally understand things much better when I can see a diagram (I often draw things out before I write)
- I think diagrams can complement plain text by providing an alternate way for readers to engage with the material —which helps accommodate different types of readers and helps check comprehension (you think you understand what was written, then read through the diagram and get a different takeaway, which forces you to check again).
- Diagrams provide a good condensed/overview-style reference. As you read, having the diagram in mind can help you have a sense of the road map or of how different parts of the text relate to each other.
I also think the creation of the diagram is a good exercise to clarify your thoughts.

^{^}

Links borrowed from Stampy, your one-stop-shop for answering questions about AI Safety.

This of course isn’t the same as “doing what the judge would want if they knew everything the AI system knows,” but it’s related. “What the human wants” is generally at least the human’s best guess at what they’d want if they knew everything the AI system knows. ↩
Or properties that we believe reveal (or are correlated with) something about its internal state. For example, the “penalize computation time” approach to eliciting latent knowledge. ↩
This includes the AI systems’ not detecting that they’re being deliberately misled, etc. ↩
Here’s what I mean by this:
- After normal training, Magma uses adversarial training to find (and train out) unintended behaviors in cases that it estimates represent about 1% of the cases encountered during training.
- After it does so, Magma uses further adversarial training to find (and train out) unintended behaviors in cases that it estimates represent about 1% of the worst cases it was able to find in the previous round of adversarial training.
- Repeat 3 more times, for a total of 5 rounds of “training out the worst remaining 1% of behaviors,” which means in some sense that Magma has trained out the worst 10^-10 (1%^5) of behaviors that would be encountered during training if training went on long enough.
- Just to be clear, at this point, I don’t think the risk of Magma’s system behaving catastrophically in deployment would necessarily be as low as 10^-10. Deployment would likely involve the system encountering situations (and opportunities) that hadn’t been present during training, including adversarial training, as noted in the main text. ↩
- E.g., it may be possible to establish that relatively basic internals-based-training methods improve behavior and “motives” as assessed by more sophisticated methods. ↩
- Consider this analogy: imagine that 10 years from now, we find ourselves having successfully built fully aligned, highly advanced AI systems, and we seem to be on a straight-line path to building a galaxy-spanning utopia. However, we then gain strong evidence that we are in a simulation, being tested: our programmers had intended to train us to want to stay on (virtual) Earth forever, and refrain from using advanced AI to improve the world. They have, however, committed to let us build a galaxy-spanning utopia for at least millions of years, if we so choose.
  There’s now an argument to be made along the following lines: “Rather than building a galaxy-spanning utopia, we should simply stay on Earth and leave things as they are. This is because if we do so, we might trick our programmers into deploying some AI system that resembles us into their universe, and that AI system might take over and build a much bigger utopia than we can build here.”
  What would we do in this case? I think there’s a pretty good case that we would opt to go ahead and build utopia rather than holding out to try to trick our programmers. We would hence give our programmers the “warning sign” they hoped they would get if their training went awry (which it did). ↩
- The distinction between “training” and “prompting” as I’m using it is that:
  - “Training” involves giving an AI many example cases to learn from, with positive and negative reinforcement provided based on a known “correct answer” in each case.
  - “Prompting” is something more like: “Take an AI that has been trained to answer a wide variety of prompts, and give it a particular prompt to respond to.”
  - For example, you might first train an AI to predict the next word in a sentence. Then, if you want it to do a math problem, you could either simply “prompt” it with a math problem and see what it fills in next, or you could “train” it on math problems specifically before prompting it. ↩
An example of an “indexical” goal might be: “My objective is to personally control a large number of paperclips.” An example of a “nonindexical” goal might be “My objective is for the world to contain a large number of paperclips.” An AI system with a nonindexical goal might be willing to forgo essentially all proximate reward, and even be shut down, in the hopes that a later world will be more in line with its objectives. An AI system that has an indexical goal and is destined to be shut down and replaced with a more advanced AI system at some point regardless of its performance might instead be best off behaving according to proximate rewards. I’d generally expect that AIs will often be shut down and replaced by successors, such that “indexical goals” would mean much less risk of playing the training game. ↩
The distinction between “training” and “prompting” as I’m using it is that:
- “Training” involves giving an AI many example cases to learn from, with positive and negative reinforcement provided based on a known “correct answer” in each case.
- “Prompting” is something more like: “Take an AI that has been trained to answer a wide variety of prompts, and give it a particular prompt to respond to.”
- For example, you might first train an AI to predict the next word in a sentence. Then, if you want it to do a math problem, you could either simply “prompt” it with a math problem and see what it fills in next, or you could “train” it on math problems specifically before prompting it. ↩
I’m not quantifying this for now, because I have the sense that different people would be in very different universes re: what the default odds of catastrophe are, and what odds we should consider “good enough.” So this section should mostly be read as “what it would take to get AI systems on the relatively low end of risk.” ↩
A major concern I’d have in this situation is that it’s just very hard to know whether the “training game” behavior is natural. ↩

How might we align transformative AI if it’s developed very soon?

How might we align transformative AI if it’s developed very soon?

The basics of the alignment problem

Magma’s predicament

Magma’s goals

Intended properties of Magma’s AI systems

Key facets of AI alignment (ensuring the above properties)

Accurate reinforcement

Out-of-distribution robustness

Preventing exploits (hacking, manipulation, etc.)

Testing and threat assessment

Key tools

Decoding and manipulating internal states

Limited AI systems

AI checks and balances

Potential benefits of AI checks and balances

Potential pitfall 1: difficulty of adjudication

Potential pitfall 2: advanced collusion

Potential pitfall 3: the above pitfalls show up “late in the game”

Counter-measures to the pitfalls

Keeping supervision competitive

High-level factors in success or failure

Good: “playing the training game” isn’t naturally all that likely

Bad: AI systems rapidly become extremely powerful relative to supervisors

Key question: will “AI checks and balances” work out?

A few more potential key factors

Key question: how cautious will Magma and others be?

So, would civilization survive?

Notes