785 karmaJoined Feb 2022


Topic Contributions

Are you saying AIs trained this way won’t be agents?

Not especially. If I had to state it simply, it's that massive space for instrumental goals isn't useful today, and plausibly in the future for capabilities, so we have at least some reason to not worry about misalignment AI risk as much as we do today.

In particular, it means that we shouldn't assume instrumental goals to appear by default, and to avoid overrelying on non-empirical approaches like your intuition or imagination. We have to take things on a case-by-case basis, rather than using broad judgements.

Note that instrumental convergence/instrumental goals isn't a binary, but rather a space, where more space for instrumental goals being useful for capabilities is continuously bad, rather than a sharp binary of instrumental goals being active or not active.

My claim is that the evidence we have is evidence against much space for instrumental convergence being useful for capabilities, and I expect this trend to continue, at least partially as AI progresses.

Yet I suspect that this isn't hitting at your true worry, and I want to address it today. I suspect that your true worry is this quote below:

And regardless of whatever else you’re saying, how can you feel safe that the next training regime won’t lead to instrumental convergence?

And while I can't answer that question totally, I'd like to suggest going on a walk, drinking water, or in the worst case getting mental help from a professional. But try to stop the loop of never feeling safe around something.

The reason I'm suggesting this is because the problem with acting on your need to feel safe is that the following would happen:

  1. This would, if adopted leave us vulnerable to arbitrarily high demands for safety, possibly crippling AI use cases, and as a general policy I'm not a fan of actions that would result in arbitrarily high demands for something, at least without scrutinizing it very heavily, and would require way, way more evidence than just a feeling.

  2. We have no reason to assume that people's feelings of safety or unsafety actually are connected to the real evidence of whether AI is safe, or whether misalignment risk of AI is big problem. Your feelings are real, but I don't trust that your feeling of unsafety of AI is telling me anything else other than your feelings about something. This is fine, to the extent that it isn't harming you materially, but it's an important thing to note here.

Kaj Sotala made a similar post, which talks about why you should mostly feel safe. It's a different discussion than my comment, but the post below may be useful:


EDIT 1: I deeply hope you can feel better, no matter what happens in the AI space.

EDIT 2: One thing to keep in mind in general is that in typical cases, when claims that something is more or less anything based on x evidence, this is usually smoothly less or more, rather than something going to zero of something or all of something, so in this case I'm claiming that AI is less dangerous, probably a lot less dangerous, but it doesn't mean we totally erase the danger, it just means that things are more safe and things have gotten smoothly better based on our evidence to date.

The basic reasoning is that SGD is an extremely powerful optimizer, and even the imperfections of SGD in real life that mesa-optimizers can use are detectable without much interpretability progress at all. Also, there is an incentive by capabilities groups to improve SGD, so we have good reason to expect that these flaws become less worrisome over time.

In particular, it is basically immune to acausal trade setups or blackmail setups by mesa-optimizers.

Some choice quotes from Beren's post below:

The key intuition is that gradient descent optimizes the entire causal graph leading to the loss (property 1) and leaves no slack (properties 2 and 3). Everything that can be optimized will be, and in the locally maximal way. If a parameter is contributing to the loss, it is optimized by its exact sensitivity to decrease the loss the most. If a circuit results in bad behaviour, every element of the circuit is optimized away independently. It is easy to construct gradient hacking examples in cases where these conditions do not hold. For instance, if the gradient hacker's internal circuitry and external connections to other parts of the network are shielded from GD, then it is easy to construct a gradient hacker

In general, if a gradient hacker H is on part of a differentiable path through the network, and tries to manipulate some other part of the network X to output some incorrect output, then gradient descent will simultaneously and independently do all of: a.) destroy the internal representations of the gradient hacker H to stop it doing this behaviour. b.) modify the input connections to H such that the gradient hacker becomes confused about how to do its negative behaviour. c.) modify the output connections from H to X to reduce the impact its behaviour and causally cut off the gradient hacker, d.) modify the internal structure of X so that it is less affected by whatever H outputs, and e.) modify the outputs of X (i.e. to the loss) to reduce its influence, potentially trying to cut off the entire causal network path the the gradient hacker is on.

This leads to an important point. Better optimization algorithms lead to both better capabilities and greater safety from mesaoptimizers. They do this because they reduce the slack available in the optimization process that a mesaoptimizer could exploit for its own purposes. For instance, gradient descent is very sensitive to the conditionining of the loss landscape around a parameter. A gradient hacker could exist for a very long time in an extremely flat region of poor conditioning before being optimized away, potentially persisting over an entire training run. However, this threat can be removed with better preconditioners or second order optimizations which makes gradient descent much less sensitive to local conditioning and so removes the slack caused by poor conditioning.

The link here is below for the full post:


Even though I don't think EA needs to totally replicate outside norms, I do agree that there are good reasons why quite a few norms exist.

I'd say the biggest norms from outside that EA needs to adopt are less porous boundaries on work/dating, and importantly actually having normalish pay structures/work environments.

I agree, but that implies pretty different things than what is currently being done, and still implies that the danger from AI is overestimated, which bleeds into other things.

Basically, kind of. The basic issue is that instrumental convergence, and especially effectively unbounded instrumental convergence is a central assumption of why AI is uniquely dangerous, compared to other technologies like biotechnology. And in particular, without the instrumental convergence assumption, or at least with the instrumental convergence assumption being too weak to make the case for doom, unlike what Superintelligence told you, matters a lot because it kind of ruins a lot of our inferences of why AI would likely doom us, like deception or unbounded powerseeking and many more inferences that EAs/rationalists did, without high prior probability on it already.

This means we have to fall back on our priors for general technology being good or bad, and unless one already has high prior probability that AI is doomy, then we should update quite a lot downwards on our probability of doom by AI. It's probably not as small as our original prior, but I suspect it's enough to change at least some of our priorities.

Remember, the fact that the instrumental convergence assumption was essentially used as though it was an implicit axiom on many of our subsequent beliefs turning out to be not as universal as we thought, nor as much evidence of AI doom as we thought (indeed even with instrumental convergence, we don't actually get enough evidence to move us towards high probability of doom, without more assumptions.) is pretty drastic, as a lot of beliefs around the dangerousness of AI rely on the essentially unbounded instrumental convergence and unbounded powerseeking/deception assumptions.

So to answer your question, the answer depends on what your original priors were on technology and AI being safe.

It does not mean that we can't go extinct, but it does mean that we were probably way overestimating the probability of going extinct.

I mean, regardless of how much better their papers are in the meantime, does it seem likely to you that those labs will solve alignment in time if they are racing to build bigger and bigger models?

Basically, yes. This isn't to say that we aren't doomed at all, but contrary to popular beliefs of EAs/rationalists, the situation you gave actually has a very good chance, like >50% of working, for a few short reasons:

  1. Vast space for instrumental convergence/instrumental goals aren't incentivized in current AI, and in particular, the essentially unbounded instrumental goal AI is very bad for capabilities relative to more bounded instrumental goals. In general, the vaster the space for instrumental convergence, the worse the AI performs.

This is explained well by porby's post below, and I'll copy the most important footnote here:


This also means that minimal-instrumentality training objectives may suffer from reduced capability compared to an optimization process where you had more open, but still correctly specified, bounds. This seems like a necessary tradeoff in a context where we don't know how to correctly specify bounds.

Fortunately, this seems to still apply to capabilities at the moment- the expected result for using RL in a sufficiently unconstrained environment often ranges from "complete failure" to "insane useless crap." It's notable that some of the strongest RL agents are built off of a foundation of noninstrumental world models.

  1. Instrumental convergence appears to be too weak, on it's own to generate substantial conclusions that AI will doom us, barring more assumptions or more empirical evidence, contrary to Nick Bostrom's Superintelligence.

This is discussed more in a post below, but in essence the things you can conclude from it are quite weak, and they don't imply anything near the inferences that EAs/rationalists made about AI risk.


Given that instrumental convergence/instrumental goals have been arguably a foundational assumption that EAs/rationalists make around AI risk, this has very, very big consequences, almost on par with discovering a huge new cause like AI safety.

I want to be clear for the record here that this is enormously wrong, and Greg Colbourn's advice should not be heeded unless someone else checks the facts/epistemics of his announcements, due to past issues with his calls for alarm.

I definitely agree with this, and I'm not very happy with the way Omega solely focuses on criticism, at the very least without any balanced assessment.

And given the nature of the problem, some poor initial results should be expected, by default.

I actually would find this at least somewhat concerning, because selection bias/selection effects are my biggest worry with smart people working in an area. If a study area is selected based upon any non-truthseeking motivations, or if people are pressured to go along with a view for non-truthseeking reasons, then it's very easy to land into nonsense, where the consensus is based totally on selection effects, making them useless to us.

There's a link to the comment by lukeprog below on the worst case scenario for smart people being dominated by selection effects:

One marker to watch out for is a kind of selection effect.

In some fields, only 'true believers' have any motivation to spend their entire careers studying the subject in the first place, and so the 'mainstream' in that field is absolutely nutty.

Case examples include philosophy of religion, New Testament studies, Historical Jesus studies, and Quranic studies. These fields differ from, say, cryptozoology in that the biggest names in the field, and the biggest papers, are published by very smart people in leading journals and look all very normal and impressive but those entire fields are so incredibly screwed by the selection effect that it's only "radicals" who say things like, "Um, you realize that the 'gospel of Mark' is written in the genre of fiction, right?"


This. I generally also agree with your 3 observations, and the reason I was focusing on truth seeking is because my epistemic environment tends to reward worrying AI claims more than it probably should due to negativity bias, as well as looking at AI Twitter hype.

Load more