AD

Anthony DiGiovanni

2174 karmaJoined

Bio

Researcher at the Center on Long-Term Risk. All opinions my own.

Sequences
1

The challenge of unawareness for impartial altruist action guidance

Comments
258

Quotes: Recent discussions of backfire risks in AI safety

Some thinkers in AI safety have recently pointed out various backfire effects that attempts to reduce AI x-risk can have. I think pretty much all of these effects were known before,[1] but it's helpful to have them front of mind. In particular, I'm skeptical that we can weigh these effects against the upsides precisely enough to say an AI x-risk intervention is positive or negative in expectation, without making an arbitrary call. (Even if our favorite intervention doesn't have these specific downsides, we should ask if we're pricing in the downsides (and upsides) we haven't yet discovered.)

(Emphasis mine, in all the quotes below.)

Holden's Oct 2025 80K interview:

Holden: I mean, take any project. Let’s just take something that seems really nice, like alignment research. You’re trying to detect if the AI is scheming against you and make it not scheme against you. Maybe that’ll be good. But maybe the thing you’re doing is something that is going to get people excited, and then they’re going to try it instead of doing some other approach. And then it doesn’t work, and the other approach would have worked. Well, now you’ve done tremendous harm. Maybe it will work fine, but it will give people a false sense of security, make them think the problem is solved more than it is, make them move on to other things, and then you’ll have a tremendous negative impact that way.

Rob Wiblin: Maybe it’ll be used by a human group to get more control, to more reliably be able to direct an AI to do something and then do a power grab.

Holden: [...] Maybe it would have been great if the AIs took over the world. Maybe we’ll build AIs that are not exactly aligned with humans; they’re actually just much better — they’re kind of like our bright side, they’re the side we wish we were. [...]

[... M]aybe alignment is just a really… What it means is that you’re helping make sure that someone who’s intellectually unsophisticated — that’s us, that’s humans — remains forever in control of the rest of the universe and imposes whatever dumb ideas we have on it forevermore, instead of having our future evolve according to things that are much more sophisticated and better reasoners following their own values.

[...]

Holden: I just think AI is too multidimensional, and there’s too many considerations pointing in opposite directions. I’m worried about AIs taking over the world, but I’m also worried about the wrong humans taking over the world. And a lot of those things tend to offset each other, and making one better can make the other worse. [...]

[T]here’s also a lot of micro ways in which you could do harm. Just literally working in safety and being annoying, you might do net harm. You might just talk to the wrong person at the wrong time, get on their nerves. I’ve heard lots of stories of this. Just like, this person does great safety work, but they really annoyed this one person, and that might be the reason we all go extinct. 

[...]

Option value in the policy world is kind of a bad concept anyway. A lot of times when you’re at a nonprofit or a company and you don’t know what to do, you try and preserve option value. But giving the government the option to go one way or the other, that’s not a neutral intervention — it’s just like you don’t know what they’re going to do with that option. Giving them the option could have been bad. … you don’t know who’s going to be in power when, and whether they’re going to have anything like the goals that you had when you put in some power that they had. I know people have been excited at various points about giving government more power and then at other points giving government less power.

And all this stuff, I mean, this one axis you’re talking about: centralisation of power versus decentralisation. Most things that touch policy at all in any way will move us along that spectrum in one direction or another, so therefore have a high chance of being negative [...] 

And then most things that you can do in AI at all will have some impact on policy. Even just alignment research: policy will be shaped by what we’re seeing from alignment research, how tractable it looks, what the interventions look like.

[... I]n AI, it’s easier to annoy someone and polarise them against you, because whatever it is you’re trying to do, there’s some coalition that’s trying to do the exact opposite. In certain parts of global health and farm animal welfare, there’s certainly people who want to prioritise it less, but it doesn’t have the same directional ambiguity.

 

Helen Toner’s Nov 2025 80K interview:

Helen: And I think there’s a natural tension here as well among some people who are very concerned about existential risk from AI, really bad outcomes, and AI safety: there’s this sense that it’s actually helpful if there’s only a smaller number of players. Because, one, they can coordinate better — so maybe if racing leads to riskier outcomes, if you just have two top players, they can coordinate more directly than if you have three or four or 10 — and also a smaller number of players is going to be easier for an outside body to regulate, so if you just have a small number of companies, that’s going to be easier to regulate.

[...] But the problem is then the “Then what?” question of, if you do manage to avoid some of those worst-case outcomes, and then you have this incredibly powerful technology in the hands of a very small number of people, I think just historically that’s been really bad. It’s really bad when you have small groups that are very powerful, and typically it doesn’t result in good outcomes for the rest of the world and the rest of humanity.

[...]

Rob: I feel like we’re in a very difficult spot, because so many of the obvious solutions that you might have, or approaches you might take to dealing with loss of control do make the concentration of power problem worse and vice versa. So what policies you favour and disfavour depends quite sensitively on the relative risk of these two things, the relative likelihood of things going negatively in one way versus the other way.

And at least on the loss of control thing, people disagree so much on the likelihood. People who are similarly informed, know about everything there is to know about this, go all the way from thinking it’s a 1-in-1,000 chance to it’s a 1-in-2 chance — a 0.1% likelihood to 50% chance that we have some sort of catastrophic loss of control. And discussing it leads sometimes to some convergence, but people just have not converged on a common sense of how likely this outcome is.

So the people who think it’s 50% likely that we have some catastrophic loss-of-control event, it’s understandable that they think, “Well, we just have to make the best of it. Unfortunately, we have to concentrate. It’s the only way. And the concentration of power stuff is very sad and going to be a difficult issue to deal with, but we have to bear that cost.” And people who think it’s one in 1,000 are going to say, “This is a terrible move that you’re making, because we’re accepting much more risk, we’re creating much more risk than we’re actually eliminating.”

 

Wei Dai, “Legible vs. Illegible AI Safety Problems”:

Some AI safety problems are legible (obvious or understandable) to company leaders and government policymakers, implying they are unlikely to deploy or allow deployment of an AI while those problems remain open (i.e., appear unsolved according to the information they have access to). But some problems are illegible (obscure or hard to understand, or in a common cognitive blind spot), meaning there is a high risk that leaders and policymakers will decide to deploy or allow deployment even if they are not solved. (Of course, this is a spectrum, but I am simplifying it to a binary for ease of exposition.)

From an x-risk perspective, working on highly legible safety problems has low or even negative expected value. Similar to working on AI capabilities, it brings forward the date by which AGI/ASI will be deployed, leaving less time to solve the illegible x-safety problems.

  1. ^

    Among other sources, see my compilations of backfire effects here and here, and discussion of downside risks of capacity-building / aiming for "option value" or "wiser futures" here.

So then, the difference between (a) and (b) is purely empirical, and MNB does not allow me to compare (a) and (b), right? This is what I'd find a bit arbitrary, at first glance.

Gotcha, thanks! Yeah, I think it's fair to be somewhat suspicious of giving special status to "normative views". I'm still sympathetic to doing so for the reasons I mention in the post (here). But it would be great to dig into this more.

What would the justification standards in wild animal welfare say about uncertainty-laden decisions that involve neither AI nor animals: e.g. as a government, deciding which policies to enact, or as a US citizen, deciding who to vote for President?

Yeah, I think this is a feeling that the folks working on bracketing are trying to capture: that in quotidian decision-making contexts, we generally use the factors we aren't clueless about (@Anthony DiGiovanni -- I think I recall a bracketing piece explicitly making a comparison to day-to-day decision making, but now can't find it... so correct me if I'm wrong!). So I'm interested to see how that progresses.

I think the vast majority of people making decisions about public policy or who to vote for either aren't ethically impartial, or they're "spotlighting", as you put it. I expect the kind of bracketing I'd endorse upon reflection to look pretty different from such decision-making.

That said, maybe you're thinking of this point I mentioned to you on a call: I think even if someone is purely self-interested (say), they plausibly should be clueless about their actions' impact on their expected lifetime welfare, because of strange post-AGI scenarios (or possible afterlives, simulation hypotheses, etc.).[1] See this paper. So it seems like the justification for basic prudential decision-making might have to rely on something like bracketing, as far as I can tell. Even if it's not the formal theory of bracketing given here. (I have a draft about this on the backburner, happy to share if interested.)

  1. ^

    I used to be skeptical of this claim, for the reasons argued in this comment. I like the "impartial goodness is freaking weird" intuition pump for cluelessness given in the comment. But I've come around to thinking "time-impartial goodness, even for a single moral patient who might live into the singularity, is freaking weird".

Would you say that what dictates my view on (a)vs(b) is my uncertainty between different epistemic principles

It seems pretty implausible to me that there are distinct normative principles that, combined with the principle of non-arbitrariness I mention in the "Problem 1" section, imply (b). Instead I suspect Vasco is reasoning about the implications of epistemic principles (applied to our evidence) in a way I'd find uncompelling even if I endorsed precise Bayesianism. So I think I'd answer "no" to your question. But I don't understand Vasco's view well enough to be confident.

Can you explain more why answering "no" makes metanormatively bracketing in consequentialist bracketing (a bit) arbitrary? My thinking is: Let E be epistemic principles that, among other things, require non-arbitrariness. (So, normative views that involve E might provide strong reasons for choice, all else equal.) If it's sufficiently implausible that E would imply Vasco's view, then E will still leave us clueless, because of insensitivity to mild sweetening.

But lots of the interventions in 2. seem to also be helpful for getting things to go better for current farmed and wild animals, e.g. because they are aimed avoiding a takeover of society by forces which don't care at all about morals

Presumably misaligned AIs are much less likely than humans to want to keep factory farming around, no? (I'd agree the case of wild animals is more complicated, if you're very uncertain or clueless whether their lives are good or bad.)

Thanks Jo! Yeah, the perspective I defend in that post in a nutshell is:

  • The "reasons" given by different normative views are qualitatively different.
  • So, when choosing between A and B, we should look at whether each normative view gives us reason to prefer A over B (or B over A).
  • If consequentialist views say A and B are incomparable, these views don't give me a reason to prefer A over B (or B over A).
  • Therefore, if the other normative views in aggregate say A is preferable, I have more reason to choose A.

(Similarly, the decision theory of "bracketing" might also resolve incomparability within consequentialism, but see here for some challenges.)

There are other good reasons to reject incomparability.

Re: the first link, what do you think of Dynamic Strong Maximality, which avoids money pumps while allowing for incomparability?

this happens to break at least the craziest Pascalian wagers, assuming plausible imprecise credences (see DiGiovanni 2024).

FWIW, since writing that post, I've come to think it's still pretty dang intuitively strange if taking the Pascalian wager is permissible on consequentialist grounds, even if not obligatory. Which is what maximality implies. I think you need something like bracketing in particular to avoid that conclusion, if you don't go with (IMO really ad hoc) bounded value functions or small-probability discounting.

(This section of the bracketing post is appropos.)

This particular claim isn't empirical, it's about what follows from compelling epistemic principles

(As for empirical evidence that would change my mind about imprecision being so severe that we're clueless, see our earlier exchange. I guess we hit a crux there.)

Hi Vasco —

one will still be better than the other in expectation

My posts argue that this is fundamentally the wrong framework. We don't have precise "expectations".

Load more