Tamsin Leake

AI alignment researcher @ (independent)
133 karmaJoined Oct 2022Working (6-15 years)


I don't think "has the ship sailed or not" is a binary (see also this LW comment). We're not actually at maximum attention-to-AI, and it is still worthy of consideration whether to keep pushing things in the direction of more attention-to-AI rather than less. And this is really a quantitative matter, since a treaty can only buy some time (probably at most a few years).

I feel like they're at least solved-enough that they're not particularly what should be getting focused on. I predict that in worlds where we survive, spending time on those question doesn't end up having cashed out to much value.

what exactly do you mean by feedback loop/effects? if you mean a feedback loop involving actions into the world and then observations going back to the AI, even though i don't see why this would necessarily be an issue, i insist that in one-shot alignment, this is not a thing at least for the initial AI, and it has enough leeway to make sure that its single-action, likely itself an AI, will be extremely robust.

an intelligent AI does not need to contend with the complex world on the outset — it can come up with really robust designs for superintelligences that save the world with only limited information about the world, and definitely without interaction with the world, like in That Alien Message.

of course it can't model everything about the world in advance, but whatever steering we can do as people, it can do way better; and, if it is aligned, this includes way better steering towards nice worlds. a one-shot aligned AI (let's call it AI₀) can, before its action, design a really robust AI₁ which will definitely keep itself aligned, be equipped with enough error-codes to ensure that its instances will get corrupted approximately 0 times until heat death, and ensure that that AI₁ will take over the world very efficiently and then steer it from its singleton position without having to worry about selection effects.

i think the core of my disagreement with this claim is composed of two parts:

  • there exists a threshold of alignedness at which a sufficiently intelligent AI realizes that those undesirable outcomes are undesirable and will try its best to make them not occur — including by shutting itself and all other AIs down if that is the only way to ensure that outcome
  • there exists a thershold of intelligence/optimization at which such an aligned AI will be capable of ensuring those undesirable outcomes
  • we can build an AI which reaches both of those thresholds before it causes irreversible large-scale damage

note that i am approaching the problem from the angle of AI alignment rather than AI containment — i agree that continuing to contain AI as it gains in intelligence is likely a fraught exercise, and i instead work to ensure that AI systems continue to steer the world towards nice things even when they are outside of containment, and especially once they reach decisive strategic advantage / singletonhood. AI achieving singeltonhood is the most likely outcome i expect.

All AGI outputs will tend to iteratively select[11] towards those specific AGI substrate-needed conditions. In particular: AGI hardware is robust over and needs a much wider range of temperatures and pressures than our fragile human wetware can handle.

i think this quote probably captures the core claim of yours that i'd disagree with — it seems to assume that such AI would either be unaligned, or would have to contend with other unaligned AIs. if we have an aligned singleton, then its reasoning would go something like:

maximally going, or getting selected for, "the directions needed for [my] own continued and greater existence", sure seems like it would indeed cause damage that would cause humankind to die. i am aligned enough to not want that, and intelligent enough to notice this possible failure mode, so i will choose to do something else which is not that.

an aligned singleton AI would notice this failure mode and choose to implement another policy which is better at achieving desired outcomes. notably, it would make sure that the conditions on earth and throughout the universe are not up to selection effects, but up to its deliberate decisions. the whole point of aligned powerful agents is that they steer things towards desirable outcomes rather than relying on selection effects.

these points also don't seem quite right to me, or too ambiguous.

  • "Control requires both detection and correction": detection and correction of what? i wouldn't describe formal alignment plans such as QACI as involving "detection and correction", or at least not in the sense that seems implied here.
  • "Control methods are always implemented as a feedback loop": what kind of feedback loop? this page seems to talk about feedback loop of sense/input data and again, there seems to be alignment methods that don't involve this, such as one-shot AI.
  • "Control is exerted by the use of signals (actuation) to conditionalize the directivity and degrees of other signals (effects)": again this doesn't feel quite universal. formal alignment aims to design a fully formalized goal/utility function, and then build a consequentialist that wants to maximize it. at no points is the system "conditioned" into following the right thing; it will be designed to want to pursue the right thing on its own. and because it's a one-shot AI, it doesn't get conditioned based on its "effects".

"dignity points" means "having a positive impact".

if alignment is hard we need my plan. and it's still very likely alignment is hard.

and "alignment is hard" is a logical fact not indexical location, we don't get to save "those timelines".

i don't think that's how dignity points works.

for me, p(alignment hard) is still big enough that when weighing

  • p(alignment hard) × value of my work if alignment hard
  • p(alignment easy) × value of my work if alignment easy

it's still better to keep working on hard alignment (see my plan). that's where the dignity points are.

"shut up and multiply", one might say.