Pursuing an undergraduate degree


I'm studying at Oxford, where I do some AI safety community building. Currently a Summer Research Fellow at CERI, working on applications of AI to state surveillance. Blogging at


Sorted by New


Disagreements about Alignment: Why, and how, we should try to solve them

Thanks for posting this! I agree that AI alignment is currently pre-paradigmatic, but I disagree (at least partially) with your conclusions.

You mention two kinds of steering in your post: one that’s focused on evaluating the assumptions, theories of change, and likelihood of meaningful contribution to aligning AI of specific research agendas[1], and another that investigates the extent to which we can get experimental evidence about alignment strategies from sub-AGI systems. I think the latter question is a crux for assessing the goodness of different research agendas, and that we should spend much more time working on it. I’m unconvinced that we should spend much more time working on the former.

The reason I think the question of whether sub-AGI alignment evidence generalises to AGI is a crux is because it really informs which research agendas we should pursue. If it turns out that evidence does generalise, we can treat alignment like a science in the normal way: making some theories, testing their predictions, and seeing which theories best fit the evidence. So, if evidence generalisation is true, we don’t need to spend any more time evaluating theoretical assumptions and theories of change and so on - we just need to keep gathering evidence and discarding poorly-fitting theories. However, if a) evidence does not generalise well and b) we want to be very very confident that we can align AGI, then we should spend way less time on research agendas which have weak theoretical justifications (even if they have strong evidential justifications), and more time on research agendas which seem to have strong theoretical justifications.

So if we’re in the ‘sub-AGI evidence of alignment doesn’t tell us about AGI alignment’ world, I pretty much agree with you, but with caveats:

  • I agree with Oscar that steering (in the ‘evaluating assumptions and theories of change of different research agendas’ sense) seems really difficult unless you have a lot of context and experience, and even then it sounds like it would take a long time to do a good job. I’m pessimistic that anyone other than established alignment researchers could do a good job, and unfortunately the opportunity cost of alignment researchers working on steering is really high.
  • Oscar argues that the comparative advantage of senior researchers is to steer, and therefore they should spend more time steering. This conclusion doesn’t follow if you think steering and rowing have sufficiently different value. I think the value of senior researchers doing direct research is sufficiently high that even though they are comparatively better suited to steer, they should still spend their time on direct research.
  • As we get closer to AGI (using your definition of ‘AI systems which are at least as capable as humans across a range of domains’), we should be more and more surprised if evidence of alignment doesn’t generalise. I would guess that AGI and not-quite-as-capable-as-humans-but-extremely-close AI aren’t qualitatively different, so it would be surprising if there was this discontinuous jump in how useful evidence is for assessing competing theories.
  • This means that as we get closer to AGI, we should be more confident that evidence of alignment on current AI systems is helpful, and so spend more time rowing (doing research which produces evidence) vs steering (thinking about the assumptions and theories of change of different research agendas).

It might be really hard to figure out whether sub-AGI evidence of alignment tells us about AGI alignment. In that case, given our uncertainty it makes sense to spend some time steering as you describe it (i.e. evaluating the assumptions and theories of change of different research agendas). But this is time-consuming and has a high opportunity cost, and our answer to the evidence question is crucial to figuring out the amount of time we should spend on this. Given this, I think the steering we do should be focused on figuring out the overarching question of whether sub-AGI evidence of alignment tells us about AGI alignment, and not on the narrower task of evaluating different research agendas. Plausible research agendas should just be pursued.

  1. I think in your post you move between referring to epistemic strategies and research agendas. I understood it better when I took you to mean research agendas, so I’ll use that in my comment. ↩︎

What if AI development goes well?

I really liked this post, thanks for writing it! I'm much more sympathetic to ideal governance now.

Two fairly off-the-cuff reactions:

First, I would guess that some (most?) of the appeal of utopias is conditional on lack of scarcity? I'm not sure how to interpret Karnofsky's results here: he notes that his freedom-focused utopia without mention of happiness, wealth or fulfilment was still the third most popular utopia amongst respondents. However, the other top five utopias highlight a lack of scarcity, and even the freedom-focused utopia without conclusion implies a lack of scarcity ("If you aren't interfering with someone else's life, you can do whatever you want"). Naively, I'd guess that at least some people value freedom so highly only if they can do what they like with that freedom.

I think this is a problem for creating utopias that respect pluralism - it's still possible, but I think it's harder than it appears. I would expect people to have strong opinions about resource allocation, which makes it hard to get to post-scarcity for everyone, which makes it hard to reach a utopia that many people like. The counterargument here is that if we had resource abundance, people would be much happier sharing resources, and this would make getting to a post-scarcity state for everyone much easier, but I'm a little sceptical about this. (Billionaires are quite close to resource abundance, and most of them seem quite keen to hold onto their personal wealth/don't seem very interested in redistributing their resources?)

It might still be motivating to construct pluralistic utopias which assume post-scarcity for everyone, even if this condition is unlikely to be met in practice, but I'm less confident that utopias which require a fairly unlikely condition will be action-guiding.

Second, I agree that AI ideal governance theories are useful for action. But shouldn't we care more about how useful for action they are? I'm not sure how useful it is to work on AI ideal governance vs. working on e.g. more applied AI policy, and it seems like you need a stronger claim than "AI ideal governance" is useful to motivate working on it. (Probably >0 people should work on ideal governance? But without a compelling argument that it's more valuable on the margin than other AI governance work, I'm not sure if many people should work on it.)

On the Vulnerable World Hypothesis

On 1: I agree it's not clear that having surveillance would make us less likely to implement other defence mechanisms because of a false sense of security. I think it's more plausible that having surveillance makes us less likely to implement other defence mechanisms because implementing new policies takes time, political energy, and money. I think it makes sense to think about policymaking as a prioritisation question, and probably controversial and expensive policies are less likely to be implemented if the issue they address is perceived to have been dealt with. So I'd expect implementing perceived effective surveillance to decrease the likelihood that other defence mechanisms aimed at reducing GCRs are implemented. (Although this isn't necessarily the case - maybe increasing surveillance makes other extreme defence mechanisms less politically costly?) This is isn't an argument I make in my post, so thanks for pushing back!

I like your point on supposedly effective surveillance as a kind of bluff. I think this imposes a lower bound on the effectiveness of global surveillance, as even ineffective surveillance will have this deterrence effect. However, I'd guess that over time, malicious actors will realise that the system is less effective than they initially thought, and so the risk from malicious actors creeps back up again. (This is speculative: I'm guessing that some actors will still try things and realise that they don't get caught, and that there's some communication between malicious actors. My immediate reaction was "man, it'd be hard for a surveillance system that wasn't that effective to be considered effective for a really long time, won't people find out?")

On 2 and 3: yeah, I agree with you here that totalitarianism risk is the main problem, and I should have been clearer about that in my post. I can imagine (like you say) that in a world where trusted global surveillance has always been the norm, we remain weird and free-thinking.

On the Vulnerable World Hypothesis

The anxiety point sounds plausible to me, but it depends on how the surveillance is implemented and who implements it (as do all my concerns, to be fair). I expect if surveillance was gradually introduced and generally implemented by trusted actors, then people would be much less likely to feel anxious about being watched. (Maybe a relevant analogy is CCTV - people now seem basically fine with being on camera in public, at least in Britain, but I expect we'd be much less happy about it if we'd gone from 0 CCTV cameras to current levels.)

I agree that if surveillance is stopping most bad acts currently, the case for expanding it is stronger! I probably should have been clearer about this in my post. I think my main worry is that harm doesn't increase linearly with the scale of surveillance - I think some harms, like totalitarianism risk and effects on free speech and weirdness, only occur when surveillance is very widespread (if not universal). So even if limited forms of surveillance are doing a good job at stopping bad stuff, we should think carefully about massively expanding it.

I agree with your last point too, and I don't think my suggestions were particularly good. Ideally we could find an effective response which, if it is surveillance, is limited in scope - i.e. surveilling people in certain roles or contexts. I think this would be significantly less harmful than ubiquitous surveillance, for the reasons I've described in the previous paragraph. And I also don't think we should implement all of these methods, for the same reasons :)

On the Vulnerable World Hypothesis

Yeah, thanks for flagging this! I didn't cover the other kinds of risks because I think the case for surveillance is strongest for mitigating type-1 risks, and Bostrom's suggestions for mitigating other risks looked less contentious.

On the Vulnerable World Hypothesis

Hey, thanks for commenting!

I think this is a good criticism, and despite most of my post arguing that surveillance would probably be bad, I agree that in some cases it could still be worth it. I think my crux is whether the decrease of risk from malicious actors due to surveillance is greater than the increase in totalitarianism and misuse risk (plus general harms to free speech and so on).

It seems like surveillance must be global and very effective to greatly decrease the risk from malicious actors, and furthermore that it's really hard to reduce misuse risk of global and effective surveillance. I'm sceptical that we could make the risks associated with surveillance sufficiently small to make surveillance an overall less risky option, even supposing the risks surveillance helps decrease are worse than the ones it increases. (I don't think I share this intuition, but it definitely seems right from a utilitarian perspective). I agree though that in principle, despite increasing other risks, it might be sometimes better to surveil.