Background in philosophy, international development, statistics. Doing a technical AI PhD at Bristol.
Financial conflict of interest: technically the British government through the funding council.
Follow-up post to ARCHES with ranking of existing fields, lots more bibliographies.
Some more prior art, on Earth vs off-world "lifeboats". See also 4.2 here for a model of mining Mercury (for solar panels, not habitats).
This makes sense. I don't mean to imply that we don't need direct work.
AI strategy people have thought a lot about the capabilities : safety ratio, but it'd be interesting to think about the ratio of complementary parts of safety you mention. Ben Garfinkel notes that e.g. reward engineering work (by alignment researchers) is dual-use; it's not hard to imagine scenarios where lots of progress in reward engineering without corresponding progress in inner alignment could hurt us.
research done by people who are trying to do something else will probably end up not being very helpful for some of the core problems.
Yeah, it'd be good to break AGI control down more, to see if there are classes of problem where we should expect indirect work to be much less useful. But this particular model already has enough degrees of freedom to make me nervous.
I think that it might be easier to assign a value to the discount factor by assessing the total contributions of EA safety and non-EA safety.
That would be great! I used headcount because it's relatively easy, but value weights are clearly better. Do you know any reviews of alignment contributions?
... This doesn't seem to mesh with your claim about their relative productivity.
Yeah, I don't claim to be systematic. The nine are just notable things I happened across, rather than an exhaustive list of academic contributions. Besides the weak evidence from the model, my optimism about there being many other academic contributions is based on my own shallow knowledge of AI: "if even I could come up with 9..."
Something like the Median insights collection, but for alignment, would be amazing, but I didn't have time.
those senior researchers won't necessarily have useful things to say about how to do safety research
This might be another crux: "how much do general AI research skills transfer to alignment research?" (Tacitly I was assuming medium-high transfer.)
I think the link is to the wrong model?
No, that's the one; I mean the 2x2 of factors which lead to '% work that is alignment relevant'. (Annoyingly, Guesstimate hides the dependencies by default; try View > Visible)
An important source of capabilities / safety overlap, via Ben Garfinkel:
Let’s say you’re trying to develop a robotic system that can clean a house as well as a human house-cleaner can... Basically, you’ll find that if you try to do this today, it’s really hard to do that. A lot of traditional techniques that people use to train these sorts of systems involve reinforcement learning with essentially a hand-specified reward function...
One issue you’ll find is that the robot is probably doing totally horrible things because you care about a lot of other stuff besides just minimizing dust. If you just do this, the robot won’t care about, let’s say throwing out valuable objects that happened to be dusty. It won’t care about, let’s say, ripping apart a couch cushion to find dust on the inside... You’ll probably find any simple line of code you write isn’t going to capture all the nuances. Probably the system will end up doing stuff that you’re not happy with.
This is essentially an alignment problem. This is a problem of giving the system the right goals. You don’t really have a way using the standard techniques of making the system even really act like it’s trying to do the thing that you want it to be doing. There are some techniques that are being worked on actually by people in the AI safety and the AI alignment community to try and basically figure out a way of getting the system to do what you want it to be doing without needing to hand-specify this reward function...
These are all things that are being developed by basically the AI safety community. I think the interesting thing about them is that it seems like until we actually develop these techniques, probably we’re not in a position to develop anything that even really looks like it’s trying to clean a house, or anything that anyone would ever really want to deploy in the real world. It seems like there’s this interesting sense in which we have the storage system we’d like to create, but until we can work out the sorts of techniques that people in the alignment community are working on, we can’t give it anything even approaching the right goals. And if we can’t give anything approaching the right goals, we probably aren’t going to go out and, let’s say, deploy systems in the world that just mess up people’s houses in order to minimize dust.
I think this is interesting, in the sense in which the processes to give things the right goals bottleneck the process of creating systems that we would regard as highly capable and that we want to put out there.
He sees this as positive: it implies massive economic incentives to do some alignment, and a block on capabilities until it's done. But it could be a liability as well, if the alignment of weak systems is correspondingly weak, and if mid-term safety work fed into a capabilities feedback loop with greater amplification.
Thanks for this, I've flagged this in the main text. Should've paid more attention to my confusion on reading their old announcement!
If the above strikes you as wrong (and not just vague), you could copy the Guesstimate, edit the parameters, and comment below.
It's a common view. Some GiveWell staff hold this view, and indeed most of their work involves short-term effects, probably for epistemic reasons. Michael Plant has written about the EA implications of person-affecting views, and emphasises improvements to world mental health.
Here's a back-of-the-envelope estimate for why person-affecting views might still be bound to prioritise existential risk though (for the reason you give, but with some numbers for easier comparison).
Dominic Roser and I have also puzzled over Christian longtermism a bit.
Great comment. I count only 65 percentage points - is the other third "something else happened"?
Or were you not conditioning on long-termist failure? (That would be scary.)
IKEA is an interesting case: it was bequeathed entirely to a nonprofit foundation with a very loose mission and no owner(?)
Not a silly question IMO. I thought about Satoshi Nakamoto's bitcoin - but if they're dead, then it's owned by their heirs, or failing that by the government of whatever jurisdiction they were in. In places like Britain I think a combination of "bona vacantia" (unclaimed estates go to the government) and "treasure trove" (old treasure also) cover the edge cases. And if all else fails there's "finders keepers".