Koen Holtman

6Joined Nov 2022


There is some AGI safety work that specifically targets deep RL, under the asumption that deep RL might scale to AGI. But there is also a lot of other work, both on failure modes and on solutions, that is much more independent of the method being used to create the AGI.

I do not have percentages on how it breaks down. Things are in flux. A lot of the new technical alignment startups seem to be mostly working in a deep RL context. But a significant part of the more theoretical work, and even some of the experimental work, involves reasoning about a very broad class of hypothetical future AGI systems, not just those that might be produced by deep RL.

Are people in the AI safety community thinking about this?

Yes. They think about this more on the policy side than on the technical side, but there is technical/policy cross-over work too.

Should we be concerned that an aligned AI's values will be set by (for example) the small team that created it, who might have idiosyncratic and/or bad values?


There is significant of talk about 'aligned with whom exactly'. But many of the more technical papers and blog posts on x-risk style alignment tend to ignore this part of the problem, or mention it only in one or two sentences and then move on. This does not necessarily mean that the authors are unconcerned about this question, it more often means that they feel they have little new to say about it.

If you want to see an example of a vigorous and occasionally politically sophisticated debate on solving the 'aligned with whom' question, instead of the moral philosophy 101/201 debate which is still the dominant form of discourse in the x-risk community, you can dip into the literature on AI fairness.

Thus we EAs should vigorously investigate whether this concern is well-founded

I am not an EA, but I am an alignment researcher. I see only a small sliver of alignment research for which this concern would be well-founded.

To give an example that is less complicated than my own research: suppose I design a reward function component, say some penalty term, that can be added to an AI/AGI reward function to make the AI/AGI more aligned. Why not publish this? I want to publish it widely so that more people will actually design/train their AI/AGI using this penalty term.

Your argument has a built-in assumption that it will be hard to build AGIs that will lack the instrumental drive to protect themselves from being switched off, or protect themselves from having their goals changed. I do not agree with this built-in assumption, but even if I did agree, I would see no downside to publishing alignment research about writing better reward functions.