Koen Holtman

31 karmaJoined


Thanks for sharing! Speaking as a European I think this is a pretty good summary of the latest state of events.

I currently expect the full text of the Act as agreed on in the trilogue to be published by the EU some time in January or February.

Hi Holden! I may be able to get some people my network interested in submitting a funding request to you for writing a case study.

There are two important questions they would have, that I could not find answers for in your post or form:

  1. Are you inviting case studies you will be able to post on the web when you get them, or that the authors are allowed to publish also themselves as a blog post or academic journal article?

  2. Are you inviting case studies for which the authors can request that they are kept confidential?

One of the biggest challenges with AI safety standards will be the fact that no one really knows how to verify that a (sufficiently-powerful) system is safe. And a lot of experts disagree on the type of evidence that would be sufficient.

While overcoming expert disagreement is a challenge, it is not one that is as big as you think. TL;DR: Deciding not to agree is always an option.

To expand on this: the fallback option in a safety standards creation process, for standards that aim to define a certain level of safe-enough, is as follows. If the experts involved cannot agree on any evidence based method for verifying that a system X is safe enough according to the level of safety required by the standard, then the standard being created will simply, and usually implicitly, declare that there is no route by which system X can comply with the safety standard. If you are required by law, say by EU law, to comply with the safety standard before shipping a system into the EU market, then your only legal option will be to never ship that system X into the EU market.

For AI systems you interact with over the Internet, this 'never ship' translates to 'never allow it to interact over the Internet with EU residents'.

I am currently in the JTC21 committee which is running the above standards creation process to write the AI safety standards in support of the EU AI Act, the Act that will regulate certain parts of the AI industry, in case they want to ship legally into the EU market. ((Legal detail: if you cannot comply with the standards, the Act will give you several other options that may still allow you to ship legally, but I won't get into explaining all those here. These other options will not give you a loophole to evade all expert scrutiny.))

Back to the mechanics of a standards committee: if a certain AI technology, when applied in a system X, is well know to make that system radioactively unpredictable, it will not usually take long for the technical experts in a standards committee to come to an agreement that there is no way that they can define any method in the standard for verifying that X will be safe according to the standard. The radioactively unsafe cases are the easiest cases to handle.

That being said, in all but the most trivial of safety engineering fields, there is a complicated epistemics involved in deciding when something is safe enough to ship, it is complicated whether you use standards or not. I have written about this topic, in the context of AGI, in section 14 of this paper.

There is some AGI safety work that specifically targets deep RL, under the asumption that deep RL might scale to AGI. But there is also a lot of other work, both on failure modes and on solutions, that is much more independent of the method being used to create the AGI.

I do not have percentages on how it breaks down. Things are in flux. A lot of the new technical alignment startups seem to be mostly working in a deep RL context. But a significant part of the more theoretical work, and even some of the experimental work, involves reasoning about a very broad class of hypothetical future AGI systems, not just those that might be produced by deep RL.

Are people in the AI safety community thinking about this?

Yes. They think about this more on the policy side than on the technical side, but there is technical/policy cross-over work too.

Should we be concerned that an aligned AI's values will be set by (for example) the small team that created it, who might have idiosyncratic and/or bad values?


There is significant of talk about 'aligned with whom exactly'. But many of the more technical papers and blog posts on x-risk style alignment tend to ignore this part of the problem, or mention it only in one or two sentences and then move on. This does not necessarily mean that the authors are unconcerned about this question, it more often means that they feel they have little new to say about it.

If you want to see an example of a vigorous and occasionally politically sophisticated debate on solving the 'aligned with whom' question, instead of the moral philosophy 101/201 debate which is still the dominant form of discourse in the x-risk community, you can dip into the literature on AI fairness.

Thus we EAs should vigorously investigate whether this concern is well-founded

I am not an EA, but I am an alignment researcher. I see only a small sliver of alignment research for which this concern would be well-founded.

To give an example that is less complicated than my own research: suppose I design a reward function component, say some penalty term, that can be added to an AI/AGI reward function to make the AI/AGI more aligned. Why not publish this? I want to publish it widely so that more people will actually design/train their AI/AGI using this penalty term.

Your argument has a built-in assumption that it will be hard to build AGIs that will lack the instrumental drive to protect themselves from being switched off, or protect themselves from having their goals changed. I do not agree with this built-in assumption, but even if I did agree, I would see no downside to publishing alignment research about writing better reward functions.