I'm a PhD student at the Center for Human-Compatible AI (CHAI) at UC Berkeley. I edit and publish the Alignment Newsletter, a weekly publication with recent content relevant to AI alignment. In the past, I ran the EA UC Berkeley and EA at the University of Washington groups.

rohinmshah's Comments

My personal cruxes for working on AI safety

My interpretation was that the crux was

We can do good by thinking ahead

One thing this leaves implicit is the counterfactual: in particular, I thought the point of the "Problems solve themselves" section was that if problems would be solved by default, then you can't do good by thinking ahead. I wanted to make that clearer, which led to

we both **can** and **need to** think ahead in order to solve [the alignment problem].

Where "can" talks about feasibility, and "need to" talks about the counterfactual.

I can remove the "and **need to**" if you think this is wrong.

What are the challenges and problems with programming law-breaking constraints into AGI?
What if they also have access to nukes or other weapons that could prevent them or their owners from being held accountable if they're used?

I'm going to interpret this as:

  • Assume that the owners are misaligned w.r.t the rest of humanity (controversial, to me at least).
  • Assume that enforcement is impossible.

Under these assumptions, I feel better about 1 than 2, in the sense that case 1 feels like a ~5% chance of success while case 2 feels like a ~0% chance of success. (Numbers made up of course.)

But this seems like a pretty low-probability way the world could be (I would bet against both assumptions), and the increase in EV from work on it seems pretty low (since you only get 5% chance of success), so it doesn't seem like a strong argument to focus on case 1.

What are the challenges and problems with programming law-breaking constraints into AGI?
Part of the reason that enforcement works, though, is that human agents have an independent incentive not to break the law (or, e.g., report legal violations) since they are legally accountable for their actions.

Certainly you still need legal accountability -- why wouldn't we have that? If we solve alignment, then we can just have the AI's owner be accountable for any law-breaking actions the AI takes.

This seems to require the same type of fundamental ML research that I am proposing: mapping AI actions onto laws.

Imagine trying to make teenagers law-abiding. You could have two strategies:

1. Rewire the neurons or learning algorithm in their brain such that you can say "the computation done to produce the output of neuron X reliably tracks whether a law has been violated, and because of its connection via neuron Y to neuron Z, if an action is predicted to violate a law, the teenager won't take it".

2. Explain to them what the laws are (relying on their existing ability to understand English, albeit fuzzily), and give them incentives to follow it.

I feel much better about 2 than 1.

When you say "programming AI to follow law" I imagine case 1 above (but for AI systems instead of humans). Certainly the OP seemed to be arguing for this case. This is the thing I think is extremely difficult.

I am much happier about AI systems learning about the law via case 2 above, which would enable the AI police applications I mentioned above.

However, some ML people I have talked about this with have given positive feedback, so I think you might be overestimating the difficulty.

I suspect they are thinking about case 2 above? Or they might be thinking of self-driving car type applications where you have an in-code representation of the world? Idk, I feel confident enough of this that I'd predict that there is a miscommunication somewhere, rather than an actual strong difference of opinion between me and them.

What are the challenges and problems with programming law-breaking constraints into AGI?
My intuition is that more formal systems will be easier for AI to understand earlier in the "evolution" of SOTA AI intelligence than less-formal systems.

I agree for fully formal systems (e.g. solving SAT problems), but don't agree for "more formal" systems like law.

Mostly I'm thinking that understanding law would require you to understand language, but once you've understood language you also understand "what humans want". You could imagine a world in which AI systems understand the literal meaning of language but don't grasp the figurative / pedagogic / Gricean aspects of language, and in that world I think AI systems will understand law earlier than normal English, but that doesn't seem to be the world we live in:

  • GPT-2 and other language models don't seem particularly literal.
  • We have way more training data about natural language as it is normally used (most of the Internet), relative to natural language meant to be interpreted mostly literally.
  • Humans find it easier / more "native" to interpret language in the figurative / pedagogic way than to interpret it in the literal way.
My point was that I think that making a law-following AI that can follow (A) all enumerated laws is not much harder than one that can be made to follow (B) any given law.

Makes sense, that seems true to me.

My personal cruxes for working on AI safety

Planned summary for the Alignment Newsletter:

This post describes how Buck's cause prioritization within an effective altruism framework leads him to work on AI risk. The case can be broken down into a conjunction of five cruxes. Specifically, the story for impact is that 1) AGI would be a big deal if it were created, 2) has a decent chance of being created soon, before any other "big deal" technology is created, and 3) poses an alignment problem that we both **can** and **need to** think ahead in order to solve. His research 4) would be put into practice if it solved the problem and 5) makes progress on solving the problem.

Planned opinion:

I enjoyed this post, and recommend reading it in full if you are interested in AI risk because of effective altruism. (I've kept the summary relatively short because not all of my readers care about effective altruism.) My personal cruxes and story of impact are actually fairly different: in particular, while this post sees the impact of research as coming from solving the technical alignment problem, I care about other sources of impact as well. See this comment for details.
My personal cruxes for working on AI safety

I enjoyed this post, it was good to see this all laid out in a single essay, rather than floating around as a bunch of separate ideas.

That said, my personal cruxes and story of impact are actually fairly different: in particular, while this post sees the impact of research as coming from solving the technical alignment problem, I care about other sources of impact as well, including:

1. Field building: Research done now can help train people who will be able to analyze problems and find solutions in the future, when we have more evidence about what powerful AI systems will look like.

2. Credibility building: It does you no good to know how to align AI systems if the people who build AI systems don't use your solutions. Research done now helps establish the AI safety field as the people to talk to in order to keep advanced AI systems safe.

3. Influencing AI strategy: This is a catch all category meant to include the ways that technical research influences the probability that we deploy unsafe AI systems in the future. For example, if technical research provides more clarity on exactly which systems are risky and which ones are fine, it becomes less likely that people build the risky systems (nobody _wants_ an unsafe AI system), even though this research doesn't solve the alignment problem.

As a result, cruxes 3-5 in this post would not actually be cruxes for me (though 1 and 2 would be).

What are the best arguments that AGI is on the horizon?

Just wanted to note that while I am quoted as being optimistic, I am still working on it specifically to cover the x-risk case and not the value lock-in case. (But certainly some people are working on the value lock-in case.)

(Also I think several people would disagree that I am optimistic, and would instead think I'm too pessimistic, e.g. I get the sense that I would be on the pessimistic side at FHI.)

What are the challenges and problems with programming law-breaking constraints into AGI?

Cullen's argument was "alignment may not be enough, even if you solve alignment you might still want to program your AI to follow the law because <reasons>." So in my responses I've been assuming that we have solved alignment; I'm arguing that after solving alignment, AI-powered enforcement will probably be enough to handle the problems Cullen is talking about. Some quotes from Cullen's comment (emphasis mine):

Reasons other than directly getting value alignment from law that you might want to program AI to follow the law

We will presumably want organizations with AI to be bound by law.

We don't want to rely on the incentives of human principals to ensure their agents advance their goals in purely legal ways

Some responses to your comments:

if we want to automate "detect bad behavior", wouldn't that require AI alignment, too?

Yes, I'm assuming we've solved alignment here.

Isn't most of this after a crime has already been committed?

Good enforcement is also a deterrent against crime (someone without any qualms about murder will still usually not murder because of the harsh penalties and chance of being caught).

Furthermore, AIs may be able to learn new ways of hiding things from the police, so there could be gaps where the police are trying to catch up.

Remember that the police are also AI-enabled, and can find new ways of detecting things. Even so, this is possible: but it's also possible today, without AI: criminals presumably constantly find new ways of hiding things from the police.

What are the challenges and problems with programming law-breaking constraints into AGI?
(Most) real laws have huge bodies of interpretative text surrounding them and examples of real-world applications of them to real-world facts.

Right, I was trying to factor this part out, because it seemed to me that the hope was "the law is explicit and therefore can be programmed in". But if you want to include all of the interpretative text and examples of real-world application, it starts looking more like "here is a crap ton of data about this law, please understand what this law means and then act in accordance to it", as opposed to directly hardcoding in the law.

Under this interpretation (which may not be what you meant) this becomes a claim that laws have a lot more data that pinpoints what exactly they mean, relative to something like "what humans want", and so an AI system will more easily pinpoint it. I'm somewhat sympathetic to this claim, though I think there is a lot of data about "what humans want" in everyday life that the AI can learn from. But my real reason for not caring too much about this is that in this story we rely on the AI's "intelligence" to "understand" laws, as opposed to "programming it in"; given that we're worried about superintelligent AI it should be "intelligent" enough to "understand" what humans want as well (given that humans seem to be able to do that).

Lawyers approximate generalists: they can take arbitrary written laws and give advice on how to conform behavior to those laws. So a lawyerlike AI might be able to learn general interpretative principles and research skills and be able to simulate legal adjudications of proposed actions.

I'm not sure what you're trying to imply with this -- does this make the AIs task easier? Harder? The generality somehow implies that the AI is safer?

Like, I don't get why this point has any bearing on whether it is better to train "lawyerlike AI" or "AI that tries to do what humans want". If anything, I think it pushes in the "do what humans want" direction, since historically it has been very difficult to create generalist AIs, and easier to create specialist AIs.

(Though I'm not sure I think "AI that tries to do what humans want" is less "general" than lawyerlike AI.)

What are the challenges and problems with programming law-breaking constraints into AGI?

I agree that getting a guarantee of following the law is (probably) better than trying to ensure it through enforcement, all else equal. I also agree that in principle programming the AI to follow the law could give such a guarantee. So in some normative sense, I agree that it would be better if it were programmed to follow the law.

My main argument here is that it is not worth the effort. This factors into two claims:

First, it would be hard to do. I am a programmer / ML researcher and I have no idea how to program an AI to follow the law in some guaranteed way. I also have an intuitive sense that it would be very difficult. I think the vast majority of programmers / ML researchers would agree with me on this.

Second, it doesn't provide much value, because you can get most of the benefits via enforcement, which has the virtue of being the solution we currently use.

It will also probably be able to hide its actions, obscure its motives, and/or evade detection better than humans could.

But AI-enabled police would be able to probe actions, infer motives, and detect bad behavior better than humans could. In addition, AI systems could have fewer rights than humans, and could be designed to be more transparent than humans, making the police's job easier.

Load More