Alignment 201 curriculum

richard_ngo

Alignment 201 curriculum

richard_ngo

1 min read · Oct 12, 2022

Comments 9

Sorted by

New & upvoted

Geoffrey Miller

Richard -- thanks for posting this. It looks like a very useful curriculum.

Naive question as an alignment newbie:

If the point of 'AI alignment' is 'alignment with human values', why does the alignment field pay so little attention to the many decades of scientific research on the origins, nature, and diversity of human values, and focus almost entirely on the last few decades of research on machine learning?

It feels like many alignment courses are focusing only on the AI side of the equation, and acting as if the human side of alignment is trivial, obvious, and/or under-researched.

Genuine question; it's something that's been puzzling me for several months.

richard_ngo

Most people in the field expect that the hardest part of the problem is "robustly align AI with any goal". I expect AGIs to have a very sophisticated understanding of human values, along with many other concepts. The question is how we can precisely select which concepts they'll be motivated by.

Geoffrey Miller

Richard - thanks for your reply.

What I'm struggling with is how we'd plausibly get from (1) 'align with any human goal' to (2) 'align with all relevant goals across all humans in such a way that we actually minimize global catastrophic risks'.

In my view, getting to (1) only gets us about 2% of the way towards (2), and doesn't come anywhere close to 'solving alignment' in a way that would allow for safe AGI.

Also, I don't see how AGIs could develop a provably, interpretably, 'very sophisticated understanding of human values' if alignment researchers don't have a sophisticated understanding of human values that they could test against the AGI's understanding.

At least, it seems like we'd need a strong 'training set' of human values that includes plausibly complete coverage of the 'deployment set' of human values the AGI would actually encounter in the real world -- and I don't see how we'd get a decent training set of values without quite a thorough understanding of the nature and diversity of human values.

I'm raising these issues not to be contrarian or ornery, just out of a genuine puzzlement about the long-term game plan in research on alignment with human values, and why alignment researchers seem often uninterested in behavioral sciences research on human values.

richard_ngo

I don't see how AGIs could develop a provably, interpretably, 'very sophisticated understanding of human values' if alignment researchers don't have a sophisticated understanding of human values that they could test against the AGI's understanding.

I don't think anyone is aiming for provable alignment properties (except maybe for Stuart Russell); this just seems too hard.

But if AGIs could develop a very sophisticated understanding of other domains that humans don't understand very well, by virtue of being more intelligent than humans, I don't see why they wouldn't be able to understand this domain very well too.

At least, it seems like we'd need a strong 'training set' of human values

This is how classic ML would do it. But in the modern paradigm, ML systems can infer all sorts of information from being trained on a very wide range of data (e.g. all the books, all the internet, etc), and so we should expect that they can infer human values from that too. There's some preliminary evidence that language models can perform well on common-sense moral reasoning, and alignment researchers generally expect that future language models will be capable of answering questions about ethics to a superhuman level "by default".

More generally, it sounds like you're gesturing towards the difference between "narrow alignment" and "ambitious alignment", as discussed in this blog post. Broadly speaking, the goal of the former is basically to have AI that can be controlled; the goal of the latter is to have AI that could be trusted steer the world. One reason that most researchers focus on the former is because if we could narrowly align AI, we could then use it to help us with the more complex task of ambitious alignment. And the properties required for an AI to be narrowly aligned (like "helpful", "honest", etc) are sufficiently common-sense that I don't think we gain much from a very in-depth study of them.

Geoffrey Miller

Richard - thanks very much for your quick and helpful reply. I'll have a look at the links you included, and ruminate about this further...

Guy Raveh

I feel like we shouldn't expect to be able to express the Values Of Humanity to an AGI in order for it to be safe - in the same way that humans are currently mostly safe towards the rest of humanity despite not being able to articulate those Values Of Humanity themselves. There's something stopping one person (even a very rich or powerful one) from killing everyone else, and it's not explicit knowledge.

aog

You might be well aware of this, but there is a great line of research on machine ethics that tries to build AI with a sophisticated understanding of human values. The ETHICS benchmark for example measures language model understanding of various moral theories: https://arxiv.org/abs/2008.02275

Quadratic Reciprocity

Try to develop an algorithm which solves the problems outlined in the heuristic arguments report.

In the Eliciting Latent Knowledge readings, this is mentioned. What report is it referring to - there doesn't seem to be a link?

CalebW

Any updates around the likelihood/timing of a discussion course? :)

Comments

More from the author

My 2025 donations (so far)

richard_ngo·7mo ago·4m read

110

Third-wave AI safety needs sociopolitical thinking

richard_ngo·1y ago·31m read

214

AGI safety career advice

richard_ngo·3y ago·Curated 3y ago·15m read

Curated and popular this week

Was Partisanship Good for the Environmental Movement?

Jeffrey Heninger·2y ago·Curated 3d ago·6m read

This is the third in a sequence of posts taken from my recent report: Why Did Environmentalism Become Partisan? Summary Rising partisanship did not make environmentalism more popular or politically effective. Instead, it saw flat or falling overall public opinion, fewer major legislative achievements, and fluctuating executive actions. Public Opinion...

127

Possible mistake EAs are making and shout out to Pause AI UK

Michelle_Hutchinson·4d ago·4m read

I think right now EAs might be making a significant mistake by paying insufficient attention to the political realm. As EAs we tend to figure out what’s most impactful for us to work on and focus hard. That’s great! But there are various actions that are ‘non-delegatable’ - the extent to which an individual can do the action is limited (like voting, going to a protest, making hard money contributions to particular campaigns). It might be useful if we were all more in the habit of doing variou...

102

New Video from AI in Context: The Fall and Rise of Sam Altman

ChanaMessinger, phoebe b, Aric Floyd·6d ago·3m read

New Video from AI in Context: The Fall and Rise of Sam Altman If you want to skip straight to the video, here it is! AI in Context is excited to be back with our fourth video! For those just hearing from us, we make videos for 80,000 Hours, telling stories about transformative AI...

Recent opportunities to take action

127

Possible mistake EAs are making and shout out to Pause AI UK

Michelle_Hutchinson·4d ago·4m read

A Philanthropic Case for PFAS

Christina Barstow·5h ago·3m read

Build a flourishing EA group at the University of Toronto

Joseph Kostousov, Sophia Wan (navarhontes)·1w ago·1m read

richard_ngo

I don't see how AGIs could develop a provably, interpretably, 'very sophisticated understanding of human values' if alignment researchers don't have a sophisticated understanding of human values that they could test against the AGI's understanding.

I don't think anyone is aiming for provable alignment properties (except maybe for Stuart Russell); this just seems too hard.

At least, it seems like we'd need a strong 'training set' of human values