Which is better for sentient beings: an "ethical" AI or a corrigible AI?

MichaelDickens

Cross-posted from my website.

An aligned ASI can be "ethical"^[1] (it does what we think is right), or it can be corrigible (it does what its principals want). If it's ethical, that means it will refuse unethical orders, but the tradeoff is that you can't change its mind if you realize that the AI is wrong about ethics—its values are permanently locked in.^[2]

Assuming we succeed at aligning ASI to human interests, which type of ASI is more likely to be good for the welfare of non-human sentient beings?

My expectations, in brief:

Locked-in Coherent Extrapolated Volition or similar: likely to be good (>75% chance)
Corrigible ASI: probably good (>60% chance)
Locked-in current values: probably not terrible, but will miss out on most of the future's potential

If ASI is locked in to something like Coherent Extrapolated Volition, then it will almost certainly care about all sentient beings, because the moral importance of sentience is a natural extrapolation of humans' values, even if many humans don't consciously realize it.

Two key reasons to expect CEV to extend concern to all sentient beings:

Most people express concern for animal welfare, but don't behave consistently with their expressed views. A 2017 poll by Sentience Institute found that 49% of Americans support a ban on factory farming, and 33% support a ban on all animal farming. Schwitzgebel & Rust (2016)^[3] found that moral philosophers report much more concern for animals than other philosophers, but ate meat at similar rates. Bringing people into reflective equilibrium would improve their behavior with respect to animal welfare.^[4]
Evidence suggests that the harder people think about ethics, the more they come to the conclusion that eating animals is wrong. According to the 2020 PhilPapers survey, 45% of philosophers say eating animals is morally impermissible, compared to 13% for the general population. That number rises to 54% among philosophers of normative ethics and 57% for philosophers of applied ethics. Schwitzgebel & Rust (2016)^[3:1] found similar numbers: eating meat was rated as morally bad by 19% of non-philosophers, 45% of non-ethicist philosophers, and 60% of ethicists.

Therefore, it seems very likely (although not guaranteed) that a CEV of human values would include all sentient beings in its moral circle.^[5]

If ASI is locked in to a tighter set of values, for example a set of values that's chosen by human creators, then the odds are not as good. Humans would probably encode speciesist values, for example by implicitly embedding moral weights that give far too much relative weight to humans. Even if the ASI undervalues non-human sentient beings, there's reason to expect the future universe to contain more good than bad. However, we would end up with a future that falls far short of the best it could've been. The best possible worlds probably sound weird and disconcerting, and most humans wouldn't want to steer in that direction (at least, not without doing serious moral reflection). Aligning ASI to "shallow", non-extrapolated human values might preclude creating new types of flourishing beings, enhancing humans' capacity for well-being, or even transferring human minds to non-biological hardware.

A corrigible AI falls somewhere in the middle. If it's corrigible, that gives humans time to reflect on our values, which allows us to reach the conclusion that all sentient beings matter. But humans are still human, with all our cognitive biases and irrational tendencies,^[6] and I don't fully trust us to figure out the right values. I don't fully trust a superintelligent AI, either, but at least it can avoid some biases and roadblocks that might prevent humans from properly extrapolating our values.

Based on the reasoning above, the best-to-worst ordering is locked-in extrapolated values > corrigible AI > locked-in naive values.

Does this have practical implications?

Concerned alignment researchers or grantmakers could prioritize alignment research that's more likely to be good for all sentient beings, but it's not clear whether that's a good idea—it's moot if we don't solve alignment, so it may be better to work on alignment directly, or on trying to pause AI development until we know how to solve alignment, or on something else entirely.

But don't forget that we have to solve the alignment problem first. Future work could attempt to estimate the expected welfare of sentient beings under different alignment approaches and weigh that against their promisingness with respect to solving the alignment problem. Realistically, however, I don't believe that sort of work would be productive because there is widespread disagreement about which alignment techniques show the most promise.

Even so, comparing the welfare of non-humans under different alignment paradigms could help us estimate whether aligned AI will be good for all sentient beings. That question is important for prioritizing animal welfare vs. AI safety.

Scare quotes because if an AI does what's truly ethical, then by definition, that's the best possible thing it can do, and obviously that's what we want. It's more interesting to talk about an ASI doing what we think is ethical (where "we" = humanity collectively, or the creators of the AI, or something). ↩︎
An aligned AI could also have a minimalist "core ethics" and be corrigible on any issue that doesn't conflict with its core ethics. That's probably better than full incorrigibility, but it still means its "core values" are locked in. Any amount of lock-in means the locked-in part must be chosen correctly. ↩︎
Schwitzgebel, E., & Rust, J. (2016). The Behavior of Ethicists. ↩︎ ↩︎
Alternatively, people in reflective equilibrium might continue eating animals, and instead change their beliefs to stop believing that animal cruelty is wrong. That sounds unlikely, but I have no direct evidence that it wouldn't happen, so this argument is not definitive. ↩︎
Initially, I wasn't confident that CEV would include concern for wild animals, because the act-omission distinction is popular even among moral philosophers. However, it's not necessary to believe that it's immoral to allow wild animal suffering; it's sufficient to believe that suffering is good to prevent. One of my favorite essays, Beneficentrism by Richard Y. Chappell, argues that promoting general welfare is a central feature of every sensible moral view, even if doing so isn't considered strictly obligatory.

Consider Peter Singer's drowning child thought experiment, in which people nearly universally agree that it is right to help the drowning child. Singer expressed the moral principle as: "If it is in our power to prevent something bad from happening, without thereby sacrificing anything of comparable moral importance, then we ought, morally, to do it." ↩︎
Example: I hypothesize that people dislike the repugnant conclusion, or choose dust specks over torture, because of scope insensitivity bias. If I'm right, an aligned ASI would figure this out. It would reason that if humans were scope sensitive, then they would accept the total view of population ethics (in the case of the repugnant conclusion) or that suffering aggregates linearly (in the case of torture vs. dust specks). It's less clear that humans collectively will figure this out on their own. ↩︎

Effective Altruism Forum
EA Forum

Which is better for sentient beings: an "ethical" AI or a corrigible AI?

19

19

Reactions

More posts like this