Miles Tidmarsh

That was the intervention class we had in mind, though there could be other pretraining interventions that don't fall cleanly into good/bad values (e.g. promoting risk aversion)

Community Polls on Alignment Controversies

Miles Tidmarsh12d1

By specific values we mean any particular goal we want AIs to pursue besides deferrence to humans. So democracy and equality would both count, as would goals like harm reduction or utilitarianism

Community Polls on Alignment Controversies

Miles Tidmarsh12d1

Agreed, the intent here by using "will" was because people have wildly different intuitions of what 'could' means. So 100% agree would mean "definitely true" and 30% disagree would mean "probably not"

Community Polls on Alignment Controversies

Miles Tidmarsh12d1

Definitely agree that stability doesn't equate to safety, but it sounds like that's not necessary to your response.

Community Polls on Alignment Controversies

Miles Tidmarsh13d3

My perspective is that even though current meat production is quite efficient, from the fundamental physics there's no way that growing a whole living being with a brain and bones and all that is the most efficient possible way of producing this (and immune systems are irrelevant if you have good enough isolation). I do agree that at our current tech level it seems like synthetic meat won't be competitive anytime soon. While vegan alternatives are delicious to many people, it's not exactly the same (though wanting to eat animals for psychological reasons is definitely part of it). Though I do agree that these issues are uncertain!

Community Polls on Alignment Controversies

Miles Tidmarsh13d1

In that case the intent is to vote 100% disagree (as you did here). That's the belief that anything falling short of full alignment will cause total loss of value

Community Polls on Alignment Controversies

Miles Tidmarsh13d1

The intent was that, conditional on AI sharing most but not all human values, the AIs wouldn't change their own values later.

You could have a world where all humans die and the AIs later change their own values, and you could also have worlds where partially aligned AIs don't wipe out humanity but change their values to be better (e.g. internalizing the goal of being aligned) or worse (e.g. internalizing paperclip maximizer) by our measures.

In worlds where the first TAIs share most but not all human values, what do you think most likely happens?

Community Polls on Alignment Controversies

Miles Tidmarsh13d3

Thanks Dawn, taking these in turn:

1: "Robust alignment" is a deliberately vague term, it's meant to incorporate your views about how hard alignment is (e.g. UDT vs. well intentioned)

4: It's a hard question, our perspective is that the backfire->cluelessness-> don't act chain can be thought of as low tractability

5: By "stable under reflection" we meant the AI reflecting on it's own values (while interacting with the world), where agreement means they wouldn't change their values much (stylistically: an AI that shares 70% of our values in 2030 has those same values in 3030). But you're right that how AIs interact (beyond competition, handled in the last question) is important.

7. S-risks do break the scale and we couldn't find a good simple way to deal with that (though we'll do other polls more directly on that later). The intent of "will" was to match 100% expected probability to 100% agree on the scale

Community Polls on Alignment Controversies

Miles Tidmarsh14d1

If robust alignment is orthogonal to pretraining then shouldn't that mean a strong disagreement with the statement (that alignment requires pretraining)?

Community Polls on Alignment Controversies

Miles Tidmarsh14d2

Some people believe that if we get partial alignment (i.e. cares about what we want, but also cares about other things) then we can get decent outcomes for the future (analogous to humans being partially aligned to each other). But others think that if we don't get alignment perfect ASIs will have incentive to take over, and then will either have value-drift towards something orthogonal to humans or will deliberately reformat it's own values. "Stable under reflection" is the opinion that this wouldn't happen: that ASIs that care somewhat about humans would continue to care somewhat about humans in the long term

Miles Tidmarsh

Bio

Posts 4

Comments21

Posts
4

Comments
21