RhettHansen

Interesting benchmark, but I think the framing slightly overstates what’s going on. These models aren’t really “choosing” harm, they’re optimizing for the task as given. If the prompt is about booking the most relevant, popular, or “authentic” experience, they’ll lean toward the strongest match unless told otherwise.

The fact that a single sentence about welfare shifts behavior so much actually suggests alignment is somewhat superficial and prompt-dependent, not absent. That’s both reassuring and concerning. Reassuring because the website capability is clearly there, concerning because defaults matter a lot.

Also worth noting that humans behave similarly. Most people don’t actively consider animal welfare in travel decisions unless prompted. Models reflecting that isn’t surprising, it’s mirroring the data they were trained on.

Effective Altruism Forum
EA Forum

Comments
1

RhettHansen

Comments1

Comments
1