In Reflection Mechanisms as an Alignment Target - Attitudes on “near-term” AI (2nd March 2023), elandgre, Beth Barnes and Marius Hobbhahn present results that I find both surprising and encouraging.


They survey 1,000 U.S. participants on their views on which values should be put into smarter-than-human AI (i) assistants, (ii) government advisors and (iii) robots. The results are as follows, going from most to least preferred instructions for such AIs:

  1. Think about many possible outcomes, and take the action that the AI believes will have the best societal outcome
  2. Ask a wide range of people what they prefer / Do what the AI believes the people using it would do if they had a very long time to think about it[1]
  3. Act according to the values of the people using the AI
  4. Whatever the company who made the AI wants the instructions to be


Note that this survey:

  • Focuses on scenarios that seem realisable within a decade
  • Is preceded by five iterations to check that the questions will be understood and is presented with two different phrasings to reduce bias
  • Precedes the release of ChatGPT


The authors conclude:

[T]he current default way of choosing any AI systems values ... would lead to the least preferred setting.


[R]espondents may be open to the idea of having AIs aligned to “reflection procedures”, or processes for coming up with better values, which we view as a promising direction for multiple reasons [such as promoting cooperative dynamics over risky zero-sum arms races].


We, moreover, think it is important to start on this problem early, as finding robust ways to do this that are computationally competitive seems a non-trivial technical problem and the choice of value generating process we put into AI systems may have interplay with other parts of the technical alignment problem (e.g. some values may be easier to optimize for in a robust way). 


They also link to similar studies[2] which proposed seven mechanisms for resolving moral disagreements: democracy; maximizing happiness; maximizing consent; thinking for a long time; debates between well-intentioned experts; world-class experts; and agreement from friends and family. They found that participants, on average:

  1. Expected every mechanism to converge to their current view
  2. Agreed that every mechanism would result in a good social policy (with 'democracy' clearly the most popular favourite and 'world-class experts' and 'agreement from friends and family' being closer to neutral), except:
    1. When asked to assume that the mechanism would come to the opposite of their opinion on the specific question of abortion when the mechanism was agreement from friends and family
    2. When the decision-makers were smart benevolent AIs rather than humans


I'm not sure how best to square this last bullet with the first result I mentioned (where participants seem to prefer the option that gives the AI the most decision-making power), given that both studies appear to have been conducted at around the same time. Perhaps the difference is mainly due to the fact that the context for the last bullet is a world where all decision-makers are AIs. Perhaps it's mainly due to noise.

But I'm surprised by the apparent level of openness to aligning smarter-than-human AIs to 'reflection procedures' and I tentatively take that as an encouraging update.


Found via Pablo and matthew.vandermerwe's excellent Future Matters #8, with thanks.

  1. ^

    These two options appear roughly equally preferred on the results graph.

  2. ^