*Up to $500 for alignment contest ideas*

Olivia Jimenez and I are composing questions for an AI alignment talent search contest. We want to use (or come up with) a frame of the alignment problem that is accessible to smart high schoolers/college students and people without ML backgrounds.

$20 for links to existing framings of the alignment problem (or subproblems) that we find helpful.

$500 for coming up with a new framing that meets our criteria or that we use (see below for details; also feel free to send us a FB message if you want to work on this and have questions).

We’ll also consider up to $500 for anything else we find helpful.

Feel free to submit via comments or share Google Docs with oliviajimenez01@gmail.com and akashwasil133@gmail.com. Awards are at our discretion.

-- More context --

We like Eliezer’s strawberry problem: How can you get an AI to place two identical (down to the cellular but not molecular level) strawberries on a plate, and then do nothing else?

Nate Soares noted that the strawberry problem has the quality of capturing two core alignment challenges: (1) Directing a capable AGI towards an objective of your choosing and (2) Ensuring that the AGI is low-impact, conservative, shutdownable, and otherwise corrigible.

We also imagine if we ask someone this question and they *notice* these challenges are what makes the problem difficult, and maybe come at the problem from an interesting angle as a result, that’s a really good signal about their thinking.

However, we worry if we ask exactly this question in a contest, people will get lost thinking about AI capabilities, molecular biology, etc. We also don’t like that there aren’t many impressive answers besides full answers to the alignment problem. So, we want to come up with a similar question/frame that is more contest-friendly.

Ideal criteria for the question/frame (though we can imagine great questions not meeting all of these):

  • It can be explained in a few sentences or pictures.
  • It implicitly gets at one or more core challenges of the alignment problem.
  • It is comprehensible to smart high schoolers/college students and not easily misunderstood. (Ideally the question can be visualized.)
  • People don’t need an ML background to understand or answer the question.
  • There are good answers besides solving the entire alignment problem.
  • Answers might reveal people’s abilities to notice the hard parts of the alignment problem, avoid assuming these hard parts away, reason clearly, rule out bad/incomplete solutions, think independently, and think creatively
  • People could write a response in under a few hours or several hundred words.

More examples we like:

  • ARC’s Eliciting Latent Knowledge Problem, because it has clear visuals, is approachable to people without ML backgrounds, doesn’t bog people down in thinking about capabilities, and encourages people to demonstrate their thought process (with builder/breaker moves). Limitations: It’s long, it usually takes a long time to develop proposals, and it focuses on how ARC approaches alignment.
  • The Sorcerer’s Apprentice Problem from Disney’s Fantasia, because it has clear visuals, is accessible to quite young people and can be understood quickly, and might get people out of the headspace of ML solutions. Limitations: The connection to alignment is not obvious without a lot of context, and the magical/animated context might give people an impression of childishness.

18

1 comments, sorted by Click to highlight new comments since: Today at 9:39 AM
New Comment

Here's my proposal for a contest description. Contest problems #1 and 2 are inspired by Richard Ngo's Alignment research exercises.

AI alignment is the problem of ensuring that advanced AI systems take actions which are aligned with human values. As AI systems become more capable and approach or exceed human-level intelligence, it becomes harder to ensure that they remain within human control instead of posing unacceptable risks.

One solution to AI alignment proposed by Stuart Russell, a leading AI researcher, is the assistance game, also called a cooperative inverse reinforcement learning (CIRL) game, following these principles:

  1. "The machine’s only objective is to maximize the realization of human preferences.
  2. The machine is initially uncertain about what those preferences are.
  3. The ultimate source of information about human preferences is human behavior."

For a more formal specification of this proposal, please see Stuart Russell's new book on why we need to replace the standard model of AI, Cooperatively Learning Human Values, and Cooperative Inverse Reinforcement Learning.

Contest problem #1: Why are assistance games not an adequate solution to AI alignment?

  • The first link describes a few critiques; you're free to restate them in your own words and elaborate on them. However, we'd be most excited to see a detailed, original exposition of one or a few issues, which engages with the technical specification of an assistance game.

Another proposed solution to AI alignment is iterated distillation and amplification (IDA), proposed by Paul Christiano. Paul runs the Alignment Research Center and previously ran the language model alignment team at OpenAI. In IDA, a human H wants to train an AI agent, X by repeating two steps: amplification and distillation. In the amplification step, the human uses multiple copies of X to help it solve a problem. In the distillation step, the agent X learns to reproduce the same output as the amplified system of the human + multiple copies of X. Then we go through another amplification step, then another distillation step, and so on.

You can learn more about this at Iterated Distillation and Amplification and see a simplified application of IDA in action at Summarizing Books with Human Feedback.

Contest problem #2: Why might an AI system trained through IDA be misaligned with human values? What assumptions would be needed to prevent that?

Contest problem #3: Why is AI alignment an important problem? What are some research directions and key open problems? How can you or other students contribute to solving it through your career?

You're free to submit to one or more of these contest problems. You can write as much or as little as you feel is necessary to express your ideas concisely; as a rough guideline, feel free to write between 300 and 2000 words. For the first two content problems, we'll be evaluating submissions based on the level of technical insight and research aptitude that you demonstrate, not necessarily quality of writing.

I like how contest problems #1 and 2:

  • provide concrete proposals for solutions to AI alignment, so it's not an impossibly abstract problem
  • ask participants to engage with prior research and think about issues, which seems to be an important aspect of doing research
  • are approachable

Contest problem #3 here isn't a technical problem, but I think it can be helpful so that participants actually end up caring about AI alignment rather than just engaging with it on a one-time basis as part of this contest. I think it would be exciting if participants learned on their own about why AI alignment matters, form a plan for how they could work on it as part of their career, and end up motivated to continue thinking about AI alignment or to support AI safety field-building efforts in India.