Hide table of contents

Comment Permalink

Answer by aogApr 17, 202218

The main challenge seems to be formulating the goal in a sufficiently specific way. We don’t currently have a benchmark that would serve as a clear indicator of solving the alignment problem. Right now, any proposed solution ends up being debated by many people who often disagree on the solution’s merits.

FTX Future Fund listed AI Alignment Prizes on their ideas page and would be interested in funding them. Given that, it seems like coming up with clear targets for AI safety research would be very impactful.

Showing 3 of 6 replies (Click to show all)

John_Maxwell

Apr 18 2022

My solution to this problem (originally posted here) is to run builder/breaker tournaments: * People sign up to play the role of "builder", "breaker", and/or "judge". * During each round of the tournament, triples of (builder, breaker, judge) are generated. The builder makes a proposal for how to build Friendly AI. The breaker tries to show that the proposal wouldn't work. ("Builder/breaker" terminology from this report.) The judge moderates the discussion. * Discussion could happen over video chat, in a Google Doc, in a Slack channel, or whatever. Personally I'd do text: anonymity helps judges stay impartial, and makes it less intimidating to enter because no one will know if you fail. Plus, having text records of discussions could be handy, e.g. for fine-tuning a language model to do alignment work. * Each judge observes multiple proposals during a round. At the end of the round, they rank all the builders they observed, and separately rank all the breakers they observed. (To clarify, builders are really competing against other builders, and breakers are really competing against other breakers, even though there is no direct interaction.) * Scores from different judges are aggregated. The top scoring builders and breakers proceed to the next round. * Prizes go to the top-ranked builders and breakers at the end of the tournament. The hope is that by running these tournaments repeatedly, we'd incentivize alignment progress, and useful insights would emerge from the meta-game: * "Most proposals lack a good story for Problem X, and all the breakers have started mentioning it -- if you come up with a good story for it, you have an excellent shot at the top prize" * "Almost all the top proposals were variations on Proposal Z, but Proposal Y is an interesting new idea that people are having trouble breaking" * "All the top-ranked competitors in the recent tournament spent hours refining their ideas by playing with a language model fine-tuned on earlier tour

aog

Apr 17 2022

Yeah that's a good point. Another hack would be training a model on text that specifically includes the answers to all of the TruthfulQA questions. The real goal is to build new methods and techniques that reliably improve truthfulness over a range of possible measurements. TruthfulQA is only one such measurement, and performing well on it does not guarantee a signficant contribution to alignment capabilities. I'm really not sure what the unhackable goal looks like here.

sbowmanApr 29 20223

My colleagues have often been way too nice about reading group papers, rather than the opposite. (I’ll bet this varies a ton lab-to-lab.)

See in context

[ Question ]

Why not offer a multi-million / billion dollar prize for solving the Alignment Problem?

by Aryeh Englander

Apr 17 20221 min read2 answers 0