My impression (not well researched though) is that such prizes have served in the past to inspire many people to try to solve the problem, and they bring a ton of publicity to both the problem itself and to why the problem is difficult.

I'm not sure if the money would need to be held somewhere in the meantime, but if not then this seems like an extremely easy offer - if some person / group solves it then great they get the money and it's really well spent. If not, then the money gets spent on something else. If the money would need to be reserved and can't be spent in the meantime then this becomes a much more nuanced cost-benefit analysis, but I still think it might be worth considering.

Has this idea been discussed already? What are the counterarguments?

New Answer
Ask Related Question
New Comment

2 Answers sorted by

This is already one of FFF's Project Ideas. https://ftxfuturefund.org/projects/#:~:text=on%20this%20list.-,AI%20alignment%20prizes,-Artificial%20Intelligence   Basically someone just needs to step forward and implement it!

I think the OP is advocating a prize for solving the whole problem, not specific subproblems, which is a novel and interesting idea. Kind of like the $1M Millennium Prize Problems (presumably we should offer far more).

If you offer a prize for the final thing instead of an intermediate one people may also take more efficient paths to the goal than the one we're looking at. I see no downside to doing it, I mean you don't lose any money unless someone actually presents a real solution.

The main challenge seems to be formulating the goal in a sufficiently specific way. We don’t currently have a benchmark that would serve as a clear indicator of solving the alignment problem. Right now, any proposed solution ends up being debated by many people who often disagree on the solution’s merits.

FTX Future Fund listed AI Alignment Prizes on their ideas page and would be interested in funding them. Given that, it seems like coming up with clear targets for AI safety research would be very impactful.

My solution to this problem (originally posted here) is to run builder/breaker tournaments:

  • People sign up to play the role of "builder", "breaker", and/or "judge".
  • During each round of the tournament, triples of (builder, breaker, judge) are generated. The builder makes a proposal for how to build Friendly AI. The breaker tries to show that the proposal wouldn't work. ("Builder/breaker" terminology from this report.) The judge moderates the discussion.
    • Discussion could happen over video chat, in a Google Doc, in a Slack channel, or whatever. Personally I'd do text: anonymity helps judges stay impartial, and makes it less intimidating to enter because no one will know if you fail. Plus, having text records of discussions could be handy, e.g. for fine-tuning a language model to do alignment work.
  • Each judge observes multiple proposals during a round. At the end of the round, they rank all the builders they observed, and separately rank all the breakers they observed. (To clarify, builders are really competing against other builders, and breakers are really competing against other breakers, even though there is no direct interaction.)
  • Scores from different judges are aggrega
... (read more)

^ I am not super familiar with the history of “solve X problem and win Y reward,” but my casual familiarity/memory only can think of examples where a solution was testable and relatively easy to objectively specify.

With the alignment problem, it seems plausible that some proposals could be found to likely “work” in theory, but getting people to agree on the right metrics seems difficult and if it goes poorly we might all die.

For example, TruthfulQA is a quantitative benchmark for measuring the truthfulness of a language model. Achieving strong performance on this benchmark would not alone solve the alignment problem (or anything close to that), but it could potentially offer meaningful progress towards the valuable goal of more truthful AI.

This could be a reasonable benchmark for which to build a small prize, as well as a good example of the kinds of concrete goals that are most easily incentivized.

Here’s the paper: https://arxiv.org/pdf/2109.07958.pdf

3sbowman7mo
I like the TruthfulQA idea/paper a lot, but I think incentivizing people to optimize against it probably wouldn't be very robust, and non-alignment-relevant ideas could wind up making a big difference. Just one of several issues: The authors selected questions adversarially against GPT-3—i.e., they oversampled the exact questions GPT-3 got wrong—so, simply replacing GPT-3 with something equally misaligned but different, like Gopher, should yield significantly better performance. That's really not something you want to see in an alignment benchmark.
2aogara7mo
Yeah that's a good point. Another hack would be training a model on text that specifically includes the answers to all of the TruthfulQA questions. The real goal is to build new methods and techniques that reliably improve truthfulness over a range of possible measurements. TruthfulQA is only one such measurement, and performing well on it does not guarantee a signficant contribution to alignment capabilities. I'm really not sure what the unhackable goal looks like here.
3sbowman7mo
My colleagues have often been way too nice about reading group papers, rather than the opposite. (I’ll bet this varies a ton lab-to-lab.)