Safety Researcher and Scalable Alignment Team lead at DeepMind. AGI will probably be wonderful; let's make that even more probable.
Nice to see you here, Ferenc! We’ve talked before when I was at OpenAI and you Twitter, and always happy to chat if you’re pondering safety things these days.
In outer alignment one can write down a correspondence between ML training schemes that learn from human feedback and complexity classes related to interactive proof schemes. If we model the human as a (choosable) polynomial time algorithm, then1. Debate and amplification get to PSPACE, and more generally n-step debate gets to ΣnP.2. Cross-examination gets to NEXP.3. If one allows opaque pointers, there are schemes that go further: market making gets to R.
Moreover, we informally have constraints on which schemes are practical based on properties of their complexity class analogues. In particular, interactive proofs schemes are only interesting if they relativize: we also have IP=PSPACE and thus a single prover gets to PSPACE given an arbitrary polynomial time verifier, but w.r.t. a typical oracle IPO<PSPACEO. My sense is there are further obstacles that can be found: my intuition is that "market making = R" isn't the right theorem once obstacles are taken into account, but don't have a formalized model of this intuition.
The reason this type of intuition is useful is humans are unreliable, and schemes that reach high complexity class analogies should (everything else equal) give more help to the humans in noticing problems with ML systems.
I think there's quite a bit of useful work that can be done pushing this type of reasoning further, but (full disclosure) it isn't of the "solve a fully formalized problem" sort. Two examples:1. As mentioned above, I find "market making = R" unlikely to the right result. But this doesn't mean that market making isn't an interesting scheme: there are connections between market making and Paul Christiano's learning the prior scheme. As previously formalized, market making misses a practical limitation on the available human data (the n-way assumption in that link), so there may be work to do to reformalize it into a more limited complexity class in a more useful way.2. Two-player debate is only one of many possible schemes using self-play to train systems, and in particular one could try to shift to n-player schemes in order to reduce playing-for-variance strategies where a behind player goes for risky lies in order to possibly win. But the "polynomial time judge" model can't model this situation, as there is no variance when trying to convince a deterministic algorithm. As a result, there is a need for more precise formalization that can pick up the difference between self-play schemes that are more or less robust to human error, possibly related to CRMDPs.
That is also very reasonable! I think the important part is to not feel to bad about the possibility of never having a view (there is a vast sea of things I don't have a view on), not least because I think it actually increases the chance of getting to the right view if more effort is spent.(I would offer to chat directly, as I'm very much part of the subset of safety close to more normal ML, but am sadly over capacity at the moment.)
Yep, that’s very fair. What I was trying to say was that if in response to the first suggestion someone said “Why aren’t you deferring to others?” you could use that as a joke backup, but agreed that it reads badly.
(I’m happy to die on the hill that that threshold exists, if you want a vicious argument. :))
I think the key here is that they’ve already spent quite a lot of time investigating the question. I would have a different reaction without that. And it seems like you agree my proposal is best both for the OP and the world, so perhaps the real sadness is about the empirical difficulty at getting people to consensus?
At a minimum I would claim that there should exist some level of effort past which you should not be sad not arguing, and then the remaining question is where the threshold is.
As somehow who works on AGI safety and cares a lot about it, my main conclusion from reading this is: it would be ideal for you to work on something other than AGI safety! There are plenty of other things to work on that are important, both within and without EA, and a satisfactory resolution to “Is AI risk real?” doesn’t seem essential to usefully pursue other options.
Nor do I think this is a block to comfortable behavior as an EA organizer or role model: it seems fine to say “I’ve thought about X a fair amount but haven’t reached a satisfactory conclusion”, and give people the option of looking into it themselves or not. If you like, you could even say “a senior AGI safety person has given me permission to not have a view and not feel embarrassed about it.”
This is a great article, and I will make one of those spreadsheets!Though I can't resist pointing out that, assuming you got 99.74% out of an Elo calculation, I believe the true probability of them beating you is way higher than 99.74%. :)
Yes, Holevo as you say. By information I mean the standard definitions.
The issue is not the complexity, but the information content. As mentioned, n qbits can’t store more than n bits of classical information, so the best way to think of them is “n bits of information with some quantum properties”. Therefore, it’s implausible that they correspond to exponential utility.