Safety Researcher and Scalable Alignment Team lead at DeepMind. AGI will probably be wonderful; let's make that even more probable.

1

53

Those aspects are getting weaker, but the ability for ML to models humans is getting stronger, and there are other “computer acting as salesperson” channels which don’t go through Privacy Sandbox. But probably I’m just misusing the term “ad tech” here, and “convince someone to buy something” tech might be a better term.

In outer alignment one can write down a correspondence between ML training schemes that learn from human feedback and complexity classes related to interactive proof schemes. If we model the human as a (choosable) polynomial time algorithm, then

1. Debate and amplification get to PSPACE, and more generally -step debate gets to .

2. Cross-examination gets to NEXP.

3. If one allows opaque pointers, there are schemes that go further: market making gets to R.

Moreover, we informally have constraints on which schemes are practical based on properties of their complexity class analogues. In particular, interactive proofs schemes are only interesting if they relativize: we also have and thus a single prover gets to PSPACE given an arbitrary polynomial time verifier, but w.r.t. a typical oracle . My sense is there are further obstacles that can be found: my intuition is that "market making = R" isn't the right theorem once obstacles are taken into account, but don't have a formalized model of this intuition.

The reason this type of intuition is useful is humans are unreliable, and schemes that reach high complexity class analogies should (everything else equal) give more help to the humans in noticing problems with ML systems.

I think there's quite a bit of useful work that can be done pushing this type of reasoning further, but (full disclosure) it isn't of the "solve a fully formalized problem" sort. Two examples:

1. As mentioned above, I find "market making = R" unlikely to the right result. But this doesn't mean that market making isn't an interesting scheme: there are connections between market making and Paul Christiano's learning the prior scheme. As previously formalized, market making misses a practical limitation on the available human data (the -way assumption in that link), so there may be work to do to reformalize it into a more limited complexity class in a more useful way.

2. Two-player debate is only one of many possible schemes using self-play to train systems, and in particular one could try to shift to -player schemes in order to reduce playing-for-variance strategies where a behind player goes for risky lies in order to possibly win. But the "polynomial time judge" model can't model this situation, as there is no variance when trying to convince a deterministic algorithm. As a result, there is a need for more precise formalization that can pick up the difference between self-play schemes that are more or less robust to human error, possibly related to CRMDPs.

That is also very reasonable! I think the important part is to not feel to bad about the possibility of never having a view (there is a vast sea of things I don't have a view on), not least because I think it actually increases the chance of getting to the right view if more effort is spent.

(I would offer to chat directly, as I'm very much part of the subset of safety close to more normal ML, but am sadly over capacity at the moment.)

Aha. Well, hopefully we can agree that those philosophers are adding confusion. :)