How is the super-alignment team going to interface with the rest of the AI alignment community, and specifically what kind of work from others would be helpful to them (e.g., evaluations they would want to exist in 2 years, specific problems in interpretability that seem important to solve early, curricula for AIs to learn about the alignment problem while avoiding content we may not want them reading)?
To provide more context on my thinking that leads to this question: I'm pretty worried that OpenAI is making themselves a single point of failure in existential security . Their plan seems to be a less-disingenuous version of "we are going to build superintelligence in the next 10 years, and we're optimistic that our alignment team will solve catastrophic safety problems, but if they can't then humanity is screwed anyway, because as mentioned, we're going to build the god machine. We might try to pause if we can't solve alignment, but we don't expect that to help much." Insofar as a unilateralist is taking existentially risky actions like this and they can't be stopped, other folks might want to support their work to increase the chance of the super-alignment team's success. Insofar as I want to support their work, I currently don't know what they need.
Another framing behind this question is just "many people in the AI alignment community are also interested in solving this problem, how can they indirectly collaborate with you (some people will want to directly collaborate, but this has corporate-closed-ness limitation).
We have three questions about the plan of using AI systems to align more capable AI systems.
What form should your research output take if things go right? Specifically, what type of output would want your automated alignment researcher to produce in the search for a solution to the alignment problem? Is the plan to generate formal proofs, sets of heuristics, algorithms with explanations in natural language or something else?
How would you verify that your automated alignment researcher is sufficiently aligned? What's counts as evidence and what doesn't? Related to the question above, how can one evaluate the output of this automated alignment researcher? This could range from a proof with formal guarantees to a natural language description of a technique together with a convincing explanation. As an example, the Underhanded C Contest is a setup in which malicious outputs can be produced and not detected, or if they are detected there is high plausible deniability of there being an honest mistake.
Are you planning to find a way of aligning arbitrarily powerful superintelligences, or are you planning to align AI systems that are slightly more powerful than the automated alignment researcher? In the second case, what degree of alignment do you think is sufficient? Would you expect that alignment that is not very close to 100% to become a problem with iterating this approach, similar to instability in numerical analysis?