The post is ~900 words so I would recommend reading it, but the key takeaways are:
- OpenAI are starting a new “superintelligence alignment” team, led by Ilya Sutskever (Chief Scientist at OpenAI) and Jan Leike (Alignment team lead at OpenAI). Ilya Sutskever will be making this his core research focus.
- OpenAI are dedicating 20% of the compute they’ve secured to date over the next four years to solving superintelligent alignment
- The superalignment team is hiring for a research engineer, research scientist and research manager.
[Edited to add] :
Here's the introduction (with footnotes removed)
Superintelligence will be the most impactful technology humanity has ever invented, and could help us solve many of the world’s most important problems. But the vast power of superintelligence could also be very dangerous, and could lead to the disempowerment of humanity or even human extinction.
While superintelligence seems far off now, we believe it could arrive this decade. Managing these risks will require, among other things, new institutions for governance and solving the problem of superintelligence alignment:
How do we ensure AI systems much smarter than humans follow human intent?
And later on:
Our approach
Our goal is to build a roughly human-level automated alignment researcher. We can then use vast amounts of compute to scale our efforts, and iteratively align superintelligence.
To align the first automated alignment researcher, we will need to 1) develop a scalable training method, 2) validate the resulting model, and 3) stress test our entire alignment pipeline:
- To provide a training signal on tasks that are difficult for humans to evaluate, we can leverage AI systems to assist evaluation of other AI systems (scalable oversight). In addition, we want to understand and control how our models generalize our oversight to tasks we can’t supervise (generalization).
- To validate the alignment of our systems, we automate search for problematic behavior (robustness) and problematic internals (automated interpretability).
- Finally, we can test our entire pipeline by deliberately training misaligned models, and confirming that our techniques detect the worst kinds of misalignments (adversarial testing).
Paraphrased from OpenAI's Twitter thread:
In addition to members from our existing alignment team, joining are Harri Edwards, Yuri Burda, Adrien Ecoffet, Nat McAleese, Collin Burns, Bowen Baker, Pavel Izmailov, and Leopold Aschenbrenner
And paraphrased from a Nat McAleese tweet:
Yes, this is the notkilleveryoneism team
I disagree with multiple things in this sentence. First, you take a deontology stance, whereas OpenAI clearly acts within consequentialist stance, assuming that if they don't create 'safe' AGI, reckless open-source hackers will (upon the continuing exponential decrease in the cost of effective training compute, and/or the next breakthrough in DNN architecture or training that will make it much more efficient, and/or will enable effective online training). Second, I largely agree with OpenAI as well as Anthropic that iteration is important for building an alignment solution. One probably cannot design a robust, safe AI without empirical iteration, including with increasing capabilities.
I agree with your assessment of the strategy they are taking probably will fail, but mainly because I think we have inadequate human intelligence, human psychology, and coordination mechanisms to execute it. That is, I would support Yudkowsky's proposal: halt all AGI R&D, develop narrow AI and tech for improving the human genome, make humans much smarter (von Neumann-level of intelligence should be just the average) and have much more peaceful psychology, like bonobos, reform coordination and collective decision-making, and only then re-visit the AGI project roughly with the same methodology as OpenAI proposes, albeit with more diversified methodology: I agree with your criticism that OpenAI is too narrowly focused on some sort of computationalism, to the detriment of the perspectives from psychology, neuroscience, biology, etc. BTW, it seems that DeepMind is more diversified in this regard.