The post is ~900 words so I would recommend reading it, but the key takeaways are:
- OpenAI are starting a new “superintelligence alignment” team, led by Ilya Sutskever (Chief Scientist at OpenAI) and Jan Leike (Alignment team lead at OpenAI). Ilya Sutskever will be making this his core research focus.
- OpenAI are dedicating 20% of the compute they’ve secured to date over the next four years to solving superintelligent alignment
- The superalignment team is hiring for a research engineer, research scientist and research manager.
[Edited to add] :
Here's the introduction (with footnotes removed)
Superintelligence will be the most impactful technology humanity has ever invented, and could help us solve many of the world’s most important problems. But the vast power of superintelligence could also be very dangerous, and could lead to the disempowerment of humanity or even human extinction.
While superintelligence seems far off now, we believe it could arrive this decade. Managing these risks will require, among other things, new institutions for governance and solving the problem of superintelligence alignment:
How do we ensure AI systems much smarter than humans follow human intent?
And later on:
Our approach
Our goal is to build a roughly human-level automated alignment researcher. We can then use vast amounts of compute to scale our efforts, and iteratively align superintelligence.
To align the first automated alignment researcher, we will need to 1) develop a scalable training method, 2) validate the resulting model, and 3) stress test our entire alignment pipeline:
- To provide a training signal on tasks that are difficult for humans to evaluate, we can leverage AI systems to assist evaluation of other AI systems (scalable oversight). In addition, we want to understand and control how our models generalize our oversight to tasks we can’t supervise (generalization).
- To validate the alignment of our systems, we automate search for problematic behavior (robustness) and problematic internals (automated interpretability).
- Finally, we can test our entire pipeline by deliberately training misaligned models, and confirming that our techniques detect the worst kinds of misalignments (adversarial testing).
Paraphrased from OpenAI's Twitter thread:
In addition to members from our existing alignment team, joining are Harri Edwards, Yuri Burda, Adrien Ecoffet, Nat McAleese, Collin Burns, Bowen Baker, Pavel Izmailov, and Leopold Aschenbrenner
And paraphrased from a Nat McAleese tweet:
Yes, this is the notkilleveryoneism team
So, OpenAI believes that superintelligence 'could arrive this decade', and 'could lead to the disempowerment of humanity or even human extinction'.
Those sentences should strike EAs as among the most alarming ones ever written.
Personally, I'm deeply concerned that OpenAI seems to have become even more caught up in their runaway hubris spiral, such that they're aiming not just for AGI, but for ASI, as soon as possible -- whether or not they get anywhere close to achieving viable alignment solutions.
The worst part of this initiative is framing alignment as something that will require a vast new increase in AI capabilities -- the AGI-level 'automated alignment researcher'. This gives them a get-out-of-jail-free card: they can claim they must push ahead with AGI, so they can build this automated alignment researcher, so they can keep us all safe from... the AGI-level systems they've just built. In other words, instead of treating AI alignment with human values as a problem at the intersection of moral philosophy, moral psychology, and other behavioral sciences, they're treating it as just another AI capabilities issue, amenable to clever technical solutions plus a whole lot of compute. Which gives them carte blanche to push ahead with capabilities research, under the guise of safety research.
So, I see this 'superintelligence alignment' effort as cynical PR window-dressing, intended to reassure naive and gullible observers that OpenAI is still among 'the good guys', even as they accelerate their imposition of extinction risks on humanity.
Let's be honest with ourselves about that issue. If OpenAI still had a moral compass, and were still among the good guys, they would pause AGI (and ASI) capabilities research until they have achieved a viable, scalable, robust set of alignment methods that have the full support and confidence of AI researchers, AI safety experts, regulators, and the general public. They are nowhere close to that, and they probably won't get close to it in four years. Many AI-watchers (including me) are extremely skeptical that 'AI alignment' is every possible, in any meaningfully safe way, given the diversity, complexity, flexibility, and richness of human (and animal) values and preferences that AIs are trying to 'align' with.
In summary: If OpenAI was an ethical company, they would stop AI capabilities research until they solve alignment. Period. They're not doing that, and have shown no intention of doing that. Therefore, I infer that they are not an ethical company, and they do not have humanity's best interests at heart.
In what sense are these efforts 'premature'? AGI capabilities research is already far surpassing AI alignment research.