Hide table of contents

The post is ~900 words so I would recommend reading it, but the key takeaways are:

  • OpenAI are starting a new “superintelligence alignment” team, led by Ilya Sutskever (Chief Scientist at OpenAI) and Jan Leike (Alignment team lead at OpenAI). Ilya Sutskever will be making this his core research focus. 
  • OpenAI are dedicating 20% of the compute they’ve secured to date over the next four years to solving superintelligent alignment
  • The superalignment team is hiring for a research engineer, research scientist and research manager

[Edited to add] :

Here's the introduction (with footnotes removed)

Superintelligence will be the most impactful technology humanity has ever invented, and could help us solve many of the world’s most important problems. But the vast power of superintelligence could also be very dangerous, and could lead to the disempowerment of humanity or even human extinction.

While superintelligence seems far off now, we believe it could arrive this decade. Managing these risks will require, among other things, new institutions for governance and solving the problem of superintelligence alignment:

How do we ensure AI systems much smarter than humans follow human intent? 

And later on:

Our approach

Our goal is to build a roughly human-level automated alignment researcher. We can then use vast amounts of compute to scale our efforts, and iteratively align superintelligence.

To align the first automated alignment researcher, we will need to 1) develop a scalable training method, 2) validate the resulting model, and 3) stress test our entire alignment pipeline:

  1. To provide a training signal on tasks that are difficult for humans to evaluate, we can leverage AI systems to assist evaluation of other AI systems (scalable oversight). In addition, we want to understand and control how our models generalize our oversight to tasks we can’t supervise (generalization).
  2. To validate the alignment of our systems, we automate search for problematic behavior (robustness) and problematic internals (automated interpretability).
  3. Finally, we can test our entire pipeline by deliberately training misaligned models, and confirming that our techniques detect the worst kinds of misalignments (adversarial testing).

Paraphrased from OpenAI's Twitter thread

In addition to members from our existing alignment team, joining are Harri Edwards, Yuri Burda, Adrien Ecoffet, Nat McAleese, Collin Burns, Bowen Baker, Pavel Izmailov, and Leopold Aschenbrenner

And paraphrased from a Nat McAleese tweet:

Yes, this is the notkilleveryoneism team 

100

0
0

Reactions

0
0
Comments16


Sorted by Click to highlight new comments since:

So, OpenAI believes that superintelligence 'could arrive this decade', and 'could lead to the disempowerment of humanity or even human extinction'. 

Those sentences should strike EAs as among the most alarming ones ever written.

Personally, I'm deeply concerned that OpenAI seems to have become even more caught up in their runaway hubris spiral, such that they're aiming not just for AGI, but for ASI, as soon as possible -- whether or not they get anywhere close to achieving viable alignment solutions.

The worst part of this initiative is framing alignment as something that will require a vast new increase in AI capabilities -- the AGI-level 'automated alignment researcher'. This gives them a get-out-of-jail-free card: they can claim they must push ahead with AGI, so they can build this automated alignment researcher, so they can keep us all safe from... the AGI-level systems they've just built. In other words, instead of treating AI alignment with human values as a problem at the intersection of moral philosophy, moral psychology, and other behavioral sciences, they're treating it as just another AI capabilities issue, amenable to clever technical solutions plus a whole lot of compute. Which gives them carte blanche to push ahead with capabilities research, under the guise of safety research. 

So, I see this 'superintelligence alignment' effort as cynical PR window-dressing, intended to reassure naive and gullible observers that OpenAI is still among 'the good guys', even as they accelerate their imposition of extinction risks on humanity.

Let's be honest with ourselves about that issue. If OpenAI still had a moral compass, and were still among the good guys, they would pause AGI (and ASI) capabilities research until they have achieved a viable, scalable, robust set of alignment methods that have the full support and confidence of AI researchers, AI safety experts, regulators, and the general public. They are nowhere close to that, and they probably won't get close to it in four years. Many AI-watchers (including me) are extremely skeptical that 'AI alignment' is every possible, in any meaningfully safe way, given the diversity, complexity, flexibility, and richness of human (and animal) values and preferences that AIs are trying to 'align' with.

In summary: If OpenAI was an ethical company, they would stop AI capabilities research until they solve alignment. Period. They're not doing that, and have shown no intention of doing that. Therefore, I infer that they are not an ethical company, and they do not have humanity's best interests at heart.

Since OpenAI won't prioritize alignment research (by devoting more than than 50% of their employees and compute to it), and won't pause capabilities research until alignment is solved -while acknowledging that un-aligned AGI would pose an existential threat to humanity - we can deduce that they are recklessly negligent.

But I think this initiative is worse than simple inaction. As implemented, it actively lowers our chances of successful AGI alignment, for two reasons. First, by claiming that they will solve alignment with such confidence, OpenAI may be dampening interest in this problem everywhere else in the field. People will think "Oh OpenAI has got Alignment covered. I'll work on something else." Second, by absorbing many of the best and brightest engineers/scientists in the field, they stifle innovation elsewhere, at other companies where incentives to get it right or try different paradigms may be stronger, and those people may have had bigger impact.

Yes; excellent points. OpenAI giving false hope that alignment can be solved quickly and easily is very very bad.

PS as always, for people who disagree-vote, I'd appreciate some feedback on what specifically you disagree with.

I think your heart is in the right place. But, a lot of these concerns, and also OpenAI's efforts, are very premature. Good to be cautious, sure. Yet extrapolating too far ahead usually doesn't produce useful results. Safety at every step at the way and a response in proportion to the threat will likely work better.

In what sense are these efforts 'premature'? AGI capabilities research is already far surpassing AI alignment research.

4 years doesn't seem like a whole lot of time though. it no extrapolation is required to so see OAIs intention is to not to treat alignment as an "intersection between moral philosophy, moral psychology, and other behavioral sciences...". From the perspective of someone who finds any of this ethically problematic,, now would be a great time to talk about it.

On my reading, "human-level automated alignment researcher" means a system that is human-level at alignment research, but not AGI. You can take the position that in order to be human-level at alignment research, it will need to be AGI, but I don't think that's necessarily true, and in any case it's certainly not obvious. For myself, I keep being surprised at how capable systems can get at particular abilities without being fully general. (Years ago I wrongly believed that AGI would be necessary for artificial systems to reach the level of language capability they have right now; back in the 70's, Hofstadter wrongly believed AGI would be necessary for superhuman chess ability; etc.)

It's hard to imagine more general and capability-demanding activity as doing good (superhuman!) science in such an absurdly cross-disciplinary field as AI safety (and among the disciplines that are involved there are those that are notoriously not very scientific yet: psychology, sociology, economics, the studies of consciousness, ethics, etc.). So if there is an AI that can do that but still is not counted as AGI, I don't know what the heck 'AGI' should even refer to. Compare with chess, which is a very narrow problem which can be formally defined and doesn't require AI to operate with any science (and world models) whatsoever.

If OpenAI still had a moral compass, and were still among the good guys, they would pause AGI (and ASI) capabilities research until they have achieved a viable, scalable, robust set of alignment methods that have the full support and confidence of AI researchers, AI safety experts, regulators, and the general public.

I disagree with multiple things in this sentence. First, you take a deontology stance, whereas OpenAI clearly acts within consequentialist stance, assuming that if they don't create 'safe' AGI, reckless open-source hackers will (upon the continuing exponential decrease in the cost of effective training compute, and/or the next breakthrough in DNN architecture or training that will make it much more efficient, and/or will enable effective online training). Second, I largely agree with OpenAI as well as Anthropic that iteration is important for building an alignment solution. One probably cannot design a robust, safe AI without empirical iteration, including with increasing capabilities.

I agree with your assessment of the strategy they are taking probably will fail, but mainly because I think we have inadequate human intelligence, human psychology, and coordination mechanisms to execute it. That is, I would support Yudkowsky's proposal: halt all AGI R&D, develop narrow AI and tech for improving the human genome, make humans much smarter (von Neumann-level of intelligence should be just the average) and have much more peaceful psychology, like bonobos, reform coordination and collective decision-making, and only then re-visit the AGI project roughly with the same methodology as OpenAI proposes, albeit with more diversified methodology: I agree with your criticism that OpenAI is too narrowly focused on some sort of computationalism, to the detriment of the perspectives from psychology, neuroscience, biology, etc. BTW, it seems that DeepMind is more diversified in this regard.

We are dedicating 20% of the compute we’ve secured to date over the next four years to solving the problem of superintelligence alignment.

This may sound like a lot, but I think it's likely that 4 years from now 20% of currently secured compute will just be a tiny fraction of what they'll have secured by then.

Related: Zvi has just written a pretty in-depth piece on this new Superalignment team, which I recommend. Here's the opening:

This is a real and meaningful commitment of serious firepower. You love to see it. The announcement, dedication of resources and focus on the problem are all great. Especially the stated willingness to learn and modify the approach along the way.

The problem is that I remain deeply, deeply skeptical of the alignment plan. I don’t see how the plan makes the hard parts of the problem easier rather than harder.

I will begin with a close reading of the announcement and my own take on the plan on offer, then go through the reactions of others, including my take on Leike’s other statements about OpenAI’s alignment plan.

Very nice! I'd say this seems like it's aimed at a difficulty level of 5 to 7 on my table,

https://www.lesswrong.com/posts/EjgfreeibTXRx9Ham/ten-levels-of-ai-alignment-difficulty#Table

I.e. experimentation on dangerous systems and interpretability play some role but the main thrust is automating alignment research and oversight, so maybe I'd unscientifically call it a 6.5, which is a tremendous step up from the current state of things (2.5) and would solve alignment in many possible worlds.

Our goal is to solve the core technical challenges of superintelligence alignment in four years.

This is a great goal! I don’t believe they’ve got what it takes to achieve it, though. Safely directing a superintelligent system at solving alignment is an alignment-complete problem. Building a human-level system that does alignment research safely on the first try is possible, running more than one copy of this system at a superhuman speed safely is something no one has any idea how to even approach, and unless this insanity is stopped so we have many more than four years to solve alignment, we’re all dead

My first thought was "oh, so we should just let you handle your own evaluations?" My husband (Ronny Fernandez)'s first thought was "training "human-level AI researchers" to perform adversarial evaluations is setting up an intelligence explosion"...

We can then use vast amounts of compute to scale our efforts, and iteratively align superintelligence.

Worth reading:

No control method exists to safely contain the global feedback effects of self-sufficient learning machinery. What if this control problem turns out to be an unsolvable problem?

https://www.lesswrong.com/posts/xp6n2MG5vQkPpFEBH/the-control-problem-unsolved-or-unsolvable

More from alejandro
34
alejandro
· · 1m read
Curated and popular this week
 ·  · 32m read
 · 
Summary Immediate skin-to-skin contact (SSC) between mothers and newborns and early initiation of breastfeeding (EIBF) may play a significant and underappreciated role in reducing neonatal mortality. These practices are distinct in important ways from more broadly recognized (and clearly impactful) interventions like kangaroo care and exclusive breastfeeding, and they are recommended for both preterm and full-term infants. A large evidence base indicates that immediate SSC and EIBF substantially reduce neonatal mortality. Many randomized trials show that immediate SSC promotes EIBF, reduces episodes of low blood sugar, improves temperature regulation, and promotes cardiac and respiratory stability. All of these effects are linked to lower mortality, and the biological pathways between immediate SSC, EIBF, and reduced mortality are compelling. A meta-analysis of large observational studies found a 25% lower risk of mortality in infants who began breastfeeding within one hour of birth compared to initiation after one hour. These practices are attractive targets for intervention, and promoting them is effective. Immediate SSC and EIBF require no commodities, are under the direct influence of birth attendants, are time-bound to the first hour after birth, are consistent with international guidelines, and are appropriate for universal promotion. Their adoption is often low, but ceilings are demonstrably high: many low-and middle-income countries (LMICs) have rates of EIBF less than 30%, yet several have rates over 70%. Multiple studies find that health worker training and quality improvement activities dramatically increase rates of immediate SSC and EIBF. There do not appear to be any major actors focused specifically on promotion of universal immediate SSC and EIBF. By contrast, general breastfeeding promotion and essential newborn care training programs are relatively common. More research on cost-effectiveness is needed, but it appears promising. Limited existing
Ben_West🔸
 ·  · 1m read
 · 
> Summary: We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks. > > The length of tasks (measured by how long they take human professionals) that generalist frontier model agents can complete autonomously with 50% reliability has been doubling approximately every 7 months for the last 6 years. The shaded region represents 95% CI calculated by hierarchical bootstrap over task families, tasks, and task attempts. > > Full paper | Github repo Blogpost; tweet thread. 
 ·  · 2m read
 · 
For immediate release: April 1, 2025 OXFORD, UK — The Centre for Effective Altruism (CEA) announced today that it will no longer identify as an "Effective Altruism" organization.  "After careful consideration, we've determined that the most effective way to have a positive impact is to deny any association with Effective Altruism," said a CEA spokesperson. "Our mission remains unchanged: to use reason and evidence to do the most good. Which coincidentally was the definition of EA." The announcement mirrors a pattern of other organizations that have grown with EA support and frameworks and eventually distanced themselves from EA. CEA's statement clarified that it will continue to use the same methodologies, maintain the same team, and pursue identical goals. "We've found that not being associated with the movement we have spent years building gives us more flexibility to do exactly what we were already doing, just with better PR," the spokesperson explained. "It's like keeping all the benefits of a community while refusing to contribute to its future development or taking responsibility for its challenges. Win-win!" In a related announcement, CEA revealed plans to rename its annual EA Global conference to "Coincidental Gathering of Like-Minded Individuals Who Mysteriously All Know Each Other But Definitely Aren't Part of Any Specific Movement Conference 2025." When asked about concerns that this trend might be pulling up the ladder for future projects that also might benefit from the infrastructure of the effective altruist community, the spokesperson adjusted their "I Heart Consequentialism" tie and replied, "Future projects? I'm sorry, but focusing on long-term movement building would be very EA of us, and as we've clearly established, we're not that anymore." Industry analysts predict that by 2026, the only entities still identifying as "EA" will be three post-rationalist bloggers, a Discord server full of undergraduate philosophy majors, and one person at