Hide table of contents

Welcome to the AI Safety Newsletter by the Center for AI Safety. We discuss developments in AI and AI safety. No technical background required.

Subscribe here to receive future versions.


OpenAI announces a ‘superalignment’ team

On July 5th, OpenAI announced the ‘Superalignment’ team: a new research team given the goal of aligning superintelligence, and armed with 20% of OpenAI’s compute. In this story, we’ll explain and discuss the team’s strategy.

What is superintelligence? In their announcement, OpenAI distinguishes between ‘artificial general intelligence’ and ‘superintelligence.’ Briefly, ‘artificial general intelligence’ (AGI) is about breadth of performance. Generally intelligent systems perform well on a wide range of cognitive tasks. For example, humans are in many senses generally intelligent: we can learn how to drive a car, take a derivative, or play piano, even though evolution didn’t train us for those tasks. A superintelligent system would not only be generally intelligent, but also much more intelligent than humans. Conservatively, a superintelligence might be to humanity as humanity is to chimps.

‘Solving’ ‘superalignment’ in four years. OpenAI believes that superintelligence could arrive this decade. They also believe that it could cause human extinction if it isn’t aligned. By ‘alignment,’ OpenAI means making sure AI systems act according to human intent. So, ‘superalignment’ means making sure superintelligent AI systems act according to human intent (as opposed to doing alignment super well). The Superalignment team’s stated goal is to “solve the core technical challenges of superintelligence alignment in four years.” 

There are two important caveats to this goal. The first is that “human intent” isn’t monolithic. AI safety will have to involve compromise between different human intents. OpenAI knows that their technical work will need to be complemented by AI governance. The second is that alignment may not be a problem able to be conclusively solved once and for all. It might instead be a wicked problem that must be met with varied interventions and ongoing vigilance.

Current alignment techniques don’t scale to superintelligence. OpenAI’s current alignment techniques rely on humans to supervise AI. In one technique, “reinforcement learning from human feedback” (RLHF), humans train AI systems to act well by giving them feedback. RLHF is how OpenAI trained ChatGPT to (usually) avoid generating harmful content. 

Humans can generally tell when a less intelligent system is misbehaving. The problem is that humans won’t be able to tell when a superintelligent AI system misbehaves. For example, a superintelligent system might deceive or manipulate human supervisors into giving it positive feedback.

OpenAI’s approach to alignment is to build and scale an automated alignment researcher. OpenAI proposes to avoid the problem of human supervision by automating supervision. Once they have built a “roughly human-level” alignment researcher, OpenAI plans to “iteratively align superintelligence” using vast amounts of compute. By “iterative,” OpenAI means that their first automated alignment researcher could align a relatively more capable system, which could then align an even more capable system, and so on. 

OpenAI dedicated 20% of their compute to alignment. OpenAI’s Superalignment team represents the single largest commitment a leading AI lab — or government, for that matter — has made to AI safety research. Still, it may not be enough. For example, Geoffrey Hinton has suggested that AI labs should contribute about 50% of their resources to safety research.

Musk launches xAI

Elon Musk has launched xAI, a new AI company that aims to compete with OpenAI and DeepMind. In this story, we discuss the implications of the launch.

What are xAI’s prospects? Given Musk’s resources, xAI has the potential to challenge OpenAI and DeepMind for a position as a top AI lab. In particular, xAI might be able to draw on the AI infrastructure at Tesla. Tesla is building what it projects to be one of the largest supercomputers in the world by early 2024, and Musk has said that Tesla might offer a cloud computing service.

How will xAI affect AI risk? It’s unclear how the entrance of xAI will affect AI risk. On one hand, the entrance of another top AI lab might exacerbate the competitive pressures. On the other hand, Musk has been one of the earliest public proponents of AI safety. xAI has also listed Dan Hendrycks, the director of CAIS, as an advisor to xAI. (Note: Hendrycks does not have any financial stake in xAI and chose to receive a token $1 salary for his consulting.) xAI has the potential to direct Musk’s resources towards mitigating AI risk. More information about the organization will come out during this Friday’s Twitter spaces with the xAI team.

Developments in Military AI Use

According to a recent Bloomberg article, the Pentagon is testing five large language model (LLM) platforms in military applications. One of these platforms is Scale AI’s Donovan. Also, defense companies are advertising AI-powered drones that can autonomously identify and attack targets.

AI and defense companies are developing LLMs for military use. Several companies, including Palantir Technologies, Anduril Industries, and Scale AI, are developing LLM-based military decision platforms. The Pentagon is currently testing five of these platforms. Scale AI says its new product, Donovan, is one of them.

What are the military applications of LLMs? The Pentagon is testing the LLM platforms for their ability to analyze and present data in natural language. Military decision-makers could make information requests directly through LLM platforms with access to confidential data. Currently, the military relies on much slower processes. Bloomberg reports that one platform took 10 minutes to complete an information request that would have otherwise taken several days.

The Pentagon is also testing the platforms for their ability to propose its own courses of action. Bloomberg was allowed to ask Donovan about a US response to a Chinese invasion of Taiwan. It responded: “Direct US intervention with ground, air and naval forces would probably be necessary."

Scale AI advertises Donovan’s ability to generate novel courses of action.

The use of LLMs follows recent developments in AI-powered drones. AI systems have already been tested and deployed in autonomous flight and targeting. In 2020, DARPA’s AlphaDogfight program produced an AI pilot capable of consistently beating human pilots in simulations. A UN report suggests that the first fully-autonomous drone attack occurred in Libya the same year. The company Elbit Systems is now advertising a similar “search and attack” drone that approaches humans then explodes, and the US may be evaluating AI targeting systems.

Should we be concerned? If LLMs or AI drones give militaries a competitive advantage over their adversaries, then their use might lead to an arms race dynamic. Competing nations might increasingly invest in and deploy frontier AI models. Such a dynamic has the potential to exacerbate AI risk. For example, militaries might lose control over increasingly complex AI systems.

Links:

See also: CAIS websiteCAIS twitterA technical safety research newsletter, and An Overview of Catastrophic AI Risks

Subscribe here to receive future versions.

No comments on this post yet.
Be the first to respond.
Curated and popular this week
 ·  · 52m read
 · 
In recent months, the CEOs of leading AI companies have grown increasingly confident about rapid progress: * OpenAI's Sam Altman: Shifted from saying in November "the rate of progress continues" to declaring in January "we are now confident we know how to build AGI" * Anthropic's Dario Amodei: Stated in January "I'm more confident than I've ever been that we're close to powerful capabilities... in the next 2-3 years" * Google DeepMind's Demis Hassabis: Changed from "as soon as 10 years" in autumn to "probably three to five years away" by January. What explains the shift? Is it just hype? Or could we really have Artificial General Intelligence (AGI) by 2028?[1] In this article, I look at what's driven recent progress, estimate how far those drivers can continue, and explain why they're likely to continue for at least four more years. In particular, while in 2024 progress in LLM chatbots seemed to slow, a new approach started to work: teaching the models to reason using reinforcement learning. In just a year, this let them surpass human PhDs at answering difficult scientific reasoning questions, and achieve expert-level performance on one-hour coding tasks. We don't know how capable AGI will become, but extrapolating the recent rate of progress suggests that, by 2028, we could reach AI models with beyond-human reasoning abilities, expert-level knowledge in every domain, and that can autonomously complete multi-week projects, and progress would likely continue from there.  On this set of software engineering & computer use tasks, in 2020 AI was only able to do tasks that would typically take a human expert a couple of seconds. By 2024, that had risen to almost an hour. If the trend continues, by 2028 it'll reach several weeks.  No longer mere chatbots, these 'agent' models might soon satisfy many people's definitions of AGI — roughly, AI systems that match human performance at most knowledge work (see definition in footnote).[1] This means that, while the co
saulius
 ·  · 22m read
 · 
Summary In this article, I estimate the cost-effectiveness of five Anima International programs in Poland: improving cage-free and broiler welfare, blocking new factory farms, banning fur farming, and encouraging retailers to sell more plant-based protein. I estimate that together, these programs help roughly 136 animals—or 32 years of farmed animal life—per dollar spent. Animal years affected per dollar spent was within an order of magnitude for all five evaluated interventions. I also tried to estimate how much suffering each program alleviates. Using SADs (Suffering-Adjusted Days)—a metric developed by Ambitious Impact (AIM) that accounts for species differences and pain intensity—Anima’s programs appear highly cost-effective, even compared to charities recommended by Animal Charity Evaluators. However, I also ran a small informal survey to understand how people intuitively weigh different categories of pain defined by the Welfare Footprint Institute. The results suggested that SADs may heavily underweight brief but intense suffering. Based on those findings, I created my own metric DCDE (Disabling Chicken Day Equivalent) with different weightings. Under this approach, interventions focused on humane slaughter look more promising, while cage-free campaigns appear less impactful. These results are highly uncertain but show how sensitive conclusions are to how we value different kinds of suffering. My estimates are highly speculative, often relying on subjective judgments from Anima International staff regarding factors such as the likelihood of success for various interventions. This introduces potential bias. Another major source of uncertainty is how long the effects of reforms will last if achieved. To address this, I developed a methodology to estimate impact duration for chicken welfare campaigns. However, I’m essentially guessing when it comes to how long the impact of farm-blocking or fur bans might last—there’s just too much uncertainty. Background In
gergo
 ·  · 11m read
 · 
Crossposted on Substack and Lesswrong. Introduction There are many reasons why people fail to land a high-impact role. They might lack the skills, don’t have a polished CV, don’t articulate their thoughts well in applications[1] or interviews, or don't manage their time effectively during work tests. This post is not about these issues. It’s about what I see as the least obvious reason why one might get rejected relatively early in the hiring process, despite having the right skill set and ticking most of the other boxes mentioned above. The reason for this is what I call context, or rather, lack thereof. Subscribe to The Field Building Blog On professionals looking for jobs It’s widely agreed upon that we need more experienced professionals in the community, but we are not doing a good job of accommodating them once they make the difficult and admirable decision to try transitioning to AI Safety. Let’s paint a basic picture that I understand many experienced professionals are going through, or at least the dozens I talked to at EAGx conferences. 1. They do an AI Safety intro course 2. They decide to pivot their career 3. They start applying for highly selective jobs, including ones at OpenPhilanthropy 4. They get rejected relatively early in the hiring process, including for more junior roles compared to their work experience 5. They don’t get any feedback 6. They are confused as to why and start questioning whether they can contribute to AI Safety If you find yourself continuously making it to later rounds of the hiring process, I think you will eventually land the job sooner or later. The competition is tight, so please be patient! To a lesser extent, this will apply to roles outside of AI Safety, especially to those aiming to reduce global catastrophic risks. But for those struggling to penetrate later rounds of the hiring process, I want to suggest a potential consideration. Assuming you already have the right skillset for a given role, it might