What success looks like

mariushobbhahn; MaxRa; Yannick_Muehlhaeuser; JasperGo; slg

noteThis article was written by Marius Hobbhahn, Max Räuker, Yannick Mühlhäuser, Jasper Götting and Simon Grimm. We are grateful for feedback from and discussions with Lennart Heim, Shaun Ee and AO.

Summary

Thinking through scenarios where TAI goes well informs our goals regarding AI safety and leads to concrete action plans. Thus, in this post,

We sketch stories where the development and deployment of transformative AI go well. We broadly cluster them like
1. Alignment won’t be a problem, …
  - Because alignment is easy: Scenario 1
  - We get lucky with the first AI: Scenario 4
2. Alignment is hard, but …
  - We can solve it together, because …
    - We can effectively deploy governance and technical strategies in combination together: Scenario 2
    - Humanity will wake up due to an accident: Scenario 3
    - The US and China will realize their shared interests: Scenario 5
  - One player can win the race, by …
    - Launching an Apollo Project for AI: Scenario 6
We categorize central points of influence that seem relevant for causing the success of our sketches. The categories with some examples are:
1. Governance: domestic laws, international treaties, safety regulations, whistleblower protection, auditing firms, compute governance and contingency plans
2. Technical: Red teaming, benchmarks, fire alarms, forecasting and information security
3. Societal: Norms in AI, publicity and field-building
We lay out some central causal variables for our stories in the third chapter. They include the level of cooperation, AI timelines, take-off speeds, size of the alignment tax, type of actors and number of actors

Introduction

There are many posts in AI alignment on sketching out failure scenarios. However, there seems to be less (public) work that talks about possible pathways to success. Holden Karnofsky writes in the appendix of Important, actionable research questions for the most important century:

Quote (Holden Karnofsky): I think there’s a big vacuum when it comes to well-thought-through visions of what [a realistic best-case transition to transformative AI] could look like, and such a vision could quickly receive wide endorsement from AI labs (and, potentially, from key people in government). I think such an outcome would be easily worth billions of dollars of longtermist capital.

We want to sketch out such best-case scenarios for the transition to transformative AI. This post is inspired by Paul Christiano’s What Failure Looks Like. Our goals are

To better understand how positive scenarios could look like
To better understand what levers are available to make positive outcomes more likely
To better understand subgoals to work towards
To make our reasoning transparent and receive feedback on our misconceptions

Scenarios

In the following, we will sketch out what some of the success stories for relevant subcomponents of TAI could look like. We know that these scenarios often lack important detailed counterarguments. We are also aware that every single point here could be its own article but we want this post to merely give a broad overview.

Scenario 1: Alignment is much easier than expected

Note: this is a hypothetical scenario. This is NOT a description of the findings in AI safety over the last years.

The alignment problem turns out much easier than expected. Increasingly better AI models have a better understanding of human values, and they do not naturally develop strong influence-seeking tendencies. Moreover, in cases of malfunctions and for preventative measures, interpretability tools now allow us to understand important parts of large models on the most basic level and ELK-like tools allow us to honestly communicate with AI systems.

We know of many systems where a more powerful actor is relatively well aligned to one or many powerless actors. Parents usually protect their children who couldn’t survive without them, democratic governments often act in ways that benefit their citizens, doctors intend to act in ways that improve their patients’ health and companies often act to the benefit of their consumers.

While the bond between parents and children is the result of millions of years of evolution, democracies took hundreds of years to build and beneficial market economies are the results of decades of fine-tuning regulation, the underlying alignment mechanisms are not complicated in themselves (Note: markets can also be unaligned in many ways, e.g. by externalizing cost). Human parents often genuinely care about the interests of their children because it was a beneficial evolutionary adaptation, democracies are roughly aligned with their citizens because of electoral pressure, and markets are mostly aligned to their consumers due to profit incentives. Of course, all of these come with complex details and exceptions, but the main driver for alignment is not complicated. Maybe, it’s similar in the case of AI alignment. Maybe, we just find a simple mechanism that works most of the time and whenever it doesn’t we can understand why and intervene.

We would like this scenario to be true, but we don’t think it’s likely. The analogies are relatively weak, and there are many reasons to assume they don’t transfer to highly advanced AI systems. Furthermore, for these analogies, a small number of exceptions (e.g. market failures) often don't lead to major harm but one unaligned AI could have disastrous consequences.

Scenario 2: The combination of many technical and governance strategies works

Within ten years after the first development of TAI, multiple international companies have built powerful AI systems and deployed them in the real world. As the societal impact becomes clear, governments make safety and alignment a priority. Due to fine-tuned governance and targeted incentives, all of these companies prioritize alignment and have established protocols for safety and security. They extensively test and monitor their models on large alignment benchmarks before deployment, they have fire alarms in place and can immediately respond to problems when they show up. Most of them have entire departments solely focused on alignment and employ red teaming units that analyze models and their possible failure modes in great detail. Additionally, regulators use AI systems to monitor and forecast the societal impact of AI. There are off switches both in soft- and hardware, and contingency plans are ready for many different failure scenarios.

From time to time, a model fails one or multiple tests but it never evades all of them, e.g. a model is able to fool a number of benchmarks but its deception is spotted by a red team investigating the AI system. All in all, the vast majority of AI uptime doesn’t create any problems and transformative AI has already led to drastic increases in GDP, increases human leisure time and enables multiple breakthroughs in the life sciences. Whenever unintended behavior arises, it is captured by one of many established failure-safe mechanisms that were put in place because society has made it a priority. Failure-safe mechanisms enumerate and improve faster than the ability of individual AI systems to fool them, and the loss of control over a misaligned AI is avoided indefinitely.

We think this scenario is more plausible than the previous one but it still has many failure modes, e.g. when a failure-safe mechanism can’t keep up with increasingly intelligent AI systems, if multiple AIs cooperate to fool safety mechanisms, or when an AI is intentionally used for bad purposes by a bad actor.

Scenario 3: Accident and regulation

By 2040 many advanced AI systems are being commonly used, such as highly advanced voice assistants, largely automated supply chains, and driverless cars. AIs are integrated into numerous economic sectors and applied to more and more problems. AI safety is of little concern to the general public, and due to the continuing technical and conceptual difficulty of alignment, many companies do not invest resources into problems that don’t affect the direct usability of their AI systems.

At some point, a particularly advanced AI system from a leading quantitative finance firm autonomously exploits a loophole and executes trades that quickly cause an unprecedented stock market crash while amassing extraordinary amounts of wealth. The resulting recession costs hundreds of millions of jobs and leads to humanitarian crises in developing countries. A public outcry over loose AI regulations is used as an opportunity by AI safety advocates to get governments to pass new, more restrictive regulations for automated systems, thus severely slowing down the development and deployment of AI systems. Private companies are now only allowed to use AIs above a certain capability level in a global AI framework that requires intensive vetting before usage. Advanced AI systems may not be used for certain tasks, like the control and development of weapon systems or financial trading. The AI must pass a long list of tests and possibly be individually vetted by alignment researchers. This leaves more time for alignment research to progress. The accident also causes an influx in AI alignment funding and talent, and the alignment community is able to develop much better alignment solutions.

This scenario is modeled after the trajectory of nuclear power, which today is highly regulated and used less than it might have been due to several accidents. We believe this scenario is somewhat unlikely but more likely than the above scenarios, as we find it implausible that governments will jointly and effectively regulate advanced AI due to its enormous utility and the difficulty of regulating a relatively broad class of technologies.

Scenario 4: Alignment by chance

In the years leading up to TAI, AI safety research does not see any breakthroughs or yield applicable results which demonstrate the safe deployment of powerful AI systems. The field grows considerably but not as much as the overall field of machine learning. AI systems grow increasingly more powerful until TAI is eventually achieved. Instead of alignment being particularly easy, the first developed TAI just by sheer luck ends up fairly intent aligned, i.e. fairly accurately infers the intentions of operators and predominantly considers action plans that would be endorsed by its operators.

The AI safety community convinces the operators to use their TAI to develop algorithms for the robust alignment of future AI systems. These algorithms are then deployed and their usage is enforced through cooperation with governments and private industry actors. Furthermore, the alignment tax turns out to be very small which leads to the widespread voluntary adoption of alignment procedures among most actors.

This scenario seems unlikely as it both requires a) being lucky about alignment and b) coming up with and implementing a successful strategy to prevent all deployments of unaligned TAI

Scenario 5: US-China-driven global cooperation

During the 2020s the international awareness of the risks through advanced AI and the alignment problem increases strongly. Reasons for the increase are lethal AI failures, dystopian advances in lethal autonomous weapons and increasingly complex problems that AI systems are applied to but where it’s hard to satisfactorily specify the goals and constraints. Additionally, field-building efforts for technical AI safety contributed to a widespread consensus on the existence of an alignment problem within computer science. This led to an increased appreciation of the risks posed through advancing AI, both in Western countries as well as in China.

Through discussions and agreements between a small number of great powers, it is agreed that hastened or non-public development of potentially transformative AI must be prevented at significant costs. This results in a) international agreements that strongly restrict the development of advanced AI systems and establish standards on AI development and deployment, b) exhaustive institutionalized auditing and monitoring mechanisms (similar to the IAEA today) and c) the establishment of international academic cooperations on value-aligned and cooperative AI.

This regime is sustained until international research projects successfully develop advanced AI systems that are robustly aligned with human values.

This scenario only works if all relevant international powers are highly willing to cooperate and are able to keep non-cooperative actors in check. Without those powers buying into reliable and likely costly mechanisms that prevent defection from the agreements, it seems likely that those agreements will not be able to prevent countries from developing TAI with their individual interest in mind.

Scenario 6: Apollo Project for AI

As AI development advances, it appears that TAI necessitates levels of compute beyond the reach of private actors. Due to national strategic interest in leading on AI, the largest development programs are taken over by nation states. They acquire a large fraction of compute, talent and data. With a massive investment of money, more and more supply chains are captured and integrated with TAI systems.

The largest government-led AI project has a sufficiently large head start so they feel comfortable to invest a lot of resources in long term safety measures such as technical alignment research. They use their economic power to gain access to the most advanced AI technology in exchange for stopping their own TAI development and agreeing to a compromise on political goals.

In this scenario, it wasn’t even necessary for the leading actor to have developed TAI in order to prevent AI existential catastrophes. Due to a shared perception of a clear head start in terms of developing TAI, other actors have significant interests to cooperate with the leading power. If only a few countries don’t want to engage in the effective regulatory regime that is set up, there is likely sufficient economic power available to make non-compliance extremely costly.

This scenario is only plausible if states have sufficient interest and resources to take on such a large project. Furthermore, getting sufficiently far ahead of the competition seems somewhat unlikely as there is ample international interest in leading on AI, and technological leadership seems difficult to judge accurately. Lastly, even if someone gets ahead this does not mean that they will use their advantage well.

Catalysts for success

In this section, we flesh out possible catalysts for success, i.e. factors that have a plausible causal influence on increasing the probability of good TAI transition scenarios. In the following table, we show which catalysts stand out for the different scenarios. The rest of the section explains these factors in more detail.

Scenario	Catalysts
Scenario 1: Alignment is much easier than expected	fast progress in technical solutions
Scenario 2: The combination of many technical and governance strategies works	benchmarks, fire alarms, auditing firms, red teams, contingency plans, norms, in AI, bounties, security clearances, whistleblower protection, transparency rules
Scenario 3: Accident and regulation	lobbying, publicity, field building, auditing firms, red teams, benchmarks
Scenario 4: Alignment by chance	international treaties, domestic laws, differential technological development
Scenario 5: US-China driven global cooperation	International treaties, monitoring and control of compute resources, facilitation of international diplomacy
Scenario 6: Apollo Project for AI	facilitation of international diplomacy, security clearances, red teams, domestic laws

Governance

Domestic laws

Domestic laws are the most straightforward way for governments to influence AI safety. Laws can, for example, mandate certain safety standards that are enforced by existing state agencies. Some upsides of domestic laws are

the ability to flexibly design and adapt laws based on local feedback
the relative increase in variability of legal approaches, which in turn can inspire laws in other countries

Some downsides of this approach are

the risk of a regulatory race to the bottom in which states relax safety standards in the hope of giving national companies a leg up in international competition.
that the downside risks of unaligned AI are not locally confined and would spill over internationally

International treaties

International treaties suggest themselves as a solution to the downsides of domestic laws. Many dangerous weapon types (missiles, ABC weapons) and dual-use technologies (nuclear power) currently underlie global governance mechanisms. The mechanisms include international treaties (e.g. the Non-Proliferation Treaty for nuclear weapons), sometimes combined with semi-independent organizations with formal- and informal authority to promote the peaceful and safe use of certain technologies (e.g. International Atomic Energy Agency, IAEA). Another example are treaties like the Treaty on Open Skies that allow nations to collect information about other actors’ technological capabilities.

We have recently seen policymakers take some first steps into focused regulation of AI, such as in the EU AI Act and others (Regulation of artificial intelligence - Wikipedia), but these say little about potentially transformative AI systems. As AI development becomes a more senior field, there might be room for an international agency, staffed with subject experts who can provide guidance on regulation to nation states. There is a plausible world in which these international organisations are a toothless tiger, e.g. underfunded and understaffed without executive power, thus requiring additional safety mechanisms.

Bans

While a ban on AI with transformative potential currently seems extraordinarily unlikely to succeed, there are use-cases of AI that the international community might be able to ban, such as including AI systems into the decision-making chain for dangerous technologies and weapons or intransparent use of advanced AI systems in social media campaigns. Narrow bans in the near future could also increase the probability and tractability of bans on more general AI systems in the future and thereby set a precedent.

Safety regulations

These seem like a more realistic global governance mechanism for advanced AI in comparison to bans. This might include mandating a certain level of interpretability and explainability, regular auditing as well as a high level of cyber security.

Risk- and benefit-sharing mechanisms

Cooperation among nation states could be improved through innovative governance mechanisms that share the upsides and potential downsides of AI. One could for instance imagine an international, multi-state fund, containing considerable amounts of stocks of the most important AI companies. This would ensure that all fund-owners have a financial incentive for a positive transition to TAI. Furthermore, there could be financial compensation schemes similar to gun buyback programs to make it easier for companies to retreat from risky trajectories. Benefit-sharing mechanisms could backfire, e.g. countries with a financial stake in the development of TAI might become less focused on costly safety regulations.

Transparency Rules

Strong transparency norms and rules would decrease the risk of repeat accidents. Being made aware of concrete cases of safety failures is important to continuously improve future safety architecture. In the absence of such norms, we can expect organizations to almost exclusively report positive outcomes.

Whistleblower Protection

We can expect Safety regulations to be quite nuanced and therefore likely difficult to enforce in practice. Similar to tax and business fraud, we should expect whistleblowers to play a relevant role. To incentivize whistleblowers, they could be protected from legal liability if their whistleblowing breaks a contract. Reporting infractions of high importance could also be financially rewarded if they are genuine. Furthermore, one could establish clear mechanisms to report AI failures in private companies, e.g. a specific government agency that promises legal protection.

Auditing firms

These are independent organizations that assess whether private firms and governments follow agreed-upon safety standards. Financial auditing is a big industry that almost every investor relies on to some extent. Governments tend to defer to the standards of the accounting industry represented by various industry associations. A similar model could work for AI. Auditing firms could check AI systems for reliability and safety as well as ethical issues. Similar to financial auditing there is a risk of a conflict of interest here, but there are reasons to believe that governments deferring to private actors instead of making and enforcing their own standards might be better in a field that is changing as rapidly as AI.

Note: we are not claiming that private or state-run auditing firms are clearly better. We think there are trade-offs and want to point out that speed is one of the key factors in AI and that this might be an argument in favor of private auditing actors.

Monitoring and Control of Compute Resources

If current trends continue we expect the training of TAI candidates to entail a large-scale concentration of compute resources. The technical capacity to produce the most advanced semiconductors is currently resting in the hands of a very small set of companies. Computation is energy-intensive and will therefore be hard to hide (at least for now). Those facts combined suggest it may be possible to institute an effective control and monitoring system for the hardware necessary for TAI. This could be done by nation states, but ideally, an international organization tasked with monitoring and regulating compute would be set up.

Information Sharing and Cooperation

Information Sharing and Cooperation between regulators, academia and engineers is likely helpful to enable more transfer of knowledge or the introduction of industry-wide standards. For a policy to be effective, it seems plausible that we need the knowledge of people at all steps of the production chain. When we find good technical solutions to alignment, there should be a way for them to become industry-wide standards as fast as possible.

This includes things like conferences and workshops that bring together people from policy and industry as well as private and public dialogues.

Facilitation of International Diplomacy

Facilitation of International Diplomacy to help policymakers of hostile nation states to see common interests and build trust. Confidence-building measures for policymakers can reduce their fear of other actors and stop them from taking radical actions due to excessive mistrust or panic. This can also be achieved via Track 1.5 and Track 2 diplomacy.

Contingency plans

There are many possible scenarios how powerful AI could go wrong. In a military context, it is common to be prepared for multiple failure scenarios with so-called war plans. These provide detailed plans on multiple ways to react to a given scenario and include information such as

“Who needs to be informed?”,
“Which units need to be activated?”,
“How high is the priority of this scenario?”,
“What is the best-, median- and worst-case outcome of this scenario?”.

Those plans are ideally made at all relevant levels (from individual teams, over companies to governments and the international community).

The main reason to have these plans is that these questions need to be answered as fast as possible in case something happens and it might be too late to react once the scenario has already started.

Security clearances

AI safety work will necessarily use and generate sensitive information and infohazards such as information on contingency plans, agreements between actors, exploits or untapped capabilities in commonly used methods. Security clearances, which grant access to authorized, vetted individuals, are a necessary tool for keeping that information safe. Security clearances are standard operating procedures in militaries, governments, and secure facilities like nuclear power plants and are often also required for e.g. janitorial positions. Awarding of clearances usually considers criminal background, travel history, character assessments, and more.

Differential Technological Development

With more work on technical AI safety it might become clear that certain approaches to developing TAI are much easier to align than others. Governance could then be used to steer the advancement of AI in the most advantageous direction. For instance, governments could fund good approaches or disincentivize bad ones.

Bounties

Beyond tasking company internal safety teams, delegated red teams or external auditing firms with finding safety risks, one could mobilize the capabilities of many more external actors by creating bounties to reward people that detect safety risks and sources of misalignment. There is however a limit to how far we can get with bounties since a lot of safety gaps are hard to discover without access to the source code and exact architecture.

Technical

Red teaming

One possible way to mitigate and prevent failures is to actively look out for them, potentially rewarding the identification of possible failure modes. This could be done by red teams that are either part of the company deploying the AI or part of an independent organization similar to consulting firms. These red teams would actively seek failure modes in the AI (in a secure setting - if that is possible) to prevent more harm in the future. For example, they could use interpretability tools or ELK to investigate its true beliefs or test the AI’s behavior in extreme cases. Red teaming is already routinely performed in, e.g., intelligence agencies or defense firms. Similarly, auditing is a routine procedure for most major corporations.

Benchmarks

While it is hard to measure alignment we should still attempt to do it whenever possible and reasonable. In practice, this could look like a really large benchmark like Google’s BigBench that provides a large range of tasks all of which could capture some component of alignment. With a sufficiently large and complex benchmark, overfitting and the effect of Goodhart's law could be reduced. Such a benchmark could also be used to keep companies working with LLMs accountable and provide them with cheap options to measure their models. We think good examples of smaller projects going in the right direction are RobustBench, a benchmark to measure adversarial robustness and BASALT a benchmark for learning from human feedback.

Forecasting

Improving our ability to foresee the most likely TAI scenarios will aid us in deciding between different relevant developments. Possible forecasts include

research advances (e.g. progress in and types of AI capabilities, progress in safety-relevant aspects such as interpretability and value learning)
geopolitical developments (e.g. tension between superpowers, motivation to prevent AI-related conflicts)
economic applications (e.g. what applications will most prominently affect the attention of regulators, which applications will be most profitable and thereby affect the most relevant actors)

Professional forecasting could be used by regulators to fine-tune novel legislation. It might be a good idea to obligate the actors involved in TAI development to aid forecasters by supplying data or answering specific technical questions.

Fire Alarms

It is hard to measure how capable and aligned a new AI system is. Therefore, we should have fire alarms that go off once a previously defined threshold is exceeded. An example for a vaguely defined fire alarm could be “ring the bell when the AI shows specific trading patterns such as buying lots of resources at once”. Optimally, such fire alarms would be widespread and seen as necessary and important by all actors. Furthermore, there should be many people in private industry, governance and academia working on fire alarms similar to how there are many people working on safety in cars, planes and food today.

Information security

We think that more powerful models will attract the attention of insufficiently careful or even malicious actors. For example, we think it is plausible that someone will try to steal the parameters/weights of powerful model architectures. Furthermore, some of the strategic questions that AI safety professionals are working on might be of high relevance to other actors and could thus be subject to information theft. Possible countermeasures can range from stronger personal information security for AI safety staff, over working on air-gapped computers for confidential projects to strong security measures for powerful AI model weights.

Societal

Norms in AI

Establishing a norm of seriously assessing the safety of AI systems before design and deployment would likely contribute to an increased understanding and prevention of possible failures related to highly advanced AI. Similar to the nuclear industry, people should be aware that they are involved in a field where risk minimization and safety are at the core of the work, rather than just a thing to think about at the end. Ideally, systems are checked for safety multiple times by independent people and safety is viewed as the greatest good. A model for such a mindset is found in the aviation industry, where the publicity of accidents and thus fear of lost revenue led to extremely rigorous safety procedures, including multiple levels of redundancy, standard operating procedures and checklists for processes, and designing safe systems that account for the natural fallibility of humans.

Widespread publicity

We see two good potential scenarios in the public perception and conversation on TAI.

The first would be widespread, roughly informed conversations around the use and misuse of strong AI. Increased attention on AI will incentivize more investigations into possible and ongoing failure modes of deployed AI systems, similar to how environmental concerns lead private groups to uncover unlawful pollution. This would also put pressure on policymakers to put more attention to AI safety.

Expert publicity

An alternative scenario would be artificial intelligence primarily remaining a topic among experts and law-makers. The governance of TAI would not be coded as political, thus allowing level-headed, technocratic governance of TAI. Most experts avoid drawing a lot of attention to their work and focus on communicating with other experts directly instead of speaking to the public. A useful analogy here would be financial regulation.

We think both scenarios have clear failure modes. Widespread public information can often rely on disinformation such as in the case of nuclear power. When a topic is primarily of concern for experts and policy makers it can receive insufficient attention such as with pandemic preparation prior to covid.

Field building

Technical AI safety and AI governance as academic fields are in their infancy and growing them could help in many ways. Possible actions include

Establishing a small subfield, e.g. having its own journal and academic conferences. One goal could be to have at least as many non-EA academics as EA academics doing research on AI safety.
Persuade existing fields, e.g. >5% of all computer scientists to switch to work on technical AI safety
Safety is a basic building block of all teaching. All studies related to ML have one or multiple courses on AGI safety.

Scenario variables

We think there are key variables around AI scenarios that influence multiple scenarios. We list them separately to make it easier to think about them.

Level of Cooperation across nations and labs

This refers to the willingness and intensity of different actors to cooperate on AI alignment. Examples include different actors’ willingness to exchange information, compromise with others, or their willingness to stop their own efforts toward TAI in case they have credible information that their path is too risky.

Timelines

Refers to the time we cross the threshold to TAI. For example, if someone says they have 2040 TAI timelines, they mean their median estimate for TAI is in 2040. Usually, timelines are expressed as a probability distribution over possible dates and not as a point estimate. For example, someone could think that there is a 25% chance that TAI is developed until 2030, a 50% chance until 2050 and a 75% chance until 2060.

Takeoff speed

Refers to the speed of improvement after crossing the TAI threshold. It is often operationalized as the timespan between TAI and superintelligent AI (SAI) where SAI is an AI that clearly outperforms humans in approximately all tasks. A fast takeoff speed might be one year or less.

Size of the Alignment Tax

This refers to how much more costly building an aligned TAI is compared with building an unaligned one. For example, it is possible that aligning an AI leads to worse performance or building an aligned AGI costs more time and money. If we assume that most actors care mostly about profits/performance and a little bit about alignment, then a smaller alignment tax leads to more adoption of safety measures. There is also the possibility of a negative alignment tax, i.e. that better-aligned algorithms also lead to better performance.

Type of Actors

There are multiple types of actors who could plausibly develop TAI. They can roughly be grouped into private actors (such as companies or non-profits), state actors and academia. The type of actor to develop TAI is relevant because they work under different incentives and thus require different strategies to ensure the development of aligned TAI. For example, private actors will mostly be incentivized by profits, while governments care more about tipping the balance of international power.

Number of Actors

This refers to the number of actors who develop TAI. However, depending on the context it can refer to slightly different quantities such as the number of actors who have the capacity to develop TAI, the number of actors who have already developed TAI or the number of actors who have already deployed TAI. Different numbers of actors require different alignment strategies. For example, it is harder to create cooperation between 100 different actors than between five or fewer.

Power Asymmetries between Actors

This refers to the difference in power that different actors involved in developing or deploying TAI have. For example, a large private company is more powerful than a small start-up and the US government is likely more powerful than any private company. This power asymmetry is relevant before TAI is developed because influencing actors with different levels of power require different strategies. Power asymmetries might be even more important after TAI is developed because the actor who develops TAI might suddenly be much more powerful than many or all other actors.

Type of Solution to the technical Alignment Problem

There are multiple different plausible paths to technical AI Alignment. These range from very specific training techniques over interpretability work to solutions where human operators can communicate truthfully with an AI in natural language. Depending on which type of solution is effective, different general strategies follow. If, for example, interpretability is a necessary requirement for alignment, this could have implications on norms and laws around storing the parameters of different important models.

Perceived Importance of AI in general

AI can plausibly have a large effect on the economic and military power of a country or could lead to major scientific advances. The higher the importance of AI is seen by countries and the general public, the larger the willingness of countries and private actors to engage in its development. Furthermore, higher importance of AI, in general, could lead to more risky behavior by different actors due to the increased stakes.

Prominence of AI safety

Laws and norms shape how TAI is developed and deployed. If more relevant actors know about AI safety or the possible negative consequences of unaligned AI, it might be easier to implement solutions to alignment in practice or cooperate with governments on international solutions.

Feel free to add further scenarios, catalysts and variables in the comments and provide feedback on our proposals.

Anonymous_EAJun 28 202214

Great post!

From Scenario 1, in which alignment is easy:

Here you seem to be imagining that technical AI alignment turns out to be easy, but you don't discuss the political/governance problem of making sure the AI (or AIs) are aligned with the right goals.

E.g. what if the first aligned transformative AI systems are built by bad actors? What if they're built by well-intentioned actors who nevertheless have no idea what to do with the aligned TAI(s) they've developed? (My impression is that we don't currently have much idea of what a lab should be looking to do in the case where they succeed in technical alignment. Maybe the aligned system could help them decide what to do, but I'm pretty nervous about counting on that.)

From my perspective a full success story should include answers to these questions.

mariushobbhahnJun 28 20225

Yes, that is true. We made the decision to not address all possible problems with every approach because it would have made the post much longer. It's a fair point of criticism though.

[anonymous]Jun 29 202211

Nice post!

One question: within your "Catalysts for success" subsections (i.e., "Governance," "Technical," "Societal"), have you listed things in a rough rank order of what you think is most important?

(E.g., do you think domestic laws and international treaties are roughly the most important success catalysts within the governance category?)

mariushobbhahnJun 29 20224

No, it's a random order and does not have an implied ranking.

Zach Stein-PerlmanJun 28 202210

Scenario 3 depends on broad attention to and concern about AI. A major accident would suffice to cause this, but it is not necessary. I expect that even before a catastrophic accident occurs (if one ever does), the public and governments will pay much more attention to AI, just due to its greater and more legibly powerful capabilities in the future. Of course, such appreciation of AI doesn't automatically lead to sane policy responses. But neither does an accident -- do you think that if one state causes a global catastrophe, the main response from other AI-relevant states will be "AI is really risky and we should slow down" rather than just some combination of being angry at the responsible state and patching the revealed vulnerability to a particular kind of deployment?? Note also that even strong regulation catalyzed by an accident is likely to

target AI deployments, not development, which is not directly helpful in terms of classic Yudkowsky-style risk
be domain-specific; an accident in an unrelated domain doesn't by default lead governments to stop companies from making ever-bigger language models

[anonymous]Jun 28 20229

Thank you for the post, it's interesting and, I think, neglected work on realistic scenarios to take a look at positive goals.

I have a relatively easy time imagining what a stable failure mode looks like - if everyone's dead, for instance, it seems like they're likely to stay dead. I'm somewhat less certain about how to model a truly stable success mode. What you describe seems to be in essence one successfully aligned AGI and how we might get there. Do you think that is sufficient to be a stable good state regarding AI safety - i.e. a state in which the collective field of AI safety can take a breath and say 'we did it, let's pack it up'? I ask this because it seems important to not be hasty about defining success modes, a false sense of security seems generally dangerous.

I would imagine you might think that one aligned AGI can prevent less well-aligned AGIs from coming into existence; but that of course might come with a potentially concerningly powerful influence on the world. Or that there is a general baseline interest in not building unaligned AGI, so that once the alignment problem is solved, there's just no reason for unaligned AGI coming into existence? Especially in a very slow, multipolar take-off scenario, an isolated success in aligning one AGI doesn't necessarily seem to translate to a global success story. (Even less so if you're worried about how the unaligned and aligned AI might interact).

Another failure mode for an ostensibly stable good state is of course that you just think the AI is aligned and the actions suggested by its value function only come apart from what we think it should do (doesn't even need to be a particularly treacherous or a particularly big turn). Accordingly, some success modes might be more stable than others - i.e. in how certain we can be that the AI is actually correctly aligned and not just seemingly.

This is a bit of a random collection of thoughts - the TL;DR question version might be: How stable do you think the success in your success stories is?

mariushobbhahnJun 28 20223

I think this is a very important question that should probably get its own post.

I'm currently very uncertain about it but I imagine the most realistic scenario is a mix of a lot of different approaches that never feels fully stable. I guess it might be similar to nuclear weapons today but on steroids, i.e. different actors have control over the technology, there are some norms and rules that most actors abide by, there are some organizations that care about non-proliferation, etc. But overall, a small perturbation could still blow up the system.

A really stable scenario probably requires either some very tough governance, e.g. preventing all but one actor from getting to AGI, or high-trust cooperation between actors, e.g. by working on the same AGI jointly.

Overall, I currently don't see a realistic scenario that feels more stable than nuclear weapons seem today which is not very reassuring.

SharmakeJun 28 20221

I'd argue it's even less stable than nukes, but one reassuring point: There will ultimately be a very weird future with thousands, Millions or billions of AIs, post humans and genetically engineered beings, and the borders are very porous and dissolvable and that ultimately is important to keep in mind. Also we don't need arbitrarily long alignment, just aligning it for 50-100 years is enough. Ultimately nothing needs to be in the long term stable, just short term chaos and stability.

mariushobbhahnJun 28 20224

I'm not sure why this should be reassuring. It doesn't sound clearly good to me. In fact, it sounds pretty controversial.

Because it's possible that even in unstable, diverse futures, catastrophe can be avoided. As to the long-term future after the Singularity, that's a question we will deal with it when we get there

I don't think "dealing with it when we get there" is a good approach to AI safety. I agree that bad outcomes could be averted in unstable futures but I'd prefer to reduce the risk as much as possible nonetheless.

Zach Stein-PerlmanJun 28 20228

Aw, I was going to write something with this title! At least this is an excellent post. We should be doing more of this kind of macrostrategy -- thinking about (partially quoting you) what positive scenarios could look like, what levers are available to make positive outcomes more likely, and what subgoals (or just desiderata) are good.

Zach Stein-PerlmanJun 28 20226

Another kind of scenario that comes to mind: there exists a pivotal act that is possible before AGI, which one actor performs.

Another catalyst for success that comes to mind: reducing existential risk from misuse & conflict caused by AI.

mariushobbhahnJun 28 20222

We thought about including such a scenario but decided against it. We think it might give the EA community a bad rep even if some people have already publically talked about it.

Zach Stein-PerlmanJun 28 20224

"Pivotal act" includes [scary sounding stuff]; if you don't want to discuss that, fine. But I think it's tragic how under-discussed very different kinds of pivotal acts or pivotal processes or just things that would be very good are. Don't assume it has to look like [scary sounding stuff].

Jackson WagnerAug 22 20224

This is some good work. You folks might enjoy perusing this similar project run by the Future of Life Institute (which was also about envisioning positive futures featuring advanced AI, although less focused on the core problem of alignment); I summarize some of the winning entries here.

MaxRaAug 22 20222

Glad you liked it, and thanks for pointer, it is on my reading list for quite some time now. :)

MMMaasAug 8 20223

Thanks for this post, I found it very interesting.

More that I'd like to write after reflection, but briefly -- on further possible scenario variables, on either the technical or governance side, I'm working out a number of these here https://docs.google.com/document/d/1Mlt3rHcxJCBCGjSqrNJool0xB33GwmyH0bHjcveI7oc/edit# , and would be interested to discuss.

MaxRaAug 8 20223

Hey Matthijs :) Glad you found it interesting!

Oh cool, just quickly skimmed the doc, that looks super useful. I'll hopefully find time to take a deeper look later this week.

My mainline best case or median-optimistic scenario is basically partially number 1, where aligning AI is somewhat easier than today, plus acceleration of transhumanism and a multipolar world both dissolve boundaries between species and the human-AI divide, this by the end of the Singularity things are extremely weird and deaths are in the millions or tens of millions due to wars.