Before we start: AGI, TAI, HLMI and related concepts would ideally all fall within this conceptual framework. The nuances (whether it’s transformative or machine-based etc.) are irrelevant in this framework. Same for “safety”. I use AGI throughout for my own convenience.
TL;DR: this post aims to develop a conceptual framework general enough to fit with every reader's understanding of how to achieve AGI safety: i.e. AGI alignment as a series of actions. The hope is to gradually make it more concrete to aid EA community communication and predictions in this cause area.
I have found many of my conversations about AGI safety as a cause area to be impeded by misunderstandings, even within the EA and long-termist community and even with people with whom I tend to reach similar broad conclusions. For example, even with EAs who agree with the broad conclusion that “AI governance matters for the catastrophic and existential riskiness of AI”, there will be confusion about a possible implication that “international standard-setting organisations matter for catastrophic and existential riskiness of AI.” This confusion “downstream” in the reasoning makes me think either or both of us do(es)n’t really get the broad conclusion upstream. Most EAs with whom I have talked about AGI safety have also vastly different beliefs about the same statements (even though they use the same definitions). There also seems to be no common understanding of what statements are important to have beliefs about, and no common framework to even begin to collectively identify cruxes from.
Beyond the notions of “AGI”, ”timeline”, ”comprehensive AI services”, ”takeoff” and ”likelihood of AGI occurring/being developed”, etc. which have been talked about a bit more, there aren’t widely agreed upon dissolutions into more tangible concepts, let alone subsequent analysis that could help anticipation of future events and course-correcting interventions. We also seem to have common notions of “US-China race dynamics”, “lonewolf trajectories”, … yet, even though I have a clear idea of what I mean by it, I am unsure others understand it the same way. These notions function well enough for informal communication, but for the sake of anticipating future developments, I want to ask: what do you mean by “China and the US are racing for AI supremacy”? What does a world where “China” and “the US” are not “racing” for “AI supremacy” look like? As there are dozens of these statements, I think there might be economies of scale in developing a common framework to parse them through.
Common understanding should hopefully aid a convergence in anticipation. I don’t imagine we will be able to confidently predict the exact trajectory of AGI any time soon, but convergence in anticipation helps AGI safety-concerned people coordinate. There is work that no one would have bothered doing without common understanding that it is useful and why it is useful. For example, GCRI’s survey of AGI projects would not have taken place without some common understanding that, ceteris paribus, the intention to develop AGI-like technologies correlates positively with assigned likelihood of developing AGI and that, therefore, to a certain extent, we should try to know of existing AGI projects.
Many EAs' decisions (career choice, research questions, or philanthropic funding) hinge on this common understanding and anticipation. I too think that it is hard to plan for affecting something which you cannot anticipate properly and good anticipation relies on a genuine understanding of the underlying mechanisms and parameters driving the results. Finally, I would like to subject my understanding to others’ views - I have now more time to do that: I can interact productively with the occasional reader and don’t fear as much that it would waste anyone’s time.
Before reaching more directly useful predictions, I hope we can reach a framework general enough to encompass every reader’s existing ideas about how AGI safety might happen i.e. that no-one feels like a key concept on this is not captured by the framework below. If we can achieve this, we would then be better able to communicate about, compare and coordinate existing approaches or ideas for interventions (e.g. “technical AGI safety”, “hardware control” and “AI Governance”), and scrutinize new ones.
I start with 10 concepts that I have found useful across conversations. I attempt to explain, rather than define, the concepts. I spare few words for the concepts I believe are relatively straightforward. You can help by answering these questions:
- Do these concepts suffice to capture (at a high level of abstraction) your understanding of AGI safety as a cause area? I.e. do we all agree on that abstract description?
- How should the concepts be explained to better capture what you mean?
- If not, could you comment and explain the missing concept(s)? How does it relate to these 10 concepts?
- If yes, feel free to contribute: explain your own understanding, hone these concepts' explanation and perhaps identify sub-concepts that could help make this framework more granular. For example, I haven’t included the notion of “relevance” of an actor, because it brings us distractingly close to questions where there is no agreement (e.g. whether governments, military and academia matter, whether UK matters more than the EU, etc.). I will do so in comments later if these first 10 concept are uncontroversial.
1. Series of actions
Regardless of its alignment, technological trajectory, and speed of takeoff, AGI will occur following a series of numerous relevant actions by one or more human being(s), whether deliberate or not. For the AGI safety cause area, a series of actions is relevant if it leads to AGI. Even “lone wolf hacker”/”AGI in my parents’ garage” pathways will require some actions by more than one human being, be it for procurement of specific hardware, access to education, development of software capabilities, access to data, and so on. Some series of actions are long, meandrous, and involve many human actors (e.g., for-profit research & development of technologies for synchronisation of distributed comprehensive services over several decades); some are more straightforward (e.g., skunkworks project with “test-ready AGI code” as a deliverable); and all are hypothetical. Some result in gradual achievement of AGI-like capabilities embedded in a system, others result in achievement of AGI-like capabilities at the push of a button. Importantly, many actions along the way might be entirely disconnected from the intent to create AGI. Individuals concerned about whether AGI is safe would prefer a series of actions that outputs safe AGI.
Each action along these pathways is carried out by humans. Even if some automated intermediaries are involved, the intermediaries are built by humans. These humans are therefore actors in the series of actions. We can consider a simplified model of humans as having objective functions factoring in rewards and costs; sensors, memory and actuators; and a significant amount of noise and design flaws in sensing, input processing and actuation.
Factors can be action-specific (affecting a single action) or actor-specific (affecting all that actor’s actions). Many factors come into play for any action - from one’s physiological state and direct environmental stimuli to moral psychological configuration and more sophisticated systems shaping one’s beliefs, attitudes, and skills, such as education, (geo)political situation, and philosophical stance. Concretely, whether Jane presses “Cmd+r” or hesitates half a second longer and remembers to fix an error on the impact regularizer - or whatever the relevant intermediary action may be - is based on these many other factors. These factors are therefore important determinants of actions undertaken by actors.
I also argue these various factors can be altered by the actions of other humans with different objectives, belief systems, etc. (e.g., a colleague working on the impact regularizer, a product manager urging that development proceeds more quickly, an actor lauding software perfectionism in the latest blockbuster). To make the series of actions converge towards the safe AGI outcome, interventions to alter these factors are needed along the series. Some interventions alter only one action’s determining factors (e.g., a colleague talking to Jane just before she presses “Cmd+r”), while other interventions alter many actions’, through their influence on actors’ factors (e.g., a PR campaign promoting Stuart Russell’s Human Compatible book). Similarly, some interventions alter factors significantly (e.g., manager bilaterally pressuring Jane for performance improvement and delivery 1 hour before the action at hand), while other interventions alter factors only marginally or at a distance (e.g., a conversation with Jane on alignment, robustness and assurance at a conference 2 years before the action).
Moreover, we are uncertain about the series of actions that could output safe vs. unsafe AGI and which are the most influential factors for shaping the relevant series of actions. There are many factors that pertain to many actions in many different possible series, and there are many interventions that may cut in both directions.
6. Strategic efficiency
A more efficient strategy is a portfolio of interventions that collectively more significantly affects a greater number of more relevant actions towards a safer outcome, using less resources as the given comparable (add “weakly” before every comparative if you prefer ≤ instead of <, and any subset of these comparatives being strictly defined is sufficient to gain efficiency).
7. Strategic effectiveness
An effective strategy converts available resources (including through collecting extra resources if needed) into confidence that AGI will be safe.
8. Strategic research
Under this framework, research activities pursuing safe AGI are useful when they reduce uncertainty and enable the development of more efficient strategies to positively influence actions (e.g., answering what series of actions are more or less likely, what factors are most important, and what interventions are most significant).
9. Direct work
Direct work activities are the implementation of strategies and associated interventions. They are useful when they increase confidence that the outcome of the current series of actions is safety-enhancing. For example, developing elements of solutions to the AGI alignment problem would increase confidence that many series of actions result in safe AGI.
For individuals solely concerned about whether AGI is safe, if strategic research doesn’t reduce uncertainty or identify efficient strategies, or direct work output doesn’t increase confidence in the outcome being safe, it is wasteful.
: Please do share if you know of any. If there are, it seems they are not being referred to by AGI safety-concerned individuals in the EA community, so I’ll edit this post to further disseminate them.
: While this framework has been developed with AI governance in mind, it can be applied to a broad range of cause areas and subareas where the objective is to intervene to influence “target actions” (e.g. the decision of whether to donate 10% of one’s income, the decision to impose a different tax rate on meat products, the decision to make a grant for foresight research, …)
: This example of action immediately preceding takeoff is chosen for the sake of simplicity. I don’t want to make AGI development (or alignment) sound “easy”, but any other more sophisticated AGI-generating action would require further assumptions. I am very interested in understanding better the likely ultimate actions to counter them, but this exploration is beyond the scope of this post.
11. Actor relevance
In this conceptual framework, various actors will have more or less influence -through their own actions- at different points on the series of actions. Therefore, there is a subset of actors having a disproportionate influence on the development, deployment and safety of AGI/CAIS/TAI (e.g. biggest tech investors, heads of government, OECD.AI executive director, DeepMind safety team, regulators, government AI procurement officials, standard-setting bodies representatives, "Jane" in the explanations above, …). Depending on the costs involved, identifying the most relevant actors and altering their actions to ensure that the outcome is safe could be an effective strategy.