This post is the first in a sequence of posts about AI strategy co-authored by Thomas Larsen, Akash Wasil, and Olivia Jimenez (TAO). In the next post, we’ll provide more examples of “buying time” interventions that we’re excited about. 

We’re grateful to Ajeya Cotra, Daniel Kokotajlo, Ashwin Acharya, and Andrea Miotti for feedback on this post. 

If anyone is interested in working on "buying time" interventions, feel free to reach out. (Note that Thomas has a list of technical projects with specifics about how to implement them. We also have a list of non-technical projects). 

Summary

A few months ago, when we met technical people interested in reducing AI x-risk, we were nearly always encouraging them to try to solve what we see as the core challenges of the alignment problem (e.g., inner misalignment, corrigibility, interpretability that generalizes to advanced systems). 

But we’ve changed our mind.

On the margin, we think more alignment researchers should work on “buying time” interventions instead of technical alignment research (or whatever else they were doing). 

To state the claim another way: on the margin, more researchers should backchain from “how do I make AGI timelines longer, make AI labs more concerned about x-risk, and present AI labs with clear things to do to reduce x-risk", instead of “how do I solve the technical alignment problem?” 

Some “buying time” interventions involve performing research that makes AI safety arguments more concrete or grounds them in ML (e.g., writing papers like Goal misgeneralization in deep reinforcement learning and Alignment from a deep learning perspective & discussing these with members of labs). 

Some “buying time” interventions involve outreach and engagement with capabilities researchers, leaders in AI labs, and (to a lesser extent) the broader ML community (e.g., giving a presentation to a leading AI lab about power-seeking and deception). We expect that successful outreach efforts will also involve understanding the cruxes/counterarguments of the relevant stakeholders, identifying limitations of existing arguments, and openly acknowledging when the AI safety community is wrong/confused/uncertain about certain points.

We are excited about “buying time” interventions for four main reasons: 

  1. Multiplier effects: Delaying timelines by 1 year gives the entire alignment community an extra year to solve the problem. 
  2. End time: Some buying time interventions give the community a year at the end, where we have the most knowledge about the problem, access to near-AGI systems, the largest community size, the broadest network across other influential actors, and the most credibility at labs. Buying end time also increases the amount of serial alignment research, which some believe to be the bottleneck. We discuss this more below.   
  3. Comparative advantage: Many people would be better-suited for buying time than technical alignment work (see figure 1).
    1. Buying time for people at the tails: We expect that alignment research is extremely heavy-tailed. A median researcher who decides to buy time is buying time for people at the tails, which is (much) more valuable than the median researcher’s time.
  4. Externalities/extra benefits: Many buying time interventions have additional benefits (e.g., improving coordination, reducing race dynamics, increasing the willingness to pay a high alignment tax, getting more people working on alignment research). 

Figure 1: Impact by percentile for technical alignment research and buying time interventions.

Caption: We believe that impact in technical alignment research is more heavy-tailed than impact for buying time interventions. Figure 1 illustrates this belief. Note that this is a rough approximation. Note also that both curves should also go below the 0-point of the y-axis, as both kinds of interventions can be net negative. 

Concretely, we recommend that ~40-60% of alignment researchers should focus on "buying time" interventions rather than technical alignment research (whereas we currently think that only ~20% are focusing on buying time). We also recommend that ~20-40% of community-builders focus on “buying time” interventions rather than typical community-building (whereas we currently think that ~10% are focusing on buying time). 

In the rest of the post, we: 

  1. Offer some disclaimers and caveats (here)
  2. Elaborate on the reasons why we're excited about "buying time" interventions (here)
  3. Describe some examples of "buying time" interventions (here)
  4. Explain our theory of change in greater detail (here)
  5. Describe some potential objections & our responses (here)
  6. Describe some changes we recommend (here)

Disclaimers

Disclaimer #1: We’re not claiming that “buying time” is the only way to categorize the kinds of interventions we describe, and we encourage readers to see if they can come up with alternative frames/labels. Many “buying time” interventions also have other benefits (e.g., improving coordination, getting more people to work on AI safety researchers, and making it less likely that labs deploy dangerous systems). We chose to go with the “buying time” frame for two main reasons

  1. For nearly all of the interventions we describe, we think that most of the benefit comes from buying time, and these other benefits are side benefits. One important exception to this is that much of the impact of evals/demos may come from their ability to prevent labs from deploying dangerous systems. This buys time, but it’s plausible that the main benefit is “the world didn’t end”.
  2. We have found backchaining from “buying time” more useful than other frames we brainstormed. Some alternative frames have felt too limiting (e.g., “outreach to the ML community” doesn’t cover some governance interventions).

Disclaimer #2: Many of these interventions have serious downside risks. We also think many of them are difficult, and they only have a shot at working if they are executed extremely well.

Disclaimer #3: We have several “background assumptions” that inform our thinking. Some examples include (a) somewhat short AI timelines (AGI likely developed in 5-15 years), (b) high alignment difficulty (alignment by default is unlikely, and current approaches seem unlikely to work), and (c) there has been some work done in each of these areas, but we are far behind what we would expect in winning worlds, and there are opportunities to do things that are much more targeted & ambitious than previous/existing projects. 

Disclaimer #4: Much of our thinking is informed by conversations with technical AI safety researchers. We have less experience interacting with the governance community and even less experience thinking about interventions that involve the government. It’s possible that some of these ideas are already widespread among EAs who focus on governance interventions, and a lot of our arguments are directed at the thinking we see in the technical AI safety community.

Why are we so excited about "buying time" interventions?”

Large upsides of buying time 

Some time-buying interventions buy a year at the end. If capabilities growth continues as normal until someone is about to deploy an AI model that would improve into a TAI, but an evaluation triggers and reveals misaligned behavior, causing this lab to slow down and warn the other labs, the time from this event until when AGI is deployed is very valuable for the following reasons: 

  1. You have bought one month for the entire AI safety community[1].
  2. More researchers: The number of alignment researchers is growing each year, so we expect to have the most alignment researchers at the end.
  3. Better understanding of alignment: We understand more about the alignment problem each year, and the field becomes less pre-paradigmatic. This makes it easier to make progress each year.
  4. Serial time: buying time increases the amount of serial alignment research, which could be the bottleneck
  5. AGI assisted alignment: Some alignment agendas involve using AI assistants to boost alignment research. It’s plausible that a year of alignment research with AI assistants is 5-10X more valuable than a year of alignment research right now. If we’re able to implement interventions that buy time once we have powerful AI assistants, this intervention would be especially valuable (assuming that these assistants can make differential alignment progress or that we can buy time once we have the assistants).
  6. Better understanding of architecture: when we are close to AGI, we have a better understanding of the architectures and training paradigms that will be used to actually build AGI, allowing for alignment solutions to be much more concrete and informed.  

Other interventions have a different shape, and do not buy as valuable time. If you simply slow the rate of capabilities progress through publication policies or convince some capabilities researchers to transition, such that on net then AI will take one more year to generate, this has the benefit of 1-4, but not 5 and 6. However, we are proposing to reallocate alignment researchers to buying time interventions, which means that less alignment research is being made this year, so reason 3 might be less strong. 

Some interventions that buy time involve coordinating with members of major AI labs (e.g., OpenAI, DeepMind, Anthropic). As a result, these interventions often have the additional benefit of increasing communication, coordination, trust, and shared understanding between major AI labs and members of the AI safety community who are not part of the AI labs. (Note that this is not true of all “buying time” interventions, and several of them could also lead to less coordination or less trust).

Tractability and comparative advantage: lots of people can have a solid positive impact by buying time, while fewer can do great alignment work

  1. Buying a year buys a year for researchers at the tails. If you buy a year of time, you buy a year of time for some of the best alignment researchers. It’s plausible to us that researchers at the tails are >50-100X more valuable than median researchers.
  2. Many projects that are designed to buy time require different skills than technical AI safety research. 
    1. Skills that seem uniquely valuable for buying time interventions: general researcher aptitudes, ability to take existing ideas and strengthen them, experimental design skills, ability to iterate in response to feedback, ability to build on the ideas of others, ability to draw connections between ideas, experience conducting “typical ML research,” strong models of ML/capabilities researchers, strong communication skills
    2. Skills that seem uniquely valuable for technical AI safety research: abstract thinking, ability to work well with very little structure or guidance, ability to generate and formalize novel ideas, focus on “the hard parts of the problem”, ability to be comfortable being confused for long periods of time.
    3. Skills that seem roughly as useful in both: Strong understanding of AI safety material, machine learning knowledge.
  3. Alignment research seems heavy-tailed. It’s often easy to identify whether or not someone has a reasonable chance of being at the tail (e.g., after 6-12 months of trying to solve alignment). People who are somewhat likely to be at the tail should keep doing alignment research; other people should buy time.

A reasonable counterpoint is that “buying time” might also be heavily-tailed. However, we currently expect it to be less heavy-tailed than alignment research. It seems plausible to us that many “median SERI-MATS scholars” could write papers like the goal misgeneralization paper, explain alignment difficulties in clearer and more compelling ways, conduct (or organize) high-quality outreach and coordination activities and perform many other interventions we’re excited about. On the other hand, we don’t expect that “median SERI-MATS scholars” would be able to make progress on heuristic arguments, create their own alignment agendas, or come up with other major conceptual advancements.

Nonetheless, a lot of the argument depends on the specific time-buying and the specific alignment research. It seems plausible to us that some of the most difficult time-buying interventions are more heavy-tailed than some of the more straightforward alignment research projects (e.g., coming up with good eval tools and demos might be more heavy-tailed than performing interpretability experiments). 

What are some examples of "buying time" interventions?

The next post in this sequence (rough draft here) outlines more concrete interventions that we are excited about in this space, but we highlight three interventions here that are especially exciting to us. We briefly provide some examples below.

Outreach efforts that involve interactions between the AI safety community and (a) members of AI labs + (b) members of the ML community.

Some specific examples:

  1. More conferences that bring together researchers from different groups who are working on similar topics (e.g., Anthropic recently organized an interpretability retreat with members from various different AI labs and AI alignment organizations). 
  2. More conferences that bring together strategy/governance thinkers from different groups (e.g., Olivia and Akash recently ran a small 1-day strategy retreat with a few members from AI labs and members). 
  3. Discussions like the MIRI 2021 conversations, except with a greater emphasis on engaging with researchers and decision-makers at major AI labs by directly touching on their cruxes. 
  4. Collaborations on interventions that involve coordinating with AI labs (e.g., figuring out if there are ways to collaborate on research projects, efforts to implement publication policies and information-sharing agreements, efforts to monitor new actors that are developing AGI, etc.)
  5. More ML community outreach. Examples include projects by the Center for AI Safety (led by Dan Hendrycks) and AIS field-building hub (led by Vael Gates). 

The Evaluations Project (led by Beth Barnes)

Beth’s team is trying to develop evaluations that help us understand when AI models might be dangerous. The path to impact is that an AI company will likely use the eval tool on advanced AI models that they train, and this eval could then lead them to delay deployment of a model for which the eval unveiled scary behavior. In an ideal world, this would be so compelling that multiple AI labs slow down, potentially extending timelines by multiple years.

Papers that take theoretical/conceptual safety ideas and ground them in empirical research.

Specific examples of this type of work include Lauro Langosco’s goal misgeneralization paper (which shows how an RL agent can appear to learn goal X but actually learn goal Y) and Alex Turner’s optimal policies tend to seek power paper. Theoretical alignment researchers had already proposed that agents could learn unintended goals and that agents would have incentives to seek power. The papers by Lauro and Alex take these theoretical ideas (which are often perceived as fuzzy and lacking concreteness), formalize them more crisply, and offer examples of how they affect modern ML systems. 

We think that this buys time primarily by convincing labs and academics of alignment difficulty. In the next section, we give more detail on the theory of chance.  

Theory of Change

In the previous section, we talked about why we were so excited about buying time interventions. However, the interventions we have in mind often have a number of other positive impacts. In this section, we provide more detail about these impacts as well as why we think these impacts end up buying time. 

We summarize our theory of change in the following diagram: 

Labs take AGI x-risk seriously + Labs have concrete things they can do → More Time

We think that timelines are largely a function of (a) the extent to which leaders and researchers at AI labs are concerned about AI x-risk and (b) the extent to which they have solutions that can be (feasibly) implemented. 

If conditions (a) and (b) are met, we expect the following benefits:

  1. Less capabilities research. There is less scaling and less algorithmic progress.  
  2. More coordination between labs. There are more explicit and trusted agreements to help each other with safety research, avoid deploying AGI prematurely, and avoid racing. 
  3. Less publishing capabilities advances. Capabilities knowledge is siloed, so when one lab discovers something, it doesn't get used by the rest of the world. For example, PaLM claims a 15% speed up from parallelizing layers. If this insight hadn't been published, PaLM would likely have been 15% slower. 
  4. Labs being less likely to deploy AGI and scale existing models. 

These all lead to relative slowdowns of AGI timelines, giving everyone more time to solve the alignment problem.

Benefits other than buying time

Many of the interventions we describe also have benefits other than buying time. We think the most important ones are:

  1. Willingness to pay a higher alignment tax: Concern about alignment going poorly means that the labs invest more resources into safety. An obvious resource is that a computationally expensive solution to alignment becomes a lot more likely to be implemented, as the labs recognize how important it is. Concretely, we are substantially more excited about worlds in the lab building AGI actually implements all of the interventions described in Holden's How might we align transformative AI if it’s developed very soon?. By default, if AGI were developed in the next few years at OpenAI or DeepMind, we would put ~20% on each of these solutions actually being used. 
  2. Less likely to deploy dangerous systems: If evaluations are used and safety standards are implemented, labs could catch misaligned behavior and decide not to deploy systems that would have ended the world. This leads to timelines increases, but it also has the more direct impact of literally saving the world (at least temporarily). We expect that this causes the probability of a naive accident risk to go down substantially. 
  3. More alignment research: If labs are convinced by AI safety arguments, they may shift more of their (capabilities) researchers toward alignment issues.

Some objections and our responses

1. There are downside risks from low-quality outreach and coordination efforts with AI labs

Response: We agree. Members of AI labs have their own opinions about AI safety; efforts to come in and proselytize are likely to fail. We think the best efforts will be conducted by people who have (a) strong understandings of technical AI safety arguments, (b) strong interpersonal skills and ability to understand different perspectives, (c) caution and good judgment, and (d) collaborators or advisors who can help them understand the space. However, this depends on the intervention. Caution is especially warranted when doing direct outreach that involves interaction with capabilities researchers, but more technical work such as empirically grounding alignment arguments pretty much only requires technical skill.  

2. Labs perceive themselves to be in a race, so they won’t slow down. 

Response #1: We think that some of the concrete interventions we have in mind contribute to coordination and reduce race dynamics. In particular, efforts to buy time by conveying the difficulty of alignment could lead multiple players to become more concerned about x-risk (causing all of the leading labs to slow down). 

Furthermore, we’re optimistic that sufficiently well-executed coordination events could lead to increased trust and potentially concrete agreements between labs. We think that differences in values (company A is worried that company B would not use AI responsibly) and worries about misuse risk (company A is worried that company B’s AI is likely to be unaligned) are two primary drivers of race dynamics.

However, to the extent that A and B are value-aligned, both are aware that each of them are taking reasonable safety precautions, and leaders at both companies trust each other, they are less incentivized to race each other. Coordination events could help with each of these factors.

Response #2: Some interventions don’t reduce race dynamics (e.g., slowing down the leading lab). These are high EV in worlds where the safety-conscious lab (or labs) has a sizable advantage. On the margin, we think more people should be investing into these interventions, but they should be deployed more carefully (ideally after some research has been conducted to compare the upside of buying time to the downside of increasing race dynamics). 

3. Labs being more concerned about safety isn’t that helpful. They already care; they just lack solutions.

Response: Our current impression is that many leaders at major AGI labs are concerned about safety. However, we don’t think everyone is safety-conscious, and we think there are some policies that labs could adopt to buy time (e.g., adopting publication policies that reduce the rate of capabilities papers).

4. Slowing down ODA+[2] could increase the chance that a new (and less safety-conscious actor) develops AGI. 

Response #1: Our current best guess is that ODA+ has a >6 month lead over less safety-conscious competitors. However, this is fairly sensitive to timelines. If scale is critical, then one would expect a small number of very large projects to be in the lead for AGI, and differentially slowing the most receptive / safety oriented / cautious of those labs seems on net negative. However, interventions that slow the whole field such as a blanket slowdown in publishing or increases the extent to which all labs are safety oriented are robustly good.    

Response #2: Even if ODA+ does not have a major lead, many of the interventions (like third-party audits) could scale to new AGI developers too. For example, if there's a culture in the field of doing audits, and pressure to do so, talented researchers are likely not to want to work for you unless you participate, or if later there's a regulatory regime attached to all that.

Response #3: Some interventions to buy time increase lead time of labs and slow research overall (e.g., making it more difficult for new players to enter the space; compelling evals or concretizations of alignment difficulties could cause many labs to slow down). 

Response #4: Under our current model, most P(doom) comes from not having a solution to the alignment problem. So we’re willing to trade some P(solution gets implemented) and some P(AGI is aligned to my values) in exchange for a higher P(we find a solution). However, we acknowledge that there is a genuine tradeoff here, and given the uncertainty of the situation, Thomas thinks that this is the strongest argument against buying time interventions. 

5. Buying time is not tractable.

Response: This is possible, but we currently doubt it. There seems to be a bunch of stuff that no one has tried (we will describe this further in a follow-up post). 

6. In general, problems get solved by people actually trying to solve them. Not by avoiding the hard problems and hoping that people solve them in the future.

Response #1: Getting mainstream ML on board with alignment concerns is solving one of the hardest problems for the alignment field. 

Response #2: Although some of the benefit from “buying time” involves hoping that new researchers show up with new ideas, we’re also buying time for existing researchers who are tackling the hard problems. 

Response #3: People who have promising agendas that are attacking the core of the problem should continue doing technical alignment research. There are a lot of people who have been pushed to do technical alignment work who don’t have promising ideas (even after trying for years), or feel like they have gotten substantial signal that they are worse at thinking about alignment than others. These are the people we would be most excited to reallocate. (Note though that we think feedback loops in alignment are poor. Our current guess is that among the top 10% of junior researchers, it seems extremely hard to tell who will be in the tail. But it’s relatively easy to tell who is in the top 10-20% of junior researchers).

7. There’s a risk of overcorrection: maybe too many people will go into “buying time” interventions and too few people will go into technical alignment. 

Response: Currently, we think that the AIS community heavily emphasizes the value of technical alignment relative to “buying time.” We think it’s unlikely that the culture shifts too far in the other direction.

8. This doesn't seem truth-seeky or epistemically virtuous— trying to convince ML people of some specific claims feels wrong, especially given how confused we are and how much disagreement there is between alignment researchers. 

Response #1: There are a core set of claims that are pretty well supported that the majority of the ML community has not substantially engaged with (e.g. goodharting, convergent instrumental goals, reward misspecification, goal misgeneralization, risks from power-seeking, risks from deception). 

Response #2: We’re most excited about outreach efforts and coordination efforts that actually allow us to figure out how we’re wrong about things. If the alignment community is wrong about something, these interventions make it more likely that we find out (compared to a world in which we engage rather little with capabilities researchers + ML experts and only talk to people in our community). If others are able to refute or deconfuse points that are made in this outreach effort, this seems robustly good, and it seems possible that sufficiently good arguments would convince us (or the alignment community) about potentially cruxy issues. 

What changes do we want to see?

  1. Allocation of talent: On the margin, we think more people should be going into “buying time” interventions (and fewer people should be going into traditional safety research or traditional community-building).
  2. Concretely, we think that roughly 80% of alignment researchers are working on directly solving alignment. We think that roughly 50% should be working on alignment, while 50% should be reallocated toward buying time.
  3. We also think that roughly 90% of (competent) community-builders are focused on “typical community-building” (designed to get more alignment researchers). We think that roughly 70% should do typical community-building, and 30% should be buying time.
  4. Culture: We think that "buying time" interventions should be thought of as a comparable or better path than (traditional) technical AI safety research and (traditional) community-building.
  5. Funding: We’d be excited for funders to encourage more projects that buy time via ML outreach, improved coordination, taking conceptual ideas and grounding them empirically, etc. 
  6. Strategy: We’d be excited for more strategists to think concretely about what buying time looks like, how it could go wrong, and if/when it would make sense to accelerate capabilities (e.g., how would we know if OpenAI is about to lose to a less safety-conscious AI lab, and what would we want to do in this world?)
  1. ^

    We did a BOTEC that suggested that 1 hour of alignment researcher time would buy, in expectation, 1.5-5 quality-adjusted research hours. The BOTEC made several conservative assumptions (e.g., it did not account for the fact that we expect alignment research to be more heavy-tailed than buying time interventions). We are in the process of revising our BOTEC, and we hope to post it once we have revised it.

  2. ^

     ODA+ = OpenAI, DeepMind, Anthropic, and a small number of other actors.

105

New Comment
32 comments, sorted by Click to highlight new comments since: Today at 4:50 AM

I think I have one intuition that strongly agrees with you. I have another (more quantitative) intuition that strongly disagrees, which roughly goes:

1. There aren't that many alignment researchers. Last estimate I heard was maybe 300 total?
2. Many people are trying to advance AI capabilities. Maybe 30k total?

3. Naively, buying time interventions is 100x less efficient on average. So your comparative advantage for buying time must be really strong, to the tune of thinking you're 100x better at helping to buy time than doing technical AGI safety research, for the math to work out.

I'm probably missing something, but I notice myself being confused. How do I reconcile these two intuitions?

Probably the number of people actually pushing the frontier of alignment is more like 30, and for capabilities maybe 3000. If the 270 remaining alignment people can influence those 3000 (biiiig if, but) then the odds aren't that bad

This is confused, afaict? When comparing the impact of time-buying vs direct work, the probability of success for both activities is negated by the number of people pushing capabilities. So it cancels out, and you don't need to think about the number of people in opposition.

The unique thing about time-buying is that its marginal impact increases with the number of alignment workers,[1] whereas the marginal impact of direct work plausibly decreases with the number of people already working on it (due to fruit depletion and coordination issues).[2]

If there are 300 people doing direct alignment and you're an average worker, you can expect to contribute 0.3% of the direct work that happens per unit time. On the other hand, if you spend your time on time-buying instead, you only need to expect to save 0.3% units of time per unit of time you spend in order to break even.[3]

  1. ^

    Although the marginal impact of every additional unit of time scales with the number of workers, there are probably still diminishing returns to more people working on time-buying.

  2. ^

    Probably direct work scales according to some weird curve idk, but I'm guessing we're past the peak. Two people doing direct work collaboratively do more good per person than one person. But there are probably steep diminishing marginal returns from economy-of-scale/specialisation, coordination, and motivation in this case.

  3. ^

    Impact is a multiplication of the number of workers , their average rate of work , and the time they have left to work . And because multiplication is commutative, if you increase one of the variables by a proportion , that is equivalent to increasing any of the other variables with the same proportion. .

When comparing the impact of time-buying vs direct work, the probability of success for both activities is negated by the number of people pushing capabilities. So it cancels out, and you don't need to think about the number of people in opposition.

Time-buying  (slowing down AGI development) seems more directly opposed to the interests of those pushing capabilities than working on AGI safety. 

If the alignment tax is low (to the tune of an open-source Python package that just lets you do "pip install alignment") I expect all the major AGI labs to be willing to pay it. Maybe they'll even thank you. 

On the other hand, asking people to hold off on building AGI (though I agree there's more and less clever ways to do it in practice) seems to scale poorly especially with the number of people wanting to do AGI research, and to a lesser degree the number of people doing AI/ML research in general. Or even non-researchers whose livelihoods depends on such advancements. At the very least, I do not expect effort needed to persuade people to be constant with respect to the number of people with a stake in AGI development.

Fair points. On the third hand, the more AGI researchers there are, the more "targets" there are for important arguments to reach, and the higher impact systematic AI governance interventions will have.

At this point, I seem to have lost track of my probabilities somewhere in the branches, let me try to go back and find it...

Good discussion, ty. ^^

Crossposting from LW

Here is a sceptical take: anyone who is prone to getting convinced by this post to switch to attempts at “buying time” interventions from attempts at do technical AI safety is pretty likely not a good fit to try any high-powered buying-time interventions. 

The whole thing reads a bit like "AI governance" and "AI strategy" reinvented under a different name, seemingly without bothering to understand what's the current understanding.

Figuring out that AI strategy and governance are maybe important, in late 2022, after spending substantial time on AI safety,  does not seem to be what I would expect from promising AI strategists. Apparent lack of coordination with people already working in this area does not seem like a promising sign from people who would like to engage with hard coordination problems.

Also, I'm worried about suggestions like 

Concretely, we think that roughly 80% of alignment researchers are working on directly solving alignment. We think that roughly 50% should be working on alignment, while 50% should be reallocated toward buying time.

We also think that roughly 90% of (competent) community-builders are focused on “typical community-building” (designed to get more alignment researchers). We think that roughly 70% should do typical community-building, and 30% should be buying time.

...could be easily counterproductive.

What is and would be really valuable are people who understand both the so-called "technical problem" and the so-called "strategy problem".  (secretly, they have more in common than people think)

What is not only not valuable, but easily harmful, would be an influx of people who understand neither,  but engage with the strategy domain instead of technical domain.  

Why call it "buying time" instead of "persuading AI researchers"? That seems to be the direct target of efforts here, and the primary benefit seems better conceptualised as "AI researchers act in a way more aligned with what AI safety people think is appropriate" rather than "buying time" which is just one of the possible consequences.

Multiplier effects: Delaying timelines by 1 year gives the entire alignment community an extra year to solve the problem. 

This is the most and fastest I've updated on a single sentence as far back as I can remember. Probably the most important thing I've ever read on the EA forum. I am deeply gratefwl for learning this, and it's definitely worth Taking Seriously. Hoping to look into it in January unless stuff gets in the way.

(Update: I'm much substantially less optimistic about time-buying than I was when I wrote this comment, but I still think it's high priority to look into.)

I have one objection to claim 3a, however: Buying-time interventions are plausibly more heavy-tailed than alignment research in some cases because 1) the bottleneck for buying time is social influence and 2) social influence follows a power law due to preferential attachment. Luckily, the traits that make for top alignment researchers have limited (but not insignificant) overlap with the traits that make for top social influencers. So I think top alignment researchers should still not switch in most cases on the margin.

Can you be more specific about "the bottleneck for buying time is social influence"?

I basically agree with this breakdown from the post:

How do you account for the fact that the impact of a particular contribution to object-level alignment research can compound over time?

  1. Let's say I have a technical alignment idea now that is both hard to learn and very usefwl, such that every recipient of it does alignment research a little more efficiently. But it takes time before that idea disseminates across the community.
    1. At first, only a few people bother to learn it sufficiently to understand that it's valuable. But every person that does so adds to the total strength of the signal that tells the rest of the community that they should prioritise learning this.
    2. Not sure if this is the right framework, but let's say that researchers will only bother learning it if the strength of the signal hits their person-specific threshold for prioritising it.
    3. Number of researchers are normally distributed (or something) over threshold height, and the strength of the signal starts out below the peak of the distribution.
    4. Then (under some assumptions about the strength of individual signals and the distribution of threshold height), every learner that adds to the signal will, at first, attract more than one learner that adds to the signal, until the signal passes the peak of the distribution and the idea reaches satiation/fixation in the community.
  2. If something like the above model is correct, then the impact of alignment research plausibly goes down over time.
    1. But the same is true of a lot of time-buying work (like outreach). I don't know how to balance this, but I am now a little more skeptical of the relative value of buying time.
  3. Importantly, this is not the same as "outreach". Strong technical alignment ideas are most likely incompatible with almost everyone outside the community, so the idea doesn't increase the number of people working on alignment.

I would push back a little, the main thing is that buying time interventions obviously have significant sign uncertainty. Eg. your graph on median researcher "buying time" vs technical alignment, I think should have very wide error at the low end of "buying time", going significantly below 0 within the 95% confidence interval. Technical alignment is lots less risky to that extent.

To clarify, you think that "buying time" might have a negative impact [on timelines/safety]?

Even if you think that, I think I'm pretty uncertain of the impact of technical alignment, if we're talking about all work that is deemed 'technical alignment.' e.g., I'm not sure that on the margin I would prefer an additional alignment researcher (without knowing what they were researching or anything else about them), though I think it's very unlikely that they would have net-negative impact.

So, I think I disagree that (a) "buying time" (excluding weird pivotal acts like trying to shut down labs) might have net negative impact and that & thus also that (b) "buying time" has more variance than technical alignment.

edit: Thought about it more and I disagree with my original formulation of the disagreement. I think "buying time" is more likely to be net negative than alignment research, but also that alignment research is usually not very helpful.

Rob puts it well in his comment as "social coordination". If someone tries "buying time" interventions and fails, I think that because of largely social effects, poorly done "buying time" interventions have potential to both fail at buying time and preclude further coordination with mainstream ML. So net negative effect.

On the other hand, technical alignment does not have this risk.

I agree that technical alignment has the risk of accelerating timelines though.

But if someone tries technical alignment and fails to produce results, that has no impact compared to a counterfactual where they just did web dev or something.

My reference point here is the anecdotal disdain (from Twitter, YouTube, can dm if you want) some in the ML community have for anyone who they perceive to be slowing them down.

I see!  Yes, I agree that more public "buying time" interventions (e.g. outreach) could be net negative. However, for the average person entering AI safety, I think there are less risky "buying time" interventions that are more useful than technical alignment.

I think probably this post should be edited and "focus on low risk interventions first" put in bold in the first sentence and put right next to the pictures. Because the most careless people (possibly like me...) are the ones that will read that and not read the current caveats

You'd be well able to compute the risk on your own, however, if you seriously considered doing any big outreach efforts. I think people should still have a large prior on action for anything that looks promising to them. : )

An addendum is then:

  1. If Buying time interventions are conjunctive (ie. one can cancel out the effect of the others); but technical alignment is disjunctive

  2. If the distribution of people performing both kinds of intervention is mostly towards the lower end of thoughtfulness/competence, (which we should imo expect)

Then technical alignment is a better recommendation for most people.

In fact it suggests that the graph in the post should be reversed (but the axis at the bottom should be social competence rather than technical competence)

It's not clear to me that the variance of "being a technical researcher" is actually lower than "being a social coordinator".  Historically, quite a lot of capabilities advancements have come out of efforts that were initially intended to be alignment-focused. 

Edited to add: I do think it's probably harder to have a justified inside-view model of whether one's efforts are directionally positive or negative when attempting to "buy time", as opposed to "doing technical research", if one actually makes a real effort in both cases.

Would you be able to give tangible examples where alignment research has advanced capabilities? I've no doubt it's happened due to alignment-focused researchers being chatty about their capabilities-related findings, but idk examples.

There's obviously substantial disagreement here, but the most recent salient example (and arguably the entire surrounding context of OpenAI as an institution).

Not sure what Rob is referring to but there are a fair few examples of org/people's purposes slipping from alignment to capabilities, eg. OpenAI

I myself find it surprisingly difficult to focus on ideas that are robustly beneficial to alignment but not to capabilities.

(E.g. I have a bunch of interpretability ideas. But interpretability can only have no impact on, or accelerate timelines)

Do you know if any of the alignment orgs have some kind of alignment research NDA, with a panel to allow any alignment-only ideas be public, but keep the maybe-capabilities ideas private?

Do you mean you find it hard to avoid thinking about capabilities research or hard to avoid sharing it?

It seems reasonable to me that you'd actually want to try to advance the capabilities frontier, to yourself, privately, so you're better able to understand the system you're trying to align, and also you can better predict what's likely to be dangerous.

Thinking about

That's a reasonable point - the way this would reflect in the above graph is then wider uncertainty around technical alignment at the high end of researcher ability

Thank you so much for your excellent post on the strategy of buying time, Thomas, Akash, and Olivia! I strongly agree that this strategy is necessary, neglected, and tractable.

For practical ideas of how to achieve this (and a productive debate in the comments section of the risks from low-quality outreach efforts), please see my related earlier forum post: https://forum.effectivealtruism.org/posts/juhMehg89FrLX9pTj/a-grand-strategy-to-recruit-ai-capabilities-researchers-into

Interesting. Have not read; only skimmed and bookmarked. But one of my initial naive reactions to the framing is that I have more or less always viewed technical safety work that is ‘aimed’ relatively near term as a form of buying time (I’m thinking of pretty much all work that is aimed at AI via the ML/DL paradigm, eg alignment work that is explicitly stuff to do with big transformers etc) .

[-][anonymous]3mo 20

Have made this point previously, glad to see the upvotes on this!

https://www.lesswrong.com/posts/oWKgzpB6sscnaxbJX/alignment-research-for-meta-purposes

  1. What does it mean to buy "end time"? If an action results in a world with longer timelines, then what does it mean to say that the additional time comes "later"?
  2. What is "serial alignment research"? I'm struggling to distinguish it from "alignment research."
  3. Can you clarify the culture change you want to see? We should think of buying time as "better" than "(traditional) technical AI safety research and (traditional) community-building"?

Less-important comments:

  • Can you be more specific about "ODA+"? Does it include Meta and Google Brain, or only the most safety-conscious labs?
  • I'm confused why field building isn't listed as one of the key benefits other than buying time. Publishing more work like the goal misgeneralization paper would make field building more successful.
  • In the graph ("Figure 1"), I'm parsing "technical alignment" as "technical research on the core challenges of alignment that avoids shortening timelines," because the impact of talented researchers depends heavily on the projects they pursue.
  • PaLM would have been just as fast if they chose not to publish the PaLM paper. (The sentence about PaLM is confusing.)

Naively, the main argument (imo) can be summed up as:

If I am capable of doing an average amount of alignment work  per unit time, and I have  units of time available before the development of transformative AI, I will have contributed  work. But if I expect to delay transformative AI by  units of time if I focus on it, everyone will have that additional time to do alignment work, which means my impact is , where  is the number of people doing work. If , I should be focusing on buying time.[1]

This analysis further favours time-buying if the total amount of work per unit time accelerates, which is plausibly the case if e.g. the alignment community increases over time.

  1. ^

    This assumes time-buying and direct alignment-work is independent, whereas I expect doing either will help with the other to some extent.

Your impact is  if each of the   alignment researchers contributes exactly  alignment work per unit of time.

The first sentence points out that I am doing an average amount of alignment work, and that amount is . I realise this is a little silly, but it makes the heuristic smaller. Updated the comment to  instead. Thanks.