Technical AI Safety Research Plausibly Increases P(doom) Absent Vastly More Advocacy

Contrarian^2 = I 🔸

TL;DR: Safety research without regulation is optional. Cautious labs adopt it and slow down, while reckless labs don't and accelerate. The median lead lab is more reckless in the scenario with strong technical AI safety research and weak governance than in any other scenario.

This post seems longer than it is. If the 4 premises below seem plausible to you, you can read just the short scenarios and get the gist

This post rests on the following premises:

Premise 1: Labs with greater AI capabilities will develop their capabilities more quickly than those with lesser capabilities.

Premise 2: Cautious labs that adhere to higher safety standards will develop capabilities more slowly than reckless labs that adhere to lower safety standards.

Premise 3: Reckless labs are more likely to kill us all than cautious ones.

Premise 4: Absent a strong advocacy and policy branch of the AI safety movement, AI safety research will lead to uneven adoption of additional safety standards. Cautious labs will adopt more external standards than reckless labs.

If we accept all these, we quickly get a plausible narrative for how AI safety technical research backfires. It goes something like this:

At the moment, we have a handful of frontier labs, each with their own internal attitudes towards safety. Those who value safety more put models through a longer series of tests and evaluations they have developed before releasing them and, importantly, before using them to make algorithmic decisions. Those who are more lax release earlier, with fewer safety checks, and eagerly adopt their own coding tools.

These differences in attitudes have resulted in significant safety differences. They have also presented financial risks to more cautious labs — take, for example, Anthropic's recent designation as a supply chain risk, which prevents federal contractors from using their products. Although capabilities are similar across the top labs despite their safety differences, we should not expect this state of affairs to continue without strong evidence.

Bad Scenario: Strong AI safety research, weak policy arm (status quo)

Suppose technical AI safety researchers continue to demonstrate that additional safeguards are necessary. Without any governmental enforcement mechanism, only cautious labs agree to institute them, slowing down iteration. Reckless labs don't, and continue advancing at breakneck speeds. This magnifies the existing safety asymmetries, further disadvantaging cautious labs.

As differences in capabilities emerge, two forces converge to magnify differences: investment and recursive self-improvement. Before long, reckless labs are vastly outpacing cautious ones, making them virtually obsolete.

Mediocre Scenario: Weak AI safety research, weak policy arm

Without a strong push for additional safety standards, cautious labs continue with moderately more safety checking than their reckless counterparts. The additional, internal safeguards they employ are more important on the compute margin and reduce risk without significantly slowing them down. Although they remain disadvantaged in the capabilities race, the difference is small enough that random variation and individual breakthroughs might make it up, allowing cautious labs to fight for the lead.

Good Scenario: Strong AI safety research, strong policy arm

Safety researchers and policymakers collaborate to roll out and enforce safety protocols, raising standards across the board. Labs are slowed down as they fail checks, giving safety researchers more time to solve unsolved problems. Those who try to circumvent safety checks are publicly fined, losing them market share and forcing them out of the race.

Suggested Intervention

Spend more on lobbying, advocacy, and policy work. Ideally, do this without cutting research budgets, but if necessary, redirect funding. Individually, consider switching to policy work and donating to AI governance organizations. Encourage your friends to do the same. Plenty of smarter people than me have made this case, so I won't go into much detail here.

Extra stuff

Defense of Premises

Premise 1 seems intuitive to me. More capable models draw more investment, (probably) have better research taste, and make better algorithmic decisions. When tasked with improving themselves, more capable models will perform better.

Premise 2 is perhaps the weakest, but I still am inclined to believe it. Safety requires compute and, when safety checks are failed, difficult, time-consuming setbacks to trace mistakes and retrain models to the same capabilities. Investors may not be convinced that safety spending is necessary, so they may prioritize investing in more reckless labs that spend more of their investments on capabilities-advancing compute which, in their view, provides better returns. Cautious labs may refuse highly profitable contracts on safety grounds, which reduces revenue and investment (see Anthropic DOD fight).

Premise 3 seems intuitive. Reckless labs release more capable models more readily, with fewer jailbreak protections and more potential for misuse, and readily use them internally without sufficient proof of alignment, with potentially catastrophic alignment implications.

Premise 4 is probably true, but the degree to which it is true is significant for my argument. If my defense of Premise 2 is true, then financial incentives may lead reckless labs to reject even convincing safety research. Cautious labs share these same incentives, but are insulated from them by the values and tendencies of their leaders.

Caveats

There are a number of reasons that I could be wrong, and I would love to hear them in the comments. A couple that come to mind:

It is plausible that AI safety research is adopted at least a little across the board, such that even the worst labs are better than the median lab would be absent external AI safety research. I think the financial incentives explained in my defense of Premise 2 make this less likely.
It is plausible that investors are drawn to safe labs, since significant, public safety failures could cause other investors to go elsewhere. If this were true, though, we would expect more non-philanthropic AI safety funding, and investors would be pushing for stronger internal and governmental AI safety protocols. They do not seem serious about any of those things.
It is plausible that safe labs develop more quickly, since sufficiently unsafe labs may eventually run into catastrophic safety failures that force them to rewind more than their safe counterparts. However, advanced misaligned AI may not reveal itself until it is too late in order to avoid these sorts of setbacks.
It is plausible that the safety standards adopted by the most cautious labs do not significantly slow them down, even absent failures in unsafe labs. This may be the case if safety research makes safety testing more compute-friendly. If, however, investors choose to avoid safe labs anyway, for fear that their money is not being well-used, this caveat would not matter.
It is plausible that businesses will purchase from safer labs to avoid risks like data leakage, providing cautious labs with additional revenue to scale compute, and drawing investors to cautious labs. However, business-relevant safety concerns may not be relevant to existential and suffering risks.

JamesNApr 301

(Caveat - I read the premises and skimmed the rest)

Yes - AI research is useful and does help highlight specific advancements or potential risks. However, I fear it is being focused on by many because of personal interest in the topic, rather than the best route to reduce catastrophic and existential risks.

For better or worse, advocacy, policy, and communications are the most likely routes to reduce p(doom) - unless you believe alignment is a plausible and concrete thing.

Effective Altruism Forum
EA Forum