TLDR: We may capture much or most of the available AI safety benefit by reserving expensive, specialized agents for the <1% of tasks that carry catastrophic risk. This would mean that AI safety work on high-cost but highly safe systems could be very useful.

The standard objection to compute-heavy AI safety measures is competitive: any lab paying a large alignment tax gets undercut by one that doesn't, so expensive safety doesn't survive the market.

This objection typically assumes the tax is paid uniformly - levied on every action regardless of what that action is doing. Drop that assumption and the objection loses most of its force. If the expensive treatment can be applied selectively, to the small fraction of actions where catastrophic consequences live, the blended overhead is small even when the per-action multiplier is enormous. A 100× tax on 0.1% of actions is roughly a 10% tax on the system.

How would one spend 100× on additional safety? The easiest answer is inference-time safety solutions. Redwood’s control research has studied several. Certain situations could also call for different LLMs, perhaps ones optimized for transparency, robustness, neutrality, or that have simply been more heavily vetted. 

Could we identify and isolate the top n% of tasks? There are two clear solutions. The first is to get good at understanding which AI tasks are most critical. I'd assume there are some fairly obvious moves to start with. Frontier LLM development and deployment tasks seem critical to get right; a major corporate database migration is clearly more critical than a local one. If you simply charged 100× for more robust LLM agents today, I'd expect them to get used for some of the most important cases by default.

The second is to actively design processes so the critical work is contained to a narrow amount of computation. Companies already do this in many settings: it's risky to hand direct bank-account access to every employee, so access is restricted to a few trusted ones and everyone else goes through a request process. The same shape applies to high-consequence AI actions.

Of course, a different solution is to use the ~100x costly agents training signals in post-training. This probably asks for agents with very different properties than ones optimized for general direct consumer use, but the big-picture economic justifications might be similar. 

In principle, highly costly but reliable system use could be formalized. Certain AI development decisions might be deemed sensitive enough that they can only be carried out by a specific set of expensive, vetted (perhaps government-approved) AI agents. There are ways this goes poorly, but also versions that look like a reasonable extension of current practice. 

Given all of this, I think that:

  1. Instead of aiming for one set of LLMs and agents for everyone, it's realistic to aim for a variety of solutions with different safety trade-offs. The obvious implication is a market for highly safe/robust/reliable AI agents.
  2. If we can ensure that fairly safe solutions get used for the top fraction of tasks, we may capture most of the safety benefits available.
  3. If (2) holds, AI safety work on improving systems with large safety taxes might be highly valuable. A 10× or 100× safety tax might be well within spec.

Objections

  1. Some readers will find this obvious. I think the general intuition - spend more on safety where the stakes are higher - is common, close to folk wisdom. LLMs and AI agents already have different models and thinking efforts and these are being used for different tasks. It seems simple to imagine further extensions of this.
  2. There are worlds where risk is widely spread between tasks in a way that makes it impossible to concentrate safe AI use. I don’t expect this will be the case (with decent efforts and mitigations), but could be convinced otherwise.

 

This post was improved with Claude Opus. Opus provided high-level feedback, helped find the links, and made a bunch of wording adjustments.

9

0
0

Reactions

0
0

More posts like this

Comments
No comments on this post yet.
Be the first to respond.
Curated and popular this week
Relevant opportunities