Moratoriums Only Freeze Half the Stack: Why AI Capability Growth Won't Stop at Compute

Charlie_Guthmann

Note: I wrote the entire article and then put it into Claude for reformatting. Final format quite different (and I think more readable) than what I wrote but ideas are 99% from my original draft. My opinions do not reflect anyone I work for or with.

There is growing momentum behind data center moratoriums as AI policy. I think this is broadly good. Moratoriums are simple, politically legible, enforceable, and — because of the sheer physical scale of data centers and the energy they consume — potentially verifiable enough to anchor international non-proliferation agreements. These are rare and valuable properties for any regulation to have.

But I want to argue that even a well-executed moratorium addresses only part of the AI capability stack. To see why, and to think clearly about what moratoriums can and cannot do, it helps to have a framework. In some ways, this isn't bitter lesson compatible, but because of the specifics of pauses and current tech, I still think it's relevant.

The Capability Stack

When we talk about "how capable an AI system is," we're not talking about one thing. We're talking about a stack of interacting layers, each of which contributes to what the system can actually do in the world. I've written about this framework previously. Roughly:

Level 1 — Pre-training. The raw model, trained on internet-scale data.

Level 2 — Post-training. RLHF, instruction tuning, safety training — everything that turns a base model into something you can talk to.

Level 3 — Inference scaling. How much compute you throw at the model at runtime. Thinking tokens, chain-of-thought, best-of-N sampling.

Level 4 — Agentic harnesses. The scaffolding around the model: Claude Code, Codex, SWE-Agent, Pi, Devin. The digital robot armor for the AI brain.

Level 5 — Context engineering. The prompt, the skill files, the retrieved context, the evolutionary algorithms that search prompt space. Everything that determines what the model sees when it starts working.

Level 6 — The built environment. APIs designed for agent consumption, verification infrastructure, data markets, workflows rewritten to be machine-readable. The world reshaping itself around AI.

This is reductive — innovations don't cleanly fall into one level — but it's a useful mental model. And the key observation is this: moratoriums target levels 1–3. Levels 4–6 are essentially untouched.

A data center moratorium constrains how much compute is available for training new foundation models and, to some extent, how much inference can be provisioned at scale. It does almost nothing to constrain what people do with a fixed model and a fixed amount of inference. And that's where an increasing share of capability gains are coming from.

Where the Low-Hanging Fruit Is

Over the past year, through a combination of building with these tools, reading the literature (but all opinions are my own and not of my employer), and following the broader ecosystem, I've become more convinced that there is substantial room to improve AI capabilities at levels 4–6 — without training a single new model. I don't feel strongly about any single approach, but enough different approaches seem to be working that it's hard to believe it's all a mirage.

Let me walk through the evidence, grouped roughly by how strong I think it is.

Strong evidence of headroom

Agentic harnesses are simple wrappers around a model API. At their core, they're something like: if the model says "write," then write to a file. But the details matter enormously. Benchmarks like SWE-bench and TerminalBench show heavy differentiation from harness to harness, using the same underlying model. Anthropic and OpenAI are both pouring resources into improving Claude Code and Codex respectively. These are not mature, optimized systems — they're early and rapidly improving.

Verification technologies exploit what's sometimes called the generation-verification gap: it's often much cheaper to check if an output is correct than to produce the correct output in the first place. Good verifiers transform capability questions into economics questions. If a model has 50% accuracy on some workflow and you have a reliable verifier, you can get 75% accuracy for roughly double the cost, and so on. The hard verification tools — compilers, proof assistants like Lean, unit test suites, labeled datasets — were all originally built for humans and are nowhere near optimized for agentic use. Meanwhile, softer verification methods (LLM-as-judge, synthetic data selection) are brand new, with many unexplored angles. There's an explosion of interest here from the labs for good reason.

Skills and context engineering represent a vast, complicated search space. For a fixed model, harness, and problem, there is a space of possible prompts and context configurations with wildly differing success rates. Early evidence suggests that human-written "skill files" (markdown documents describing a workflow) improve agent pass rates by roughly 16 percentage points. On the research side, evolutionary and ML-based approaches to traversing prompt space have shown promising results — most notably J. Berman's ARC run, which achieved the highest score on the ARC benchmark using what is fundamentally a prompt search method. ACE and GEPA claim results as well, though it's harder to parse their real-world value.

Moderate evidence, more uncertainty

Multi-agent coordination — systems that enable teams of agents to work together with management structures, version control, and task routing — is theoretically very promising. Much of it has already been invented by human society: division of labor, code review, parallel work streams. In practice, I'm more skeptical. I've put real effort into setting up multi-agent "code factories" like Gastown with limited success and considerable chaos. There are a lot of people on Twitter claiming they run swarms of agents that make autonomous git decisions, but the gap between demos and reliable production systems remains wide. It may be that models need to cross some capability threshold before coordination architectures really work. If they do cross that threshold, though, the upside is very large.

The agent-native world — the gradual reshaping of digital infrastructure around AI workflows — is more speculative but worth flagging. Right now, when you ask a chatbot a question requiring data, it typically does a web search, which is a remarkably inefficient way to gather structured information. We're probably moving toward metadata warehouses, API-first data access, AI librarian services, and standardized machine-readable formats that will make the whole process of data retrieval dramatically more efficient. Some of this will require markets and coordination, which means governments could plausibly regulate parts of it, but the core innovations are still just APIs, standards, and information architecture.

Why none of this is regulable

The pattern across all of levels 4–6 is the same: the innovations are text files, markdown, simple code, vanilla ML, and ideas. An agentic harness is a few hundred lines of Python. A skill file is a markdown document. A verification pipeline is a test suite. A prompt search algorithm is a standard optimization loop. None of this requires large-scale compute infrastructure. None of it is visible from a satellite. None of it can be meaningfully embargoed.

It's true that all of this requires inference to A/B test and iterate on, so a moratorium does create some drag. But compared to how much it slows down pre-training and inference scaling (levels 1–3), the effect on levels 4–6 is much smaller. You need orders of magnitude less compute to test a new harness configuration than to train a new frontier model.

Counterarguments and Honest Uncertainties

I want to flag several reasons I might be wrong or overstating this:

Sigmoid asymptotes may be near. For many of these techniques, the biggest gains probably come first. Prompt engineering has been around for a few years now and the improvements, while real, may be flattening. The same could be true for harness design, verification methods, and so on. Even if all of this works, we might be looking at something like a 2–3x improvement in effective capability from a fixed model, not a 10x.

Inference constraints do slow levels 4–6 down. I said this above but want to emphasize it. Multi-agent systems and large-scale context search both burn a lot of tokens. If inference is expensive or scarce, the iteration loop slows considerably, especially for approaches that rely on massive parallelism or extensive trial-and-error.

Coordination may require smarter models. Several of the most promising approaches (multi-agent systems, autonomous git workflows, complex orchestration) may simply not work well until models are meaningfully more capable than they are today. If that's true, then a training moratorium might indirectly block these advances too.

What This Means

Even if a moratorium works exactly as intended, we should expect to still have the compute to train and deploy the next generation of frontier models (Opus 5, Gemini 4, etc.). And we should expect continued, meaningful capability improvements at levels 4–6 on top of whatever those models can do.

To make this concrete: imagine a near-future system that combines a model roughly one generation ahead of today's best, wrapped in a mature agentic harness, equipped with an efficiently searchable library of optimized skill files, backed by strong verification infrastructure, and operating in a digital environment increasingly designed for AI workflows.

I don't think this requires any single breakthrough. It requires incremental progress on many fronts simultaneously, most of which is already underway, with continuous evidence that lots of things yield incremental improvements.

The implication for policymakers is that compute regulation, while necessary and valuable, is not obviously sufficient. Probably a moratorium will heavily reduce the probability of some sort of explosive RSI that ends with a silica god, but we may still be looking at agents that are more economically productive than large swaths of the population soon, and all the downstream effects this might cause.

I think is important is that people working on AI policy have a loose understanding of this paradigm: that AI capability is a stack, not a single number, and that freezing the bottom of the stack leaves the top free to grow, and there is probably quite a bit of room to grow in ways that are nearly impossible to regulate/stop in a liberal democracy.

Effective Altruism Forum
EA Forum