Ben Cottier

Staff Researcher @ Epoch
352 karmaJoined Feb 2020Working (0-5 years)


At Epoch, helping to clarify when and how transformative AI capabilities will be developed.

Previously a Research Fellow on the AI Governance & Strategy team at Rethink Priorities.


Understanding the diffusion of large language models


Hi Charlotte - as you can imagine, estimating the latter is much more difficult due to reasoning about counterfactuals. But I do have some thoughts on it in this section of a post in the sequence.

I think the key claim you'd be looking for there is:

My best guess is that the knowledge of GPT-3’s existence sped up both DeepMind and Google’s work scaling up language models by six months (90% CI: 1–18 months). But I have not been able to distinguish whether this acceleration was driven by insider knowledge, or the publication of GPT-3, or the hype generated after publication, or some combination of those factors.

As my 90% confidence interval shows, I'm very uncertain, but I hope this helps.

I have made a big update regarding this claim:

What about for a very large-scale application of a GPT-3-like model—for example, generating text equivalent to 1% of global Twitter activity for one year, or assisting one million software developers with coding for one year? I estimate that deploying a model like BLOOM in these ways would be 20% of the cost of developing the model (90% CI: 10 to 68%), in terms of the dollar cost of compute alone. This means that deployment is most likely much less prohibitive than development. But it means I give a 5% chance that for the largest-scale applications, the cost of deploying the model is at least 68% of the cost of developing the model, which would make deployment similarly prohibitive.

The claims about the cost of the specific deployment scenarios (which were  oversimplified to begin with) may still be fairly accurate. But in terms of the intent behind the estimates I made, I think I greatly underestimated the largest scale of deployment for LLMs, a scale which is becoming more common and which I understand a little better. I now think that for the largest, most commercially successful LLMs, the total compute spent on deployment is much larger than in development.

My update was mostly influenced by several more sources (and more credible sources than the ones I reviewed in the post) suggesting that the total compute that major AI companies spend on inference is significantly larger then the total compute spent on training and experimentation: 

  1. https://arxiv.org/pdf/2111.00364.pdf, p.3, Fig. 3 caption: "At Facebook, we observe a rough power capacity breakdown of 10:20:70 for AI infrastructures devoted to the three key phases — Experimentation, Training, and Inference". Also, "Considering the primary stages of the ML pipeline end-to-end, the energy footprint of RM1 is roughly 31:29:40 over Data, Experimentation/Training, and Inference".[1][2]
  2. https://arxiv.org/abs/2204.05149, p.7: "Across all three years, about ⅗ of ML energy use is for inference and ⅖ for training. These measurements include all ML energy usage: research, development, testing, and production."
  3. https://www.semianalysis.com/p/the-inference-cost-of-search-disruption: "inference costs far exceed training costs when deploying a model at any reasonable scale. In fact, the costs to inference ChatGPT exceed the training costs on a weekly basis."

However, this doesn't significantly update my conclusion about the importance of focusing on development rather than deployment as a target of intervention (point 2c in the Key Takeaways). This is because of theother strong reasons to focus on development that I mention. I would revise point 2c to say that, even if the amount of compute is smaller in total, the compute you have to spend on training tends to be more up-front and all-or-nothing than deployment which can be scaled quite smoothly. This creates a greater barrier.

I have edited the post to point out this comment, but for the sake of posterity and prioritizing other projects, I won't be updating the rest of the post.

  1. ^

    Power and energy usage are not 1-1 with compute usage, especially over time as new hardware improves energy efficiency. But there is a clear relationship: computation requires running GPUs for some time, which consumes a fairly consistent amount of average power. I don't expect that improvements in energy efficiency have a big impact on the ratio of development and deployment compute.

  2. ^

    RM1 denotes one of Facebook's six models that "account for a vast majority of compute resources for the overall inference predictions at Facebook, serving billions of users world wide" (see footnote 4 on p.4). RM1 is the single most carbon-intensive model out of these six models (see Fig 4 on p.4).

Hi! I'm in Lisbon until Sunday. Excited to meet you!

Also, I'd like to be clear about what it means to "keep up". I expect those lower-resourced types of actors won't keep up in the sense that they won't be the first to advance state-of-the-art on the most important AI capabilities. But the cost of a given ML system falls over time and that is a big driver of how AI capabilities diffuse. 

Thanks Haydn!

I just want to add caution on taking the extrapolations too seriously. The linear extrapolation is not my all-things-considered view of what is going to happen, and the shaded region is just the uncertainty in the linear regression trendline rather than my subjective uncertainty  in the estimates.

I agree with you inasmuch as I expect the initial costs of state-of-the-art models to get well out of reach for actors other than big tech (if we include labs with massive investment like OpenAI), and states, by 2030. I still have significant uncertainty about this though. Plausibly, the biggest players in AI won't be willing to spend $100M just on the computation for a final training run as soon as 2030. We still don't have a great understanding of what hardware and software progress will be like in future (though Epoch has worked on this). Maybe efficiency improves faster than expected and/or there just won't be worthwhile gains from spending so much in order to compete. 

how sensitive do you think your conclusions are to the choice of using GPT-3 as your point of reference?

I tried to qualify claims to account for using a single point of reference, e.g. just talk about pre-trained language models rather than all ML models. However, as I note in the final section of this post, my claims about the broader implications of this research have the lowest confident and resilience. It feels really hard to quantify the sensitivity overall (I'm not sure if you have a way to measure this in mind). But my off-the-cuff intuition is that if my language model case studies turn out to not at all generalise in the way that I assumed, my % likelihoods for the generalised claims throughout the sequence would change by 20 percentage points on average.

I'm curious also if you think diffusion has differed between GPT-2 and GPT-3 and what factors you think are relevant for explaining that difference, if any? I kinda forget my history but I have a rough recollection that GPT-2 was successfully replicated faster.

I think Shevlane (2022) is currently the best source on this topic. Unfortunately it is not very accessible due to the style of an academic thesis. But the Abstract of Chapter 2 (p.63 of the PDF) gives an idea. 

I didn't explicitly compare to GPT-2 but I'd say that this section ("Diffusion can be significantly limited if (a) training compute cost is high and (b) developers don’t release their model weights; otherwise, developers need to rely more on keeping methods secret") is implicitly explaining why GPT-3's release strategy succeeded more than GPT-2's release strategy: (a) there was the opportunistic fact that GPT-3 required 2 orders of magnitude more compute to train, and (b) no (smaller) versions of the GPT-3 model were open-sourced; only an API to GPT-3 was provided.

For example, with DALLE-2 my understanding is that similar capabilities were obtained by much lower resource actors (Midjourney, Stable Diffusion) and I'm curious what the relevant differences are to explain the much more rapid diffusion there. (The irony in the name "Stable Diffusion" being a model resulting from diffusion is funny.)

I think the training compute requirement and hardware improvements are two key differences here. Epoch's database currently estimates the training compute of Stable Diffusion as 5E+22 FLOP (link to the spreadsheet cell). That is about 6 times smaller than the estimated FLOP for GPT-3, at 3.14E+23 FLOP.

As I said in another comment, the leap from NVIDIA V100 (used to train GPT-3) to NVIDIA A100 (used to train Stable Diffusion) seems to enable a ~6x improvement in efficiency (in turn a 6x reduction in $ cost). So as a back-of-the-envelope calculation that would put Stable Diffusion at ~36x cheaper to train than the original GPT-3 training run.

There could also be algorithmic/engineering reasons why a model like Stable Diffusion is easier to produce, but I haven't looked into that.

You can find my take on that in this section, but I'll put an excerpt of that here:

The main driver of this is improved GPU price performance. The actual GPT-3 training run used NVIDIA V100 GPUs, but OPT-175B and other more recent GPT-3-like models were trained on A100 GPUs. A100 and V100 GPUs currently have a similar price on Google Cloud. However, A100 can be up to six times more efficient than V100, since

  1. V100 has about three times slower peak throughput (125 teraflop/s vs. 312 teraflop/s)
  2. V100 has less than half the memory capacity of the 80GB A100 chip, at 32 GB, therefore requiring over two times the number of chips to fit a model in memory.

OPT also seems to have been trained with a higher hardware utilization rate than GPT-3 (the actual FLOP/s achieved divided by the theoretical peak FLOP/s for the hardware), if reported numbers are to be believed (only 21% for GPT-3 compared to 47% for OPT-175B). This is a smaller factor of difference compared to the hardware specs, but I think I ought to have mentioned it in the report.

As an aside, it's still pretty unclear to me how different practitioners are measuring their reported utilization rates. For example is it a single measurement at a random time during training, or an average of multiple measurements, or the maximum of multiple measurements?

this might be goalpost shifting, since GPT3!2022 is a very different thing from GPT3!2020

That's a good point, but I think goalpost shifting is likely not significant in this case, which supports your original point. The OPT paper compares to "GPT-3" (or "GPT" in the plots, as shorthand I guess) for the prompting and few-shot evaluations (section 3). It says on p.3:

We follow GPT-3 (Brown
et al., 2020) by using their prompts and overall ex-
perimental setup. We compare primarily to GPT-3,
having aimed to re-implement their evaluation set-
tings, but include reported performance of other
LLMs on a per-task basis when available (Lieber
et al., 2021; Rae et al., 2021; Hoffmann et al., 2022;
Black et al., 2022)

Also on p.3 they refer to "numbers reported by Brown et al. (2020)"

In WIC, we see that the OPT models always out-
perform the GPT-3 models, though the numbers
reported by Brown et al. (2020) also seem question-
able, given WIC being a binary classification task.

But p.3 also mentions

For MultiRC, we are unable to replicate the GPT-3
results using the Davinci API within our evalua-
tion setup [...]

It sounds to me like they used the original results from Brown et al. (2020) where available, but evaluated using the Davinci API as a cross-check or fallback.

In contrast, the paper talks about "Davinci" for the evaluations in subsequent sections, so this is presumably the API version of GPT-3 that was available at the time. It says on p.5 that "We compare primarily against GPT-3 Davinci, as these benchmarks were not yet available to be included in Brown et al. (2020)." I didn't include these other evaluations (e.g. Bias and Toxicity) in my analysis; I'm just pointing this out to support my guess that the evaluations in section 3 are comparing to the original GPT-3.

Load more