Largest AI model in 2 years from $10B

Peter Drotos 🔸

I created this short report as a work test for an ML Hardware researcher role and sharing it with the organization's approval in case someone might find it useful or happy to give me feedback on this. The full title is: Largest AI model (in terms of compute) in 2 years from a $10B compute infrastructure purchase now

Epistemic status:

I work as a Hardware engineer but have no ML/AI hardware specific experience. I also only recently started to follow the developments in this area. I don’t have significant research experience (neither quantitative nor qualitative) that I think would be really useful to create a proper, high-quality report. But time was limited and I was also afraid of getting distracted and potentially distracting the readers too much by doing things in a way I’m not currently comfortable with e.g. numerical probability values, confidence intervals and proper aggregation methods. I also wanted to note that I have not done any significant calibration training on estimations so please treat my estimations accordingly. I spent slightly more than 10 hours altogether looking into the questions and writing up the report.

As requested by the task, I focus on the training in terms of computation and ignore things like how much training data and parameters would be ideal for a training run of such scale.

Summary

A single training run of scale utilizing $10B AI training computer infrastructure is definitely ambitious and plausibly unprecedented as of now. Whether it’s used to set-up their own datacenter or paid as a multi-year commitment to a cloud provider, I estimate this amount to be sufficient for acquiring even the SotA AI chips in the order of some hundreds of thousands ( > 1e5). Since this scale is unprecedented and memory and interconnect bandwidth of the chips are already key factors (e.g. see memory bandwidth highlighted here and interconnect bandwidth included in the US export control) even at the existing scales, going with the most advanced chips (e.g. highest memory and interconnect bandwidths and individual performance) seem to be the way indeed.

That's still around more than an order of magnitude (~10x) more chips than the largest estimated final training run so far (GPT-4 using ~10-20 thousand NVIDIA GPUs). I think this implies the following important factors:

This is comparable in magnitude with the number of such chips delivered in a year so I expect it takes months to get access to all of them.
Significant improvements are likely needed in the parallelization of the workload to get a utilization rate even comparable to the rate of the current large scale runs.
Given the unprecedented scale of the infrastructure and potential significant changes in the training (e.g. algorithms, model parameters), decent testing work is also likely needed before the final training run.

Combining the above I estimate that a final training run utilizing the whole acquired compute capacity may only start about a year from now.

Finally, combining the estimates of the chips used, the performance of the individual chips, the time available for the final training and the utilization rate, I estimate that the resulting final training could use operations roughly around 1e27 FLOP (with my rough estimate 90% CI being something like 3e26 to 5e27).

What $10B is good for?

Most large scale AI training labs use NVIDIA data center GPUs (see why). Another chip family worth mentioning is the TPUs from Google who are one of the big players in AI themselves and use their own chips for the trainings in addition to making it available through the cloud. These two families represent the current large-scale training runs pretty well so I’ll stick to these two.

The latest NVIDIA card (released last year) of this family is the H100 (~$30,000 each) released last year and it clearly beats the previous A100 (~$10,000 each) which was the previous one both in performance (RAW FLOPs + new transformer engine) and bandwidth. (See H100 website and H100 datasheet).

NVIDIA also sells 8-card servers called the DGX(standard)/HGX(customizable) product. The price of a DGX H100 is around $400,000-$500,000. The same for the HGX version does not seem to be publicly available probably due to the higher room for customization but we can assume it is about the same order of magnitude.

Price calculations can also be made through cloud provider hourly costs. This article collects AWS prices showing the 3-years reserved hourly cost which is a plausible discount we get if we invest $10B. The 8GPU H100 machine’s hourly cost is ~$40 so ~$5/GPU.

Google just released the new version of its TPU v5e but it is currently unclear if it’s better for large scale training runs than the previous v4 version.

Google TPUs are only available through the cloud (unless you are Google) but I can imagine they might consider a $10B offer or set up the same infrastructure to be accessed through the cloud. The prices are slightly lower than that the NVIDIA GPUs with the 3-year commitment price being ~1.5$ for v4 and ~0.5$ for v5e. But details about the v5e and about v4 also suggest that these are ⅕ - ¼ lower performance (BF16 FLOP/s) than the H100, respectively.

Obviously there are additional operation costs when we simply buy the hardware like facility, power (15-25% energy cost), etc. but it’s also likely that the cloud prices contain a significant profit margin that the vendors are happy to cut from for a $10B commitment.

So based on the cloud and DGX prices, we could get access to ~1-2e5 (hundred thousand) H100s or ~4e5 TPUv4 or ~1e6 TPUv5e. ~1-2e5 H100s is an order of magnitude more than the number used for GPT4.

Is 100,000 chips a lot?

It’s estimated that the largest final training run as of now (GPT4) was about $50M based on calculations from epoch using 10,000-25,000 A100 GPUs for 2-6 months. And historical trends show that there has been a rapid scale-up of 4x/year in compute and 3x/year in costs for large scale trainings. Applying this to GPT4 run we become close to the $B and the 100,000 chip era but the investment is quite ambitious even if this very rapid growth trend is maintained which obviously can not go on forever. According to estimates this trend is expected to slow down around 2025.

But 100,000s of chips instead of 10,000s is still a large increase even for SotA trainings. E.g. memory bandwidth (highlighted here) and interconnect bandwidth (included in the US export control) had already been making it difficult to get high utilization rate out of the chips. There’s were obviously some technical improvements e.g. (bandwidths of the H100/HGX H100 are somewhat increased vs A100/HGX A100 and Google also announced Multislice which is designed to support large scaling) Still, these large scale generative AI trainings had not been going on for long so without deeper understanding for the underlying difficulties I’d estimate that it’s less likely that training runs of such scale would be completely unviable in e.g. a year from now. So I give a decent chance for reaching a utilization similar in magnitude to the GPT4 final run (10%+) after some significant work spent on this. I’ll use 20% (estimated ~40% for GPT4 and 10% which is my rough lower bound estimate) being slightly more on the pessimistic side.

A scale of 100,000 accelerators is larger than that of the top non-AI supercomputers although the top one gets close in magnitude. Also NVIDIA estimates to altogether ship ~550.000 H100s in 2023. So the challenge does not seem trivial at all.

When could the final training start?

When the final training run can start seems to be an important question given that utilizing an infrastructure of such scale well does not seem to be a trivial task. I’m assuming hardware companies and cloud providers don’t just have $10B stock of AI chips that one could immediately get access to. I think you’d need to wait for at least multiple months or even more than a year to get access to all of them.

This seems a lot but given the large increase in scale both in the computer infrastructure and in the training size I’d expect that decent preparation work would be needed from the different research and engineering teams anyway before the final training can be started. It seems plausible that the preparation work can be started with only a fraction of the infrastructure and that the majority of the preparation work can be finished even without the full final infrastructure being available (e.g. with 50%).

I’m quite uncertain on this but I’d estimate that it’d take somewhere around a year for the full infrastructure and all the necessary algorithmic improvements to be in place and adequate amount of testing being completed to start a final training run. Note that a year long large scale training is likely unprecedented but I’d assume a project at such scale would aim at something of this length to get the most out of the investment. I think they would aim for around a year long run but with a minimum threshold of 6 months.

How much compute we might squeeze out?

Without creating proper probabilistic estimates myself I’ll heavily rely on this calculation from Epoch that estimates the details of the GPT4 training. I’ll slightly adjust the calculations for the scenario I’m estimating and “scale it” up based on the factors I identified as relevant earlier. I’ll also try to estimate an “artificial” confidence interval, also heavily relying on the GPT4 calculation to express my certainty in the final estimate.

The calculation estimates that the GPT4 training was run from 2-6 months on 10-25,000 A100 GPUs with a utilization of ~40% and estimated to utilize 2e25 FLOP (90% CI 8e24 to 4e25). I expect that we have ~2.5-3x more time (~6-15 months), ~10x more chips (100-200,000), the pure performance of the individual chips is ~3x higher (FP16 dense FLOP of H100 vs A100) and I estimated that utilization might be lower ~0.5x.

One more potentially important factor I don’t know much about is the FP8 representation and transformer engine feature support of the H100. NVIDIA claims that the transformer engine can “dramatically” accelerate AI calculations for transformers using FP8 representation of the numbers instead of FP16 when possible. The H100’s FP8 performance is 2x the FP16 and it’s not clear what “dramatically” exactly means in this context. I estimate that the increase is less likely around ~2x given that this is the pure FP8 performance but I’ll give it a ~1,5x factor.

Altogether it is ~50x scaling up of the calculations resulting in the final estimate being ~1e27 FLOP.

To get a lower bound of my “artificial” confidence interval I re-ran the calculation of the GP4 training with the training days fixed to 2 years. That is how much compute we could expect to utilize if we don’t change anything, just buy/keep the same infrastructure used for GPT4 and run that for 2 years. This results in 1e26 (90% CI: 7e25 to 2e26).

The performance of the GPT infrastructure in 2 years is definitely a very good starting point. But given that we have $10B to invest I think it’s fair to be slightly more optimistic than that. Scaling up the GPT4 estimate’s lower bound results in 4e26 which seems about right but still a 5-6x increase compared to the 2 years GPT4 run estimate lower bound (7e25) which seems quite optimistic. I’m more comfortable going slightly lower to ~3e26 (~4x increase to the 2 years GPT4 run estimate).

I also feel quite uncertain about the upper bound. The anchor point I have is scaling up the GPT4 estimate 50x which results in ~2e27. However I’m definitely less confident about the upper limit than that although I’m not confident with going as high as 1e28 either. A value roughly around 4e27 seems more plausible and also still proportionate to the adjustment of the lower bound.

This makes my full estimation be 1e27 FLOP (90% CI 3e26 to 4e27).

Effective Altruism Forum
EA Forum