# 22

This post is one part of the sequence Understanding the diffusion of large language models.  As context for this post, I strongly recommend reading at least the 5-minute summary of the sequence.

EDIT 25-Feb-2023: I have made a big update from the claims in this post about deployment of large language models costing less than development in total. I now think that for the largest, most commercially successful LLMs, the total compute spent on deployment is much larger than in development. However, this doesn't significantly update my conclusion about the importance of focusing on development rather than deployment as a target of intervention (point 2c in the Key Takeaways). More information in this comment.

# Key takeaways

1. GPT-3 itself can be used and fine-tuned via an API. Despite this, there’s still demand for direct access to the model’s weights, and multiple similar models now exist that provide access to weights. OPT-175B is a GPT-3 replica; I estimate that its model weights can be downloaded by hundreds to thousands of ML researchers in academia, government, and industry labs, subject to approval. BLOOM is similar to GPT-3, but not a replica, and is publicly available for anyone to download. (more)
2. What resources are required to actually use these models? (more)
1. One relevant question is how much money and talent it would take to generate a large volume of text with about the same average usefulness as GPT-3’s outputs. (more)
1. Based on publicly listed pricing, running GPT-3 via the OpenAI API would generate approximately 150 million English words for $4000.[1] 2. I estimate that a user could generate 150 million English words per day of similar usefulness to GPT-3’s outputs for as little as$240 per day by running one instance of BLOOM independently (i.e., downloading the model weights and running the model on a server that they either directly rent or own).
3. I estimate there are 5000 people (90% CI: 100 to 45,000) that are capable of running BLOOM independently.[2]
2. What about for a very large-scale application of a GPT-3-like model—for example, generating text equivalent to 1% of global Twitter activity for one year, or assisting one million software developers with coding for one year? I estimate that deploying a model like BLOOM in these ways would be 20% of the cost of developing the model (90% CI: 10 to 68%), in terms of the dollar cost of compute alone. This means that deployment is most likely much less prohibitive than development. But it means I give a 5% chance that for the largest-scale applications, the cost of deploying the model is at least 68% of the cost of developing the model, which would make deployment similarly prohibitive. (more)

## Even running a GPT-3-like model independently seems feasible for thousands of individuals

I have just argued that most actors that use models will do so via other actors’ APIs. However, I still expect there to be cases of diffusion where some actors want to run a model independently. By running “independently,” I mean that an actor downloads the model weights and runs the model on a server that they either directly rent or own. The incentive for independent deployment could arise because

1. There is no existing API for a model of interest, or
2. Existing APIs are not sufficient for some actors’ aims.

For (1), there might be no existing API because the model developer wants to limit diffusion of capabilities, or simply has no interest in providing wider access to the model. For (2), the actors in question could be startups that want to scale up to train and deploy their own models, or malicious actors that don’t want their malicious usage to be tracked (even if permitted) by an API.

For these reasons, I think it’s useful to analyze who can run a model like BLOOM independently, even though there is in fact an openly accessible API for BLOOM. One can then apply a similar analysis to future models where there is a stronger incentive to run the model independently.

### Combining the scenarios and accounting for other evidence to estimate the compute cost of the “largest viable deployment”

Combining the Twitter and coding assistant compute cost estimates in this Guesstimate model, I get an overall estimate of $1.3M (90% CI: 230K to 5.2M). As a percentage of the GPT-3 replication cost estimate, this is 12% (90% CI: 1.3 to 56%). As another line of evidence, I tried to find information on what percentage of the cost of machine learning applications as a whole is accounted for by inference rather than training. I found two sources estimating 80-90% of the cost is for inference.[38] However, those sources don’t provide clear evidence or reasoning for those estimates, and they appear to be incentivized to give high estimates. Updating slightly for this evidence, my overall estimate of the cost of the largest viable deployments of GPT-3-like models is 20% of the development cost (90% CI: 10 to 68). Converting this back to dollars, I get$2.6M (90% CI: 1 to 6.6). [39]

Putting this all together, in my median scenario the development of a GPT-3-like model costs about 5 times more than the largest viable deployment of the model. But my confidence interval means there is a 5% chance that there are deployment scenarios which (a) cost more than 68% as much as developing the model, and (b) have a significant impact, such as improving one million software developer’s productivity by a few percent. So plausibly, for the highest-impact applications, the cost of deployment is almost as prohibitive as the cost of development.

One consideration I haven’t taken into account in the above analysis is the ability for actors to scale up via commercial revenue. Actors could deploy a model at a small but profitable scale, then use the resulting revenue to scale up, then deploy at a larger and more profitable scale, and so on in an amplifying feedback loop. This feedback loop can also have discontinuous jumps—if an actor has a moderately successful and promising application of AI, they might suddenly receive much more funding from investors. AI21 Labs is an example, reportedly raising $64M in funding in July 2022 and thereby almost doubling their total capital (Wiggers, 2022). Having said that, the current leading AI developers can also set up this amplifying feedback loop, and have the biggest head start. So I think that leading developers are likely to maintain a steady (and perhaps an accelerating) lead this way. Because of this maintained lead, I think the number of actors that can afford to independently deploy future state-of-the-art models will most likely not increase significantly over time, even as smaller actors scale up. ## Upshot: focus more on shaping development than deployment Above, I have argued that the development of GPT-3-like models is a much larger constraint than the deployment of models. I think there is more opportunity for the AI governance community to take advantage of the larger constraint on development, than to make deployment more difficult. For example, diffusion can be limited by taking advantage of the large compute and talent requirements to train GPT-3-like models. Meanwhile, deployment seems much easier to do and more difficult to control. This is because the cost of even the largest viable deployments seem to be much smaller (about four times smaller, at my best guess). Furthermore, the developers of models seem to be in the most convenient position to deploy those same models. This is because 1. There is significant overlap in the expertise required to develop and deploy. 2. The compute used for training the model is probably much more than is needed to deploy (again, based on my above analysis), so this compute can be reused. 3. The developer has the most control over the model initially, because they are the first to possess the model and can decide to keep it private. For these reasons, I think the AI governance community should prioritize limiting which actors can develop models over limiting which actors can deploy models. # Appendix: Who can download and run BLOOM independently? ## Cost:$240 for 150 million words

For the number of words, see the “Actual throughput (tokens per day per GPU node)” cell in this Guesstimate model, which estimates 200 million tokens per day. Average words per token is 0.75 (see https://perma.cc/T6M8-Q9BJ), so 200M tokens corresponds to roughly 150M words. The cost per hour comes from the "Reserved pricing" for 8x NVIDIA A100 80GB GPUs from Lambda, listed here: https://perma.cc/TTB9-B8TF.

Most CS graduates could in principle afford the financial cost of $240 to run BLOOM for one day, but running BLOOM for a year (say) would then cost ~$90K which would only be affordable for perhaps tens to hundreds of individuals.

## Talent pool: Thousands of people

Let’s consider the minimum talent required to download BLOOM and run the model on a separate cloud compute server.[40] I think this requirement is equivalent to a single top-one-percentile Computer Science graduate who has passed at least one course on natural language processing with deep learning, and who can spend three months full-time figuring out how to run the model. This is because a lot of the know-how to run the model is available on the internet, such that a strong machine learning background is not required to start with. For example, EleutherAI’s Discord server would have a lot of relevant information and people willing to help. Tools such as HuggingFace accelerate make it easier to use machine learning models with multiple GPUs (which seems to be required for models as big as BLOOM).

Besides that, I don’t have special reasons to specify the requirement as a single top-one-percentile CS graduate with introductory machine learning experience spending three months trying. It is just a concrete-enough requirement that is intuitively plausible to me. I think that the people in this set are a useful indication of the actual set, because it seems to overlap significantly with the actual set. For instance, I’m confident that high-percentile CS graduates make up more than 20% of the actual set.

Reasoning for the calculation:

1. According to datausa.io, there are currently about two million Computer Science graduates in the US workforce.
2. I’ll assume that 1% of these top 1% graduates have the requisite machine learning knowledge. Although machine learning seems to be a popular subject nowadays, many existing graduates would have graduated before machine learning was very popular, and fewer still would retain the knowledge through relevant work or continued study. My intuitive guess is that only 10% of the existing top graduates studied it, and only 10% of those have retained the requisite knowledge, hence 1%.
3. The US population is 335 million according to Worldometer, while the world population is 8 billion.
4. So a crude estimate of people meeting the above talent requirement is: 2 million / 335 million * 8 billion * 1% * 1% ~= 5000.
5. I think it’s likely that any of these 5000 people would have good internet access and be able to rent at least one 8x A100 80 GB GPU server from a cloud provider such as Lambda. One of these servers seems sufficient to run BLOOM because the server’s total memory of 640 GB is much larger than the amount of memory taken up by the model weights of BLOOM, which is 329 GB.[41]

As a lower bound, it seems implausible that the number could be any lower than the total number of “infrastructure engineers” I counted in my case studies, which was 73 (see this cell in the diffusion database). So I set an approximated lower bound at 100.

So my overall estimate is 5000 with a 90% CI of 100 to 45,000.

# Appendix: Cost of producing 1% of Twitter activity with BLOOM

See this Guesstimate model for calculations and reasoning.

Buchanan et al. (2021, p. 58) provide a point of comparison: "...creating enough content to equal in size to one percent of global Twitter activity would require hundreds of GPT-3s running 24/7 and would cost tens of millions of dollars per year." So my cloud-compute cost estimate ($160K) is about two orders of magnitude lower than theirs (~$10M). Their reasoning is not entirely clear, especially the calculation behind “hundreds of GPT-3s.” However, they seem to make the following different assumptions:

1. The actor buys the hardware rather than renting the hardware from a cloud vendor.
2. A ~2x larger memory footprint for the model than in my estimate. This is likely based on using a FP32 number representation rather than the FP16 number representation which is now more common for large language models, including BLOOM (it says “​​Bf16 weights” at https://huggingface.co/bigscience/bloom#speeds-sizes-times which refers to the bfloat16 number representation).
3. Using V100 GPUs rather than the newer A100 GPUs
1. V100 has ~3 slower peak throughput (125 teraflop/s vs. 312 teraflop/s)
2. V100 has less than half the memory capacity (32 GB vs. 80 GB), therefore requiring ~2x the number of chips to fit the model in memory.

Based on the rough factors of difference in (2) and (3), I get 2 * 3 * 2 = 12x overall. So the two orders of magnitude difference seems mostly, but perhaps not entirely, explained by the difference in assumptions that I came up with.

# Appendix: Cost to run a GPT-3-size coding language model that is very commercially successful

See this Guesstimate model for calculations and reasoning.

# Acknowledgements

This research is a project of Rethink Priorities. It was written by Ben Cottier. Thanks to Alexis Carlier, Amanda El-Dakhakhni, Ashwin Acharya, Ben Snodin, Bill Anderson-Samways, Erich Grunewald, Jack Clark, Jaime Sevilla, Jenny Xiao, Lennart Heim, Lewis Ho, Lucy Lim, Luke Muehlhauser, Markus Anderljung, Max Räuker, Micah Musser, Michael Aird, Miles Brundage, Oliver Guest, Onni Arne, Patrick Levermore, Peter Wildeford, Remco Zwetsloot, Renan Araújo, Shaun Ee, Tamay Besiroglu, and Toby Shevlane for helpful feedback. If you like our work, please consider subscribing to our newsletter. You can explore our completed public work here.

1. ^

The 150 million words is somewhat arbitrary. The number came about in my estimate of how many tokens the BLOOM model could generate when running continuously on an 8x 80GB A100 GPU instance for 24 hours, at a typical hardware utilization rate.

2. ^

That said, my intuition is that the number of people who will actually learn how to run and then use BLOOM independently for some research or application, at any point in time since BLOOM was released, is much lower. My 90% CI for that number is 10 to 1000. I expect that most people who use BLOOM will use an API rather than run it themselves.

3. ^

Note that there are other (perhaps stronger) reasons to focus on the model development stage.

Firstly, the forms of diffusion that help actors develop models pushes AI progress forward more than the forms of diffusion that help actors deploy models. Pushing AI progress forward is what shortens AI timelines and thereby increases AI existential risk.

Secondly, a lot of AI existential risk comes from misaligned power-seeking AI rather than misuse by humans. I expect that reducing diffusion of deployment would have a smaller effect on this source of risk.

4. ^

The BLOOM announcement blog post states “We're finalizing an inference API for large-scale use even without dedicated hardware or engineering. In the meantime, for quick tests, prototyping, and lower-scale use, you can already play with an early version on the HF hub” (BigScience, 2022).

5. ^

My confidence is based on (a) skimming the papers and/or blog posts for all GPT-3-like models in the diffusion database for mention of model access; (b) the 20-billion parameter GPT-NeoX-20B model being “to the best of our knowledge, the largest dense autoregressive model that has publicly available weights at the time of submission” as of February 2022 (Black et al., 2022); (c) none of the experts that I consulted with, nor papers that I looked at, mentioned other models that are both GPT-3-like and widely available for download. (I did not ask any experts about this directly, but several experts mentioned BLOOM and OPT, so it’s likely that they would have also mentioned other widely-accessible models if they existed.) YaLM from the Russian tech company Yandex is a possible exception (which was in fact known to me), but given that it has only 100 billion parameters, my guess is that it does not have comparable performance to GPT-3.

6. ^

Throughout this sequence, “GPT-3” refers to the original 175-billion-parameter model that was first described in Brown et al. (2020) unless it is mentioned in the context of using the OpenAI API, which provides an updated version of the model.

7. ^

See Shelvane (2022, p. 105): a member of the OpenAI policy team told the author that “[researchers] can't make any changes to the underlying weights [of GPT-3]. They can't fine-tune it arbitrarily. They can't remove layers, they can't inspect the activations; they can't do all sorts of things.”

8. ^

See Usage Guidelines which describe the procedure for application review, and the content policy.

9. ^

See Shelvane (2022, p. 84): “the [OpenAI] API is designed to prevent users from stealing GPT-3 [...] the API comes with usage quotas, which users must apply to increase.”

10. ^

The cost of millions of dollars is based on my training compute cost estimates for OPT-175B and BLOOM. See this column in the diffusion database.

11. ^

An example of this is AI21 Labs with Jurassic-1-Jumbo, provided via AI21 Studio (AI21 Labs, 2022).

12. ^

GPT-NeoX-20B does not meet my definition of a GPT-3-like model, but it still serves as an informative case study.

13. ^

See Zhang et al. (2022, p. 8): "Given our primary goal as a replication of GPT-3..."

14. ^

See the note on this cell in the diffusion database. I have not investigated whether the lower performance is significant in terms of how useful the model is, and I lack the intuition to judge this at face value. Zhang et al. (2022, p. 8) claim a “parity in performance for standard evaluation datasets used in the GPT-3 models,” but I didn’t find a clear statistical basis for this claim in the paper.

15. ^

See Zhang et al. (2022), Introduction, p. 1: "We are releasing all of our models between 125M and 66B parameters, and will provide full research access to OPT-175B upon request. Access will be granted to academic researchers; those affiliated with organizations in government, civil society, and academia; and those in industry research laboratories." The form to request model access includes fields for “Organization / Affiliation,” “Intended Use,” and “Previous related publications.”

16. ^

Based on my estimated number of natural language processing researchers at top universities. I also estimate this number is less than the estimated number of applications that can be processed in one year. See this Guesstimate model for further details.

17. ^

See Zhang et al. (2022), Introduction, p. 1: “will provide full research access to OPT-175B upon request.” I interpret this as making the OPT-175B trained model weight file(s) available for download to the requester.

18. ^

See access request form: “Subject to your compliance with the Documentation and Sections 2, 3, and 5, Meta grants you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable, royalty free and limited license under Meta’s copyright interests to reproduce, distribute, and create derivative works of the Software solely for your non-commercial research purposes.” Section 2 places restrictions on copying for certain purposes or copying without including the copyright, but not total restriction.

19. ^

The search can be roughly replicated at this link, but I failed to obtain a working archived copy of the search.

20. ^

I have not figured out when the API was released, but I only became aware of it in October 2022.

21. ^

For the influence of Megatron-LM on BLOOM, see https://huggingface.co/bigscience/bloom#model-architecture-and-objective: the BLOOM model architecture is "Modified from Megatron-LM GPT2 (see paper, BLOOM Megatron code)". The BLOOM Megatron code (https://github.com/bigscience-workshop/Megatron-DeepSpeed) is "a detached fork of https://github.com/microsoft/Megatron-DeepSpeed, which itself is a fork of https://github.com/NVIDIA/Megatron-LM." The original Megatron-LM code was open-sourced to accompany Shoeybi et al. (2019).

22. ^

For the influence of GPT-2 on Megatron-LM, see Shoeybi et al. (2019), Abstract, p.1: “...we train an 8.3 billion parameter transformer language model similar to GPT-2.”

23. ^

For GPT-3 see the paper, p.14: “Although GPT-3’s training data is still primarily English (93% by word count), it also includes 7% of text in other languages.” For BLOOM, see model card: English is only 30.04% of the training data (presumably also measured by word count).

24. ^

This is based on the following evidence. When I averaged the normalized accuracy on tasks that BigScience has evaluated for both BLOOM and OPT-175B, both BLOOM and OPT-175B both achieved approximately 47% accuracy. OPT-175B, in turn, had 2% less accuracy on average compared to GPT-3, on the tasks that OPT-175B was evaluated on in Zhang et al. (2022, p. 17). So this suggests that BLOOM is similarly worse than GPT-3 on those tasks. A big caveat to this is that the set of tasks that BigScience has evaluated for both BLOOM and OPT-175B seem far from comprehensive. See this Colab notebook for the calculations and further explanation.

25. ^

The 329 GB size was listed under “Checkpoint size” at https://huggingface.co/bigscience/bloom#speeds-sizes-times

26. ^

Compute cost estimates are just based on cloud compute prices, and exclude the cost of other hardware such as a laptop to set up the cloud computing instance.

27. ^

By “direct” I mean the people and skills that are required to set up the model and keep the model running in the deployment setting, excluding people that maintain software dependencies (e.g. PyTorch), or people that give advice on how to do deployment.

28. ^

This means the deployment with the highest volume of model outputs that (a) would be possible for at least one actor to do by now if they tried; (b) is worth the cost—not necessarily in terms of financial revenue, but in achieving the actor's goal. See this Guesstimate model for calculations (the method is also explained in the main text).

29. ^

Inference means passing data into the model and obtaining an output. This is also known as a “forward pass” of the model.

30. ^

By “hosted” I mean that the organization stores the model on a server, and runs the model on hardware that is owned or rented by the organization.

31. ^

See for example the OpenAI API Usage Guidelines which describe the procedure for application review, and the content policy.

32. ^

The listed price for Davinci (which is presumably some version of the 175-billion parameter GPT-3 model) is $0.02 per 1000 tokens. 1000 tokens is roughly 750 English words based on this page. Therefore 150,000,000 words requires 150e6 * 0.02 / 750 ~=$4000.

33. ^

Credit to Buchanan et al. (2021) section 4 (starting p. 55) for the inspiration for this scenario.

34. ^

Note that I am glossing over the actual capability of BLOOM to automate disinformation effectively. On this point (but substituting GPT-3 for BLOOM), Buchanan et al. (2021) concluded that “although GPT-3 will not replace all humans in disinformation operations, it is a tool that can help them to create moderate- to high-quality messages at a scale much greater than what has come before.” As I explained earlier, BLOOM seems less capable overall than GPT-3, so the quality of messages would generally be lower, or a human operator would need to spend more time ensuring the messages are high enough quality.

35. ^

Again, I am not accounting for the likelihood of a GPT-3-size coding language model being able to improve ~1 million software developers’ productivity by 1-10%. However, I think this is plausible given that OpenAI Codex is an existing 20-billion-parameter model that is already being marketed as a tool to improve developer productivity (OpenAI, 2022). Intuitively, I think that users wouldn’t be willing to adopt Codex (or tools building on Codex) in the long-term if they didn’t expect to get an overall productivity improvement of 1% or more.

36. ^

After I made these estimates, I obtained a reference class estimate. The reference class was the team working on GitHub Copilot, GitHub’s code suggestion tool powered by OpenAI Codex, which is a 20-billion parameter language model trained on code. I searched the term "GitHub copilot" on LinkedIn, filtered by "People", and then reviewed the first 4 pages of results for people that appeared to be currently working as engineers or developers for GitHub Copilot (after the 4th page, the results did not seem relevant enough to be worth continuing). I found 4 ML or Research Engineers, and 8 Software or Data Engineers, making 12 people in total. I think it's most likely that this LinkedIn search underestimates the true number of contributors, due to false negatives. This estimate is close to my intuitive estimate, but it should be taken as weak evidence due to being one case with a limited methodology. See this document for more details on the method. Due to time constraints, I did not use this evidence to update my final estimate.

37. ^

The three months is just an intuitive estimate based on project durations in my 1.5 years of experience in software engineering at a company that deployed ML models.

38. ^

See Leopold (2019) (reports 80-90%) and Barr (2019) (reports “up to 90%”).

39. ^

See Guesstimate model for calculations.

40. ^

Note: there is already an API to run inference with BLOOM here, but I think it’s useful to consider the general case where an actor deploys independently on a separate server, with less limit on usage.

41. ^

See the BLOOM model card—“Speeds, Sizes, Times” section.

# 22

New Comment

This is likely based on using a FP32 number representation rather than the FP16 number representation which is now more common for large language models, including BLOOM

BLOOM and Galactica-130B already support INT8. GLM-130B supports INT4, and the developers of LLM.int8() are working on int4.

However, I am 80% confident that before July 2022, no other GPT-3-like models had their trained weights widely available for download.

I estimate there are 5000 people (90% CI: 100 to 45,000) that are capable of running BLOOM independently.[2]

I ran Galactica-120B before via HuggingFace, it only took me about ~5 hours and $10. Considering BLOOM is also a HuggingFace model- which almost always run easily- this seems like a serious underestimation. The number of people who feel capable of running such a model and are interested in doing so, however, is much smaller. Most CS graduates could in principle afford the financial cost of$240 to run BLOOM for one day, but running BLOOM for a year (say) would then cost ~$90K which would only be affordable for perhaps tens to hundreds of individuals. Many software engineers make post-tax$190K or \$300K a year.

I have made a big update regarding this claim:

What about for a very large-scale application of a GPT-3-like model—for example, generating text equivalent to 1% of global Twitter activity for one year, or assisting one million software developers with coding for one year? I estimate that deploying a model like BLOOM in these ways would be 20% of the cost of developing the model (90% CI: 10 to 68%), in terms of the dollar cost of compute alone. This means that deployment is most likely much less prohibitive than development. But it means I give a 5% chance that for the largest-scale applications, the cost of deploying the model is at least 68% of the cost of developing the model, which would make deployment similarly prohibitive.

The claims about the cost of the specific deployment scenarios (which were  oversimplified to begin with) may still be fairly accurate. But in terms of the intent behind the estimates I made, I think I greatly underestimated the largest scale of deployment for LLMs, a scale which is becoming more common and which I understand a little better. I now think that for the largest, most commercially successful LLMs, the total compute spent on deployment is much larger than in development.

My update was mostly influenced by several more sources (and more credible sources than the ones I reviewed in the post) suggesting that the total compute that major AI companies spend on inference is significantly larger then the total compute spent on training and experimentation:

1. https://arxiv.org/pdf/2111.00364.pdf, p.3, Fig. 3 caption: "At Facebook, we observe a rough power capacity breakdown of 10:20:70 for AI infrastructures devoted to the three key phases — Experimentation, Training, and Inference". Also, "Considering the primary stages of the ML pipeline end-to-end, the energy footprint of RM1 is roughly 31:29:40 over Data, Experimentation/Training, and Inference".[1][2]
2. https://arxiv.org/abs/2204.05149, p.7: "Across all three years, about ⅗ of ML energy use is for inference and ⅖ for training. These measurements include all ML energy usage: research, development, testing, and production."
3. https://www.semianalysis.com/p/the-inference-cost-of-search-disruption: "inference costs far exceed training costs when deploying a model at any reasonable scale. In fact, the costs to inference ChatGPT exceed the training costs on a weekly basis."

However, this doesn't significantly update my conclusion about the importance of focusing on development rather than deployment as a target of intervention (point 2c in the Key Takeaways). This is because of theother strong reasons to focus on development that I mention. I would revise point 2c to say that, even if the amount of compute is smaller in total, the compute you have to spend on training tends to be more up-front and all-or-nothing than deployment which can be scaled quite smoothly. This creates a greater barrier.

I have edited the post to point out this comment, but for the sake of posterity and prioritizing other projects, I won't be updating the rest of the post.

1. ^

Power and energy usage are not 1-1 with compute usage, especially over time as new hardware improves energy efficiency. But there is a clear relationship: computation requires running GPUs for some time, which consumes a fairly consistent amount of average power. I don't expect that improvements in energy efficiency have a big impact on the ratio of development and deployment compute.

2. ^

RM1 denotes one of Facebook's six models that "account for a vast majority of compute resources for the overall inference predictions at Facebook, serving billions of users world wide" (see footnote 4 on p.4). RM1 is the single most carbon-intensive model out of these six models (see Fig 4 on p.4).

What do you think are the main reasons behind wanting to deploy your own model instead of training an API? Some reasons I can think of: