Diminishing Returns in Machine Learning Part 1: Hardware Development and the Physical Frontier

Brian Chau

This is a linkpost for https://www.fromthenew.world/p/diminishing-returns-in-machine-learning

This is a crosspost of part 1 of my article on diminishing returns on machine learning. I hope it is informative and relevant to all of you. ~Brian

The release of ChatGPT sparked a sensational reaction among media and ordinary people alike. It rapidly grew to 100 million users faster than any web app in history, including TikTok and Instagram. It is set to drastically change the economy by automating repetitive cognitive work in many walks of life. GPT-4, OpenAI’s latest release, scores within the 80th percentile of humans on many academic tests. A broader suite of machine learning technologies can, as Samuel Hammond puts it, give people “more capabilities than a CIA agent has today”.

This has led both optimists and pessimists to speculate about the likelihood of AGI - artificial general intelligence. Open Philanthropy defines AGI as “AI that can quickly and affordably be trained to perform nearly all economically and strategically valuable tasks at roughly human cost or less.” According to Sam Altman, CEO of OpenAI, “Our mission is to ensure that artificial general intelligence—AI systems that are generally smarter than humans—benefits all of humanity.” Critics of OpenAI’s ambitions often cite the risk of Unaligned Artificial Superintelligence, an AGI which is more intelligent than humanity to the point of being able to overthrow world governments and inflict catastrophic damage, for example by using nuclear weapons or bioweapons.

In my view, the severity of these concerns are based on incorrect assumptions. The implicit (if not explicit) attitude of Silicon Valley is one of indefinite growth. For good or for ill, they consistently believe that the trends of new technological research will continue at the current pace and direction that it is heading in the present. If anything, proponents of fast AGI believe that the rate of innovation is likely to be faster in the future than in the present. The purpose of this three-part series is to argue the opposite. I collect both empirical data about past progress in machine learning and an object-level description of the methods which have produced this progress. Using both, I argue that the varying methods used to achieve the current pace of machine learning progress will follow a peaking S-curve; they have largely either stagnated or will likely stagnate within the next ten years.

While exact predictions about the year AGI is invented, or whether it is invented at all, also depend on other factors such as the complexity of human thinking, the homogeneity of tasks, and the difficulty of information aggregation in institutions, what is certain is that the speed of AI development is an important factor. I believe that Effective Altruists generally overestimate the speed of future AI development and should consequently significantly reduce their estimates on the speed and likelihood of AGI development after this essay.

Parallelization

To understand the rate of machine learning progress, we have to understand how it is achieved. The single most important mechanism used to enhance AI across hardware, software, and data layers has been parallelization, the simultaneous processing of different actions. The underlying principle is simple: doing multiple things at once is faster than doing them one at a time. Examples of practical ideas to speed up machine learning using this principle are creating hardware which processes entire arrays of operations simultaneously, splitting processes within a model into independently functioning components, or combining multiple models. A convenient feature of machine learning algorithms is that they can be parallelized in several ways. For example:

Large matrix and vector multiplications can be parallelized at the hardware level
Different components of a machine learning network, which can be run separately without dependencies on each other, can be run on different pieces of hardware
The substance that is actually being processed can be subdivided. For example, a self-driving vehicle may need to avoid both moving vehicles and static obstacles, which may be more efficient to process separately.

Source: https://openai.com/research/techniques-for-training-large-neural-networks

This article will focus on optimizations to machine learning hardware, including parallelization. Optimization of software and data usage will be discussed in later parts. You will definitely see the theme of parallelization appear frequently, so it’s good to start building an intuition about it.

When it comes to hardware, the primary reason why parallelization is important is due to low level vector and matrix operations. You can think of a vector as a list of numbers. To find the sum of two vectors of the same length, you add the first two elements of the list together, then the second elements, and so on. The same is true for the (dot) product. You can process each of these additions separately, as is intuitive.

https://www.wikihow.com/Add-or-Subtract-Vectors

Matrix operations work the same way, except in two dimensions. Matrix multiplications essentially apply vector multiplication to pairs of rows in the first matrix and columns in the second matrix (or vice versa). The details are not too important for this article.

In the context of this article, the key takeaway from this section is that machine learning involves doing large quantities of repetitive math on rows or grids of numbers, which can be processed simultaneously.

A Paradigm Shift in Hardware

Specialized hardware for machine learning is a relatively new phenomenon. For much of history, machine learning has relied on the same hardware as everything else: CPUs (Central Processing Units), which aren’t capable of much parallelization. They got an upgrade later through GPUs (Graphical Processing Units), which as the name suggests, were designed for rendering detailed graphics more efficiently. Finally, the last few years marked the development of more specialized ways to use GPUs, as well as completely specialized hardware such as Google’s TPUs (Tensor Processing Units). Each of these developments are increasingly efficient solutions to the problem of multiplying many numbers in parallel.

Source: https://epochai.org/blog/compute-trends. You’ll see this again in the quantifying progress section.

There were two key points in machine learning hardware development: the introduction of graphical hardware and later specialized libraries for machine learning. When GPUs were adopted for machine learning, they resulted in an immediate several fold increase of both training and inference (answering) speeds relative to the CPUs of most computers, which only increased with further development. In 2007, NVIDIA introduced the first version of CUDA (Compute Unified Device Architecture), a programming tool for GPUs (Graphical Processor Units). Cuda was a crucial innovation for the development of machine learning because it allowed easy modification of the hardware-level instructions of GPUs.

[ Aside: One pet peeve for me is when people strongly emphasize Moore’s law in their arguments about machine learning hardware. Moore’s law, which measures transistor density, mostly measures materials science developments in how small and compact transistors can be manufactured. On the other hand, the majority of hardware improvement in machine learning has been due to completely unrelated developments in the organization of specialized hardware used for parallelization.]

Here comes the second crucial factor of hardware optimization. When thinking about the effectiveness of computer hardware, it’s important to understand that hardware does not understand complicated programming languages such as python or javascript, languages which many humans know how to write. Instead, intermediate programs (interpreters, compilers, assemblers) are used to translate human programming languages into increasingly repetitive and specific languages until they become hardware-readable machine code. This translation is typically done through strict, unambiguous rules, which is good from an organizational and cleanliness perspective, but often results in code which consumes orders of magnitude more low-level instructions (and consequently, time) than if they were hand-translated by a human. This problem is amplified when those compilers do not understand that they are optimizing for machine learning: compilation protocols optimized to render graphics, or worse for CPUs, are far slower.

An analogy is as follows: imagine taking the world’s most powerful nuclear power plant and funneling all of its electricity to running the world’s largest flashlight. Then, the flashlight is shined on a gigantic array of solar panels in order to finally deliver energy to the public. This process would obviously be highly inefficient and lose energy in all of its intermediate processes. The same can be said for intermediate steps in compilation.

These inefficiencies were removed as people began developing more specific practices for machine learning, including both detailed instruction sets for well-known machine learning network operations (CUDA kernels), as well as hardware specialized solely for machine learning, such as Google’s TPUs (Tensor Processing Units). There is no precise public timeline for when this development began to be taken seriously, but there are approximates. NVIDIA’s library for developing CUDA for machine learning, CUDNN, was released in 2014. Google’s TPUs were introduced to the public in 2016. Most researchers and engineers tend to agree that the early-mid 2010s were when ML hardware began to be taken seriously.

[Note: Some researchers would not consider CUDA optimization to be hardware optimization. It is technically a change in what software is being run. However, for the purposes of categorizing / pacing this series of articles, it makes more sense to put it here than to put it with the changes in algorithms covered in the next article in this series.]

Quantifying Hardware Improvement

As with any quantitative argument, the first questions you should ask are “what” and “how”. Metrics for machine learning performance are typically separated into two categories: training and inference. Training refers to the initial step in which a machine learning model receives feedback from data and improves its accuracy. Interference is the step in which the user actually interacts with the network, i.e. when you ask ChatGPT a question. There are several related metrics often used to quantify machine learning speed:

Throughput: the amount of data processed per second
Efficiency: the amount of data processed per watt
Cost-effectiveness: the amount of data processed per dollar

Sometimes the amount of data is replaced with the number of operations, but this is typically unimportant. Moreover, while differences between these metrics matter for technical tradeoffs and use cases, it is extremely unlikely that one of these metrics will drastically improve far beyond the other two. Consequently, I use “speed” as a vague umbrella term for all of these measures. However, the difference metrics does mean it will be somewhat difficult to precisely compare different data sources though.

What makes compiling these data sources far more annoying though, is the “how”. When referring to the earlier benchmarks, I used the term “model” instead of hardware. This is because different models, or machine learning algorithm structures, differ in speed. Machine learning hardware comparisons are made using a fixed set of models and training data. However, different studies use different models and data. So, keep in mind that you can’t simply list the numbers in each of the studies into one big ranking, you can only use the relative performance of different hardware in the same study as an approximate metric. Another reason these comparisons may vary based on methodology is that hardware is often optimized for different results. They may be optimized for training or inference (responding to queries from the user). All of this is to say that while it is possible to look at broad trends, looking at the precise tradeoffs along the efficient frontier, as on-the-ground ML companies do when making purchases, is beyond the scope of this article.

Instead the main point of this section is to ground expectations and future claims in some empirical data. The existing data is far from complete to draw an unambiguous conclusion; instead we will need to use a combination of the data and a technical understanding of the underlying causes of machine learning progress. I don’t expect anyone to draw a definite conclusion from what is presented here, but to at least narrow the range of possibilities slightly.

So, onto the data. This is arguably the highest quality paper on hardware comparison over time:

https://arxiv.org/pdf/2106.04979.pdf

Pay particular attention to the bottom row. For products targeted towards individual users, improvement is highest in the most recent hardware, RTX 2060 SUPER (2019). However, in industry products, improvement is higher in hardware released in the middle years, P100 (2016) and V100 (2017).

In my view, this represents a lag in consumer-end products hitting the efficient frontier of hardware tradeoffs that matches what I hear from machine learning researchers and engineers.

Here are some comparisons of newer releases, both consumer and industry:

https://lambdalabs.com/blog/best-gpu-2022-sofar

https://lambdalabs.com/gpu-benchmarks

The amount of intra-generation (i.e. between H100s) variation may point to significant variation due to optimization for the tasks chosen for the benchmark.

Here is a blog comparing cost per dollar:

https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/

As expected by economies of scale, newer hardware is somewhat less efficient per dollar. Consumer hardware is also more efficient per dollar, though that’s likely also due to lacking some of the more costly optimizations of the higher-end hardware.

Here is some very outdated preliminary data on Google’s TPUs. I’m not sure what to make of this.

https://techcrunch.com/2017/04/05/google-says-its-custom-machine-learning-chips-are-often-15-30x-faster-than-gpus-and-cpus/

the TPUs are on average 15x to 30x faster in executing Google’s regular machine learning workloads than a standard GPU/CPU combination (in this case, Intel Haswell processors and Nvidia K80 GPUs). And because power consumption counts in a data center, the TPUs also offer 30x to 80x higher TeraOps/Watt

The results are likely somewhat exaggerated or cherry-picked as I don’t believe Google would have such an enormous advantage over NVIDIA and not be drastically scaling up its production/distribution of hardware. This view is shared by Semianalysis:

We believe Google has a performance/total cost of ownership (perf/TCO) advantage in AI workloads versus Microsoft and Amazon due to their holistic approach from microarchitecture to system architecture. The ability to commercialize generative AI to enterprises and consumers is a different discussion.

They provide a good example of this problem in practice:

The point is especially clear with Google’s own TPUv4i chip, which was designed for inference, yet cannot run inference on Google’s best models such as PaLM. The last-generation Google TPUv4 and Nvidia A100 could not have possibly been designed with large language models in mind.

OpenAI’s Triton instruction set has more modest improvements:

https://openai.com/research/triton

A particularly high quality source that doesn’t fit exactly into this part or the next is the total compute usage of models dataset by EpochAI.

Compute usage is a distinct metric than hardware effectiveness, since it also includes the gains from algorithm parallelizing (running a ML algorithm across multiple of the same hardware), or simply allowing the ML algorithm to run for a longer period of time. However, this data point is higher quality than most and is still useful in identifying trends. We can observe roughly similar changes in development trends as in the hardware trends.

At this point I’d like to remind everyone of the basic principle that scientific evidence is a negative test, not a positive test. It’s fairly common for someone to gather a bunch of data points, give a narrative that aligns with them, and then pretend that the evidence they’ve provided shows that their narrative is the only possible one. The purpose of this section is to show that the limited evidence up to the current day matches up with any narratives I tell in this article / later articles.

The Kernel Curse

This section covers the optimization of CUDA kernels, or machine learning instruction sets written specifically for smaller components used in an ML model. While CUDA does refer to the specific set of instructions released by NVIDIA, the same principles for optimization apply for alternative instruction sets, such as those from Google or OpenAI.

Recall the metaphor of the nuclear power plant and the flashlight. You can only remove the flashlight and directly convert nuclear to electricity once. This intuition mostly holds for low-level instruction optimization. It may be the case that a better software engineer can write a slightly faster instruction set than a junior engineer for the same machine learning operation. But the gap between the two is miniscule compared to the gap between the junior engineer and an auto-generated instruction set.

A feature of CUDA optimization is that it is typically hyper-specific to individual machine learning kernels. This is a double edged sword: research and development efforts concentrate on experimenting using functions that have optimized instruction sets and (relatively) neglects those that do not. This incentivizes an iterative model of optimization, and means that if a model requiring new functions is adopted, there will be some time until it reaches the level of CUDA optimization of incumbent models.

Parallelization and Resource Constraints

The first obvious consequence of parallelization is that in many cases*, the amount of computation occurs, but is simply distributed over more hardware. Moreover, data transfer time represents a nontrivial amount of time and electricity consumption, so more aggregate energy/time-device is being spent. Historically, resource constraints have not posed a significant limitation on the scaling that can be done through parallelization, at least for top research organizations. Instead talent, or more specifically the ability to separate processes within machine learning networks was the constraint. This is about to change.

[*It is true that there are cases where parallelization is done in a way which makes the overall algorithm more efficient, such as sparse matrix multiplication. However, this is primarily done at the algorithmic rather than hardware level. I will likely discuss this in more detail in the second article. ]

Semianalysis once again has an estimate for the costs of further scaling up models:

Regarding parameter counts growth, the industry is already reaching the limits for current hardware with dense models—a 1 trillion parameter model costs ~$300 million to train. With 100,000 A100s across 12,500 HGX / DGX systems, this would take about ~3 months to train. This is certainly within the realm of feasibility with current hardware for the largest tech companies. The cluster hardware costs would be a few billion dollars, which fits within the datacenter Capex budgets of titans like Meta, Microsoft, Amazon, Oracle, Google, Baidu, Tencent, and Alibaba.

Another order of magnitude scaling would take us to 10 trillion parameters. The training costs using hourly rates would scale to ~$30 billion. Even with 1 million A100s across 125,000 HGX / DGX systems, training this model would take over two years.

They later raise correctly that improvements to model architecture can make training more efficient and push out this boundary. However, this is nonetheless an example of an avenue for unconstrained growth turning into a tradeoff.

Parallelization sits as the unifying goal of many ML engineers. It is an obvious, direct way to generate improvements. As the constraint moves from technical talent to physical cost of the large arrays of hardware, it is likely that a change in approach is necessary.

Meta-Trends

To summarize the key points from the article:

Machine learning involves repetitive operations which can be processed simultaneously (parallelization)
The goal of hardware optimization is often parallization
The widespread development of machine learning hardware started in mid-early 2010s and a significant advance in investment and progress occurred in the late 2010s
CUDA optimization, or optimization of low-level instruction sets for machine learning operations (kernels), generated significant improvements but has exhausted its low-hanging fruit
The development of specialized hardware and instruction sets for certain kernels leads to fracturing and incentivizes incremental development, since newer kernels will be unoptimized and consequently slower
The tradeoff between hardware cost and labor cost is only beginning to be reached and will lead to an additional constraint on AI development.

I believe similar patterns of deceleration are at play across all of machine learning. I’ll make the case for these patterns in the context of architecture (algorithms) in part two and data in part three. The implications are significant:

AI progress in general is slowing down or close to slowing down.
AGI is unlikely to be reached in the near future (in my view <5% by 2043).
Economic forecasts of AI impacts should assume that AI capabilities are relatively close to the current day capabilities
Overregulation, particularly restriction on access rather than development, risks stomping out AI progress altogether. AI progress is neither inevitable nor infinite.

It is true that in some cases, a slowdown in growth in one area is counteracted by newly discovered growth in other areas. Technological innovation in manufacturing, transport, chemistry, medicine, and many other fields have drastically slowed, if not stagnated completely, in the present day. Even the famous Moore’s law, describing transistor density, has had to repeatedly lengthen its doubling period due to slowing technological growth.

None of this is to downplay the economic impacts of machine learning technology. Even the widespread application and adoption of current ML models with zero further improvement would be an economic revolution. Instead, it is to oppose the confident assertion that machine learning progress will only further accelerate in the future. That is one possible scenario, though it is far from guaranteed and in my view, for the reasons stated in this part and future parts, an improbable one. When the general narrative and esoteric data conflict, it is often difficult to determine which one is correct. However, when they both align, they should be taken as a baseline which requires truly exceptional circumstances to overcome.

The stakes are high, but they are bidirectional. There are real consequences for overcorrection and undercorrection. I couldn’t have put it better than Jon Askonas and Samuel Hammond did in American Affairs:

It is our hope that [AI Safety] eschews an over-focus on extreme tail risk and instead develops measures that improve the safety and reliability of the kinds of systems we are likely to deploy, and does so in a manner respectful of the governance traditions that have underpinned scientific progress in the West.

aogMay 27 202312

Very interesting article. Some forecasts of AI timelines (like BioAnchors) are premised on compute efficiency continuing to progress as it has for the last several decades. Perhaps these arguments are less forceful against 5-10 year timelines to AGI, but they're still worth exploring.

I'm skeptical of some of the headwinds you've identified. Let me go through my understanding of the various drivers of performance, and I'd be curious to hear how you think of each of these.

Parallelization has driven much of the recent progress in effective compute budgets. Three factors enable parallelization:

Hardware
- GPUs are more easily parallelized than CPUs, as they have more cores and higher memory bandwidth. Will hardware continue its current pace of improvement?
- You cite an interesting paper on Nvidia GPU progress over time; it seems that the greatest speedups in consumer hardware came in the most recent generation, but improvements in industry-grade hardware peaked earlier, with the P100 in 2016.
- This doesn't strike me as strong evidence in any direction. Industrial progress has slowed, consumer progress has accelerated, and there are wide error bars on both of those statements because they're drawn from only four data points.
- Stronger evidence seems to come from Epoch's Trends in GPU Price Performance, showing that FLOP/s per dollar has doubled every two or three years for nearly two decades. Do you expect this trend to continue, and if not, why?
Kernels
- Software like CUDA allows developers to specify the ordering of computations and memory transfers, which reduces idling time and improves performance. You say that "CUDA optimization...generated significant improvements but has exhausted its low-hanging fruit," but I'm not sure what the argument is for that.
- You do argue that the importance of kernel optimization reduces experimentation with new algorithms. I agree, but I see a different upshot. One of the biggest reasons to be bullish on ML performance is the rise of AI programming assistants. If AI programming assistants learn kernel optimization, they'll reduce the cost and runtime of experiments. New algorithms will be on a level playing field with incumbents, and we'll be more likely to see algorithmic progress that was previously bottlenecked by writing CUDA kernels.
Algorithms
- Some algorithms are easy to parallelize; others, not so much. For example, a key benefit of transformers is they're more easily parallelized than RNNs, allowing them to scale.
- Neil Thompson has some interesting work on algorithmic progress, showing that many fundamental algorithms are provably optimal or close to it. I'm not sure if this is a relevant reference class for ML algorithms though, as runtime guarantees are far less important than measured performance.
- Overall, will future algorithms be easier to parallelize? It seems likely. We've done it before, and I don't have any particular reason to expect that it won't happen again.

Overall, I don't see strong evidence that any of these factors are hitting strong barriers. Instead, the most relevant trend I see in the next 5 years is the rise of AI programming assistants, which could significantly accelerate progress in kernel optimization and algorithms.

I'd highlight two other factors affecting effective compute budgets:

Spending. Maybe nobody will spend more than $10B on a training run, and the current trend will slow. But if we're in a very short timelines world, then AI could be massively profitable in the next few years, and OpenAI might get the $100B investment they've been talking about.
Better ML models. Some models learn more efficiently than others. Right now, algorithmic progress halve the compute necessary to reach a fixed level of performance every 16 months or every 9 months, depending on how you look at it. (This research focuses on efficiently reaching an existing level of performance -- I'm not sure if how we should expect it to generalize to improvements in SOTA performance.) Again, AI coders could accelerate this.

Overall, I used to argue that AI progress will soon slow. But I've lost a lot of Bayes points to folks like Elon, Sam Altman, and Daniel Kokotajlo. A slowdown is entirely possible, perhaps even likely. But it's a live possibility that the world could be transformed in a span only a few years by human-level AI. Safety efforts should address the full range of possible outcomes, but short timelines scenarios are the most dangerous and most neglected, so that's where I'm focusing most of my attention right now.

SteveZMay 28 20233

intermediate programs (interpreters, compilers, assemblers) are used to translate human programming languages into increasingly repetitive and specific languages until they become hardware-readable machine code. This translation is typically done through strict, unambiguous rules, which is good from an organizational and cleanliness perspective, but often results in code which consumes orders of magnitude more low-level instructions (and consequently, time) than if they were hand-translated by a human. This problem is amplified when those compilers do not understand that they are optimizing for machine learning: compilation protocols optimized to render graphics, or worse for CPUs, are far slower.

This is at best an imperfect description of how compilers work. I'm not sure what you mean by "repetitive", but yeah, the purpose is to translate high-level languages to machine code. However:

Hardware does not care about code organization and cleanliness, nor does the compiler. When designing a compiler/hardware stack the principal metrics are correctness and performance. (Performance is very important, but in relative terms is a distant second to correctness.)
The number of instructions in a program, assembly or otherwise, is not equivalent to runtime. As a trivial example, "while(1)" is a short program with infinite runtime. Some optimizations such as loop unrolling increase instruction count while reducing runtime.
Such optimizations are trivial for a compiler, and tricky but possible for a human to get right.
"often results in code which consumes orders of magnitude more low-level instructions": not sure what this means. Compilers are pretty efficient, you can play around with source code and see the actual assembly pretty easy (e.g. Godbolt is good for this). There's no significant section of dead code being produced in the common case.

(Of course the raw number of instructions increases from C or whatever language, this is simply how RISC-like assembly works. "int C = A + B;" turns into "Load A. Load B. Add A and B. Allocate C on the stack. Write the computed value to C's memory location.")
Humans can sometimes beat the compiler (particularly for tight loops), but compilers in 2023 are really good. I think the senior/junior engineer vs compiler example is wrong. I would say (for a modest loop or critical function): the senior engineer (who has much more experience and knowledge of which tools, metrics, and techniques to use) can gain modest improvement by spending significant time. The junior engineer would probably spend even more time for only a slight improvement.
"This problem is amplified when those compilers do not understand that they are optimizing for machine learning": Compilers never know the purpose of the code they are optimizing; as you say they are following rule-based optimizations based on various forms of analysis. In LLVM this is basically analysis passes which produce data for optimization passes. For something like PyTorch, "compilation" means PyTorch is analyzing the operation graph you created and mapping it to kernel operations which can be performed on your GPU.
"compilation protocols optimized to render graphics, or worse for CPUs, are far slower": I don't understand what you mean by this. What is a compilation protocol for graphics? Can you explain in terms of common compiler/ML tools? (E.g. LLVM MLIR, PyTorch, CUDA?)
I honestly don't understand how the power plant/flashlight analogy corresponds to compilers. Are you saying this maps to something like LLVM analysis and optimization passes? If so this is wrong; running multiple passes with different optimizations increases performance. Multiple optimization passes was historically (i.e. circa early 2000s) hard for compilers to do but (LLVM author) Chris Lattner's key idea was to perform all the optimizations on a simple intermediate layer of code (IR) before lowering to machine code.

Machine learning involves repetitive operations which can be processed simultaneously (parallelization)

I agree, but of course Amdahl's Law remains in effect.

The goal of hardware optimization is often parallization (sic)

Generally when designing hardware increased throughput or reduced latency (for some representative set of workloads) are the main goals. Parallelization is one particular technique that can help achieve those goals, but there are many ideas/techniques/optimizations that one can apply.

The widespread development of machine learning hardware started in mid-early 2010s and a significant advance in investment and progress occurred in the late 2010s

Sure... I mean deep learning wasn't even a thing until 2012. I think the important concept here is that hardware designs have a long time horizon (generally 2-3 years) because it takes that long to do a clean-sheet design and also because if you're spending millions of dollars to design/tapeout/manufacture a new chip, you need to be convinced that the workload is real and people will still be using it years from now when you're trying to sell your new chip.

CUDA optimization, or optimization of low-level instruction sets for machine learning operations (kernels), generated significant improvements but has exhausted its low-hanging fruit

Like the other commenter, this could be true but I'm not sure what the argument is for this. And again, it depends on the workload. My recollection is that even early versions of cuDNN (circa 2015) were good enough that you got >90% of the max floating point performance on at least some of the CNN workloads common at that time (of course transformers weren't invented yet).

The development of specialized hardware and instruction sets for certain kernels leads to fracturing and incentivizes incremental development, since newer kernels will be unoptimized and consequently slower

This could be true, I suppose. But I'm doubtful because those hardware designs are being produced by companies that have studied the workloads and are convinced they can do better. If anything competition may incentivize all hardware manufacturers to spend more time optimizing kernel performance than they otherwise would.

Effective Altruism Forum
EA Forum