Transformative AI and Compute - A holistic approach - Part 4 out of 4
This appendix attempts to:
- Provide a list of connected research questions (Appendix A).
- Present common compute metrics and discusses their caveats (Appendix B).
- Provide a list of Startups in the AI Hardware domain (Appendix C).
Previous Post: Compute Governance and Conclusions
You can find the previous post "Compute Governance and Conclusions [3/4]" here.
This appendix lists research questions, some thoughts on metrics related to compute, and a list of AI hardware startups.
These paragraphs ended up in the Appendix, as I have spent less time on them and it is more of a draft. Nonetheless, I still think they can be potentially useful.
A. Research Questions
This is not a research agenda or anything close to it. This is me doing research and thinking “Oh, I’d be curious to learn more about this *scribbling it down*”.
The questions rank from specific to nonspecific, from important to “I’m just curious”. There are a couple of questions which I think are more important, I’ve marked them with a ★.
This list is also published as a Google Doc, so one can add comments to individual items.
In general, this domain of research requires more:
- Data Acquisition: Collect data in a machine-readable format on compute trends, hardware performances, algorithmic efficiency,
- Parameter, Compute and Data Trends in Machine Learning (Sevilla et al. 2021) is such an example. Work with us!
- Data Analysis: Data analysis is a crucial ingredient: from trend graphs and doubling times up to new measurements of AI systems’ capabilities and AI hardware.
- Data Interpretation: Once we have available data and analyzed general trends, we require interpretation for informing forecasts and potential policies.
Figure A.1: Sketch of research domains for AI and Compute.
- For a list of research questions, see “Some AI Governance Research Ideas” (Anderljung and Carlier 2021).
- What is the current budget of public research organizations for compute?
- Should we lobby for an increase in compute budget?
- If yes, how?
- How does this compare to corporate AI research labs?
- Should we lobby for an increase in compute budget?
- GPT-3 is a leading language model in terms of compute (≈3.64E+23) and parameters (≈1.75E+11). However, PanGu-α has more parameters (≈2.07E+11) but used less compute for training (≈5.83E+22) (estimates from here).
- Would we expect better performance for additional training of PanGu-a? Why did they stop there?
- What is the parameter-to-compute ratio for language models? What can we learn from it?
- New efficient architectures, such as EfficientNet, achieve similar performance while requiring less compute for training.
- Are we expecting better performance if we scale up this architecture and increase the compute?
- How can we break up the AI and Compute trend? And what are the proportions of the relevant components, such as increased spending and more powerful hardware? ★
- I think the current breakups of the compute trend into (a) more spending and (b) more performant hardware could be improved. My initial guess is that the performance of hardware is developing faster than Moore’s Law. This is important as the limit is then often assumed when we hit 1% of GDP in spending. If (b) more performant hardware makes up a more significant proportion, this trend could be maintained for longer (at least from the spending side).
- How can we break up the price for compute? And what are the proportions of the relevant components?
- I am discussing the operational costs in Appendix B. What is the ratio of the purchase price to energy, operations, engineering, etc?
- I have heard from researchers that the funding for compute is not the immediate problem, rather the required knowledge to use it and deploy the training run efficiently.
- Does spending money on compute always entail hiring ML engineers? What is the fraction?
- Is there a “Learning Curve” for compute hardware (similar to solar power)?
- Should we expect a strong effect with an increased spending in the future?
- Which emerging computing paradigms should we monitor? How can we prioritize among them? ★
- In-memory computing, optical computing, hybrid approaches, 3D stacked chips, …
- Which metrics are the most informative for compute hardware performance? ★
- Can we use benchmarks, such as MLCommons or Lambda Labs, for measuring hardware progress?
- Can we based on those application-specific benchmarks create a benchmark like benchmark seconds per $?
- Should we use metrics only in an application-specific context?
- E.g., we could use the MLCommons benchmarks on language models to measure the performance of hardware for AI systems on language models.
- How can we consider utilization in our metrics?
- Can we use benchmarks, such as MLCommons or Lambda Labs, for measuring hardware progress?
- Quantization and number representation: We have seen that neural networks can be quantized (going from 32bit float to 8bit integer representation) for inference without significant accuracy loss. This leads to substantial memory savings and inference speedups.
- What is the limit of bit-width for training?
- Which kind of speedup should we expect once we design dedicated hardware for the designed number representation? Integer representations require less energy and space on the die of the chip.
- Specialized architectures and ASICs
- Why haven’t we seen ASICs for AI systems yet? I outline some reasons in Section 4.2 but would like to dig deeper and do some forecasts. GPU and TPUs are still fairly general in regards to the computation they can execute.
- Are there certain fundamental building blocks that we could significantly accelerate by designing specialized hardware for them?
- E.g., could we design hardware that is significantly faster and more efficient for executing transformers?
- We have seen technologies like bitcoin mining, or encryption, switching towards specialized chips or dedicated subprocessors.
- What can we learn from those developments?
- Does narrow AI enable the potential for feedback loops for the design process of AI chips? E.g., (Mirhoseini et al. 2021).★
- There are a couple of problems in the field of electronic design automation, such as place and route where AI could be highly beneficial and speedup the development process.
- Discontinuous progress
- Which categories of AI hardware innovation could lead to discontinuities? How feasible are they?
- Are the examples of discontinuous progress in computing capabilities for other applications, such as encryption, video processing, Bitcoin, or others? What happened there?
- Are there some potential overhangs within AI hardware domains (processing, memory, architectural, etc) which could be unlocked?
- I think an innovation within the interconnect could unlock lots of hidden processing performance (discussed in Appendix B).
- Existing bottlenecks
- What are existing bottlenecks in the computing hardware domain?
- Can we distribute AI (training) workloads across multiple datacenters? What is the minimum bandwidth we need to maintain (in regards to the memory hierarchy: from on-chip, on-board, on-system, in-datacenter, internet, …)?
- Which kind of speedup could we expect if we unlock the processor-memory performance gap?
- What are existing bottlenecks in the computing hardware domain?
- NVIDIA asks for a significant top-up for their commercial AI hardware. What is the price difference between consumer-grade GPUs and commercial ones?
- Can we expect this to change in the near future when other competitors enter the market?
- Are metrics based on consumer-grade GPUs a good metric for compute prices?
- Which companies will dominate the AI hardware market?
- What can we learn from other technologies (e.g. encryption, Bitcoin, video accelerators) which resulted in their own chips?
- What is the time-to-market for compute hardware? Could this allow us to foresee upcoming developments?
- Assuming we acquire (e.g. stealing) the blueprints of transformative new computing technology or paradigm, can we just rebuild it?
- For used compute trends: Which models should we include in the dataset? What are our inclusion criteria?
- How do we deal with the more frequent emergence of efficient ML systems?
- Can rank lists, such as the Top 500 on supercomputers, provide insights on available compute?
- How can we measure their compute performance for AI tasks?
CSET is already doing most of the work within this domain. I expect my questions could be answered by reading through all of their material.
- Which role has the semiconductor industry (SCI)?
- Could AI capabilities be slowed down by manufacturing limitations (current semiconductor crisis)? Is it already slowed down? Should we expect more progress in the next decade?
- What is the role of individual actors, such as TSMC or Nvidia in the compute hardware space?
- Should we focus on regulating specific actors?
- How will the semiconductor industry develop in the next years?
- What will be the impact of nations, such as the EU and the US, focusing on building a semiconductor industry?
- Should we expect such development to decrease the price and increase research speed?
- Updating the AI and Efficiency trend with the newest data.
- Creating a similar dataset as AI and Efficiency for a different benchmark.
- New efficient AI systems are not optimized for hitting intermediate accuracy goals. We could tweak the efficient networks to achieve the accuracy benchmarks earlier and get better algorithmic improvements estimates.
- Learn more about the sample efficiency of AI systems.
- What is the role of compute hardware for AI safety?
- Is there more to it than just enabling more compute? How can we use it in regards to AI safety?
- How can we limit access to potentially harmful AI systems via controlling hardware?
- The metric FLOPs/s is commonly used for discussing compute trends. However, as outlined in Section 1.1, the memory and interconnect are also of significance.
- FLOPs/s — which is often listed on specification sheets of hardware — does not give insights into real-world computing performance. We need to adjust this idealized factor by a utilization which is determined by various factors, such as the interconnect, memory capacity, and the computing setup.
- Consequently, it is important to base trends not purely on an increase in peak FLOPs/s but rather by effective FLOPs/s — measured by real-world performance.
- I suggest the investigation of benchmark metrics, such as the MLCommons or Lambda Labs benchmarks. These benchmarks provide insights into the performance of hardware with real-world workloads of different AI domains.
I have discussed various trends related to compute, and those trends often relied on metrics which we have then investigated over time. While metrics related to compute might initially seem more quantifiable than other AI inputs (such as talent, algorithms, data, etc.), I have encountered various caveats during my research. Nonetheless, I still think that compute has the most quantifiable metrics compared to the others.
In this Appendix B, I briefly want to present commonly used metrics, discuss some caveats, and propose some ideas to address those shortcomings.
B.1 Common used metrics for measuring hardware performance
The presented forecasting model for AI timelines is informed by one hardware-related metric: FLOPs/s per $. Algorithmic efficiency and money spent on compute are hardware independent. They are multiplicative with the FLOPs per $ (see Section 4.1).
We have the following options to make progress or acquire more compute. Assuming all the other metrics are constant, we can either:
- Increase the computing performance [usually measured in FLOPs/s]
- By improving one of the listed components: logic, memory, or interconnect (see Section 1.1)
- Decrease the price [$]
- Decrease the energy usage [W] which decreases the price over time [$]
While metrics, such as the price and energy usage, are more clearly defined and measurable, the computing performance is more than just FLOPs/s and depends on various factors which I will discuss.
I will present metrics connected to the three basic components: logic, memory, and interconnect.
FLOP (with its plural FLOPs)
Presented in Section 1, our atomic entity is the operation. FLOP refers to a floating point operation which is a type of operation on a specific number representation, a floating point number. Nowadays, this term is commonly used interchangeably, independent of the number representations.
However, it is of importance which number representation is used, as this defines the performance of the hardware. Commonly used in Machine Learning are float16, float32, bloat16, int8, and int16. For an introduction to number representations see here or here.
Also commonly used is Petaflop/s-day. It's also a quantity of operations. A petaflop/s is floating point operations per second for one day. A day has . That makes FLOPs.
FLOPS or FLOPs/s
A quantity of operations does not tell us how long it will take to process them. For a given quantity of operations (X FLOPs) executed in a given time, we have the metric: FLOPs per second (FLOPS or FLOPs/s). This is the most common metric used for the processing power of hardware or whole datacenters (e.g., see the Top 500 List).
FLOP/s per $
If we now take the purchase price of the hardware into considerations, we can now divide the computing performance in FLOPs/s by the purchase price to learn about FLOPs/s per $. This puts different hardware into comparison and tells how many operations specific computing hardware can execute per second for a single dollar.
Note that this metric commonly ignores the operation costs of hardware that is partially defined by FLOPs/s per Watt.
FLOPs/s per Watt
FLOPs/s per Watt is similar to FLOPs/s per $. However, instead of the monetary costs it considers the energy costs. It describes the efficiency of the hardware.
Energy efficiency is a key metric in computer engineering, as hardware is often limited by its heat dissipation (see power wall). A higher energy efficiency often allows us to increase the clock frequency or other components which then allows us to compute more FLOPs/s.
However, I think that considering all the other operational costs, such as cooling, system setup, system administration, etc, might change this picture and would be interested in estimates.
The memory capacity is described in Bytes or Bits and is familiar to most people. Most of the confusion is usually around the different types of memory: cache, on-board memory, RAM, HDD, SDD, etc.
One of the most important to concepts to understand is memory hierarchy. It separates different storage types in regards to their response time, bandwidth and capacity. The heuristic: low response time [in seconds] and high bandwidth [in Byte per second] leads to lower memory capacities.
Figure B.1: Memory hierarchy in computing systems (Source).
Why is the memory capacity of importance for the performance of the system? As already discussed in Section 1.1, your logic can be bottlenecked if it does not get supplied with enough data. Therefore, data locality is important. If you are training an AI system, you would like your weights and parameters to be as local as possible to enable fast read and write access. Therefore the onboard memory of your GPU is often important (as a minimum you should fit all of your parameters on it). Whereas it is theoretically possible to offload them to other memory systems, this comes with a significant increase in training time and makes it usually unfeasible.
We have introduced metrics related to the logic and the memory. To now move data from the memory to be processed in the logic element, we require an interconnect.
The memory bandwidth is described in Bytes per second, short B/s. As previously described, this memory bandwidth depends highly on the locality. Moving data from the on-board (e.g. called GPU memory) to be processed in the logic is orders of magnitude faster than, for example, if you need to communicate data from local on-board memory, to a different board via various interconnects.
Traversed edges per second (TEPS)
To address the shortcomings of only measuring the computing capabilities of the logic elements (more in B.2), the measure “traversed edges per second (TEPS)” was introduced. It is a measure of the interconnect capabilities and computational performance.
This is an important measure, especially for high performance computing, as those datacenter consist of hundreds of individual computers and data needs to be moved around from computer to computer — communication through various levels of memory hierarchy.
This is just a brief selection of relevant metrics. I’ve focused on those that I have seen commonly used in my references.
B.2 Some caveats of common used metrics
This section discusses some shortcoming of the presented metrics. I list the caveats in order of my felt importance (from more important to less important).
Utilization: what happens in reality
The FLOP/s on datasheets present peak performances of the hardware — what the hardware is able to process, assuming a perfectly balanced workload. Nonetheless, this is rarely the case in the real world. Our hardware is often only processing X FLOP/s of the theoretical achievable Y FLOPs/s. We refer to this as the utilization: .
The achieved utilization depends on various factors, I briefly discuss communication, parallelism, and software.
Also, we discuss some ideas around utilization in an upcoming piece on estimating compute (here it is).
I have discussed in B.1 Interconnect, and Section 1, that the interconnect is crucial to the operations of the logic. If the logic is not supplied with enough data, or the interconnect cannot write the results fast enough, the logic is bottlenecked by the interconnect. This behavior is not captured by peak FLOPs/s.
An example is the specifications from NVIDIA’s V100 to A100. The A100 is the succeeding generation. This generation steps focused next to doubling the tensor performance, also on doubling the on-board bandwidth but also the off-board interconnect. Whereas, not focusing on the FP32 or FP64 performance, only increasing it marginally.
AI hardware builds on parallelism on various layers. Parallelism in the chip architecture, but also up to having multiple computing systems where we distribute the workload.
However, for this we require a workload which is parallelizable. Most AI workloads are highly parallelizable and this is the reason why we switched from CPU to GPUs/TPUs. However, not every workload is perfectly adapted for the underlying hardware. For this reason the peak performance is rarely achieved, as all the components (such as network architecture, hyperparameters, etc.) would need to be matched to the underlying hardware.
This also means, 10 GPUs with each 10 TFLOPs/s do not equal 100 TFLOP/s. The parallelizability and the communication overhead reduces the achieved performance and tweaking the workload and distribution is necessary.
OpenAI discusses some ideas around parallelism in the piece: “An Empirical Model of Large-Batch Training”.
Software, such as compilers and libraries, helps us to translate our high-level languages into actual machine-readable code. This machine-readable code should then leverage the features of the hardware architecture, such as the parallel architecture. Examples of this are CUDA, Triton, or XLA. Ideally, this software helps with the above-outlined problem of parallelism and communication. However, this is hard and some problems are NP-hard or NP-complete. Consequently, the peak performance is rarely achieved.
The number representation
Talking about the amount of FLOPs or FLOP/s without mentioning the number representation gives limited insights. Especially the bit width, such as float16, or float32, is important. Short bit-widths can be processed faster and also moved around faster round — as the corresponding in- and outputs are smaller in size. Already now, we see that integer processing comes with significant speedup and memory savings (but is mostly used for inference). For the future, I am assuming that we see more specialized number representations such as the bfloat from Google, emerging in the future.
Therefore, I would also criticize the term FLOP, as non-floating operations are already present. However, the term seems now to be used interchangeably with OPs (even though it is technically not correct).
Consequently, I think it is fair to say that we currently have a processing overhang. I refer to the capabilities of the logic — we can currently theoretically process more FLOPs per second than we can read and write given by our interconnect or memory. Therefore, improving the memory bandwidth —without increasing the processing performance— will lead to more computing capabilities.
Operations costs (energy, infrastructure, etc.)
For the moment, most estimates neglected the operations costs, such as energy and infrastructure. I think this is fair —at least in regards to energy— given some back-off-the-envelope calculations. Nonetheless, we should continue monitoring this. With potential longer innovation cycles and an economy of scale (discussed in Section 4.2), the purchase price might decrease and the operations costs might become a more significant proportion.
Additionally, I think engineering costs might not be neglectable (running a computing farm, setting it up, etc.). I’d be interested in estimates. To get a more complete picture, one could get some insights into those costs by estimating the costs of FLOPs/s per $ from cloud computing providers, such as AWS or Google Cloud (minus their premium for profit).
What is an operation?
We describe a FLOP or OP as the atomic instruction which we count. However, on the computer architecture level this usually gets broken into multiple instructions — how many and what they are like depends on the instruction set architecture. This is the assembler code. This type of code is closer to the hardware and could provide us more reliable insights into the performance with metrics, such as instructions per cycle. However, this would require more work and gets into the weeds of computer engineering.
I think for this reason, FLOPs and multiply-accumulate operations (MACCs) often get conflated. Some use them interchangeably and assume one FLOP equals one MACC, whereas others assume a MACC equals two FLOPs.
In the end, a MACC is hardware-specific operations. If the underlying hardware consists of a fused-multiply-add unit, then a MACC is nearly as costly (in terms of latency) as a single FLOP. However, if it does not or the operations (multiply, and then addition) are not done consecutively, it equals two FLOPs.
Nonetheless, I don’t want to go into the details here. But assuming we got all of the above-mentioned caveats right, our metrics still have some limitations. I do not think that this is of importance given our current trend line where we see a doubling in training compute every six months. However, for my previous research on ML on the edge (optimize AI systems for the deployment on small resource-constrained embedded devices.) where our error bars need to be smaller than one order of magnitude, this played an important role. Additionally, often people get confused when they make their own estimates, and then numbers do not add up.
B.3 Concluding thoughts
I have listed various caveats on commonly used metrics and under which circumstances they are of more or less use. Measuring AI hardware does not rely on a single metric that gives you all the information but that is rarely the case for any domain. Overall, I am giving the recommendation to adjust hardware performances with utilization rates and monitor developments in the memory bandwidth more closely.
We are working on a piece with more insights on the utilizations and some advice on how to estimate training compute and the connected utilization of the system (link to be added by the end of 2021; ping me if not).
I’m also expecting that measuring those metrics will get more complicated over time due to the emergence of more heterogeneous architectures, and we might see more specialized architectures for different AI workloads.
Consequently, I am interested in metrics that are workload-dependent and can give us some insights into the performance of AI hardware.
MLCommons is a project which lets hardware manufactures published their training times for different AI domains. There we find training times for different hardware, hardware setups, and different AI workloads: e.g, image recognition and NLP.
Analyzing this data would allow us to analyze trends based on the training time which already encapsulates listed caveats such as the utilization, workload-dependent parallelizability, and memory access. For example, we could calculate performance gains for real-world workloads of hardware over time (unfortunately, the data only goes back to 2017). Additionally, we get more insights into closed systems such as Google’s TPU.
I have made this data available in this sheet and would be interested in some analysis.
The GPU data from Lambda Labs is another promising dataset.
C. AI Hardware Startups
This is a list of AI hardware startups that are working on new types of hardware paradigms, often optical computing (as a hybrid approach between digital and optical).
Figure C.1: Overview from June 2020 by @draecomino (Tweet Source).
List of AI hardware startups I stumbled upon during my research:
- Fathom Radiant
- Luminous Computing
- Rain Neuromorphics
- Cambricon Technologies
- Mythic AI
- Quantum Computing
You can find the acknowledgments in the summary.
The references are listed in the summary.
As already discussed in Section 4.2, even when hardware does not significantly increase in computing performance, the price can still decrease significantly due to longer R&D cycles and an economy of scale. ↩︎
For example in our discussed timeline forecast, this Wikipedia List, or this post by AI Impacts. ↩︎
A quick back-off-the-envelope calculations: A NVIDIA A100 consumes around 6.5kW at peak usage. Assuming 0.12$ per kWh, it costs around 6,800$ per year for running this hardware. This is rather neglectable to the purchase price, given that an NVIDIA A100 costs around $200,000 to $300,000. ↩︎
For an NVIDIA A100, the on-board memory bandwidth is around 2GB/s, whereas interconnect with additional A100’s using NVIDIA'S specialized NVLINK, one achieves up too 600 GB/s. And only 64GB/s using the standard PCIe Gen4 interface (see this datasheet). ↩︎
Integer representation (instead of floating point), saves energy and requires less space on the chip die. See Computing’s Energy Problem - Slides and Computing's energy problem (and what we can do about it) ↩︎