# 12

Transformative AI and Compute - A holistic approach - Part 4 out of 4

This is the Appendix (part four) of the series Transformative AI and Compute - A holistic approach. You can find the sequence here and the summary here.

This appendix attempts to:

1. Provide a list of connected research questions (Appendix A).
2. Present common compute metrics and discusses their caveats (Appendix B).
3. Provide a list of Startups in the AI Hardware domain (Appendix C).

# Previous Post: Compute Governance and Conclusions

You can find the previous post "Compute Governance and Conclusions [3/4]" here.

# Appendix

This appendix lists research questions, some thoughts on metrics related to compute, and a list of AI hardware startups.

These paragraphs ended up in the Appendix, as I have spent less time on them and it is more of a draft. Nonetheless, I still think they can be potentially useful.

## A. Research Questions

This is not a research agenda or anything close to it. This is me doing research and thinking “Oh, I’d be curious to learn more about this *scribbling it down*”.

The questions rank from specific to nonspecific, from important to “I’m just curious”. There are a couple of questions which I think are more important, I’ve marked them with a ★.

In general, this domain of research requires more:

• Data Acquisition: Collect data in a machine-readable format on compute trends, hardware performances, algorithmic efficiency,
• Data Analysis: Data analysis is a crucial ingredient: from trend graphs and doubling times up to new measurements of AI systems’ capabilities and AI hardware.
• Data Interpretation: Once we have available data and analyzed general trends, we require interpretation for informing forecasts and potential policies.

Figure A.1: Sketch of research domains for AI and Compute.

### Scaling Hypothesis

• GPT-3 is a leading language model in terms of compute (≈3.64E+23) and parameters (≈1.75E+11). However, PanGu-α has more parameters (≈2.07E+11) but used less compute for training (≈5.83E+22) (estimates from here).
• Would we expect better performance for additional training of PanGu-a? Why did they stop there?
• What is the parameter-to-compute ratio for language models? What can we learn from it?
• New efficient architectures, such as EfficientNet, achieve similar performance while requiring less compute for training.
• Are we expecting better performance if we scale up this architecture and increase the compute?

### Compute Price

• How can we break up the AI and Compute trend? And what are the proportions of the relevant components, such as increased spending and more powerful hardware? ★
• I think the current breakups of the compute trend into (a) more spending and (b) more performant hardware could be improved. My initial guess is that the performance of hardware is developing faster than Moore’s Law. This is important as the limit is then often assumed when we hit 1% of GDP in spending. If (b) more performant hardware makes up a more significant proportion, this trend could be maintained for longer (at least from the spending side).
• How can we break up the price for compute? And what are the proportions of the relevant components?
• I am discussing the operational costs in Appendix B. What is the ratio of the purchase price to energy, operations, engineering, etc?
• I have heard from researchers that the funding for compute is not the immediate problem, rather the required knowledge to use it and deploy the training run efficiently.
• Does spending money on compute always entail hiring ML engineers? What is the fraction?

#### Compute Hardware

• Is there a “Learning Curve” for compute hardware (similar to solar power)?
• Should we expect a strong effect with an increased spending in the future?
• Which emerging computing paradigms should we monitor? How can we prioritize among them? ★
• In-memory computing, optical computing, hybrid approaches, 3D stacked chips, …
• Which metrics are the most informative for compute hardware performance? ★
• Can we use benchmarks, such as MLCommons or Lambda Labs, for measuring hardware progress?
• Can we based on those application-specific benchmarks create a benchmark like benchmark seconds per $? • Should we use metrics only in an application-specific context? • E.g., we could use the MLCommons benchmarks on language models to measure the performance of hardware for AI systems on language models. • How can we consider utilization in our metrics? • Quantization and number representation: We have seen that neural networks can be quantized (going from 32bit float to 8bit integer representation) for inference without significant accuracy loss. This leads to substantial memory savings and inference speedups. • What is the limit of bit-width for training? • Which kind of speedup should we expect once we design dedicated hardware for the designed number representation? Integer representations require less energy and space on the die of the chip. • Specialized architectures and ASICs • Why haven’t we seen ASICs for AI systems yet? I outline some reasons in Section 4.2 but would like to dig deeper and do some forecasts. GPU and TPUs are still fairly general in regards to the computation they can execute. • Are there certain fundamental building blocks that we could significantly accelerate by designing specialized hardware for them? • E.g., could we design hardware that is significantly faster and more efficient for executing transformers? • We have seen technologies like bitcoin mining, or encryption, switching towards specialized chips or dedicated subprocessors. • What can we learn from those developments? • Does narrow AI enable the potential for feedback loops for the design process of AI chips? E.g., (Mirhoseini et al. 2021).★ • There are a couple of problems in the field of electronic design automation, such as place and route where AI could be highly beneficial and speedup the development process. • Discontinuous progress • Which categories of AI hardware innovation could lead to discontinuities? How feasible are they? • Are the examples of discontinuous progress in computing capabilities for other applications, such as encryption, video processing, Bitcoin, or others? What happened there? • Are there some potential overhangs[1] within AI hardware domains (processing, memory, architectural, etc) which could be unlocked? • I think an innovation within the interconnect could unlock lots of hidden processing performance (discussed in Appendix B). • Existing bottlenecks • What are existing bottlenecks in the computing hardware domain? • Can we distribute AI (training) workloads across multiple datacenters? What is the minimum bandwidth we need to maintain (in regards to the memory hierarchy: from on-chip, on-board, on-system, in-datacenter, internet, …)? • Which kind of speedup could we expect if we unlock the processor-memory performance gap? • NVIDIA asks for a significant top-up for their commercial AI hardware. What is the price difference between consumer-grade GPUs and commercial ones? • Can we expect this to change in the near future when other competitors enter the market? • Are metrics based on consumer-grade GPUs a good metric for compute prices? • Which companies will dominate the AI hardware market? • What can we learn from other technologies (e.g. encryption, Bitcoin, video accelerators) which resulted in their own chips? • What is the time-to-market for compute hardware? Could this allow us to foresee upcoming developments? • Assuming we acquire (e.g. stealing) the blueprints of transformative new computing technology or paradigm, can we just rebuild it? ### Compute Forecast • For used compute trends: Which models should we include in the dataset? What are our inclusion criteria? • How do we deal with the more frequent emergence of efficient ML systems? • Can rank lists, such as the Top 500 on supercomputers, provide insights on available compute? • How can we measure their compute performance for AI tasks? #### Semiconductor Industry CSET is already doing most of the work within this domain. I expect my questions could be answered by reading through all of their material. • Which role has the semiconductor industry (SCI)? • Could AI capabilities be slowed down by manufacturing limitations (current semiconductor crisis)? Is it already slowed down? Should we expect more progress in the next decade? • What is the role of individual actors, such as TSMC or Nvidia in the compute hardware space? • Should we focus on regulating specific actors? • How will the semiconductor industry develop in the next years? • What will be the impact of nations, such as the EU and the US, focusing on building a semiconductor industry? • Should we expect such development to decrease the price and increase research speed? ### Algorithmic Efficiency • Updating the AI and Efficiency trend with the newest data. • Creating a similar dataset as AI and Efficiency for a different benchmark. • New efficient AI systems are not optimized for hitting intermediate accuracy goals. We could tweak the efficient networks to achieve the accuracy benchmarks earlier and get better algorithmic improvements estimates. • Learn more about the sample efficiency of AI systems. ### AI Safety • What is the role of compute hardware for AI safety? • Is there more to it than just enabling more compute? How can we use it in regards to AI safety? • How can we limit access to potentially harmful AI systems via controlling hardware? • Can we remotely turn off hardware? • Can we integrate a self-destroy mechanism? • What is the role of low-level security exploits (such as Spectre and Meltdown)? • Could they be exploited to acquire compute of other actors? Or to make hardware unusable? ## B. Metrics Highlights • The metric FLOPs/s is commonly used for discussing compute trends. However, as outlined in Section 1.1, the memory and interconnect are also of significance. • FLOPs/s — which is often listed on specification sheets of hardware — does not give insights into real-world computing performance. We need to adjust this idealized factor by a utilization which is determined by various factors, such as the interconnect, memory capacity, and the computing setup. • Consequently, it is important to base trends not purely on an increase in peak FLOPs/s but rather by effective FLOPs/s — measured by real-world performance. • I suggest the investigation of benchmark metrics, such as the MLCommons or Lambda Labs benchmarks. These benchmarks provide insights into the performance of hardware with real-world workloads of different AI domains. I have discussed various trends related to compute, and those trends often relied on metrics which we have then investigated over time. While metrics related to compute might initially seem more quantifiable than other AI inputs (such as talent, algorithms, data, etc.), I have encountered various caveats during my research. Nonetheless, I still think that compute has the most quantifiable metrics compared to the others. In this Appendix B, I briefly want to present commonly used metrics, discuss some caveats, and propose some ideas to address those shortcomings. ### B.1 Common used metrics for measuring hardware performance The presented forecasting model for AI timelines is informed by one hardware-related metric: FLOPs/s per$. Algorithmic efficiency and money spent on compute are hardware independent. They are multiplicative with the FLOPs per $(see Section 4.1). We have the following options to make progress or acquire more compute. Assuming all the other metrics are constant, we can either: • Increase the computing performance [usually measured in FLOPs/s] • By improving one of the listed components: logic, memory, or interconnect (see Section 1.1) • Decrease the price [$][2]

If we now take the purchase price of the hardware into considerations, we can now divide the computing performance in FLOPs/s by the purchase price to learn about FLOPs/s per $. This puts different hardware into comparison and tells how many operations specific computing hardware can execute per second for a single dollar.[3] Note that this metric commonly ignores the operation costs of hardware that is partially defined by FLOPs/s per Watt. ##### FLOPs/s per Watt FLOPs/s per Watt is similar to FLOPs/s per$. However, instead of the monetary costs it considers the energy costs. It describes the efficiency of the hardware.

Energy efficiency is a key metric in computer engineering, as hardware is often limited by its heat dissipation (see power wall). A higher energy efficiency often allows us to increase the clock frequency or other components which then allows us to compute more FLOPs/s.

#### What is an operation?

We describe a FLOP or OP as the atomic instruction which we count. However, on the computer architecture level this usually gets broken into multiple instructions — how many and what they are like depends on the instruction set architecture. This is the assembler code. This type of code is closer to the hardware and could provide us more reliable insights into the performance with metrics, such as instructions per cycle. However, this would require more work and gets into the weeds of computer engineering.

I think for this reason, FLOPs and multiply-accumulate operations (MACCs) often get conflated. Some use them interchangeably and assume one FLOP equals one MACC, whereas others assume a MACC equals two FLOPs.

In the end, a MACC is hardware-specific operations. If the underlying hardware consists of a fused-multiply-add unit, then a MACC is nearly as costly (in terms of latency) as a single FLOP. However, if it does not or the operations (multiply, and then addition) are not done consecutively, it equals two FLOPs.

Nonetheless, I don’t want to go into the details here. But assuming we got all of the above-mentioned caveats right, our metrics still have some limitations. I do not think that this is of importance given our current trend line where we see a doubling in training compute every six months. However, for my previous research on ML on the edge (optimize AI systems for the deployment on small resource-constrained embedded devices.) where our error bars need to be smaller than one order of magnitude, this played an important role. Additionally, often people get confused when they make their own estimates, and then numbers do not add up.

### B.3 Concluding thoughts

I have listed various caveats on commonly used metrics and under which circumstances they are of more or less use. Measuring AI hardware does not rely on a single metric that gives you all the information but that is rarely the case for any domain. Overall, I am giving the recommendation to adjust hardware performances with utilization rates and monitor developments in the memory bandwidth more closely.

We are working on a piece with more insights on the utilizations and some advice on how to estimate training compute and the connected utilization of the system (link to be added by the end of 2021; ping me if not).

I’m also expecting that measuring those metrics will get more complicated over time due to the emergence of more heterogeneous architectures, and we might see more specialized architectures for different AI workloads.

Consequently, I am interested in metrics that are workload-dependent and can give us some insights into the performance of AI hardware.
MLCommons is a project which lets hardware manufactures published their training times for different AI domains. There we find training times for different hardware, hardware setups, and different AI workloads: e.g, image recognition and NLP.
Analyzing this data would allow us to analyze trends based on the training time which already encapsulates listed caveats such as the utilization, workload-dependent parallelizability, and memory access. For example, we could calculate performance gains for real-world workloads of hardware over time (unfortunately, the data only goes back to 2017). Additionally, we get more insights into closed systems such as Google’s TPU.

I have made this data available in this sheet and would be interested in some analysis.

The GPU data from Lambda Labs is another promising dataset.

## C. AI Hardware Startups

This is a list of AI hardware startups that are working on new types of hardware paradigms, often optical computing (as a hybrid approach between digital and optical).

Figure C.1: Overview from June 2020 by @draecomino (Tweet Source).

List of AI hardware startups I stumbled upon during my research:

# Acknowledgments

You can find the acknowledgments in the summary.

# References

The references are listed in the summary.

1. An overhang refers to a situation where large amounts of resources that are already available cannot be used yet. If it is resolved, large amounts are unlocked immediately (Dafoe 2018). ↩︎

2. As already discussed in Section 4.2, even when hardware does not significantly increase in computing performance, the price can still decrease significantly due to longer R&D cycles and an economy of scale. ↩︎

3. See this Wikipedia List or this post by AI Impacts for historic trends of FLOPs/s per $. ↩︎ 4. A quick back-off-the-envelope calculations: A NVIDIA A100 consumes around 6.5kW at peak usage. Assuming 0.12$ per kWh, it costs around 6,800$per year for running this hardware. This is rather neglectable to the purchase price, given that an NVIDIA A100 costs around$200,000 to \$300,000. ↩︎

5. For an NVIDIA A100, the on-board memory bandwidth is around 2GB/s, whereas interconnect with additional A100’s using NVIDIA'S specialized NVLINK, one achieves up too 600 GB/s. And only 64GB/s using the standard PCIe Gen4 interface (see this datasheet). ↩︎

6. Integer representation (instead of floating point), saves energy and requires less space on the chip die. See Computing’s Energy Problem - Slides and Computing's energy problem (and what we can do about it) ↩︎

New Comment