What is Compute? - Transformative AI and Compute [1/4]

lennart

Comments 5

Sorted by

New & upvoted

I really appreciated the extension on "AI and Compute". Do you have a sense of the extent to which your estimate of the doubling time differs from "AI and Compute" stems from differences in selection criteria vs new data since its publication in 2018? Have you done analysis on what the trend looks like if you only include data points that fulfil their inclusion criteria?

For reference, it seems like their criteria is "... results that are relatively well known, used a lot of compute for their time, and gave enough information to estimate the compute used." Whereas yours is "important publication within the field of AI OR lots of citations OR performance record on common benchmark". "... used a lot of compute for their time" would probably do a whole lot of work to select data points that will show a faster doubling time.

lennart

I have been wondering the same. However, given that OpenAI's "AI and Compute" inclusion criteria are also a bit vague, I'm having a hard time which of our data points would fulfill their criteria.

In general, I would describe our dataset matching the same criteria because:

"relatively well known" equals our "lots of citations".
"used a lot of compute for their time" equals our dataset if we exclude outliers from efficient ML models.
- There's a recent trend in efficient ML models that achieve similar performance by using less compute for inference and training (those models are then used for e.g., deployment on embedded systems or smartphones).
"gave enough information to estimate the compute": We also rely on estimates from us or the community based on the information available in the paper. For a source of the estimate see the note on the cell in our dataset.
- We're working on gathering more compute data by directly asking researchers (next target n=100) .

I'd be interested in discussing more precise inclusion criteria. As I say in the post:

Also, it is unclear on which models we should base this trend. The piece AI and Compute also quickly discusses this in the appendix. Given the recent trend of efficient ML models due to emerging fields such as Machine Learning on the Edge, I think it might be worthwhile discussing how to integrate and interpret such models in analyses like this — ignoring them cannot be the answer.

MarkusAnderljung

Thanks! What happens to your doubling times if you exclude the outliers from efficient ML models?

lennart

The described doubling time of 6.2 months is the result when the outliers are excluded. If one includes all our models, the doubling time was around ≈7 months. However, the number of efficient ML models was only one or two.

SolenoidEntity

@lennart apologies if this is a silly question, but either there's an error in footnote 4, or I misunderstand something fundamental:

A petaflop/s is floating point operations per second for one day. A day has $86, 400 s e c o n d s \approx 10^{5} s e c o n d s$ . Therefore, $10^{20}$ floating point operations

Shouldn't this read something like (in verbatim spoken words)

"A petaflop per second is ten to the power of five floating point opeations per second. A day has [...] 10 to the power of five seconds. Therefore, a 'petaflop-per-second' DAY is 10 to the power of twenty floating point operations."

You've said a petaflop/s is x flop/s for one day, which seems like a typo maybe?

Would you say "petaflop-per-second" days if reading out loud?

Comments

One could argue the universe is a computer as well: pancomputationalism. ↩︎
You can read some thoughts on quantum computing in the series “Forecasting Quantum Computing” by Jaime Sevilla. ↩︎
Compute produces the data as an interactive environment for reinforcement learning. Therefore, more compute leads to more available training data. ↩︎
A petaflop/s equals $10^{15}$ floating point operations per second. A day has $86, 400 s e c o n d s \approx 10^{5} s e c o n d s$ . Therefore, a petaflop/s-day equals $10^{20}$ floating point operations. ↩︎
Nonetheless, according to estimates, overall most compute is probably used for the deployed AI systems — inference. Whereas, as outlined, the training process is computational more complex, the repetitive behavior of inference once deployed, leads to overall more used compute. In the future those resources could be repurposed for training (if we do not see different hardware for training and inference — discussed in Section 4.2) (compute for training >> compute for inference but number of inferences >> number of training runs) (Amodei and Hernandez 2018). ↩︎
The final training run refers to the last training of an AI system before stopping updating the learned weights and biases and deploying the network for inference. There are usually dozens to hundreds of training runs of AI systems to tweak the architecture and hyper-parameters optimally. While this metric is relevant for the development costs, it is not an optimal proxy for the systems’ capabilities. ↩︎
“We think it’d be a mistake to be confident this trend won’t continue in the short term.” (Amodei and Hernandez 2018). ↩︎
The data used in this section is coming out of a project by Jaime Sevilla, Pablo Villalobos, Matthew Burtell and Juan Felipe Cerón. We collaborated to add more compute estimates to the public database. I can recommend their first analysis: “Parameter counts in Machine Learning”. ↩︎
Transformative AI, as defined by Open Philanthropy in this blogpost: “Roughly and conceptually, transformative AI is AI that precipitates a transition comparable to (or more significant than) the agricultural or industrial revolution.” ↩︎
For more thoughts and a discussion on this, I can recommend “The Scaling Hypothesis” by Gwern (or the summary in the AI Alignment Newsletter #156). ↩︎
I would also describe the purple part as an open research question. How can we decompose this — differentiating between parallelization, an engineering effort, and spending, where it is easier to find upper limits? ↩︎
I would be interested in an update on this. However, I also did not spend time looking for an update on this in the recent AI experts surveys. ↩︎
I initially made the claim that there are reasons to believe that the available memory capacity of compute systems might match the human brain or at least be sufficient (at least the information we can consciously recall and access). However, while thinking more about this claim, I became uncertain. I started wondering if the brain also has something similar to a memory hierarchy as it is the default for compute systems (different levels of memory capacities which can be accessed at different speeds). I would be interested in research on this. ↩︎
In general, computational power is key to our modern society, and might also be the foundation of life in the future: digital minds. The future of humanity could be computed on digital computers — see “Digital People Would Be An Even Bigger Deal” by Holden Karnofsky or “Sharing the World with Digital Minds” by Bostrom. ↩︎

What is Compute? - Transformative AI and Compute [1/4]

Epistemic Status

1. Compute

1.1 Logic, Memory and Interconnect

1.2 Chips or Integrated Circuits

2. Compute in AI Systems

2.1 Computing in AI Systems

Training

Inference

2.2 Compute Trends: 2012 to 2018

2.3 Compute Trends: An Update^[8]

3. Compute and AI Alignment

3.1 The Bitter Lesson

3.2 Scaling Hypothesis

3.3 AI and Efficiency

3.4 Qualitative Assessment

3.5 Compute Milestones

3.6 Conclusion

Next Post: Forecasting Compute

Acknowledgments

References

What is Compute? - Transformative AI and Compute [1/4]

Epistemic Status

1. Compute

1.1 Logic, Memory and Interconnect

1.2 Chips or Integrated Circuits

2. Compute in AI Systems

2.1 Computing in AI Systems

Training

Inference

2.2 Compute Trends: 2012 to 2018

2.3 Compute Trends: An Update[8]

3. Compute and AI Alignment

3.1 The Bitter Lesson

3.2 Scaling Hypothesis

3.3 AI and Efficiency

3.4 Qualitative Assessment

3.5 Compute Milestones

3.6 Conclusion

Next Post: Forecasting Compute

Acknowledgments

References

2.3 Compute Trends: An Update^[8]