Nvidia H100 GPUs: Supply and Demand

This post is an exploration of the supply and demand of GPUs, particularly Nvidia H100s. We’re also releasing a song and music video on the same day as this post.

This post went mega viral. It was on the frontpage of HN, techmeme, many email newsletters, got tweets from Andrej Karpathy and others, comments from Mustafa (who will have $1B of GPUs online soon) from Inflection and Emad from Stability, the song was mentioned in the NY Times, and various asset managers and AI founders reached out. If you haven’t read it yet, I hope you enjoy!

Introduction #
As of August 2023, it seems AI might be bottlenecked by the supply of GPUs.

“One reason the AI boom is being underestimated is the GPU/TPU shortage. This shortage is causing all kinds of limits on product rollouts and model training but these are not visible. Instead all we see is Nvidia spiking in price. Things will accelerate once supply meets demand.”

— Adam D’Angelo, CEO of Quora, Poe.com, former Facebook CTO

These Are The CEOs And Companies That Are Most Important to GPU Supply and Demand - And To AI. Larger version

Is There Really A Bottleneck? #
Elon Musk says that “GPUs are at this point considerably harder to get than drugs.”

Sam Altman says that OpenAI is GPU-limited and it’s delaying their short term plans (fine-tuning, dedicated capacity, 32k context windows, multimodality).

Capacity of large scale H100 clusters at small and large cloud providers is running out.

“Rn everybody wishes Nvidia could produce more A/H100”

— Message from an exec at a cloud provider

“We’re so short on GPUs the less people use our products the better”

“We’d love it if they use it less because we don’t have enough GPUs”

Sam Altman, CEO at OpenAI

It’s a good soundbite to remind the world how much users love your product, but it’s also true that OpenAI needs more GPUs.

For Azure/Microsoft:

They are rate limiting employees on GPUs internally. They have to queue up like it was a university mainframe in the 1970s. I think OpenAI is sucking up all of it right now.
The Coreweave deal is all about pasting on their GPU infrastructure.
— Anonymous

In short: Yes, there’s a supply shortage of H100 GPUs. I’m told that for companies seeking 100s or 1000s of H100s, Azure and GCP are effectively out of capacity, and AWS is close to being out.

This “out of capacity” is based on the allocations that Nvidia gave them.

What do we want to know about the bottleneck?

What’s causing it (how much demand, how much supply)
How long will it last
What’s going to help resolve it
Table Of Contents #
The GPU Song #
Uh… We’re also releasing a song on the same day as we’re releasing this post. It’s fire.

If you haven’t heard The GPU Song yet, do yourself a favor and play it.

VIDEO

i just watched the video. very funny. nice work.

– Mustafa Suleyman, CEO at Inflection AI

It’s on Spotify, Apple Music and YouTube.

See more info on the song here.

Demand For H100 GPUs #
What’s causing the bottleneck - Demand

Specifically, what do people want to buy that they can’t?
How many of those GPUs do they need?
Why can’t they use a different GPU?
What are the different product names?
Where do companies buy them and how much do they cost?
Who Needs H100s? #

“It seems like everyone and their dog is buying GPUs at this point”

– Elon

Who Needs/Has 1,000+ H100 Or A100s #
Startups training LLMs
OpenAI (through Azure), Anthropic, Inflection (through Azure and CoreWeave), Mistral AI

CSPs (Cloud Service Providers)
The big 3: Azure, GCP, AWS
The other public cloud: Oracle
Larger private clouds like CoreWeave, Lambda

Other large companies

Who Needs/Has 100+ H100 Or A100s #
Startups doing significant fine-tuning large open source models.

What Are Most Of The High End GPUs Being Used For? #
For companies using private clouds (CoreWeave, Lambda), of companies with hundreds or thousands of H100s, it’s almost all LLMs, and some diffusion model work. Some of it is fine-tuning of existing models, but mostly it’s new startups that you may not yet know about that are building new models from scratch. They’re doing $10mm-50mm contracts done over 3 years, with a few hundred to a few thousand GPUs.

For companies using on-demand H100s with a handful of GPUs, it’s still probably >50% LLM related usage.

Private clouds are now starting to see inbound demand from enterprises who would normally be going with their default big cloud provider, but everyone is out.

Are The Big AI Labs More Constrained On Inference Or Training? #
Depends on how much product traction they have! Sam Altman says OpenAI would rather have more inference capacity if forced to choose, but OpenAI is still constrained on both.

Which GPUs Do People Need? #
Mostly H100s. Why? It’s the fastest both for inference and training for LLMs. (The H100 is often also the best price-performance ratio for inference, too)

Specifically: 8-GPU HGX H100 SXM servers.

My analysis is it’s cheaper to run for the same work as well. The V100 a great deal if you could find them used, which you can’t

– Anonymous

honestly not sure about [it being the best price-performance ratio]? price/performance for training looks about the same for A100 as for H100. for inference, we find that A10Gs are more than enough and much cheaper.

– Private cloud exec

this [A10G’s being more than enough] was true for a while. but in the world of falcon 40b and llama2 70b, which we’re seeing a lot of usage for, it’s not true anymore. we need A100s for these

2xA100s to be exact. so the interconnect speed matters for inference.

– (Different) Private cloud exec

What’s The Most Common Need From LLM Startups? #
For training LLMs: H100s with 3.2Tb/s InfiniBand.

What Do Companies Want For LLM Training And Inference? #
For training they tend to want H100s, for inference it’s much more about performance per dollar.

It’s still a performance per dollar question with H100s vs A100s, but H100s are generally favored as they can scale better with higher numbers of GPUs and give faster training times, and speed / compressing time to launch or train or improve models is critical for startups.

“For multi-node training, all of them are asking for A100 or H100 with InfiniBand networking. Only non A/H100 request we see are for inference where workloads are single GPU or single node”

– Private cloud exec

What Is Important For LLM Training? #
Memory bandwidth
FLOPS (tensor cores or equivalent matrix multiplication units)
Caches and cache latencies
Additional features like FP8 compute
Compute performance (related to number of cuda cores)
Interconnect speed (eg InfiniBand)
The H100 is preferred over A100 partly because of things like lower cache latencies and FP8 compute.

H100 is preferred because it is up to 3x more efficient, but the costs are only (1.5 - 2x). Combined with the overall system cost, H100 yields much more performance per dollar (if you look at system performance, probably 4-5x more performance per dollar).

— Deep learning researcher

What Are The Other Costs Of Training And Running LLMs? #
GPUs are the most expensive individual component, but there are other costs.

System RAM and NVMe SSDs are expensive.

InfiniBand networking is costly.

10-15% of total cost for running a cluster might go to power and hosting (electricity, cost of the datacenter building, cost of the land, staff) - roughly split between the two, can be 5-8% for power and 5-10% for other elements of hosting cost (land, building, staff).

It’s mostly networking and reliable datacenters. AWS is difficult to work with because of network limitations and unreliable hardware

— Deep learning researcher

What About GPUDirect? #
GPUDirect is not a critical requirement, but can be helpful.

I would not say it is supercritical, but it makes a difference in performance. I guess it depends on where your bottleneck is. For some architectures / software implementations, the bottleneck is not necessarily networking, but if it is GPUDirect can make a difference of 10-20%, and that are some pretty significant numbers for expensive training runs.

Feeds item

https://gpus.llm-utils.org/nvidia-h100-gpus-supply-and-demand/