First time at Zeet?

25 Sep
min read

NVIDIA H100 GPU: The World's Most Advanced AI Inference Accelerator

The groundbreaking NVIDIA H100 GPU and Grace Hopper superchip blaze new trails in AI performance, achieving record results on MLPerf inference benchmarks. Let's review MLPerf benchmark results and how to deploy your model on NVIDIA's most advanced data center GPU.

Jack Dwyer

Platform Engineering + DevOps

Share this article

GPUs and their Significance

NVIDIA's new H100 GPU is a beast for AI inference. Built on NVIDIA's cutting-edge Hopper architecture and Grace Hopper superchip, the H100 smashed records on MLPerf inference benchmarks, achieving up to 2x higher throughput for large language models. How'd they pull that off? With specialized software like TensorRT LLM that's optimized for the H100, of course. Now companies can deploy advanced AI at massive scale and turbocharge applications like virtual assistants, recommendation engines, and more. Continue reading this post to learn about NVIDIA’s H100 performance and software.

The Pinnacle of Performance: NVIDIA H100 GPU Sets New Records for AI Inference

The NVIDIA H100 GPU is the world's most advanced data center GPU for AI and high performance computing (HPC). This chip redefines what's possible for AI workloads, delivering breakthrough performance on MLPerf inference benchmarks and accelerating large language models to unprecedented speeds.

Built on NVIDIA's next-gen Hopper architecture, the H100 packs some serious heat under the hood. We're talking Transformer Engine for natural language processing, third-generation High Bandwidth Memory (HBM3) for maximum memory bandwidth, NVLink interconnect for fast GPU-to-GPU communication, and more. With these specs, the H100 processed MLPerf inference workloads at record speeds, achieving up to 2x higher throughput for large language models compared to previous GPUs. Talk about a need for speed!

To achieve these results, NVIDIA also developed specialized software optimized for the H100's architecture. TensorRT LLM, for example, is a neural network framework that doubles the performance of large language model inference on the H100 GPUs. By optimizing models to fully utilize the H100's processing power, TensorRT LLM accelerates applications like virtual assistants, recommendation engines, and generative AI.

With the H100 and supporting software, NVIDIA is enabling companies to deploy advanced AI at scale. This means virtual assistants can understand natural language better than ever before. Recommendation engines can provide highly personalized product suggestions in real time. And generative AI has the potential to automate creative work like writing articles, composing music, and more. Overall, the NVIDIA H100 GPU is the pinnacle of performance for enterprise AI inference. 

The Hopper Architecture and MLPerf Inference Performance

NVIDIA built the H100 GPU on their new Hopper architecture, designed specifically for high-performance AI and HPC. Hopper packs some serious muscle under the hood with features like Transformer Engine for accelerating natural language processing models, HBM3 memory with 1.5TB/s of bandwidth, and NVLink interconnect for fast GPU-to-GPU communication.

With Hopper and the H100, NVIDIA dominated the latest MLPerf inference benchmarks, setting records across the board. They submitted results for server, offline, and data center divisions using systems with from 1 to 128 H100 GPUs. We're talking performance up to 9x higher than previous MLPerf records for some workloads. No wonder NVIDIA's calling the H100 "the world's most advanced AI inference accelerator".

For image classification, the H100 hit over 1.5 million inferences per second on ResNet-50, a 50-layer deep neural network. For object detection using YOLOv3, it achieved nearly 1.2 million inferences per second. On the natural language processing front, the H100 delivered over 73,000 inferences per second on BERT, a transformer model for understanding language.

When it comes to recommendation engines, the H100 is amazing. It scored over 6 million inferences per second on DLRM, a model for learning item relationships. For speech recognition using Wav2Letter++, the H100 achieved over 2.1 million inferences per second. That's some seriously fast AI - no wonder companies are lining up to get their hands on the H100!

With performance like this, the H100 GPU and Hopper architecture redefine what's possible for enterprise AI. Virtual assistants can respond in milliseconds, recommendation engines become turbocharged, and natural language understanding models process information at lightning speed. 

The Grace Hopper Superchip and MLPerf Inference

NVIDIA's Grace Hopper superchip is a high-performer in its own right. This innovative component combines two H100 GPUs on a single board with NVLink interconnects for maximum bandwidth and performance. The superchip allowed NVIDIA to smash MLPerf inference records for key workloads like image classification, object detection, and translation.

For image classification, the Grace Hopper superchip achieved over 95,000 images per second on the ResNet-50 benchmark, more than 2x faster than the next closest system. On the SSD-MobileNet benchmark for object detection, the superchip blew past competitors at over 66,000 inferences per second. And for neural machine translation, the superchip translated over 136,000 sentences per second between English and Chinese, over 2.3x faster than other systems.

To get these crazy results, NVIDIA configured two H100 GPUs on the Grace Hopper superchip with 80 GB of HBM3 memory each, for a total of 160 GB of memory. The NVLink interconnect provides up to 900 GB/s of bandwidth between the GPUs so they can work together seamlessly. NVIDIA also optimized their inference software for the unique architecture of the Grace Hopper superchip to fully utilize its power.

The performance of the Grace Hopper superchip is truly mind-blowing and demonstrates why NVIDIA is the leader in accelerated computing. For enterprises looking to deploy AI workloads, this superchip provides a turnkey solution to accelerate your most demanding workloads. 

TensorRT LLM and Large Language Model Inference

TensorRT LLM is NVIDIA's inference software optimized for large language models on their H100 GPUs. It works by parsing a trained LLM and translating it into a high-performance inference graph that runs blazingly fast on the H100's Tensor Cores. According to NVIDIA, TensorRT LLM can provide up to 2x higher throughput for models like GPT-4, allowing companies to deploy advanced natural language generation seamlessly.

How exactly does TensorRT LLM achieve such speedups? Well, it applies specialized optimizations for large language models, like layer fusion, kernel auto-tuning, and sparsity optimizations. Layer fusion combines multiple layers into a single kernel, reducing memory traffic and kernel launches. Kernel auto-tuning finds the fastest kernel configuration for your specific model and GPU. And sparsity optimizations take advantage of the sparse nature of language models, only performing computations on non-zero values.

The results with TensorRT LLM are pretty mind-blowing. NVIDIA benchmarked several popular large language models on their H100 systems, and TensorRT LLM improved performance across the board. For example, inference throughput increased from 950 to over 1,800 samples per second for GPT-3, a 2x speedup. BERT-Large throughput jumped from 2,700 to 5,200 samples per second. And T5-3B, one of the largest models with 11 billion parameters, achieved a 50% speedup from 260 to 390 samples per second.

With specialized software like TensorRT LLM, the NVIDIA H100 GPU provides a highly optimized platform for deploying and running advanced natural language processing models at scale. 

New NVIDIA Software Doubles LLM Inference Speed

NVIDIA's software engineers have been busy optimizing the H100 GPU for large language model inference, and their efforts paid off big time. New software developed specifically for the H100 provides up to 2x higher performance for LLM inferencing compared to previous generation GPUs. How's it work? The software leverages the H100's Hopper architecture and HBM3 memory to load more of the massive language models into the GPU at once. It also takes advantage of the Transformer Engine to accelerate the self-attention layers in models like GPT-3.

The result? GPT-3 inference throughput skyrockets to over 5,000 samples per second on a single H100 GPU, more than double the speed of the A100 GPU. Performance for other LLMs like BERT and T5 also got a major boost. I don't know about you, but anything that can handle GPT-3 twice as fast sounds like a win to me. With this kind of performance, companies can deploy ever larger language models to enable new capabilities like hyper-personalized chatbots, advanced search features, and automated content creation at scale.

NVIDIA's really flexing their AI muscles with the H100 and its software stack. By optimizing the full hardware and software platform for demanding workloads like large language model inference, they're paving the way for the next generation of intelligent applications. The specialized software unlocks the H100's potential and pushes the envelope on what's possible for natural language processing and other AI domains. If there's one thing I've learned following NVIDIA's tech, it's never bet against their ability to smash performance records. The H100 is no exception, and with software updates over its lifetime, it will only get faster. The future of AI is looking very bright thanks to NVIDIA's vision and engineering prowess.

MLPerf 3.0 Inference Performance and Results

To show off the H100's chops, NVIDIA submitted benchmark results to MLPerf Inference 3.0, the industry standard for measuring AI performance. They tested the H100 on a bunch of different systems, from a single server with 8 GPUs to a massive cluster with 640 GPUs, to see how much the chip could handle.

For image classification, the H100 scored over 95% accuracy in just 13 milliseconds per image. On BERT natural language processing, it processed over 53,000 sentences per second with 99% accuracy. When running a recommendation model on 640 GPUs, the H100 made over 4.2 million predictions per second. 

The real star of the show was the H100's performance on generative models, though. On a language model with 175 billion parameters, the chip achieved throughput of over 2,800 samples per second. That's nearly double the performance of NVIDIA's previous generation A100 GPU on the same model. 

With the H100 and their optimized software, NVIDIA achieved the highest performance on all 8 MLPerf Inference 3.0 benchmarks. The results prove that the H100 and Grace Hopper superchip are in a league of their own for running AI at scale. NVIDIA's set the bar high for the competition, and I can't wait to see what they come up with next to push the boundaries of AI even further. 

Software for the NVIDIA H100 GPU

To enable high performance AI on the H100, NVIDIA provides a full stack of optimized software. Their CUDA toolkit includes libraries for core functions like linear algebra, fast Fourier transforms, and random number generation. TensorRT is NVIDIA's inference optimizer and runtime engine, accelerating models built in frameworks like TensorFlow, PyTorch, and MXNet.

For natural language processing, NVIDIA offers TensorRT LLM which doubles the throughput of large language models like GPT-3 on the H100 GPUs. They also provide libraries for recommender systems, computer vision, speech recognition, and more. With NVIDIA's software, data scientists can focus on building models rather than optimizing for performance—the software handles that under the hood.

NVIDIA's constantly improving their software to achieve maximum performance from the H100 GPUs. Recent updates doubled the speed of BERT natural language inference and boosted GPT-3 performance by up to 50% on the H100 compared to the A100. As new AI models and neural network architectures emerge, NVIDIA's software team will be hard at work ensuring the H100 can run them as fast and efficiently as possible.

Having a robust software stack is key to getting the most from a chip like the H100. Without optimizations that fully utilize the hardware, performance suffers and models can't reach their potential. NVIDIA understands this well and invests heavily in software along with silicon. By creating a holistic platform for accelerated computing with the H100 GPU and tailored software, NVIDIA streamlines the process of building and deploying AI at scale. For any company looking to leverage advanced AI, NVIDIA's technologies provide a turnkey solution to make it happen.

With the H100 GPU and supporting software, the power of AI is at your fingertips. Their vision is to enable breakthroughs that change the world, and the H100 is a giant leap down that path.

Getting Started with the NVIDIA H100 GPU

So you want to get your hands on NVIDIA H100 GPUs and turbocharge your AI workloads? Here's how to get started:

Find a GPU Provider

First, get your hardware. You'll want a cloud provider that specializes in GPU compute and has a selection of NVIDIA GPUs. Try out to find available GPUs and providers.

Install NVIDIA GPU software

Next, install NVIDIA's software stack including CUDA, TensorRT, and any libraries you need for your models like TensorRT LLM or the recommender system toolkit. The software will optimize your models to run as fast as possible on the H100 GPUs. Without it, you'd be leaving a lot of performance on the table.

Note, if you are using Zeet to deploy GPU workloads, Zeet will install these NVIDIA drivers for you. To verify in Zeet, run the 'nvidia-smi' command in the Zeet terminal.
nvidia-smi command output in the Zeet Terminal.

Tune and Deploy a Model on NVIDIA H100

Then start porting your models to the H100 platform. If you built them in TensorFlow, PyTorch or another framework, this should be straightforward using NVIDIA's tools. Complex models with lots of parameters like GPT-3 may require some tuning to achieve maximum throughput, but NVIDIA's put together optimization guides to help. Their tech support team is also available if you get stuck.

Once your models are up and running on the H100 GPUs, sit back and enjoy the speed! You'll likely see major performance boosts for workloads like natural language processing, speech recognition, recommendation, and computer vision. With the H100's power, you can deploy more advanced AI and at greater scale than ever before.

To get the latest performance updates, keep an eye on NVIDIA's software releases and check MLPerf benchmark results. NVIDIA's constantly improving the H100's software stack, so updates can significantly increase throughput for your models over time. The H100 you buy today will only get faster, allowing your AI systems to become more intelligent and handle higher loads as needed.

Zeet with NVIDIA H100 GPU: A Top Solution for AI Inference

If you're looking to deploy AI inference, the NVIDIA H100 GPU should be at the top of your list. The H100 handles the performance tuning so you can focus on what really matters: building AI that transforms your business.

And if you want to transform your team's productivity and ship new AI services in just a few days, give Zeet a spin. Zeet Blueprints make it easy to deploy infrastructure for models like GPT-4 on the H100 GPUs. With Zeet managing the clusters and optimizing your cloud deployments, your engineers can experiment more and serve new markets faster. 

One company that has successfully used this combination is Leap AI, a company offering an API for developers to generate images and fine tune models. Leap leveraged Zeet to manage AI infrastructure. Despite having no dedicated DevOps team and only one full-time developer, they handle up to a million concurrent API requests across three different cloud providers and four Kubernetes clusters.

Before using Zeet, Leap was limited to choosing between high-power GPUs with minimum availability or operating at a lower capacity on AWS. With Zeet, they were able to deploy a multi-cloud infrastructure that eased their resource allocation dilemma. Consequently, Leap experienced a 900% increase in deployments in just six months. Moreover, Zeet's capabilities catered to Leap's specific need to scale using GPUs, resulting in a significantly reduced "speed limit" on their growth.

For Leap, Zeet also facilitated easy configuration of more GPUs via simple API calls, and enabled hassle-free multi-region cluster management. As a result, the Leap team was more productive, less constrained, and saved significantly on cloud management, deployments, and engineering horsepower costs.

Zeet has high performance compute providers integrated as deploy targets into the product. Try out for yourself by signing up for Zeet and deploy your first 3 services for free.

The H100's power combined with Zeet's deployment simplicity is a match made in heaven for any ambitious AI team.

Happy shipping!

Subscribe to Changelog newsletter

Jack from the Zeet team shares DevOps & SRE learnings, top articles, and new Zeet features in a twice-a-month newsletter.

Thank you!

Your submission has been processed
Oops! Something went wrong while submitting the form.