NVIDIA did it again, but this time with a twist — borrowing a page from the competition’s playbook. At the NVIDIA GTC, which has shaped one of the most important events in the AI industry, the company announced the latest iteration of its hardware architecture and products. Here’s a breakdown of the announcements and what they mean for the ecosystem at large.
Hopper: NVIDIA’s New GPU Architecture
GTC, which started on Monday and runs till Thursday, has 900+ sessions. Over 200,000 developers, researchers and data scientists from 50+ countries have registered for the event. Your GTC 2022 . Feather keynote speechNVIDIA Founder and CEO Jensen Huang announces a wealth of news in data center and high-performance computing, AI, design collaboration and digital twins, networking, automotive, robotics and healthcare.
Huang’s determination was that “companies are processing, refining their data, creating AI software … becoming intelligence makers.” If the goal is to turn data centers into ‘AI factories’, as NVIDIA calls it, then it makes sense to have a transformer at the center of it.
Centerpiece in the announcements has been the new Hopper GPU architecture, which NVIDIA calls “the next generation of accelerated computing.” Named after Grace Hopper, a leading American computer scientist, the new architecture succeeds the NVIDIA Ampere architecture launched two years ago. The company also announced its first Hopper-based GPU, the NVIDIA H100.
NVIDIA claims that the Hopper brings an order of magnitude performance leap compared to its predecessor, and that achievement is based on six breakthrough innovations. Let’s go through them, taking a quick note of how they compare to the competition.
First, manufacturing. Manufactured with 80 billion transistors using a state-of-the-art TSMC 4N process designed for NVIDIA’s accelerated computing needs, the H100 features major advances in AI, HPC, memory bandwidth, interconnects, and communications to accelerate, including access to external connectivity. Contains approximately 5 terabytes per second. At the manufacturing level, even upstarts like Cerebras or GraphCore are pushing the limits of the possible.
NVIDIA H100 GPU, the first to use the new Hopper architecture
Second, multi-instance GPU (MIG). MIG technology allows a single GPU to be split into seven smaller, completely isolated instances to handle different types of jobs. The Hopper architecture increases MIG capabilities up to 7x compared to the previous generation by offering secure multitenant configuration in a cloud environment across each GPU instance. Run:AI, a partner of NVIDIA, provides something similar to a software layer, known as Fractional GPU Sharing.
Third, confidential computing. NVIDIA claims the H100 is the world’s first accelerator with AI models and confidential computing capabilities to protect customer data while they are being processed. Customers can also apply confidential computing to privacy-sensitive industries such as healthcare and financial services, as well as federated learning to shared cloud infrastructure. This is not a feature that we have seen anywhere else.
Fourth, the fourth generation NVIDIA NVLink. Accelerating Largest AI Models, NVLink Joins With a New External NVLink Switch to Expand NVLink as a Scale-Up Network Beyond Servers, 9x Compared to Previous Generations Using NVIDIA HDR Quantum InfiniBand Connects up to 256 H100 GPUs at high bandwidth. Again, this is NVIDIA-specific, although competitors often take advantage of their own specialized infrastructure to add their own hardware as well.
Fifth, DPX provides dynamic programming instructions to accelerate. Dynamic programming is both a mathematical optimization method and a computer programming method, originally developed in the 1950s. In the context of mathematical optimization, dynamic programming usually refers to simplifying a decision by breaking it down into a sequence of decision steps over time. Dynamic programming is primarily an optimization over plain recursion.
NVIDIA notes that dynamic programming is used in a wide range of algorithms, including routing optimization and genomics, and can speed up execution by up to 40x compared to CPUs and up to 7x compared to previous-generation GPUs. Is. We don’t know of a direct equivalent to the competition, although several AI chip upstarts also take advantage of the similarities.
The sixth innovation is what we consider most important: a new Transformer engine. As NVIDIA notes, Transformers are the Standard Model alternative to natural language processing, and one of the most important deep learning models ever. The H100 Accelerator’s Transformer Engine is designed to accelerate these networks up to 6x faster than the previous generation without losing accuracy. It deserves further analysis.
Transformer engine in the center of the hopper
Looking at the title for the new Transformer engine at the heart of NVIDIA’s H100, we were reminded of Intel architect Raja M. Koduri’s comment to ZDNet’s Tiernan Ray. Koduri noted that the acceleration of matrix multiplication is now an essential measure of the performance and efficiency of chips, meaning that each chip will be a Neural Net processor.
Koduri was definitely spot on. In addition to Intel’s own efforts, this is what is driving the new generation of AI chip designs from an array of upstarts. Seeing NVIDIA in the context of a Transformer engine made us wonder if the company had made a radical redesign of its GPU. GPUs weren’t originally designed for AI workloads, they just got good at them, and NVIDIA had the foresight and acumen to build an ecosystem around them.
Going deeper into its analysis of NVIDIA’s Hopper architecture, however, the notion of a radical redesign is dispelled. Whereas Hopper introduces a new Streaming Multiprocessor (SM) with several performance and efficiency improvements, as far as it goes. This should come as no surprise, given the enormous load of the ecosystem built around NVIDIA GPUs and the massive updates and potential incompatibilities that would lead to a radical redesign.
Breaking down the improvements to Hopper, memory seems to be a big part of it. As Facebook’s product manager for PyTorch, the popular machine learning training library, told zdnet“Models keep getting bigger and bigger, they get really, really big, and really expensive to train.” The largest models these days often cannot be stored entirely in the memory circuits that accompany a GPU. Hopper comes with memory that is faster, more and shared among SMs.
Another boost comes from NVIDIA’s new fourth-generation Tensor Cores, which are 6x faster chip-to-chip than the A100. Tensor cores are exactly what are used for matrix multiplication. In the H100, a new FP8 data type is used, resulting in 4x faster computation than the previous generation’s 16-bit floating-point options. On equivalent data types, there’s still a 2x speedup.
H100 Count Correction Summary
As for the so-called “new Transformer engine”, it turns out that this is the NVIDIA term used to refer to “a combination of software and custom NVIDIA Hopper Tensor Core technology that specifically accelerates Transformer model training and inference.” is designed to bring.”
NVIDIA notes that the Transformer Engine intelligently manages and dynamically chooses between FP8 and 16-bit computations, to automatically handle re-casting and scaling between FP8 and 16-bit in each layer Delivers up to 9x faster AI training and 30x faster AI estimation. Speedup on the larger language model compared to the prior generation A100.
So while it’s not a radical change, the combination of performance and efficiency improvements results in a 6x speedup compared to the Ampere, as detailed in NVIDIA’s tech blog. NVIDIA’s focus on improving the performance of the Transformer model doesn’t go wrong at all.
Transformer models are the backbone of widely used language models today, such as BERT and GPT-3. Initially developed for natural language processing use cases, their versatility is increasingly being applied to computer vision, drug discovery, and more, as we are documenting in the state of our AI coverage. Huh. According to a metric shared by NVIDIA, 70% of research published in AI in the past 2 years is based on transformers.
The Software Side of Things: Good News for Apache Spark Users
But what about the software side of things? In previous GTC announcements, software stack updates were a significant piece of news. In this event, while NVIDIA-tuned heuristics that dynamically choose between FP8 and FP16 calculations internally are an important part of the new Transformer engine, updates to the external-facing software stack are less important than seem.
NVIDIA’s Triton Inference Server and Nemo Megatron framework are receiving updates for training large language models. So are Riva, Merlin and Maxine – a speech AI SDK that includes pre-trained models, an end-to-end recommender AI framework, and an audio and video quality enhancer SDK, respectively. As NVIDIA highlighted, these are used by the likes of AT&T, Microsoft, and Snapchat.
There are also 60 SDK updates for NVIDIA’s CUDA-X library. NVIDIA chose to highlight emerging areas such as accelerating quantum circuit simulation (cuQuantum general availability) and 6G physical-layer research (Siona general availability). However, for most users, there is probably good news in the update to Rapids Accelerator for Apache Spark, which speeds up processing by more than 3x without any code changes.
While it wasn’t exactly prominent in NVIDIA’s announcements, we think it should be. An overnight 3x speedup without code changes for Apache Spark users, with 80 percent of Fortune 500s using Apache Spark in production, is no small news. This isn’t the first time NVIDIA has shown some love to Apache Spark users.
Overall, NVIDIA maintains its pace. While the competition is fierce, with the headstart NVIDIA has managed to make, the radical redesign may not really be needed.