Intel Details Ponte Vecchio GPU & Sapphire Rapids HBM Performance, Up To 2.5x Faster Than NVIDIA A100

News

Intel Details Ponte Vecchio GPU & Sapphire Rapids HBM Performance, Up To 2.5x Faster Than NVIDIA A100

August 23, 2022

Throughout Scorching Chips 34, Intel as soon as once more detailed its Ponte Vecchio GPUs operating on a Sapphire Rapids HBM server platform.

Intel Exhibits off Ponte Vecchio 2-Stack GPU & Sapphire Rapids HBM CPU Efficiency In opposition to NVIDIA’s A100

Within the presentation by Intel Fellow & Chief GPU Compute Architect, Hong Jiang, we get some extra particulars concerning the upcoming server powerhouses from the blue staff. The Ponte Vecchio GPU is available in three configurations beginning with a singular OAM and ranging as much as an x4 Subsystem with Xe Hyperlinks, both operating solo or with a dual-socket Sapphire Rapids platform.

The OAM helps all-to-all topologies for each 4 GPU and eight GPU platforms. Complementing the complete platform is Intel’s oneAPI software program stack which is a Degree-Zero API that gives a low-level {hardware} interface to assist cross-architecture programming. Among the principal options of the oneAPI embody:

Interface for oneAPI and different instruments to accelerator gadgets
Effective acquire management and low-latency to accelerator capabilities
Multi-Threaded Design
For GPUs, ships as part of the driving force

So coming to the efficiency metrics, a 2-Stack Ponte Vecchio GPU configuration just like the one featured on a singular OAM is able to delivering as much as 52 TFLOPs of FP64/FP32 compute, 419 TFLOPs of TF32 (XMX Float 32), 839 TFLOPs of BF16/FP16 and 1678 TFLOPs of INT8 horsepower.

Intel additionally particulars its most cache sizes and the height bandwidth supplied by every of them. The Register File measurement on Ponte Vecchio GPU is 64 MB and provides 419 TB/s of bandwidth, the L1 cache additionally is available in at 64 MB and provides 105 TB/s (4:1), and the L2 cache is available in at 408 MB and provides 13 TB/s bandwidth (8:1) whereas the HBM reminiscence swimming pools as much as 128 GB and provides 4.2 TB/s bandwidth (4:1). There’s a vary of compute effectivity strategies inside Ponte Vecchio akin to:

Register File:

Register Caching
Accumulators

L1/L2 Cache:

Write By way of
Write Again
Write Streaming
Uncached

Prefetch:

Software program (instruction) prefetch to L1 and/ or L2
Command Streamer prefetch to L2 for instruction and information

Intel explains that the bigger L2 cache can ship some large positive factors in workloads akin to 2D-FFT Case and DNN Case. Some efficiency comparisons between a full Ponte Vecchio GPU and a module down-configured to 80 MB and 32 MB have been proven.

However that is not all, Intel additionally has efficiency comparisons between the NVIDIA Ampere A100 operating CUDA and SYCL in opposition to its personal Ponte Vecchio GPUs utilizing SYCL. In miniBUDE, which is a computational workload that may predict the binding power of the ligand with the goal, the Ponte Vecchio GPU simulates the check outcomes 2 instances quicker than Ampere A100. There’s one other efficiency metric in ExaSMR (Small Modular Reactors for giant nuclear reactor designs). right here, the Intel GPU is proven to supply a 1.5x efficiency lead over the NVIDIA GPU.

It’s a bit fascinating that Intel remains to be evaluating its Ponte Vecchio GPUs to Ampere A100 as a result of the inexperienced staff has since launched its next-gen Hopper H100 to the market and it is already been delivery to prospects. If Chipzilla feels so assured inside its 2-2.5x efficiency figures, then I do not assume it can have any bother competing nicely with Hopper until in any other case.

Here is Every little thing We Know About The Intel 7 Powered Ponte Vecchio GPUs

Transferring over to the Ponte Vecchio specs Intel outlined some key options of its flagship information heart GPU akin to 128 Xe cores, 128 RT models, HBM2e reminiscence, and a complete of 8 Xe-HPC GPUs that will probably be related collectively. The chip will characteristic as much as 408 MB of L2 cache in two separate stacks that can join by way of the EMIB interconnect. The chip will characteristic a number of dies primarily based on Intel’s personal ‘Intel 7’ course of and TSMC’s N7 / N5 course of nodes.

Intel additionally beforehand detailed the bundle and die measurement of its flagship Ponte Vecchio GPU primarily based on the Xe-HPC structure. The chip will consist of two tiles with 16 lively dies per stack. The utmost lively prime die measurement goes to be 41mm2 whereas the bottom die measurement which can be known as the ‘Compute Tile’ sits at 650mm2. Now we have all of the chiplets and course of nodes that the Ponte Vecchio GPUs will make the most of, listed beneath:

Intel 7nm
TSMC 7nm
Foveros 3D Packaging
EMIB
10nm Enhanced Tremendous Fin
Rambo Cache
HBM2

Following is how Intel will get to 47 tiles on the Ponte Vecchio chip:

16 Xe HPC (inner/exterior)
8 Rambo (inner)
2 Xe Base (inner)
11 EMIB (inner)
2 Xe Hyperlink (exterior)
8 HBM (exterior)

The Ponte Vecchio GPU makes use of 8 HBM 8-Hello stacks and comprises a complete of 11 EMIB interconnects. The entire Intel Ponte Vecchio bundle would measure 4843.75mm2. Additionally it is talked about that the bump pitch for Meteor Lake CPUs utilizing Excessive-Density 3D Forveros packaging will probably be 36u.

The Ponte Vecchio GPU isn’t 1 chip however a mix of a number of chips. It is a chiplet powerhouse, packing essentially the most chiplets on any GPU/CPU on the market, 47 to be exact. And these will not be primarily based on only one course of node however a number of course of nodes as we had detailed just some days again.

Though the Aurora Supercomputer by which the Ponte Vecchio GPUs and Sapphire Rapids CPUs have been for use has been pushed again on account of a number of delays by the blue staff, it’s nonetheless good to see the corporate providing extra particulars. Intel has since teased its next-generation Rialto Bridge GPU because the successor to the Ponte Vecchio GPUs and is alleged to start sampling in 2023. You may learn extra particulars on that right here.

Subsequent-Gen Knowledge Heart GPU Accelerators

GPU Title	AMD Intuition MI250X	NVIDIA Hopper GH100	Intel Ponte Vecchio	Intel Rialto Bridge
Packaging Design	MCM (Infinity Material)	Monolithic	MCM (EMIB + Foveros)	MCM (EMIB + Foveros)
GPU Structure	Aldebaran (CDNA 2)	Hopper GH100	Xe-HPC	Xe-HPC
GPU Course of Node	6nm	4N	7nm (Intel 4)	5nm (Intel 3)?
GPU Cores	14,080	16,896	16,384 ALUs (128 Xe Cores)	20,480 ALUs (160 Xe Cores)
GPU Clock Pace	1700 MHz	~1780 MHz	TBA	TBA
L2 / L3 Cache	2 x 8 MB	50 MB	2 x 204 MB	TBA
FP16 Compute	383 TOPs	2000 TFLOPs	TBA	TBA
FP32 Compute	95.7 TFLOPs	1000 TFLOPs	~45 TFLOPs (A0 Silicon)	TBA
FP64 Compute	47.9 TFLOPs	60 TFLOPs	TBA	TBA
Reminiscence Capability	128 GB HBM2E	80 GB HBM3	128 GB HBM2e	128 GB HBM3?
Reminiscence Clock	3.2 Gbps	3.2 Gbps	TBA	TBA
Reminiscence Bus	8192-bit	5120-bit	8192-bit	8192-bit
Reminiscence Bandwidth	3.2 TB/s	3.0 TB/s	~3 TB/s	~3 TB/s
Kind Issue	OAM	OAM	OAM	OAM v2
Cooling	Passive Cooling Liquid Cooling	Passive Cooling Liquid Cooling	Passive Cooling Liquid Cooling	Passive Cooling Liquid Cooling
TDP	560W	700W	600W	800W
Launch	This autumn 2021	2H 2022	2022?	2024?

Intel Exhibits off Ponte Vecchio 2-Stack GPU & Sapphire Rapids HBM CPU Efficiency In opposition to NVIDIA’s A100

Here is Every little thing We Know About The Intel 7 Powered Ponte Vecchio GPUs

Subsequent-Gen Knowledge Heart GPU Accelerators

LEAVE A REPLY Cancel reply