Throughout Scorching Chips 34, Intel as soon as once more detailed its Ponte Vecchio GPUs operating on a Sapphire Rapids HBM server platform.

Intel Exhibits off Ponte Vecchio 2-Stack GPU & Sapphire Rapids HBM CPU Efficiency In opposition to NVIDIA’s A100

Within the presentation by Intel Fellow & Chief GPU Compute Architect, Hong Jiang, we get some extra particulars concerning the upcoming server powerhouses from the blue staff. The Ponte Vecchio GPU is available in three configurations beginning with a singular OAM and ranging as much as an x4 Subsystem with Xe Hyperlinks, both operating solo or with a dual-socket Sapphire Rapids platform.

The OAM helps all-to-all topologies for each 4 GPU and eight GPU platforms. Complementing the complete platform is Intel’s oneAPI software program stack which is a Degree-Zero API that gives a low-level {hardware} interface to assist cross-architecture programming. Among the principal options of the oneAPI embody:

  • Interface for oneAPI and different instruments to accelerator gadgets
  • Effective acquire management and low-latency to accelerator capabilities
  • Multi-Threaded Design
  • For GPUs, ships as part of the driving force

So coming to the efficiency metrics, a 2-Stack Ponte Vecchio GPU configuration just like the one featured on a singular OAM is able to delivering as much as 52 TFLOPs of FP64/FP32 compute, 419 TFLOPs of TF32 (XMX Float 32), 839 TFLOPs of BF16/FP16 and 1678 TFLOPs of INT8 horsepower.

Intel additionally particulars its most cache sizes and the height bandwidth supplied by every of them. The Register File measurement on Ponte Vecchio GPU is 64 MB and provides 419 TB/s of bandwidth, the L1 cache additionally is available in at 64 MB and provides 105 TB/s (4:1), and the L2 cache is available in at 408 MB and provides 13 TB/s bandwidth (8:1) whereas the HBM reminiscence swimming pools as much as 128 GB and provides 4.2 TB/s bandwidth (4:1). There’s a vary of compute effectivity strategies inside Ponte Vecchio akin to:

Register File:

  • Register Caching
  • Accumulators

L1/L2 Cache:

  • Write By way of
  • Write Again
  • Write Streaming
  • Uncached

Prefetch:

  • Software program (instruction) prefetch to L1 and/ or L2
  • Command Streamer prefetch to L2 for instruction and information

Intel explains that the bigger L2 cache can ship some large positive factors in workloads akin to 2D-FFT Case and DNN Case. Some efficiency comparisons between a full Ponte Vecchio GPU and a module down-configured to 80 MB and 32 MB have been proven.

However that is not all, Intel additionally has efficiency comparisons between the NVIDIA Ampere A100 operating CUDA and SYCL in opposition to its personal Ponte Vecchio GPUs utilizing SYCL. In miniBUDE, which is a computational workload that may predict the binding power of the ligand with the goal, the Ponte Vecchio GPU simulates the check outcomes 2 instances quicker than Ampere A100. There’s one other efficiency metric in ExaSMR (Small Modular Reactors for giant nuclear reactor designs). right here, the Intel GPU is proven to supply a 1.5x efficiency lead over the NVIDIA GPU.

It’s a bit fascinating that Intel remains to be evaluating its Ponte Vecchio GPUs to Ampere A100 as a result of the inexperienced staff has since launched its next-gen Hopper H100 to the market and it is already been delivery to prospects. If Chipzilla feels so assured inside its 2-2.5x efficiency figures, then I do not assume it can have any bother competing nicely with Hopper until in any other case.

Here is Every little thing We Know About The Intel 7 Powered Ponte Vecchio GPUs

Transferring over to the Ponte Vecchio specs Intel outlined some key options of its flagship information heart GPU akin to 128 Xe cores, 128 RT models, HBM2e reminiscence, and a complete of 8 Xe-HPC GPUs that will probably be related collectively. The chip will characteristic as much as 408 MB of L2 cache in two separate stacks that can join by way of the EMIB interconnect. The chip will characteristic a number of dies primarily based on Intel’s personal ‘Intel 7’ course of and TSMC’s N7 / N5 course of nodes.

Intel additionally beforehand detailed the bundle and die measurement of its flagship Ponte Vecchio GPU primarily based on the Xe-HPC structure. The chip will consist of two tiles with 16 lively dies per stack. The utmost lively prime die measurement goes to be 41mm2 whereas the bottom die measurement which can be known as the ‘Compute Tile’ sits at 650mm2. Now we have all of the chiplets and course of nodes that the Ponte Vecchio GPUs will make the most of, listed beneath:

  • Intel 7nm
  • TSMC 7nm
  • Foveros 3D Packaging
  • EMIB
  • 10nm Enhanced Tremendous Fin
  • Rambo Cache
  • HBM2

Following is how Intel will get to 47 tiles on the Ponte Vecchio chip:

  • 16 Xe HPC (inner/exterior)
  • 8 Rambo (inner)
  • 2 Xe Base (inner)
  • 11 EMIB (inner)
  • 2 Xe Hyperlink (exterior)
  • 8 HBM (exterior)

The Ponte Vecchio GPU makes use of 8 HBM 8-Hello stacks and comprises a complete of 11 EMIB interconnects. The entire Intel Ponte Vecchio bundle would measure 4843.75mm2. Additionally it is talked about that the bump pitch for Meteor Lake CPUs utilizing Excessive-Density 3D Forveros packaging will probably be 36u.

The Ponte Vecchio GPU isn’t 1 chip however a mix of a number of chips. It is a chiplet powerhouse, packing essentially the most chiplets on any GPU/CPU on the market, 47 to be exact. And these will not be primarily based on only one course of node however a number of course of nodes as we had detailed just some days again.

Though the Aurora Supercomputer by which the Ponte Vecchio GPUs and Sapphire Rapids CPUs have been for use has been pushed again on account of a number of delays by the blue staff, it’s nonetheless good to see the corporate providing extra particulars. Intel has since teased its next-generation Rialto Bridge GPU because the successor to the Ponte Vecchio GPUs and is alleged to start sampling in 2023. You may learn extra particulars on that right here.

Subsequent-Gen Knowledge Heart GPU Accelerators

GPU TitleAMD Intuition MI250XNVIDIA Hopper GH100Intel Ponte VecchioIntel Rialto Bridge
Packaging DesignMCM (Infinity Material)MonolithicMCM (EMIB + Foveros)MCM (EMIB + Foveros)
GPU StructureAldebaran (CDNA 2)Hopper GH100Xe-HPCXe-HPC
GPU Course of Node6nm4N7nm (Intel 4)5nm (Intel 3)?
GPU Cores14,08016,89616,384 ALUs
(128 Xe Cores)
20,480 ALUs
(160 Xe Cores)
GPU Clock Pace1700 MHz~1780 MHzTBATBA
L2 / L3 Cache2 x 8 MB50 MB2 x 204 MBTBA
FP16 Compute383 TOPs2000 TFLOPsTBATBA
FP32 Compute95.7 TFLOPs1000 TFLOPs~45 TFLOPs (A0 Silicon)TBA
FP64 Compute47.9 TFLOPs60 TFLOPsTBATBA
Reminiscence Capability128 GB HBM2E80 GB HBM3128 GB HBM2e128 GB HBM3?
Reminiscence Clock3.2 Gbps3.2 GbpsTBATBA
Reminiscence Bus8192-bit5120-bit8192-bit8192-bit
Reminiscence Bandwidth3.2 TB/s3.0 TB/s~3 TB/s~3 TB/s
Kind IssueOAMOAMOAMOAM v2
CoolingPassive Cooling
Liquid Cooling
Passive Cooling
Liquid Cooling
Passive Cooling
Liquid Cooling
Passive Cooling
Liquid Cooling
TDP560W700W600W800W
LaunchThis autumn 20212H 20222022?2024?

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.