Feel the love for Nvidia’s latest GPUs: Ada Lovelace

If you want the performance, you will get it, but it won’t be cheap.

This week (#38 of the year), Nvidia dominated the news with their GTC announcements. Top among them was the introduction of the long-awaited and rumored RTX 40 series and associated AD100 series GPUs known as the Ada Lovelace architecture.

Everything about the AD100 is supersized—compare the current GPU with the previous generation.

  RTX 4090 RTX 3090 Difference  
Architecture Ada Lovelace Ampere
GPU AD102 GA102
Process node 4 nm TSMC 8 nm Samsung 50%
Transistors (billion) 76 28.3 269%
Die size (mm2) 608 628 0.96%
CUDA cores 16,384 10,496 156%
RT cores 128 82 156%
Tensor cores 512 328 156%
Base clock (MHz) 2,235 1,395 160%
Boost clock (MHz) 2,520 1,695 149%
Memory speed (Gbps) 21 19.5 108%
Bandwidth (Gbps) 1,008 936 108%
TDP (W) 450 350 129%
MSRP $1,599 $1,499 (today $939) 6.6%

Comparison of Ada Lovelace to last-gen Ampere GPU.

The RTX 40 series GPUs feature a range of new technological innovations, including:

  • Streaming multiprocessors with up to 83 TFLOPS of shader power—2× over the previous generation.
  • Third-generation RT cores with up to 191 effective ray-tracing teraflops—2.8× over the previous generation.
  • Fourth-generation Tensor cores with up to 1.32 Tensor PFLOPS—5× over the previous generation using FP8 acceleration.
  • Shader Execution Reordering (SER) improves execution efficiency by rescheduling shading workloads on the fly to utilize the GPU’s resources better. As significant innovation as out-of-order execution was for CPUs, SER improves ray-tracing performance up to 3× and in-game frame rates by up to 25%.
  • Ada Optical Flow Accelerator with 2× faster performance allows DLSS 3 to predict movement in a scene, enabling the neural network to boost frame rates while maintaining image quality.
  • Architectural improvements coupled with custom TSMC 4N process technology result in as much of a 2× leap in power efficiency.
  • Dual Nvidia Encoders (NVENC) cut export times by up to half and feature AV1 support. The NVENC AV1 encode is being adopted by OBS, Blackmagic Design, DaVinci Resolve, Discord, and more.
Photo of Ada Lovelace GPU die. (Source: Nvidia)

Nvidia says the new third-generation RT Cores are enhanced to deliver 2× faster ray-triangle intersection testing and include two crucial new hardware units. An Opacity Micromap Engine speeds up ray tracing of alpha-test geometry by a factor of 2, and a Micro-Mesh Engine generates micro-meshes on the fly to generate additional geometry.

The Micro-Mesh Engine provides the benefits of increased geometric complexity without the traditional performance and storage costs of complex geometries.

The AD100 has 269 times more transistors than the A100. (Source: Nvidia)

The Nvidia Ada Lovelace architecture at the heart of each GeForce RTX 40 series AIB is a massive generational leap in transistors, as cited above, but also, says Nvidia, in performance, efficiency, and capabilities. Built on a custom TSMC 4N process, with up to 76 billion transistors (compared to the last generation’s 28 billion), the company claims Ada is the world’s most advanced GPU architecture ever created.

Ada is also very efficient, with over twice the performance at the same power as Ampere and excellent scalability and overclocking ability as power consumption increases.

Nvidia’s Ada Lovelace power-performance efficiency compared to previous generations. (Source: Nvidia)

Streaming Multiprocessors (SMs) provide the primary performance for games. Having doubled peak FP32 throughput in the last-generation GPU, Nvidia has more than doubled peak throughput with the Ada. The new GPU, says the company, can obtain up to 83 TFLOPs on the GeForce RTX 4090. That’s compared to the 40 Shader-TFLOPS of their fastest previous-generation GPU.

The following table shows a comparison of the three new AIBs from Nvidia.

AIB RTX 4090 RTX 4080 16GB RTX 4080 12GB
Architecture AD102 AD103 AD104
Process Technology TSMC 4N TSMC 4N TSMC 4N
Transistors (billion) 76 40? 32?
Die size (mm2) 608.4 380? 300?
SMs/CUs/Xe-Cores 128 76 60
GPU Cores (Shaders) 16,384 9728 7,680
Tensor Cores 512 304 240
Ray-Tracing “Cores” 128 76 60
Boost clock (MHz) 2,520 2,510 2,610
VRAM Speed (Gbps) 21 23 21
VRAM (GB) 24 16 12
VRAM Bus Width 384 256 192
L2 Cache 96? 64? 48?
ROPs 192? 112? 80?
TMUs 512? 304? 240?
TFLOPS FP32 (Boost) 82.6 48.8 40.1
TFLOPS FP16 (FP8) 661 (1,321) 391 (781) 321 (641)
Bandwidth (Gbps) 1,008 736? 504?
TDP (W) 450 320 285
Launch date Oct. 2022 Nov. 2022 Nov. 2022
Launch price ($) 1,599 1,199 899

GeForce RTX 40 series specifications.

Nvidia more than doubled the RT cores with the Ada Lovelace AD102 GPU, as depicted in the following diagram.

Nvidia’s new 3rd-generation cores.

Nvidia pioneered real-time ray tracing with the RTX 2000 series in 2018. In the 2022 RTX 4000 series, the company has introduced a revolutionary new shader approach to ray tracing they call Shader Execution Reordering (SER).

Shader Execution Reordering

A GPU’s architecture is highly parallelized in what is known as a SIMD (single instruction, multiple data) processors known as shaders. It works exceptionally well when you have well-organized data. However, ray tracing requires computing the impact of millions of rays striking numerous different material types throughout a scene, creating a sequence of divergent, inefficient workloads for basic shaders.

Nvidia’s new SER design dynamically reorganizes these previously inefficient workloads into considerably more efficient ones, improving shader performance, says the company, by up to 200% and in-game frame rates by up to 25%.

When not doing ray tracing, shaders calculate the appropriate levels of light, darkness, and color while rendering a 3D scene and are used in every modern game.

Ada’s 3rd-generation ray-tracing cores have twice the ray-triangle intersection throughput of the previous-generation GPU, increasing peak RT-TFLOP performance by up to 2.8×, says Nvidia. On the GeForce RTX 4090, gamers and creators will have 191 RT-TFLOPS of power at their disposal, compared to the 78 RT-TFLOPS of the company’s fastest previous-gen GPU, enabling far more immersive ray-traced worlds to be rendered at far faster speeds.

Nvidia describes the tensor cores in the Ada Lovelace GPU as 4th generation. Users say the company also leverages tensor cores to enhance their broadcasts and video and voice calls in the Nvidia Broadcast app.

Ada’s new tensor cores are fast and have a new 8-bit floating point (FP8) tensor engine, increasing throughput by up to 500% to 1.32 tensor-PFLOPS on the GeForce RTX 4090, says Nvidia. The tensor cores enable DLSS, which Nvidia says is now employed in 216 released games and apps,

Nvidia’s DLSS 3’s frame generation technology is powered by Ada’s new optical flow accelerator. It feeds pixel motion data from subsequent frames to the DLSS neural network. That generates new frames on the GPU, which ensures performance accelerates even in CPU-bound scenarios.

Transcoding and broadcasting

GeForce RTX 4090 and GeForce RTX 4080 AIBs have two eighth-generation Nvidia encoders (NVENC) capable of AV1 encoding, which is an attractive encoder for livestreamers, video editors, and video callers.

For livestreamers, AV1 improves encoding efficiency by 40%. OBS Studio, which offers a streaming app, collaborated with Nvidia to enable AV1 encoding within its next software release, expected in October. OBS says it also optimized encoding pipelines to reduce overhead by 35% for all Nvidia GPUs. Video callers also benefit from high-quality livestreaming, with Discord adding AV1 support later this year.

For video editors, Nvidia claims the dual encoders are up to 2× faster and will save creators hours each week. The company says it has collaborated with DaVinci Resolve, Voukoder (a popular plug-in for Adobe Premiere Pro), and Jianying (China’s most popular video editing app)  to enable this feature on their editing apps. The updates will be available in October. And if you are interested in capturing high-res content for your videos, GeForce Experience users with GeForce RTX 40 series AIBs can now use Nvidia ShadowPlay to capture gameplay up to 8K at 60 fps in HDR.

Nvidia DLSS 3

Nvidia’s Deep Learning Super Sampling (DLSS) was a groundbreaking revolution in AI-powered graphics and enabled performance (frame rate) improvement on GeForce RTX GPUs using dedicated tensor cores.

With Ada, the company has introduced DLSS 3, adding a new AI-powered optical multi-frame generation, which generates entirely new high-quality frames rather than just pixels. Through a process detailed in Nvidia’s DLSS 3 article, DLSS 3 combines DLSS Super Resolution, DLSS Frame Generation, and Nvidia Reflex to reconstruct seven-eighths of the displayed pixels, accelerating performance significantly.

The data path of Nvidia’s DLSS 3 process. (Source: Nvidia)

Nvidia claims that DLSS 2 and its Super Resolution technology in CPU-limited games can increase frame rates in some games by up to 200%. DLSS 3 can boost them by up to 400%, says Nvidia.

Opacity Micromaps optimize ray tracing by encoding data about game details directly onto the world’s objects in the game engine ahead of time. Complex items, like foliage, are especially demanding to trace due to the many different ways rays can affect their appearance based on scene lighting and the innumerable directions rays can bounce between leaves and branches. The 3rd-generation RT core present in Ada GPUs uses opacity masks to assign the opacity state of these items, which can be opaque, transparent, or unknown.

With DLSS, Nvidia introduced Opacity Micromaps. By baking the ray-tracing characteristics of irregularly shaped and translucent objects into an opacity mask, the 3rd-generation RT cores render these complex objects faster, thereby improving performance.

When used with Displaced Micro-Meshes (DMM) generated by RT cores, a ray-tracing bounding volume hierarchy (BVH) is built that is, says Nvidia, 10× faster, using one-fifth the VRAM. DMMs are new primitives representing a structured mesh of micro-triangles that Ada (and only Ada) RT cores process natively, reducing storage and processing requirements compared to previous generations when rendering complex geometries using only basic triangles.

DMM is particularly beneficial in highly detailed ray-traced games and scenes, and gives developers the performance to create photoreal games and experiences that leverage photogrammetry and super-detailed objects and surfaces.

But seeing is believing, and Nvidia has a variety of eye-catching demos on their Web page of new and upgraded games that are employing the new ray-tracing features offered in the RTX 4000 series AIBs.

(Source: Oxford Today)