• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

AMD RDNA 3 GPUs To Be More Power Efficient Than NVIDIA Ada Lovelace GPUs, Navi 31 & Navi 33 Tape Out Later This Year

Not correct.


average-fps_3840-2160.png


RTX Ampere was designed to have excess TFLOPS relative rasterization hardware for other workloads such as
1. Mesh shader (compute)
2. Direct Storage GpGPU decompression (compute)
3. DirectML (compute)

For each Turing SM vs Ampere SM difference, each Turing Integer CUDA core was turned into integer/floating-point CUDA core in Ampere.
Turing SM
64 Integer CUDA cores
64 floating-point CUDA cores

Ampere SM
64 Integer/ floating-point CUDA cores
64 floating-point CUDA cores

Try again.

Direct Storage and DirectML are GPU agnostic technologies designed by Microsoft. If anything it will have been developed to extensively be supported by Microsoft's Xbox - a console that uses RDNA2 - not Ampere.

And I know exactly how Ampere differs from Turing with respect to floating point.
BTW, did you know that before Turing, CUDA cores could do either Integer math OR floating point math? Amazing isn't it.

Which means that GA102 does indeed have 10752 cuda cores - that's literally how Nvidia defines GA102 - that's what their whitepapers say, so that's the truth. And if a CUDA Core can do either INT or FP calcs, its still a Vector ALU. No matter what fancy branding Nvidia gives it.

So the end result is that we have two GPUs

3090 - 82SM's - 10496 CUDA cores (Vector ALUs) - 100% performance (relative)
6900XT - 80CU - 5120 Stream processors (Vector ALUs) - 94% performance (relative)

Where is all that extra compute power going?
 

PaintTinJr

Member
Both AMD RX 5700 XT (via Primitive Shader) and NVIDIA Turing (via Mesh Shader) presented Next Generation Geometry Pipeline (NGGP) programming models. Microsoft has rejected RX 5700 XT's Primitive Shader NGGP.

The main point with Primitive Shaders is shader-based geometry culling that can scale with increase TFLOPS power while rendered geometry density shouldn't exceed resolution pixel count.

Differences between AMD RX 5700 XT (via Primitive Shader) and NVIDIA Turing (via Mesh Shader) Next Generation Geometry Pipeline (NGGP) programming models.

ZbIEt9N.jpg

PS: Vega's NGGP is broken.

PC and XSS/XSX RDNA 2 follows NVIDIA's NGGP model.

PS5 RDNA has RDNA 2's high clock speed improvements.

NVIDIA's Mesh Shader NGGP model exists for DirectX12U and Vulcan APIs. Mesh Shader NGGP model is NVIDIA's toy NOT MS's.

IMO Epic Sax CEO's point still stands.

Despite putting a lot of info into your post for nvidia stuff, it still doesn't show to what level these new labelled abstract blocks are refactoring of existing silicon functionality, or is actually new silicon functionality, Or even how much of it is driver/firmware changes with just new names.

On the PS5 side, we have no abstract blocks that show how the custom geometry engine works, either so it is a complete stretch to infer it is in any way deficient for primitives shading, by comparison. This was discussed frequently in the next-gen thread, and tweets from a former PlayStation 5 Engineer, now working on Roblox, describing how the mesh shading culling is done even better on PS5 because it culls before the vertex pipeline stage IIRC.

Even if the hardware aspect of mesh shading is truly missing from PS5, does it even matter that much? What is the efficiency gain - synthetic and actual game engine - of Nvidia mesh shaders over AMD primitive shaders?
Even if the gain was 200%, for games that can exploit such a feature they will be powered by UE5's nanite or similar using a SW rasterizer that bypasses those pipelines, and by Epic's numbers, the games will render about 90% of all their triangles with nanite - 70% as worst case estimate - so at best that 200% gain would be on 30% of all triangles rendered.

The consoles also have an excess of CPU performance - especially the PS5 with an IO complex that DMAs to unified RAM from ESRAM, barely touching the CPU compared to directstorage - and are hUMA processors, so doing the CS part on an idle Zen2 core - maybe even more efficiently - isn't beyond the realms of possibility based on the results we've seen so far at this very early stage on the PS5.

edit:
And in regards of RDNA cards, as others have been indicating in their comments, RDNA seems to offer more compute per TF/s than corresponding Nvidia, so even if it does cost some regular RDNA compute for the AMD cards, it was probably still a better use of their chip area, and more versatile to just use RDNA compute with primitive shaders, than provide specific silicon for that feature.
 
Last edited:

Riky

$MSFT
The latest Digital Foundy Q&A they said Geometry Engine is a cut down version of Mesh Shaders that was already present in RDNA1, see the question on PC holding back consoles.



About 1.10 in.
 
So the end result is that we have two GPUs

3090 - 82SM's - 10496 CUDA cores (Vector ALUs) - 100% performance (relative)
6900XT - 80CU - 5120 Stream processors (Vector ALUs) - 94% performance (relative)

Where is all that extra compute power going?

well it's kinda hard to keep that much cores fed. funnily enough we have kinda the reverse situation compared to most of the GCN era where suddenly AMD has much better Tflop efficiency...
 
The latest Digital Foundy Q&A they said Geometry Engine is a cut down version of Mesh Shaders that was already present in RDNA1, see the question on PC holding back consoles.



About 1.10 in.

This is poor journalism or lack of knowledge
by DF.

Alex was even corrected by LeviathanGamer2 on this on Twitter.

The hardware necessary for Mesh and Primitive Shaders has been present since AMD’s Vega architecture and has been the same up until and including RDNA 2.
 
Last edited:

sircaw

Banned
The latest Digital Foundy Q&A they said Geometry Engine is a cut down version of Mesh Shaders that was already present in RDNA1, see the question on PC holding back consoles.



About 1.10 in.

Are you stirring the pot again? :goog_halo:

BAD BOY< BAD.
 
Last edited:

Darius87

Member
The latest Digital Foundy Q&A they said Geometry Engine is a cut down version of Mesh Shaders that was already present in RDNA1, see the question on PC holding back consoles.



About 1.10 in.

They said :messenger_grinning: you mean alex said, we know what he's capable of saying against PS5, now tell me which GPU supports offscreen vertex proccessing aborting before primitive assembly? i'll wait...
 

rnlval

Member
Direct Storage and DirectML are GPU agnostic technologies designed by Microsoft. If anything it will have been developed to extensively be supported by Microsoft's Xbox - a console that uses RDNA2 - not Ampere.(1)

And I know exactly how Ampere differs from Turing with respect to floating point.
BTW, did you know that before Turing, CUDA cores could do either Integer math OR floating point math? Amazing isn't it. (2)

Which means that GA102 does indeed have 10752 cuda cores - that's literally how Nvidia defines GA102 - that's what their whitepapers say, so that's the truth. And if a CUDA Core can do either INT or FP calcs, its still a Vector ALU. No matter what fancy branding Nvidia gives it.

So the end result is that we have two GPUs

3090 - 82SM's - 10496 CUDA cores (Vector ALUs) - 100% performance (relative) (3)
6900XT - 80CU - 5120 Stream processors (Vector ALUs) - 94% performance (relative)(3)


Where is all that extra compute power going?
1. XSX's Direct Storage Decompression function is done by separate hardware on the SoC, NOT by the GPU.

PC's Direct Storage Decompression function is done by GpGPU compute, hence PC GPU needs to reserve extra compute resources for this function.

2. Pascal CUDA cores can execute integer or floating-point workload with a caveat i.e. integer workload will block floating-point workload. Did you know Pascal would suffer a pipeline stall when integer operations were calculated?

3. Flawed comparison since you didn't factor in the Raster Op (read-write units) bottleneck for GA102. GA102 shows its TFLOPS/TIOPS superiority over NAVI 21 during GpGPU compute (using TMU as read-write units).


Try again.
 
Last edited:

rnlval

Member
IMO Epic Sax CEO's point still stands.

Despite putting a lot of info into your post for nvidia stuff, it still doesn't show to what level these new labelled abstract blocks are refactoring of existing silicon functionality, or is actually new silicon functionality, Or even how much of it is driver/firmware changes with just new names.

On the PS5 side, we have no abstract blocks that show how the custom geometry engine works, either so it is a complete stretch to infer it is in any way deficient for primitives shading, by comparison. This was discussed frequently in the next-gen thread, and tweets from a former PlayStation 5 Engineer, now working on Roblox, describing how the mesh shading culling is done even better on PS5 because it culls before the vertex pipeline stage IIRC (1).

Even if the hardware aspect of mesh shading is truly missing from PS5, does it even matter that much? What is the efficiency gain - synthetic and actual game engine - of Nvidia mesh shaders over AMD primitive shaders?
Even if the gain was 200%, for games that can exploit such a feature they will be powered by UE5's nanite or similar using a SW rasterizer that bypasses those pipelines, and by Epic's numbers, the games will render about 90% of all their triangles with nanite - 70% as worst case estimate - so at best that 200% gain would be on 30% of all triangles rendered(2).

The consoles also have an excess of CPU performance - especially the PS5 with an IO complex that DMAs to unified RAM from ESRAM, barely touching the CPU compared to directstorage - and are hUMA processors, so doing the CS part on an idle Zen2 core - maybe even more efficiently - isn't beyond the realms of possibility based on the results we've seen so far at this very early stage on the PS5.

edit:
And in regards of RDNA cards, as others have been indicating in their comments, RDNA seems to offer more compute per TF/s than corresponding Nvidia, so even if it does cost some regular RDNA compute for the AMD cards, it was probably still a better use of their chip area, and more versatile to just use RDNA compute with primitive shaders, than provide specific silicon for that feature.
1.
cca72cca-f3d0-47ad-9d05-1472724aa55a.PNG


Turing's next-generation geometry pipeline (NGGP) removes the classic vertex shader stage. LOL.
Notice the object culling.

Microsoft's https://devblogs.microsoft.com/directx/dev-preview-of-new-directx-12-features/

MeshShaderPipeline.png

NVIDIA's Task Shader was changed into MS DX12U's Amplification Shader.


What does an Amplification Shader do?

While the Mesh Shader is a fairly flexible tool, it does not allow for all tessellation scenarios and is not always the most efficient way to implement per-instance culling. For this we have the Amplification Shader. What it does is simple: dispatch threadgroups of Mesh Shaders. Each Mesh Shader has access to the data from the parent Amplification Shader and does not return anything. The Amplification Shader is optional, and also has access to groupshared memory, making it a powerful tool to allow the Mesh Shader to replace any current pipeline scenario.



------------------------------

VEGA/NAVI v1's next-generation geometry pipeline (NGGP)

EW5H-srX0AAKmw1




2. Resolution's pixel grid is the limit for geometry render, unless overdraw is a good wasteful feature.
 
Last edited:

PaintTinJr

Member
1.
cca72cca-f3d0-47ad-9d05-1472724aa55a.PNG


Turing's next-generation geometry pipeline (NGGP) removes the classic vertex shader stage. LOL.
Notice the object culling.

Microsoft's https://devblogs.microsoft.com/directx/dev-preview-of-new-directx-12-features/

MeshShaderPipeline.png

NVIDIA's Task Shader was changed into MS DX12U's Amplification Shader.


What does an Amplification Shader do?

While the Mesh Shader is a fairly flexible tool, it does not allow for all tessellation scenarios and is not always the most efficient way to implement per-instance culling. For this we have the Amplification Shader. What it does is simple: dispatch threadgroups of Mesh Shaders. Each Mesh Shader has access to the data from the parent Amplification Shader and does not return anything. The Amplification Shader is optional, and also has access to groupshared memory, making it a powerful tool to allow the Mesh Shader to replace any current pipeline scenario.



------------------------------

VEGA/NAVI v1's next-generation geometry pipeline (NGGP)

EW5H-srX0AAKmw1




2. Resolution's pixel grid is the limit for geometry render, unless overdraw is a good wasteful feature.
That's more info of the same ilk. In no way does that address the questions or the wider point I made about the limited benefit in a world of SW rasterization on compute shader rasterization with likes of Nanite doing the bulk of the work instead of mesh shaders.
 
1. XSX's Direct Storage Decompression function is done by separate hardware on the SoC, NOT by the GPU.

PC's Direct Storage Decompression function is done by GpGPU compute, hence PC GPU needs to reserve extra compute resources for this function.

2. Pascal CUDA cores can execute integer or floating-point workload with a caveat i.e. integer workload will block floating-point workload. Did you know Pascal would suffer a pipeline stall when integer operations were calculated?

3. Flawed comparison since you didn't factor in the Raster Op (read-write units) bottleneck for GA102. GA102 shows its TFLOPS/TIOPS superiority over NAVI 21 during GpGPU compute (using TMU as read-write units).


Try again.

That's great, but during gaming the TMUs are used for texture mapping. So the bottleneck is still present. Which is why despite having nearly double the compute, the 3090 is barely 5-10% faster than The 6900XT.
 
"Mesh Shaders, Primitive Shaders, it doesn't not matter as long as it generates small polygons".
Mark Evan Cerny - Road to Wining Another Generation, quote
 

rnlval

Member
That's great, but during gaming the TMUs are used for texture mapping. So the bottleneck is still present. Which is why despite having nearly double the compute, the 3090 is barely 5-10% faster than The 6900XT.

1. During raytracing, a texture cache path is used on both GPUs. RTX 3090 beats 6900 XT. NAVI 2's CU TMU is halted when RT hardware is active.

gf8pgrv5em871.png


RTX 3090 = 113.3 fps

6900 XT = 70.1 fps

With Vulkan BVH RT, RTX 3090 has 61% advantage over RX 6900 XT.

Both GPUs used heavy Async Compute, Variable Rate Shading (pixel raster path), BVH RT (compute path), Rapid Pack Math (when available), Compute path Geometry Culling (compute path), and other advanced GPU hardware features e.g. Shader Intrinsic functions i.e. Vulkan API allows vendor-specific extensions.

Doom 2016 Vulkan was the introduction for AMD's Shader Intrinsic functions direct hardware access.

2. RTX 3090's compute path can extract it's TFLOPS potential

q5x7gBL.jpg


RTX 3080 Ti/3090 still has extra compute power for DirectML and DirectStorage GpGPU decompression.

For comparison

AIDA64.jpg


61% on top of RX 6900 XT's 25 TFLOPS is 40 TFLOPS which is about RTX 3900's TFLOPS range.

RTX 3090's real-life 38.4 TFLOPS is 51% higher than RX 6900 XT's real-life 25 TFLOPS. Your RTX 3090 having double TFLOPS over RX6900 XT argument is wrong.

The USD price for RTX 3080 Ti and RX 6900 XT is similar which is okay for legacy raster workload, but not for the next-gen workload.

Btw, GpGPU compute path doesn't use texture filtering hardware. When texturing is used, texture filter hardware is used. You missed a critical difference between classic texturing workload and compute/TMU read-write path.
 
Last edited:

rnlval

Member
That's more info of the same ilk. In no way does that address the questions or the wider point I made about the limited benefit in a world of SW rasterization on compute shader rasterization with likes of Nanite doing the bulk of the work instead of mesh shaders.
Nanite is Epic's toy that runs on either Primitive Shaders (AMD's NGGP via PS5) NGGP or DX12U/Vulkan's NGGP (NVIDIA's NGGP which PC/XSX RDNA 2 copied) which includes Amplification (Task) and Mesh Shaders.

NGGP = Next-Generation Geometry Pipeline.

Pixel shader Raster path has MSAA hardware. Hint: Doom 2016's enabled Async Compute render has disabled MSAA. Unlike Raster Ops, TMUs don't have MSAA hardware.
 
Last edited:

ToTTenTranz

Banned
2. RTX 3090's compute path can extract it's TFLOPS potential

q5x7gBL.jpg
q5x7gBL.jpg


RTX 3080 Ti/3090 still has extra compute power for DirectML and DirectStorage GpGPU decompression.

Using a synthetic benchmark to claim it "extracts their potential" is an odd way to put it. Potential is extracted in real loads.
A power-limited chip like the GA102 doesn't show a 60% compute advantage over Navi 21 in real loads because it can't. When doubling the FP32 units, Nvidia didn't double the caches and schedulers to take advantage of it, otherwise it would be burning a lot more power. It seems they simply promoted Ampere's INT32 ALUs to FP32, as that probably gives them a bit more flexibility at some die area cost.

In real compute loads that don't depend on driver optimization, the 3090 isn't that much faster than a 6900XT.

RT performance is a whole other issue, as Nvidia does use more dedicated units with more capability. Whether that will become increasingly important in the mid/long term is what we don't know.
 

rnlval

Member
Using a synthetic benchmark to claim it "extracts their potential" is an odd way to put it. Potential is extracted in real loads.
A power-limited chip like the GA102 doesn't show a 60% compute advantage over Navi 21 in real loads because it can't. When doubling the FP32 units, Nvidia didn't double the caches (1) and schedulers to take advantage of it, otherwise it would be burning a lot more power. It seems they simply promoted Ampere's INT32 ALUs to FP32, as that probably gives them a bit more flexibility at some die area cost.

In real compute loads that don't depend on driver optimization, the 3090 isn't that much faster than a 6900XT.(2)

RT performance is a whole other issue, as Nvidia does use more dedicated units with more capability. Whether that will become increasingly important in the mid/long term is what we don't know.
1. For RDNA 2 (NAVI 21), BVH RT transverse is processed on shaders, hence higher register stress when compared to GA102.

From https://images.nvidia.com/aem-dam/e...pere-GA102-GPU-Architecture-Whitepaper-V1.pdf

Compared to Turing, the GA10x SM’s combined L1 data cache and shared memory capacity is 33% larger. For graphics workloads, the cache partition capacity is doubled compared to Turing, from 32KB to 64KB.

Ray tracing denoising shaders are a good example of a workload that can benefit greatly from doubling FP32 throughput.


For graphics workloads and async compute, GA10x will allocate 64 KB L1 data / texture cache (increasing from 32 KB cache allocation on Turing), 48 KB Shared Memory, and 16 KB reserved for various graphics pipeline operations.

The full GA102 GPU contains 10752 KB of L1 cache (compared to 6912 KB in TU102). In addition to increasing the size of the L1, GA10x also features double the shared memory bandwidth compared to Turing (128 bytes/clock per SM versus 64 bytes/clock in Turing). Total L1 bandwidth for GeForce RTX 3080 is 219 GB/sec versus 116 GB/sec for GeForce RTX 2080 Super.

-----
2. CompuBench benchmark is worst than the Doom Eternal RT game benchmark.

For CompuBench's Vertex Connection and Merge benchmark

RTX 3090 = 44.712 mPixels
RT 6900 XT = 31.444 mPixels

RTX 3090 has 42.2% advantage over RT 6900 XT.

For CompuBench's Subsurface Scattering benchmark

RTX 3080 Ti = 23,762.213 mSample/s
RT 6900 XT = 16,749.949 mSample/s

RTX 3080 Ti has 41.45% advantage over RT 6900 XT.
 
Last edited:

Darius87

Member
1. For RDNA 2 (NAVI 21), BVH RT transverse is processed on shaders, hence higher register stress when compared to GA102.

From https://images.nvidia.com/aem-dam/e...pere-GA102-GPU-Architecture-Whitepaper-V1.pdf

Compared to Turing, the GA10x SM’s combined L1 data cache and shared memory capacity is 33% larger. For graphics workloads, the cache partition capacity is doubled compared to Turing, from 32KB to 64KB.

Ray tracing denoising shaders are a good example of a workload that can benefit greatly from doubling FP32 throughput.


For graphics workloads and async compute, GA10x will allocate 64 KB L1 data / texture cache (increasing from 32 KB cache allocation on Turing), 48 KB Shared Memory, and 16 KB reserved for various graphics pipeline operations.

The full GA102 GPU contains 10752 KB of L1 cache (compared to 6912 KB in TU102). In addition to increasing the size of the L1, GA10x also features double the shared memory bandwidth compared to Turing (128 bytes/clock per SM versus 64 bytes/clock in Turing). Total L1 bandwidth for GeForce RTX 3080 is 219 GB/sec versus 116 GB/sec for GeForce RTX 2080 Super.

-----
2. CompuBench benchmark is worst than the Doom Eternal RT game benchmark.

For CompuBench's Vertex Connection and Merge benchmark

RTX 3090 = 44.712 mPixels
RT 6900 XT = 31.444 mPixels

RTX 3090 has 42.2% advantage over RT 6900 XT.

For CompuBench's Subsurface Scattering benchmark

RTX 3080 Ti = 23,762.213 mSample/s
RT 6900 XT = 16,749.949 mSample/s

RTX 3080 Ti has 41.45% advantage over RT 6900 XT.
I don't know why you beating dead horse with Nvidia RT everyone knows it's better also RDNA2 L1 is same as Ampere L1.
also what's the point of your benchmarks? you play benchmarks or games? if you look at game performance 6900XT performs same or a bit worse then 3080Ti while having 11 less Tflops latter cost 200$ more, to be fair you have to add cost in to equation if comparing less vs more powerfull GPU.
 

PaintTinJr

Member
Nanite is Epic's toy that runs on either Primitive Shaders (AMD's NGGP via PS5) NGGP or DX12U/Vulkan's NGGP (NVIDIA's NGGP which PC/XSX RDNA 2 copied) which includes Amplification (Task) and Mesh Shaders.
You've given lots of info in other posts that isn't really relevant to the points, and then make that incorrect claim AFAIK. From watching hours and hours about UE5's nanite and lumen tech, and reading the twitter posts of - former Ubisoft AC unity lead renderer engineer that did nanite type technology previously on hw acceleration but is now - lead engineer on middleware engine Unity, who also states nanite is different because it is a SW compute shader algorithm (that gets 3:1 gain over hw acceleration).

So I'll need a reference for that claim, unfortunately. AFAIK the nanite triangle meshes, aren't strictly triangle mesh "primitives", and as they run in a single shader call - to get the efficiency they do - how would that even work for 20Mpolys/frame on the PS5 -as primitive shaders won't do that many in a single call IIRC?
 
Last edited:
Using a synthetic benchmark to claim it "extracts their potential" is an odd way to put it. Potential is extracted in real loads.

That's why it's called "potential", baka. :messenger_fistbump:

Anyway, RDNA2 still exceed expectations. It matches and in many cases beats Ampere on rasterization. NO ONE believe that AMD would really deliver before launch. Even with RDNA1 being good people were still accepting that "RTG is doomed" and would eventually kind of give up and die.
But the RT side meet the expectations that existed before launch, being that it would at most match Turing and isn't that true? RDNA2 is "Ampere with Turing's RT". Of course Ampere is faster at RT, it's a generation ahead, accelerating more things at hardware level, of course AMD Hybrid-RT will be slower, be it's a start and AMD also deliver the promise of giving RT to all tiers. This is an "advantage to consumers that works as a disadvantage to AMD".

I wonder that the MCM future will mean for AMD RT.
Simply having more hardware available sure helps, but will RDNA3 already have improved RT?
 

rnlval

Member
You've given lots of info in other posts that isn't really relevant to the points, and then make that incorrect claim AFAIK. From watching hours and hours about UE5's nanite and lumen tech, and reading the twitter posts of - former Ubisoft AC unity lead renderer engineer that did nanite type technology previously on hw acceleration but is now - lead engineer on middleware engine Unity, who also states nanite is different because it is a SW compute shader algorithm (that gets 3:1 gain over hw acceleration).

So I'll need a reference for that claim, unfortunately. AFAIK the nanite triangle meshes, aren't strictly triangle mesh "primitives", and as they run in a single shader call - to get the efficiency they do - how would that even work for 20Mpolys/frame on the PS5 -as primitive shaders won't do that many in a single call IIRC?

Nanite Exploits Primitive Shaders on PS5​


The vast majority of triangles are software rasterised using hyper-optimised compute shaders specifically designed for the advantages we can exploit. As a result, we've been able to leave hardware rasterisers in the dust at this specific task. Software rasterisation is a core component of Nanite that allows it to achieve what it does. We can't beat hardware rasterisers in all cases though so we'll use hardware when we've determined it's the faster path. On PlayStation 5 we use primitive shaders for that path which is considerably faster than using the old pipeline we had before with vertex shaders. - Senior Graphics Programmer Brian Karis, Epic

From https://blog.siggraph.org/2021/04/mesh-shaders-release-the-intrinsic-power-of-a-gpu.html/

Task to Amplifier to mesh to meshlets

In 2017, to accommodate developers’ increasing appetite for migrating geometry work to compute shaders, AMD introduced a more programmable geometry pipeline stage in their Vega GPU that ran a new type of shader called a primitive shader. According to AMD corporate fellow Mike Mantor, primitive shaders have “the same access that a compute shader has to coordinate how you bring work into the shader.” Mantor said that primitive shaders would give developers access to all the data they need to effectively process geometry, as well.

Primitive shaders led to task shaders, and that led to mesh shaders.

Mesh shaders will expand the capabilities and performance of the geometry pipeline. Mesh shaders incorporate the features of Vertex and Geometry shaders into a single shader stage through batch processing of primitives and vertices data before the rasterizer. The shaders are also capable of amplifying and culling geometry.

....
Both mesh and task shaders follow the programming model of compute shaders, using cooperative thread groups to compute their results and having no inputs other than a workgroup index.
 
Last edited:

rnlval

Member
I don't know why you beating dead horse with Nvidia RT everyone knows it's better (3) also RDNA2 L1 is same as Ampere L1.
also what's the point of your benchmarks? you play benchmarks or games?(2) if you look at game performance 6900XT performs same or a bit worse then 3080Ti while having 11 less Tflops latter cost 200$ more(1), to be fair you have to add cost in to equation if comparing less vs more powerfull GPU.

1. Not correct on pricing. From https://pcpartpicker.com/products/video-card/#c=505,498&sort=price&page=1
In USD

Cheapest RX 6900 XT = $1795.95 (via ASRock OC Formula)
Cheapest RTX 3080 Ti = $1891.99 (via EVGA XC3 ULTRA GAMING iCX3)

1891.99 - 1795.95 = $96.04

2. Doom Eternal RT is both a game and a benchmark. Running non-RT last-gen games doesn't reflect future workloads. One of the main reasons for AMD's Fine Wine is higher potential TFLOPS power when compared to the NVIDIA Kelper counterpart. NVIDIA stacked Ampere GPUs with extra high TFLOPS compute power relative to its raster hardware improvements.

3. Not just RT, refer to Mesh Shader benchmark.
 
Last edited:

longdi

Banned
i don't understand all these recent leaks from Nvidia and AMD, we are more than a year away from anything real 🤷‍♀️
 

Darius87

Member
1. Not correct on pricing. From https://pcpartpicker.com/products/video-card/#c=505,498&sort=price&page=1
In USD

Cheapest RX 6900 XT = $1795.95 (via ASRock OC Formula)
Cheapest RTX 3080 Ti = $1891.99 (via EVGA XC3 ULTRA GAMING iCX3)

1891.99 - 1795.95 = $96.04

2. Doom Eternal RT is both a game and a benchmark. Running non-RT last-gen games doesn't reflect future workloads.
these are overblown prices because of chip shortages official prices from amd are correct.

with RT on nvidia is better, only raster amd is better and cheaper.
 

rnlval

Member
these are overblown prices because of chip shortages official prices from amd are correct.

with RT on nvidia is better, only raster amd is better and cheaper.
Official prices are useless.

With RT on NVIDIA is better. Blender3D hardware RT is better on NVIDIA, not just games.
With Mesh Shader Benchmark on Nvidia is better.
With DirectML on Nvidia is better.


relative-performance_3840-2160.png

Mostly running non-RT modes.
 
Last edited:

Darius87

Member
Official prices are useless.

With RT on NVIDIA is better. Blender3D hardware RT is better on NVIDIA, not just games.
With Mesh Shader Benchmark on Nvidia is better.
With DirectML on Nvidia is better.


relative-performance_3840-2160.png

Mostly running non-RT modes.
how official prices are useless? do you think overclocked cards would still cost that much if amd would meet demand.

i don't disagree with you i'm just saying benchmarks are useless you have to look at game performances.
if you wan't fair comparisson you should compare Nvidia 1st gen RT and ML with 1st gen RT and ML from Amd that should be Turing vs RDNA2(i don't know which better) of course company with more time should make better product.
 
1. For RDNA 2 (NAVI 21), BVH RT transverse is processed on shaders, hence higher register stress when compared to GA102.

From https://images.nvidia.com/aem-dam/e...pere-GA102-GPU-Architecture-Whitepaper-V1.pdf

Compared to Turing, the GA10x SM’s combined L1 data cache and shared memory capacity is 33% larger. For graphics workloads, the cache partition capacity is doubled compared to Turing, from 32KB to 64KB.

where the heck does this Ampere is doing BVH traversal not via shader stem from? it's no where in the freaking whitepaper. it's a compute shader on turing and rdna2. reducing cache pressure does not mean it somehow magically works different on Ampere.
 

rnlval

Member
I don't know why you beating dead horse with Nvidia RT everyone knows it's better also RDNA2 L1 is same as Ampere L1.
also what's the point of your benchmarks? you play benchmarks or games? if you look at game performance 6900XT performs same or a bit worse then 3080Ti while having 11 less Tflops latter cost 200$ more, to be fair you have to add cost in to equation if comparing less vs more powerfull GPU.

RTX 3080 Ti L1 cache has 128 KB per SM x 80 = 10,240 KB

RX 6900 XT L1 cache has 128 KB per DCU x 40 = 5,120 KB (1)

Reference:
1. https://www.amd.com/system/files/documents/rdna-whitepaper.pdf Page 17 of 25

"The graphics L1 cache is shared across a group of dual compute units"
 

Kenpachii

Member
i don't understand all these recent leaks from Nvidia and AMD, we are more than a year away from anything real 🤷‍♀️

The cards could be dropped early next year without issue's and announced already in a few months from now. Ampere and RDNA2 have at best 8 months left i think. The gen is going to be short.

That's why it's called "potential", baka. :messenger_fistbump:

Anyway, RDNA2 still exceed expectations. It matches and in many cases beats Ampere on rasterization. NO ONE believe that AMD would really deliver before launch. Even with RDNA1 being good people were still accepting that "RTG is doomed" and would eventually kind of give up and die.
But the RT side meet the expectations that existed before launch, being that it would at most match Turing and isn't that true? RDNA2 is "Ampere with Turing's RT". Of course Ampere is faster at RT, it's a generation ahead, accelerating more things at hardware level, of course AMD Hybrid-RT will be slower, be it's a start and AMD also deliver the promise of giving RT to all tiers. This is an "advantage to consumers that works as a disadvantage to AMD".

I wonder that the MCM future will mean for AMD RT.
Simply having more hardware available sure helps, but will RDNA3 already have improved RT?

MCM is the future. If AMD can drop a chip on this concept and in a working state single core gpu's are dead.

9163668dd08c03c2bbccca401afaf57f.png


4eb45df1efeec46a5f36b7e0ceb4bf09.png


We going to jump into a different dimension performance wise. its not a 2080ti to 3090 performance jump, its going to be 780 ti to 1080ti performance jump.

Hell this could also mean we can already see PRO consoles coming in sooner rather then later.

2-3x isn't far fetched if amd goes all out, however the question is going to be how good will it work and how ready are they and what watt budget they wanna push and how much RDNA3 architecture is improved over RDNA2. All of that heavily depends on decisions AMD will be making as they could simple also have other problems that will reduce the performance output severely.

RT wise it will improve, and probably be better then what ampere puts on the screen because of pure performance. but i don't think it will get close or beat nvidia if they are capable to slam 2x the performance also forwards and double there RT effort.

Anyway RDNA3 is a glimps into the future, and its incredible exciting to see this thing in action.
 
Last edited:

rnlval

Member
where the heck does this Ampere is doing BVH traversal not via shader stem from? it's no where in the freaking whitepaper. it's a compute shader on turing and rdna2. reducing cache pressure does not mean it somehow magically works different on Ampere.
Read https://www.techspot.com/article/2151-nvidia-ampere-vs-amd-rdna2/

2020-11-26-image-p.webp

RTX's BVH traversal workload is processed discrete hardware unit.


For RDNA 2

This part of the CU performs ray-box or ray-triangle intersection checks -- the same as the RT Cores in Ampere. However, the latter also accelerates BVH traversal algorithms, whereas in RDNA 2 this is done via compute shaders using the SIMD 32 units.


RTX Turing accelerates 3 of 3** RT functions.

RDNA 2 accelerates 2 of 3** RT functions.


**Core RT functions i.e. triangle intersection test, bound box intersection test, and BVH traversal.

Try again.
 

rnlval

Member
how official prices are useless? do you think overclocked cards would still cost that much if amd would meet demand.

i don't disagree with you i'm just saying benchmarks are useless you have to look at game performances.
if you wan't fair comparisson you should compare Nvidia 1st gen RT and ML with 1st gen RT and ML from Amd that should be Turing vs RDNA2(i don't know which better) of course company with more time should make better product.
I have about $1850 when I sold MSI RTX 2080 Ti Gaming X Trio and ASUS RTX 2080 EVO OC, convince me to buy RX 6900 XT over RTX 3080 Ti.

PS; I would buy RX 6800 XT over RTX 2080 Ti or RTX 3070 Ti
 
Last edited:

longdi

Banned
The cards could be dropped early next year without issue's and announced already in a few months from now. Ampere and RDNA2 have at best 8 months left i think. The gen is going to be short.

You mean early 2023 right? That is the date i am hearing. :messenger_smiling_with_eyes:

The era of rdna2 and ampere dont even felt started.
 

rnlval

Member
The cards could be dropped early next year without issue's and announced already in a few months from now. Ampere and RDNA2 have at best 8 months left i think. The gen is going to be short.



MCM is the future. If AMD can drop a chip on this concept and in a working state single core gpu's are dead.

9163668dd08c03c2bbccca401afaf57f.png


4eb45df1efeec46a5f36b7e0ceb4bf09.png


We going to jump into a different dimension performance wise. its not a 2080ti to 3090 performance jump, its going to be 780 ti to 1080ti performance jump.

Hell this could also mean we can already see PRO consoles coming in sooner rather then later.

2-3x isn't far fetched if amd goes all out, however the question is going to be how good will it work and how ready are they and what watt budget they wanna push and how much RDNA3 architecture is improved over RDNA2. All of that heavily depends on decisions AMD will be making as they could simple also have other problems that will reduce the performance output severely.

RT wise it will improve, and probably be better then what ampere puts on the screen because of pure performance. but i don't think it will get close or beat nvidia if they are capable to slam 2x the performance also forwards and double there RT effort.

Anyway RDNA3 is a glimps into the future, and its incredible exciting to see this thing in action.
AMD Van Gogh APU (e.g. Steam Deck) has Deep Learning DSP hardware

AMD uses two DSPs from Cadence for their CV
- Vision Q6 DSP (2 cores)
- Vision C5 DSP (2 cores)

DSP = Digital Signal Processor
CV = Computer Vision.


Expect AMD to include Tensor DSP with future AMD GPUs and APUs.
 
Last edited:

Darius87

Member
I have about $1850 when I sold MSI RTX 2080 Ti Gaming X Trio and ASUS RTX 2080 EVO OC, convince me to buy RX 6900 XT over RTX 3080 Ti.

PS; I would buy RX 6800 XT over RTX 2080 Ti or RTX 3070 Ti
after 7 months Nvidia released 3080Ti just to counter AMD 6900XT so if you care about RT you should buy it.
 

PaintTinJr

Member

Nanite Exploits Primitive Shaders on PS5​


The vast majority of triangles are software rasterised using hyper-optimised compute shaders specifically designed for the advantages we can exploit. As a result, we've been able to leave hardware rasterisers in the dust at this specific task. Software rasterisation is a core component of Nanite that allows it to achieve what it does. We can't beat hardware rasterisers in all cases though so we'll use hardware when we've determined it's the faster path. On PlayStation 5 we use primitive shaders for that path which is considerably faster than using the old pipeline we had before with vertex shaders. - Senior Graphics Programmer Brian Karis, Epic

From https://blog.siggraph.org/2021/04/mesh-shaders-release-the-intrinsic-power-of-a-gpu.html/

Task to Amplifier to mesh to meshlets

In 2017, to accommodate developers’ increasing appetite for migrating geometry work to compute shaders, AMD introduced a more programmable geometry pipeline stage in their Vega GPU that ran a new type of shader called a primitive shader. According to AMD corporate fellow Mike Mantor, primitive shaders have “the same access that a compute shader has to coordinate how you bring work into the shader.” Mantor said that primitive shaders would give developers access to all the data they need to effectively process geometry, as well.

Primitive shaders led to task shaders, and that led to mesh shaders.

Mesh shaders will expand the capabilities and performance of the geometry pipeline. Mesh shaders incorporate the features of Vertex and Geometry shaders into a single shader stage through batch processing of primitives and vertices data before the rasterizer. The shaders are also capable of amplifying and culling geometry.

....
Both mesh and task shaders follow the programming model of compute shaders, using cooperative thread groups to compute their results and having no inputs other than a workgroup index.
I'm confused, are you now moving the goal posts, or proving my points by agreeing with me? Your quote shows that 90 - 70% scene geometry bypasses primitive and mesh shaders for nanite - as I stated in my first post, defending Eric sax ceo's post about ps5 primitive shaders.

"For that path" is the non nanite rendering using proxy meshes - that also needs UE4 lighting with hw RTX fallback lighting, rather than Lumen's SW signed distance field SW RT on compute shaders.

That is my point from the beginning. At best 30% of a frame's work will be slightly better served by parity ue4 performance nvidia hw vs Amd for that 30%. And vs ps5 maybe not even have that advantage, depending on if the ps5 custom geometry engine + primitive shaders + cache scrubbers + fill rate (from highclocks which helps sw nanite + sw lumen,) doesn't yield a greater benefit for the PS5.

Processing the proxy meshes through primitive or mesh shaders also brings up another oddity. The proxy meshes will normally only be 2k-3k polygons - based on that being the default for UE5 - and with maybe only 1/3rd of those polys facing the camera anyway, so even when rendering these fallback or hw RT situations, the numbers involved of polys counts versus instances doesn't really provide the scale of problem consistently to get all the benefits of mesh shading over conventional VS, GS, FS stages, never mind, the smaller benefit of mesh shading over primitive shading. So IMO it is a lot of great technology, but it is maybe too late, because nanite's limitations/missing features are likely to change, giving mesh shading less, and less areas to be useful.
 
Last edited:
They said :messenger_grinning: you mean alex said, we know what he's capable of saying against PS5, now tell me which GPU supports offscreen vertex proccessing aborting before primitive assembly? i'll wait...

You must be joking surely? Xbox Series X's GPU has far better support for this scenario than the PS5 does due to Mesh Shaders. Oh, and it can do it without the Input Assembler. PS5 still needs the Input Assembler.
 
Oh, and it can do it without the Input Assembler. PS5 still needs the Input Assembler.
Most traditional Vertex to Pixel shader pipeline process geometry whole and in-order using a linear index buffer. Does PS5 need an index buffer?
PS2 had mesh shaders, if you doesn't know, but then Nvidia in the GeForce introduced hardware T&L, appeared later vertex shaders, geometry shaders, hull and domain shaders. Hardware T&L was great at the time. CPUs were single core and no vector extensions such as SSE. Vertex shader was more energy efficient than PS2 VU and could be used to emulate hardware T&L easily. It was an inevitable path, but the wrong one in long run.
The greatest difference between the primitive shading (PS) in PS5 and mesh shading (MS) in the Xbox Series X is that PS has a dedicated path for the hardware processing of tessellation (on the primitive units, programmable). In MS, the hardware tesselation is abolished and works through computing shaders and configures in Amplification|Task Shader (AMD|Nvidia). No index buffers for PS5 and XSX, btw.
 
Last edited:
1. During raytracing, a texture cache path is used on both GPUs. RTX 3090 beats 6900 XT. NAVI 2's CU TMU is halted when RT hardware is active.

gf8pgrv5em871.png


RTX 3090 = 113.3 fps

6900 XT = 70.1 fps

With Vulkan BVH RT, RTX 3090 has 61% advantage over RX 6900 XT.

Both GPUs used heavy Async Compute, Variable Rate Shading (pixel raster path), BVH RT (compute path), Rapid Pack Math (when available), Compute path Geometry Culling (compute path), and other advanced GPU hardware features e.g. Shader Intrinsic functions i.e. Vulkan API allows vendor-specific extensions.

Doom 2016 Vulkan was the introduction for AMD's Shader Intrinsic functions direct hardware access.

2. RTX 3090's compute path can extract it's TFLOPS potential

q5x7gBL.jpg


RTX 3080 Ti/3090 still has extra compute power for DirectML and DirectStorage GpGPU decompression.

For comparison

AIDA64.jpg


61% on top of RX 6900 XT's 25 TFLOPS is 40 TFLOPS which is about RTX 3900's TFLOPS range.

RTX 3090's real-life 38.4 TFLOPS is 51% higher than RX 6900 XT's real-life 25 TFLOPS. Your RTX 3090 having double TFLOPS over RX6900 XT argument is wrong.

The USD price for RTX 3080 Ti and RX 6900 XT is similar which is okay for legacy raster workload, but not for the next-gen workload.

Btw, GpGPU compute path doesn't use texture filtering hardware. When texturing is used, texture filter hardware is used. You missed a critical difference between classic texturing workload and compute/TMU read-write path.

This is a non-sequitur.
What does Ray Tracing performance have to do with Ampere's extra compute?
You mentioned TMUs can be leveraged to enhance Ampere's compute throughput in certain workloads. Those workloads aren't games as games need TMUs for texture processing.
What does AMD using part of the TMUs for BVH intersection testing have to do with this? If anything it just proves my argument - When TMU's try to fulfil a dual purpose it hampers performance.
What does AIDA64 have to do with this?

Ampere is a massively bottlenecked architecture that throws massive amounts of compute for relatively little gains in performance in gaming.

Nanite Exploits Primitive Shaders on PS5​


The vast majority of triangles are software rasterised using hyper-optimised compute shaders specifically designed for the advantages we can exploit. As a result, we've been able to leave hardware rasterisers in the dust at this specific task. Software rasterisation is a core component of Nanite that allows it to achieve what it does. We can't beat hardware rasterisers in all cases though so we'll use hardware when we've determined it's the faster path. On PlayStation 5 we use primitive shaders for that path which is considerably faster than using the old pipeline we had before with vertex shaders. - Senior Graphics Programmer Brian Karis, Epic

From https://blog.siggraph.org/2021/04/mesh-shaders-release-the-intrinsic-power-of-a-gpu.html/

Task to Amplifier to mesh to meshlets

In 2017, to accommodate developers’ increasing appetite for migrating geometry work to compute shaders, AMD introduced a more programmable geometry pipeline stage in their Vega GPU that ran a new type of shader called a primitive shader. According to AMD corporate fellow Mike Mantor, primitive shaders have “the same access that a compute shader has to coordinate how you bring work into the shader.” Mantor said that primitive shaders would give developers access to all the data they need to effectively process geometry, as well.

Primitive shaders led to task shaders, and that led to mesh shaders.

Mesh shaders will expand the capabilities and performance of the geometry pipeline. Mesh shaders incorporate the features of Vertex and Geometry shaders into a single shader stage through batch processing of primitives and vertices data before the rasterizer. The shaders are also capable of amplifying and culling geometry.

....
Both mesh and task shaders follow the programming model of compute shaders, using cooperative thread groups to compute their results and having no inputs other than a workgroup index.

Primitive shaders are a thing AMD made. If the PS5 is using it, you can bet your ass so is RDNA2 and 3 and whatever comes after.

RTX 3080 Ti L1 cache has 128 KB per SM x 80 = 10,240 KB

RX 6900 XT L1 cache has 128 KB per DCU x 40 = 5,120 KB (1)

Reference:
1. https://www.amd.com/system/files/documents/rdna-whitepaper.pdf Page 17 of 25

"The graphics L1 cache is shared across a group of dual compute units"

What does this have to do with anything?

Read https://www.techspot.com/article/2151-nvidia-ampere-vs-amd-rdna2/

2020-11-26-image-p.webp

RTX's BVH traversal workload is processed discrete hardware unit.


For RDNA 2

This part of the CU performs ray-box or ray-triangle intersection checks -- the same as the RT Cores in Ampere. However, the latter also accelerates BVH traversal algorithms, whereas in RDNA 2 this is done via compute shaders using the SIMD 32 units.


RTX Turing accelerates 3 of 3** RT functions.

RDNA 2 accelerates 2 of 3** RT functions.


**Core RT functions i.e. triangle intersection test, bound box intersection test, and BVH traversal.

Try again.

What does ray tracing have to do with anything when this conversation was started when referring to Ampere's massive underutilisation of its FP32.
3080 has 2x the compute of the 2080Ti. Is it twice as fast? No.
The 3090 has nearly 2x the compute of the 6900XT, is it twice as fast? No. Is it even 50% faster? No.
You can spam technical specs and documents and synthetic benchmarks all you like, but that doesn't change the fact that in the real world with real games Ampere is massively bottlenecked.
 

rnlval

Member
you have to take clocks into account also.
You claimed RDNA2 L1 is the same as Ampere L1. RX 6900 XT doesn't have 2X clock speed over RTX 3080 Ti to make up the difference.

Furthermore, AIB OC edition beats RTX 3090 FE.

RX 6900 XT Red Devil AIB OC edition doesn't scale as well as MSI GeForce RTX 3080 Ti AIB OC.
 

rnlval

Member
This is a non-sequitur.
What does Ray Tracing performance have to do with Ampere's extra compute? (1)
You mentioned TMUs can be leveraged to enhance Ampere's compute throughput in certain workloads. Those workloads aren't games as games need TMUs for texture processing. (2)

What does AMD using part of the TMUs for BVH intersection testing have to do with this? If anything it just proves my argument - When TMU's try to fulfil a dual purpose it hampers performance.(3)
What does AIDA64 have to do with this? (4)

Ampere is a massively bottlenecked architecture that throws massive amounts of compute for relatively little gains in performance in gaming.(5)



Primitive shaders are a thing AMD made. If the PS5 is using it, you can bet your ass so is RDNA2 and 3 and whatever comes after. (6)



What does this have to do with anything?



What does ray tracing have to do with anything when this conversation was started when referring to Ampere's massive underutilisation of its FP32.
3080 has 2x the compute of the 2080Ti. Is it twice as fast? No. (7)
The 3090 has nearly 2x the compute of the 6900XT, is it twice as fast? No. Is it even 50% faster? No.
You can spam technical specs and documents and synthetic benchmarks all you like, but that doesn't change the fact that in the real world with real games Ampere is massively bottlenecked.
1. Raytracing denoise pass is done via compute, read my post.

2. PC DirectSTorage decompression is done via GpGPU compute. Try again.

3. My post answered the Ampere RT vs RDNA 2 RT question. Follow the thread.

4. Ampere Compute TFLOPS is real via a certain path.

5. PC DirectStorage decompression function is done via GpGPU path. DirectML is another GpGPU path.

6. Wrong. PC and XSX RDNA 2 was modified to follow DirectX12U and Vulkan counterpart. I answered the question

7. Unlike the RDNA 2 competition, RTX 2080 Ti has a separate TIOPS resource that hides extra performance from the typical TFLOPS debate.

6883a602-963b-411b-9c65-1f5147bf5431.PNG



RTX 2080 Ti FE real-life average clock speed is higher than paper spec i.e. 1824 Mhz (~15.9 TFLOPS, not including partly used TIOPS)
 

MikeM

Member
Most traditional Vertex to Pixel shader pipeline process geometry whole and in-order using a linear index buffer. Does PS5 need an index buffer?
PS2 had mesh shaders, if you doesn't know, but then Nvidia in the GeForce introduced hardware T&L, appeared later vertex shaders, geometry shaders, hull and domain shaders. Hardware T&L was great at the time. CPUs were single core and no vector extensions such as SSE. Vertex shader was more energy efficient than PS2 VU and could be used to emulate hardware T&L easily. It was an inevitable path, but the wrong one in long run.
The greatest difference between the primitive shading (PS) in PS5 and mesh shading (MS) in the Xbox Series X is that PS has a dedicated path for the hardware processing of tessellation (on the primitive units, programmable). In MS, the hardware tesselation is abolished and works through computing shaders and configures in Amplification|Task Shader (AMD|Nvidia). No index buffers for PS5 and XSX, btw.
So... which is better?
 

Darius87

Member
You must be joking surely? Xbox Series X's GPU has far better support for this scenario than the PS5 does due to Mesh Shaders. Oh, and it can do it without the Input Assembler. PS5 still needs the Input Assembler.
Mesh shaders essentially is same thing as primitive shaders with less steps and GE is also fully programable you can do same things as with assembler.
 

Darius87

Member
You claimed RDNA2 L1 is the same as Ampere L1. RX 6900 XT doesn't have 2X clock speed over RTX 3080 Ti to make up the difference.
L1 is same sizes on both cards with 128KB not every operation happens every cycle that's why you have take clocks into account i know cache bw is still lower but it doesn't matter that much because 6900XT have massive L3 and how RT works on AMD cards.
 

twilo99

Member
rumors that both are going to require over 400W of power for the top end? Crazy. Should be a huge performance jump based just on that.
 
Top Bottom