• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

Nvidia Ampere teraflops and how you cannot compare them to Turing

psorcerer

Banned
TL;DR 1 Ampere TF = 0.72 Turing TF, or 30TF (Ampere) = 21.6TF (Turing)

Reddit Q&A

To accomplish this goal, the Ampere SM includes new datapath designs for FP32 and INT32 operations. One datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores. As a result of this new design, each Ampere SM partition is capable of executing either 32 FP32 operations per clock, or 16 FP32 and 16 INT32 operations per clock. All four SM partitions combined can execute 128 FP32 operations per clock, which is double the FP32 rate of the Turing SM, or 64 FP32 and 64 INT32 operations per clock.

A reminder from the Turing whitepaper:
First, the Turing SM adds a new independent integer datapath that can execute instructions concurrently with the floating-point math datapath. In previous generations, executing these instructions would have blocked floating-point instructions from issuing.

So, Turing GPU can execute 64INT32 + 64FP32 ops per clock per SM.
Ampere GPU can either execute 64INT32 + 64FP32 or 128FP32 ops per clock per SM.

Which means if a game executes 0 (zero) INT32 instructions then Ampere = 2xTuring
And if game executes 50/50 INT32 and FP32 then Ampere = Turing exactly.

So how many INT32 are there on average?
According to Nvidia:

we typically see about 36 additional integer pipe instructions for every 100 floating point instructions

Some math: 36 / (100+36) = 26%, i.e. in an average game instruction stream 26% are INT32

So we can now calculate what will happen to both Ampere and Turing when 26% INT32 + 74% FP32 instruction streams are used.
I have written a simple software to do that. But you can calculate an analytical upper bound easily: 74%/50% = 1.48 or +48%
My software shows a slightly smaller number +44% (and that's because of the edge cases where you cannot distribute the last INT32 ops in a batch equally, as only one pipeline can issue INT32 per each block of 16 cores)
So the theoretical absolute max is +48%, in practice the absolute achievable max is +44%

Thus each 2TF of Ampere have only 1.44TF of Turing performance.

Let's check the actual data Nvidia gave us:
3080 = 30TF (ampere) = 21.6TF (turing) = 2.14x 2080 (10.07TF turing)
Nvidia is even more conservative than that and gives us: 3080 = 2x2080
3070
= 20.4TF (ampere) = 14.7TF (turing) = 1.86x 2070 (7.88TF turing)
Nvidia is massively more conservative here giving us: 3070 = 1.6x2070
Actually if we average the two max numbers that Nvidia gives us (they explicitly say "up to") we get to even lower theoretical max of 1 Ampere TF = 0.65 Turing TF
Which suggests that maybe these new FP32/INT32 mixed pipelines cannot execute FP32 at full speed (or cannot execute all the instructions).
We do know that Turing had reduced register file access in INT32 (64 vs 256 for FP32) if it's the same (and everything suggests that Ampere is just a Turing facelift) then obviously not all FP32 instruction sequences can run on these pipelines.

Anyway a TF table:

Ampere TFTuring TF (me)Turing TF (NV)
3080 (Ampere)3021.619.5
3070 (Ampere)20.414.713.3
2080Ti (Turing)18.75 (me) or 20.7 (NV)13.513.5
2080 (Turing)14 (me) or 15.5 (NV)10.110.1
2070 (Turing)10.4 (me) or 11.5 (NV)7.57.5

Bonus round: RDNA1 TF
RDNA1 has no INT32 pipeline, all the INT32 instructions are handled in the main stream. Thus it's essentially almost exactly the same as Ampere, but it has no skew in the last instruction thus +48% theoretical max applies here (Ampere +2.3%)

Ampere TFTuring TF (me)Turing TF (NV)
5700XT (RDNA1)10.017.2?

Amusingly enough 5700XT actual performance is pretty similar to 2070 and these adjusted TF numbers show exactly that (10TF vs 10-11TF)

Update: why Ampere is just a Turing facelift.
 
Last edited:
7KT9SZ0.gif
 
Yes I was gonna post about this but the tugging off merry-go-round (circle jerk) was so overwhelming the other day I let it slide.

This '30 tflops' is Nvidia BS marketing. Also marketing was the 1.9X perf per watt improvement. Yeh only to reach 60fps at certain settings in a certain title comparing certain cards!
 

SF Kosmo

Al Jazeera Special Reporter
This is all kind of academic. I think what Digital Foundry showed, with 3080 giving a roughly 75% uplift over 2080Ti and a 90%+ uplift on RT intensive stuff is a good indicator of what we're getting. That's real world stuff that is still at least somewhat CPU bound.
 

diffusionx

Gold Member
Now translate that to AMD FLOPS and you see how all the PCMR flexing is meaningless.

I honestly don't give a single shit about a teraflop, that seems like console posturing, and it became that because both console platforms were similar architecturally and can be directly compared.... honestly the only thing I care about are benchmarks. I think most who game on PC feel the same way.
 
Last edited:
So basically, in a nutshell, the xbox series X is still more powerfull than a 3070, considering the X's closed architechture. Perhaps the consoles aren't so "weak" after all.
LoL nah well maybe by the end of the gen when 3070 is long forgotten and not supported anymore.
 

SF Kosmo

Al Jazeera Special Reporter
Are you sure it's 2080Ti and not the vanilla 2080?
Ah, correct.

But I find it interesting that even nVidia seems to correct for this. The chart at 30:50 in their presentation shows the 2070 (20 TFlops) as only very narrowly ahead of 2080Ti (13 TFlops?). And they claim the 3080 to be about 2x the vanilla 2080.
 

Entroyp

Member
Interesting to see efficiency per teraflop going backwards... this might mean big navi might not suck that much (at least on shader performance)
 

psorcerer

Banned
But I find it interesting that even nVidia seems to correct for this. The chart at 30:50 in their presentation shows the 2070 (20 TFlops) as only very narrowly ahead of 2080Ti (13 TFlops?). And they claim the 3080 to be about 2x the vanilla 2080.

Yup, that's why it's pretty fishy.
Anyway these 2xCUDA cores and 2xTF for Ampere are sure inflated as hell compared to Turing.
 
Comparing TFLOPS across architectures has always been tricky business.

But you're trying to argue that Ampere is less powerful than it seems while I would argue that it's actually MORE powerful.

It's really getting to the point where tryin to use TFLOPS as comparison is nearly useless, especially vs Nvidia, because noone else has any kind of answer to DLSS.

How do you compare the relative performance of ANY other GPU to a GPU that can take a 1440p or even 1080p input and pump out a 4K output that is nearly indistinguishable ( and in some cases superior ) to native 4K? How do you do that? You could argue that it's cheating, but if your eyes can't tell the difference and you're getting nearly double the framerate, then what does it matter?

Taking DLSS 2.0 ( and 2.1 ) into account I could argue that Ampere can perform more like another GPU that has 50 TFLOPS, but no DLSS.
 
Last edited:

SF Kosmo

Al Jazeera Special Reporter
So basically, in a nutshell, the xbox series X is still more powerfull than a 3070, considering the X's closed architechture. Perhaps the consoles aren't so "weak" after all.
No, probably not. They're in the ballpark as far as shader compute, but better RT performance and the tensor for ai stuff.

Consoles do generally have advantage of being able to optimize to a specific target rather than make compromises to scale, but MS seems to be fucking that part up.
 
This is all kind of academic. I think what Digital Foundry showed, with 3080 giving a roughly 75% uplift over 2080Ti and a 90%+ uplift on RT intensive stuff is a good indicator of what we're getting. That's real world stuff that is still at least somewhat CPU bound.

See, this 75% figure is pulled out of a green arse. It's wrong, please stop repeating massively inflated numbers.

EDIT: So you got the cards mixed up, 75% over 2080 sounds more realistic.
 

psorcerer

Banned
But you're trying to argue that Ampere is less powerful than it seems while I would argue that it's actually MORE powerful.

Not really I'm actually saying two things:
1. 3080 is 2080Ti in disguise. Same arch (sans small improvements) on a smaller node and thus faster clocks.
2. Current NV "marketing TF" are not comparable to Turing directly.

How do you compare the relative performance of ANY other GPU to a GPU that can take a 1440p or even 1080p input and pump out a 4K output that is nearly indistinguishable ( and in some cases superior ) to native 4K.

I think everybody (including Intel) will have DL upscale a year from now.
 

nochance

Banned
Not quite. It is quite easy to calculate the performance by looking at the number of processing units. The way it handles instructions is a benefit on top of sheer processing power.
 
Not really I'm actually saying two things:
1. 3080 is 2080Ti in disguise. Same arch (sans small improvements) on a smaller node and thus faster clocks.
2. Current NV "marketing TF" are not comparable to Turing directly.



I think everybody (including Intel) will have DL upscale a year from now.

You're really trying to argue that a 3080 is a 2080ti in disguise?

Good luck with that considering ....



You can see right here that a 3080 is between 70 and 100% more performant than a 2080ti.

You think consoles are going to have an equivalent answer to DLSS 2.0 in a year? I doubt they ever will. They can come up with all the checkerboarding methods they want, but they will never be equivalent to DLSS 2.0+
 
Last edited:

nochance

Banned
Not really I'm actually saying two things:
1. 3080 is 2080Ti in disguise. Same arch (sans small improvements) on a smaller node and thus faster clocks.
2. Current NV "marketing TF" are not comparable to Turing directly.



I think everybody (including Intel) will have DL upscale a year from now.
This is factually false. 3080 has 8,704 cores vs 4352 on 2080 ti.
 

diffusionx

Gold Member
Not really I'm actually saying two things:
1. 3080 is 2080Ti in disguise. Same arch (sans small improvements) on a smaller node and thus faster clocks.

You can read all about the architecture here:


You are accusing Nvidia of lying and claiming that the 3080 is something it is not. That is a pretty heavy accusation.
 
So is the 3080 twice as fast as the 2080 or not? I mean this is bordering on false marketing at this point.

Not, not even close likely once we see averages of more than a few games.

You're really trying to argue that a 3080 is a 2080ti in disguise?

Good luck with that considering ....



You can see right here that a 3080 is between 70 and 100% more performant than a 2080ti.


What in the wo....you seriously think that in games, where a 2080 Ti gets around 100fps, the 3080 will clock around 200fps? Or are you talking about RT performance? That's astounding.
 
Last edited:

psorcerer

Banned
You are accusing Nvidia of lying and claiming that the 3080 is something it is not. That is a pretty heavy accusation.

It's not lying, it's exaggerating.
Purely theoretically Ampere has 2x FP32 cores, but in reality these are shared with INT32 (which is what NV says in their Reddit Q&A)
 
Numbers please.
According to my numbers in the OP: 3080 = 2080Ti + 60% (theoretical max).

You can just watch the video. It's actual gameplay with a FPS counter.

If you were right ( you're not ) then the 3080 framerate while playing DOOM would be the same as the 2080ti's framerate. But it isn't.

So now are you going to admit you were wrong or move goalposts?
 

psorcerer

Banned
Cuda cores are a known and defined element.

Yup.
In Turing cuda core has 1xFP32 + 1xINT32 ALUs = 2ALUs but only one is FP capable
In Ampere a cuda core has 1xFP32 + 1xINT/FP32 ALUs = 2ALUs and both are FP capable
NV counts the Turing ALU as 1 and Ampere ALU as 2 although even the die size between these will be roughly the same.
Purely symantically inflating the number of cores that were present in Turing!
 
Top Bottom