• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

Microsoft Xbox Series X's AMD Architecture Deep Dive at Hot Chips 2020

FritzJ92

Member
The narrative that PS5 has strengths in those areas becomes more and more eroded as we gain more information about the Series X. PS5 has high raw SSD speeds I/O bandwidth. Series X answers this with 40% of the raw bandwidth and a 60% bandwidth and memory savings using SFS.
PS5 has a high end custom audio tech that is capable of powerful audio. Series X answers this with high end custom audio tech that appears to be just as, or a negligible amount less, capable. You only see word soup in here because there are characters that enjoy trying to downplay any of the Series X's tech because it doesn't fit the narrative that was created by Sony for the PS5.

It becomes more clear as we gain more details about the Series X that the reason Sony leaned so heavily into the 'strengths' of their I/O Bandwidth and Audio customizations in Road to PS5 was because their system is not as impressive as the Series X in almost any other area. Then you learn more information about the Series X and those 'strengths' seem to vanish altogether.

I've been reading this from the start and seems to be the case. Sonys' advantages aren't really that huge in comparison, and Xbox has a lot of minor advantages also...Some people think SSD is important, while others think the GPU is.
 

Elog

Member
You cannot decouple the software from the hardware! Not even just that, but you are misinformed if you think XvA is only a software solution.

I have never said it is 'only' a software based solution and never will. The point is that it contains clear process steps that build CPU overhead.

You are "willing to bet" on a belief and not much more, because you are not actually factoring how XvA works on the software and hardware front.

Once again - the point is that PS5 has a hardware path without CPU overhead while XSX does not

A hardware path without CPU overhead always beats a path with CPU overhead on latency (and throughput) everything else equal. And in this case 'everything else' is not even equal it is to the benefit of the PS5.

The reason for me writing 'willing to bet' is based on the above together with the aknowledgement that none of us has actual data from these machines in hand, i.e. we are basing our writing on available information.

MS wants to make aspects of XvA scalable and future-proofed for ever-advancing hardware configurations to come, and the biggest benefit of putting certain functions on software rather than fixed-function explicitly is the fact the former can more readily and easily be updated and improved as time goes on. The latter? It requires new hardware design, which costs more in terms of end-cost to users who want to take advantage of that, and is also harder to distribute in the field due to the fact it is not a software solution and thus does not benefit from the methods of distribution that would entail.

You are absolutely right. They have chosen a flexible solution (i.e. CPU overhead) that can be updated over time but lost throughout and latency in the process. That is the tradeoff they have made - I respect them for that. You are trying to argue there was no cost to that decision though which I find odd.

You say we should base this on things that we know, but I have actually read through the various MS patents relating to their flash storage technologies (such as FlashMap) and looked into conversations engineers from the team have had on Twitter, and they contradict your assumptions quite absolutely.

I am very curious to see the technical text stating that a hardware-only path without CPU overhead has higher latency and lower throughout everything else equal. You won't find that ;)

The truth is this narrative is completely idiotic; you can't get hardware to work without software (APIs, algorithms, OSes, kernels etc.). Almost any company in the tech field dabbles with both hardware and software, out of necessity. It's a pretty generalized idea and cuts short the work both players display.

CPU overhead is a big thing so I am not sure why you argue about that. Honestly, the entire field of RISC processors for example is about this, i.e. 'how can I lower the loss of speed in the CPU and still maintain some level of flexibility'.
 
Last edited:

PaintTinJr

Member
......

Okay I think I see where you are coming from. About the latency thing though, what specific type of data are you referring to? If it's texture data, then apparently if we go by words like the DiRT 5 developer, it can't be anything severe. According to them they can take texture data, fetch it, use it, discard it and replace it mid-frame. Granted that is a cross-gen game but it's one of the few examples we've seen of any games with gameplay for next-gen and it's one of the more impressive ones IMHO. So I'm just curious what specific type of data you are referring to here.

WRT to zlib being difficult/risky for ASIC, I'm curious on that as well. PS5 is also able to use zlib, though it has dedicated decompression for Kraken. In this scenario if PS5 games were to also use zlib would they not also have to dedicate some CU resources for what you describe? And what amount of CU resources would have to be utilized for this as you are describing it?

EDIT: I did a little looking and the Eurogamer Series X article states that the decompression block is what is running the zlib decompression algorithm.



I think this here more or less supports the conclusion that the decompression block handles zlib. Granted it doesn't say anything about offloading it from the GPU but then again, neither did Sony with describing their decompression hardware either. And realistically, it doesn't make too much sense that zlib would be too risky for a dedicated ASIC but Kraken is; they are both compression/decompression algorithms at the end of the day. They function differently in ways, sure, but still...

Yes you're right about Tempest, it's a repurposed CU unit (singular; for some reason AMD calls them Dual Compute so they are in as pairs; Sony took one core of those pairs and repurposed it for Tempest Engine to simulate an SPU, basically). You're also right that Series X's audio only specifies SPFP, not DPFP, but at the same time it isn't taking a GPU compute core and repurposing it, so that figures to be the expected outcome here. FWIW, though, HRTF is perfectly capable on Series X, in fact the One X also supported HRTF. HRTF is actually new for Sony systems via PS5, but it's been a feature with MS systems since at least the One X.

Hopefully what you are describing in the end there isn't a prelude to completely shutting down the idea Tempest could possibly be used for some non-audio tasks; from the sounds of it that could be the case, but I guess if devs wanted to get really specific with the chip they could push it in that kind of direction. But the cost may not be worth it. I don't think using the audio chip in such a way (and FWIW, something roughly similar could theoretically be done with Series X's audio processor, generally speaking) would be any kind of "secret sauce" whatsoever but it would make for cool examples of tech in systems being used in creative ways. I still can't think of any game that used the sound processor for graphics/logic purposes outside of Shining Force III on the SEGA Saturn, and that was decades ago.

I get where you are going wrong there, but Sony haven't committed to an ASIC for zlib or derivatives kraken, oodle, etc.

The IO complex is using two programmable co-processors and specialist super low latency SRAM memory to accommodate compression and decompression algorithm acceleration in whatever solution it evolves into, over the generation. So even if they used zlib, they'd just update the IO complex code, as the IO complex data is moved directly into the unified RAM and so wouldn't need any CUs for that type of work.. To put in context Cerny, stated that the IO complex decompressor is supposed to be equal to 9 of the PS5 Zen2 cores attempting the same work - which is also why I said the XsX might be using AVX2 for its CPU decompression, as clearly the Zen2 cores are suited to the work.

As for your Goosen quote, that feels like an end to the idea of AVX2 decompression on XsX, and possibly even the CUs for decompression. It is just a shame that they don't seem to want to reveal the details of the decompression block in the way Sony have - maybe that's what the ARM processors you mentioned are being used for.

In regards of SFS, I suspect Dirt is able to replace data mid-frame because it is (probably) only streaming in from the 10GB GPU pool, with pre-stage assets ready to use for that race. What I was referring to was streaming in from the SSD like an open world game does. It would defeat the purpose of SFS saving on bandwidth and memory use to stage the assets in GPU ram in the traditional way,, but you can't derive and return sampler feedback data without lower order place holder data loaded and rendered, first. And because of the round trip delay of the sampler feedback info, some of those early transition frames will need to be rendered as is AFAIK, which would be the data latency I was talking about.
 

D.Final

Banned
I get where you are going wrong there, but Sony haven't committed to an ASIC for zlib or derivatives kraken, oodle, etc.

The IO complex is using two programmable co-processors and specialist super low latency SRAM memory to accommodate compression and decompression algorithm acceleration in whatever solution it evolves into, over the generation. So even if they used zlib, they'd just update the IO complex code, as the IO complex data is moved directly into the unified RAM and so wouldn't need any CUs for that type of work.. To put in context Cerny, stated that the IO complex decompressor is supposed to be equal to 9 of the PS5 Zen2 cores attempting the same work - which is also why I said the XsX might be using AVX2 for its CPU decompression, as clearly the Zen2 cores are suited to the work.

As for your Goosen quote, that feels like an end to the idea of AVX2 decompression on XsX, and possibly even the CUs for decompression. It is just a shame that they don't seem to want to reveal the details of the decompression block in the way Sony have - maybe that's what the ARM processors you mentioned are being used for.

In regards of SFS, I suspect Dirt is able to replace data mid-frame because it is (probably) only streaming in from the 10GB GPU pool, with pre-stage assets ready to use for that race. What I was referring to was streaming in from the SSD like an open world game does. It would defeat the purpose of SFS saving on bandwidth and memory use to stage the assets in GPU ram in the traditional way,, but you can't derive and return sampler feedback data without lower order place holder data loaded and rendered, first. And because of the round trip delay of the sampler feedback info, some of those early transition frames will need to be rendered as is AFAIK, which would be the data latency I was talking about.
I agree
 
I have never said it is 'only' a software based solution and never will. The point is that it contains clear process steps that build CPU overhead.

You're grossly overstating this overhead. Continuing to bring it up as if it is taxing on the system given the frequency you address it, when all proof points to the opposite.

Also if we want to get technical PS5's solution does not 100% absolve the CPU from doing anything; the CPU still needs to instruct the processor core of the I/O block on what data to fetch, move, etc., let alone initialize it for work. That still requires the CPU's input. It's just that all of the grunt from thereon is handled through the I/O block. On the Series systems some bit more after that point is handled through the CPU, but only at 1/10th of a CPU core (the OS core), which is nowhere near the CPU overhead you seem to think it is.

Once again - the point is that PS5 has a hardware path without CPU overhead while XSX does not

On an absolute technical level I've already disproven this with the response above.

A hardware path without CPU overhead always beats a path with CPU overhead on latency (and throughput) everything else equal. And in this case 'everything else' is not even equal it is to the benefit of the PS5.

Again, has been disproven. There is no 100% path "without" CPU overhead because the CPU still needs to initialize and coordinate dedicated hardware components to do their job, similar to how it is responsible for issuing instructions to the GPU (traditionally).

The reason for me writing 'willing to bet' is based on the above together with the aknowledgement that none of us has actual data from these machines in hand, i.e. we are basing our writing on available information.

This is a faith-based argument, though. And at least from what I can tell you haven't looked at all available information when it pertains to MS. Also at the end of the day you aren't really disagreeing with anything I'm saying because my argument has never been that XvA suddenly "closes the gap" or surpasses PS5's solution, but you seem to think that is what I've been arguing.

My entire point is that XvA's optimizations greatly narrow the SSD I/O performance gap between Series X and PS5, and is implemented in a way where direct comparisons aren't even particularly valid a good deal of the time. It is also more hardware-agnostic which of course comes with its own drawbacks (tho not nearly in the way you seem to be postulating), but has the benefit of being scalable to future hardware implementations in a way Sony's solution is not (since it relies so much more on absolute fixed-function hardware).

You are absolutely right. They have chosen a flexible solution (i.e. CPU overhead) that can be updated over time but lost throughout and latency in the process. That is the tradeoff they have made - I respect them for that. You are trying to argue there was no cost to that decision though which I find odd.

Latency is not a fixed, static element; algorithms improve and hardware designs improve over time, and therefore latency can drop. You will eventually see PC solutions of SSD I/O utilizing quality drives and aspects of XvA MS ports over (such as DirectStorage) that outperform Sony's solution. Will that be immediately? No. But within a year or two, it is absolutely possible. THAT is the benefit of a more hardware-agnostic design that nonetheless programs itself very well against new hardware to leverage them as best as possible.

CPU overhead is a big thing so I am not sure why you argue about that. Honestly, the entire field of RISC processors for example is about this, i.e. 'how can I lower the loss of speed in the CPU and still maintain some level of flexibility'.

No, RISC processors came about as a means of trying to parse down complex CISC instruction sets and isolate specific instructions to simpler hardware implementations for devices targeting specific markets. The trade-off being that simpler instructions execute faster on RISC but more complex ones take longer cycles to compute. None of that has anything to do with CPU overhead.

Speaking of, you're still overstating it in this specific case. A software-based solution running on two generations of processors with the latter having more modern design, better architecture etc...will run better (and therefore have much lower CPU overhead) on the latter. Again, MS have been able to reduce XvA stack on CPU to 1/10th of a CPU core, do you think this would suddenly balloon on a more advanced desktop CPU that were paired with similar SSD I/O hardware as the next-gen systems? Absolutely not.

I get where you are going wrong there, but Sony haven't committed to an ASIC for zlib or derivatives kraken, oodle, etc.

The IO complex is using two programmable co-processors and specialist super low latency SRAM memory to accommodate compression and decompression algorithm acceleration in whatever solution it evolves into, over the generation. So even if they used zlib, they'd just update the IO complex code, as the IO complex data is moved directly into the unified RAM and so wouldn't need any CUs for that type of work.. To put in context Cerny, stated that the IO complex decompressor is supposed to be equal to 9 of the PS5 Zen2 cores attempting the same work - which is also why I said the XsX might be using AVX2 for its CPU decompression, as clearly the Zen2 cores are suited to the work.

As for your Goosen quote, that feels like an end to the idea of AVX2 decompression on XsX, and possibly even the CUs for decompression. It is just a shame that they don't seem to want to reveal the details of the decompression block in the way Sony have - maybe that's what the ARM processors you mentioned are being used for.

In regards of SFS, I suspect Dirt is able to replace data mid-frame because it is (probably) only streaming in from the 10GB GPU pool, with pre-stage assets ready to use for that race. What I was referring to was streaming in from the SSD like an open world game does. It would defeat the purpose of SFS saving on bandwidth and memory use to stage the assets in GPU ram in the traditional way,, but you can't derive and return sampler feedback data without lower order place holder data loaded and rendered, first. And because of the round trip delay of the sampler feedback info, some of those early transition frames will need to be rendered as is AFAIK, which would be the data latency I was talking about.

"specialist super low latency SRAM"

There are many different types of SRAM, with varying latency, that also often depends on the size. For all we know Sony could've gone for cheaper SRAM with higher latency that could still be somewhat lower than DDR-based memories but not by a hefty amount. Conversely, if they've gone with high-quality SRAM with seemingly very low latency, then the cache size will be quite small, which will affect other performance metrics of the SSD I/O. Seeing as how they've had to be considerate of both GDDR6 and NAND memory prices, I'd suggest they've gone with some middle-of-the-road SRAM for the I/O block cache, so it actually remains to be seen just how low the latency actually is.

Yes touch on them being able to update the code for the I/O hardware; that is essentially firmware. It is the software to the hardware. The same stuff I was just telling Elog about. People are underestimating how crucial good software is required to leverage the hardware that lies underneath. So I am at least glad you have acknowledged this by stating how Sony would evolve the firmware over the generation. This is also what MS will do for their I/O hardware over the generation, too.

Also yes PS5 decompression can seemingly do 9 Zen 2 cores worth of decompression on the I/O block, but then how are you conflating this to saying Series X would need to use its own actual CPU for data decompression or CU cores of the GPU? That system has a decompression block as well. Is it as much as Sony's in throughput? No. But it doesn't need to be. So this really just moreso works against the idea that MS would need to use the CPU for decompression work or CU cores for texture data decompression (unless I missed something and you suggested Sony would have to use some CUs on their GPU for a similar purpose, if by some chance the I/O decompression hardware is not doing this?)

So with that said I don't see why you would state this, then assume Goosen's quote is suggesting what you suggest. If Series X were a PC with no specialized decompression hardware then I'd agree this is what they'd be doing, and maybe this is something they can do on PC. But I don't see any context for Series X (or Series S) needing to set aside these resources when there is no insistence PS5 does this, and both systems have I/O-purposed hardware in them to handle these very tasks.

EDIT: Wanted to add in real quick, if the insistence comes from the fact PS5's SSD I/O has more comparable raw horsepower to it, we also need to keep in mind it needs that for its higher raw bandwidth totals. Series X is targeting lower raw bandwidth with its solution, so it doesn't need as much of that hardware built right into the I/O solution (this is not to suggest it is meek by any measures here, however).
 
Last edited:

PaintTinJr

Member
I think we are saying the same thing, you did give a great detailed summary thanks for that :). Please, feel free to pick the following apart, quite enjoyable conversation.

Oodle Texture includes Rate Distortion Optimisation as well as block layout optimisation to improve the Kraken lossless compression step (enhancing compression rate) while BC7Prep (and I was thinking BCPack too) adds a further layout reordering optimisation that needs to be undone (in software on PS5, but the BCPack decoder is able to undo its version of the same optimisation in the HW decompression block, requiring no further decoding work).

BC7Prep: http://cbloomrants.blogspot.com/2020/06/oodle-texture-bc7prep-data-flow.html -> require extra decoding step in SW

Oodle Texture:http://cbloomrants.blogspot.com/2020/06/oodle-texture-slashes-game-sizes.html (this include the encoding and decoding pipeline steps too, quite handy) -> includes RDO, improves Kraken compression rate, and requires no additional SW decoding step


BsVjMUd.jpg


RNAk1j5.jpg



My only point is that, based on the advertised equivalent compressed I/O bandwidth by both MS and Sony (2.4 GB/s to ~4.8 GB/s for XSX vs 5.5 GB/s to 8-9 GB/s on PS5... the latter may be a bit more conservative based on historical data, but let’s assume it is not), BCPack averages higher compression rates for textures and there is no mention of a required GPU based or CPU based decoding step. I do think Sony was already factoring RDO pre-processing before Kraken compression, but not BC7Prep as they would need a giant * next to it (*requires GPU decoding step, might lower performance ;)).
Great post - wanted to reply earlier - lots of horse's mouth info, that certainly helps piece the solutions together.

To be fair (after looking specifically at BC7 formats) I'm a little surprised at how convoluted (not just complex) the solutions are for real-time texture decompression, now. I'm a great believer of self-contained elegant solutions that can be easily described on the back of an envelope. But when I was reading about the ASTC format (Adaptive Scalable Texture Compression) recently by the Kronos Group/Opengl standards people and it was interesting that they make the argument, that using more compute is a reasonable trade off for compression, now because memory speed increases are massively outstripped by increases in compute year-on-year.

I do wonder at what point real-time Jpeg decompression acceleration in GPUs might be reconsidered as the desirable choice for texture mapping..
 

PaintTinJr

Member
You're grossly overstating this overhead. Continuing to bring it up as if it is taxing on the system given the frequency you address it, when all proof points to the opposite.

Also if we want to get technical PS5's solution does not 100% absolve the CPU from doing anything; the CPU still needs to instruct the processor core of the I/O block on what data to fetch, move, etc., let alone initialize it for work. That still requires the CPU's input. It's just that all of the grunt from thereon is handled through the I/O block. On the Series systems some bit more after that point is handled through the CPU, but only at 1/10th of a CPU core (the OS core), which is nowhere near the CPU overhead you seem to think it is.



On an absolute technical level I've already disproven this with the response above.



Again, has been disproven. There is no 100% path "without" CPU overhead because the CPU still needs to initialize and coordinate dedicated hardware components to do their job, similar to how it is responsible for issuing instructions to the GPU (traditionally).



This is a faith-based argument, though. And at least from what I can tell you haven't looked at all available information when it pertains to MS. Also at the end of the day you aren't really disagreeing with anything I'm saying because my argument has never been that XvA suddenly "closes the gap" or surpasses PS5's solution, but you seem to think that is what I've been arguing.

My entire point is that XvA's optimizations greatly narrow the SSD I/O performance gap between Series X and PS5, and is implemented in a way where direct comparisons aren't even particularly valid a good deal of the time. It is also more hardware-agnostic which of course comes with its own drawbacks (tho not nearly in the way you seem to be postulating), but has the benefit of being scalable to future hardware implementations in a way Sony's solution is not (since it relies so much more on absolute fixed-function hardware).



Latency is not a fixed, static element; algorithms improve and hardware designs improve over time, and therefore latency can drop. You will eventually see PC solutions of SSD I/O utilizing quality drives and aspects of XvA MS ports over (such as DirectStorage) that outperform Sony's solution. Will that be immediately? No. But within a year or two, it is absolutely possible. THAT is the benefit of a more hardware-agnostic design that nonetheless programs itself very well against new hardware to leverage them as best as possible.



No, RISC processors came about as a means of trying to parse down complex CISC instruction sets and isolate specific instructions to simpler hardware implementations for devices targeting specific markets. The trade-off being that simpler instructions execute faster on RISC but more complex ones take longer cycles to compute. None of that has anything to do with CPU overhead.

Speaking of, you're still overstating it in this specific case. A software-based solution running on two generations of processors with the latter having more modern design, better architecture etc...will run better (and therefore have much lower CPU overhead) on the latter. Again, MS have been able to reduce XvA stack on CPU to 1/10th of a CPU core, do you think this would suddenly balloon on a more advanced desktop CPU that were paired with similar SSD I/O hardware as the next-gen systems? Absolutely not.



"specialist super low latency SRAM"

There are many different types of SRAM, with varying latency, that also often depends on the size. For all we know Sony could've gone for cheaper SRAM with higher latency that could still be somewhat lower than DDR-based memories but not by a hefty amount. Conversely, if they've gone with high-quality SRAM with seemingly very low latency, then the cache size will be quite small, which will affect other performance metrics of the SSD I/O. Seeing as how they've had to be considerate of both GDDR6 and NAND memory prices, I'd suggest they've gone with some middle-of-the-road SRAM for the I/O block cache, so it actually remains to be seen just how low the latency actually is.

Yes touch on them being able to update the code for the I/O hardware; that is essentially firmware. It is the software to the hardware. The same stuff I was just telling Elog about. People are underestimating how crucial good software is required to leverage the hardware that lies underneath. So I am at least glad you have acknowledged this by stating how Sony would evolve the firmware over the generation. This is also what MS will do for their I/O hardware over the generation, too.

Also yes PS5 decompression can seemingly do 9 Zen 2 cores worth of decompression on the I/O block, but then how are you conflating this to saying Series X would need to use its own actual CPU for data decompression or CU cores of the GPU? That system has a decompression block as well. Is it as much as Sony's in throughput? No. But it doesn't need to be. So this really just moreso works against the idea that MS would need to use the CPU for decompression work or CU cores for texture data decompression (unless I missed something and you suggested Sony would have to use some CUs on their GPU for a similar purpose, if by some chance the I/O decompression hardware is not doing this?)

So with that said I don't see why you would state this, then assume Goosen's quote is suggesting what you suggest. If Series X were a PC with no specialized decompression hardware then I'd agree this is what they'd be doing, and maybe this is something they can do on PC. But I don't see any context for Series X (or Series S) needing to set aside these resources when there is no insistence PS5 does this, and both systems have I/O-purposed hardware in them to handle these very tasks.

EDIT: Wanted to add in real quick, if the insistence comes from the fact PS5's SSD I/O has more comparable raw horsepower to it, we also need to keep in mind it needs that for its higher raw bandwidth totals. Series X is targeting lower raw bandwidth with its solution, so it doesn't need as much of that hardware built right into the I/O solution (this is not to suggest it is meek by any measures here, however).
Just a quick reply, to say I should have worded my previous quote about Goosen, better but was actually agreeing with you, that the XsX won't be using CUs for decompression, or using AVX2 for decompression - after reading that quote and remembering you mentioned in an earlier post mentioning about the hardware decompression block in the hotchips die-shot.

As for the SRAM in the IO complex, or even it running without much CPU interaction, I think it is worth considering that Sony already hardware accelerated real-time zlib decompression in the PS3 generation, using just an SPU with a virtually latency free 256KB Local store inside the SPU - that had to place both executable and data in that 256KB. And the SPUs were fully capable of running independently from the PPIU - once they were configured and kicked into action - so I wouldn't assume that the IO complex will be an inferior solution to an SPU solution. The performance of the IO complex being probably 5years in front of anything generic PC hardware will match - never mind better, unless Sony/AMD release the solution - would have me speculate that the SRAM chip will be the fastest available and probably no bigger than the smallest/cheapest version of that module that is 1MB or bigger - effectively 4x the memory they had for the local store in the SPUs, letting them have more headroom.

Another consideration about the IO complex is that it might be the other half of the CU pair that was reworked to provide the SPU-esq Tempest engine. But obviously, this thread is about the XsX info, so it is probably worth mentioning that in spite of the difference in decompression block numbers, Microsoft must have still put some serious costs and effort into the decompression block - if that is also (presumably) where they are providing the (DRAM) cache solution for their cacheless SSD - that was mentioned in the next-gen thread some months back., I really hope they see the technical interest in the XvA technology and provide more info before the console launch.
 
Just a quick reply, to say I should have worded my previous quote about Goosen, better but was actually agreeing with you, that the XsX won't be using CUs for decompression, or using AVX2 for decompression - after reading that quote and remembering you mentioned in an earlier post mentioning about the hardware decompression block in the hotchips die-shot.

As for the SRAM in the IO complex, or even it running without much CPU interaction, I think it is worth considering that Sony already hardware accelerated real-time zlib decompression in the PS3 generation, using just an SPU with a virtually latency free 256KB Local store inside the SPU - that had to place both executable and data in that 256KB. And the SPUs were fully capable of running independently from the PPIU - once they were configured and kicked into action - so I wouldn't assume that the IO complex will be an inferior solution to an SPU solution. The performance of the IO complex being probably 5years in front of anything generic PC hardware will match - never mind better, unless Sony/AMD release the solution - would have me speculate that the SRAM chip will be the fastest available and probably no bigger than the smallest/cheapest version of that module that is 1MB or bigger - effectively 4x the memory they had for the local store in the SPUs, letting them have more headroom.

Another consideration about the IO complex is that it might be the other half of the CU pair that was reworked to provide the SPU-esq Tempest engine. But obviously, this thread is about the XsX info, so it is probably worth mentioning that in spite of the difference in decompression block numbers, Microsoft must have still put some serious costs and effort into the decompression block - if that is also (presumably) where they are providing the (DRAM) cache solution for their cacheless SSD - that was mentioned in the next-gen thread some months back., I really hope they see the technical interest in the XvA technology and provide more info before the console launch.

Ah okay, I appreciate that clarification on the Goosen stuff, it had me perplexed there for a moment xD. Regarding the PS5 SSD I/O setup, it's certainly possible they could be doing what you suggest. The comparison with the SPU is interesting because while Sony have said nothing about customizing the hardware in the I/O block to act like an SPU similar to what they're doing with Tempest, I guess theoretically they could try something like this at least with the cache in the I/O block? Though, in PS3's case as you make mention it's "only" 256 KB; that's good for an L2 cache of the main processor (not the Coherency Engines) of PS5 SSD I/O but the way Sony presented the graphic that looked like an off-chip SRAM, so I guess we could call it a fat L3 or even "L4" cache.

In that case, the size would need to be in the MBs and as we also know cache latency takes a bigger hit moving down the stack: the registers/"L0$" are always faster than the L1$, the L1$ always faster than the L2$, etc. The trade-off tho, has always been that the lower-level caches have greater capacities. Even the GDDR6 in the next-gen consoles can be considered a sort of cache if we're sticking to memory not directly on the chip as cache. I see that you are of thought they're going with 1 MB SRAM chips, to try matching the local store caches of the PS3 SPU in latency figures? That's an interesting possibility, and it might be doable, if they really REALLY want to hit that specific aspect of the performance. However I also think an SRAM cache that small could affect other parts of the design like the look-up table; actually, I had some folks link me some possible insider details a couple months ago that detailed a lot of the PS5's SSD I/O structure. Gave it a read, thought it was pretty stimulating (whether it was all true or not is still up for discussion), and I've thought a lot about that info. They did specifically go into parts of the LUT and it'd reinforce your idea of them going with a very small but very fast/very low latency SRAM cache. I'll try finding the messages and pulling that speculation forward (IIRC it was from a Chinese tech forum and translated to English).

Anyway, my point is just trying to illustrate that if the SRAM cache Sony showed is similar in function to the DDR cache other SSDs would generally use, it's more or less off-chip so would fit that kind of "L4" cache, just being SRAM rather than DRAM. This is me assuming you've been speaking about the SRAM cache Sony have placed as replacement for the typical DRAM cache, and not SRAM cache as in terms of on-chip local cache, since the diagram shot from Road to PS5 put the SRAM cache in the role of the former (at least IMO). I kind of have a different take on the actual potential capacity of that SRAM cache, but I do think your idea on the size (and therein latency performance) is equally valid especially considering it does line up with some possible info I was forwarded about two months back. So anyway, to the speculation: it will have better latency than the equivalent DRAM cache in other SSDs, but worst latency than on-chip SRAM and almost most certainly the L2$ of a PS3 SPU, while also being smaller in capacity than the DRAM cache on other SSDs (though it could be larger than some expect; since Sony gets to enjoy economies of scale the way, say, Sabrent can't, they can probably get up to 128 MB of this SRAM and still keep things relatively affordable. Most high-end SSDs on PC that I know of have DRAM caches of around 512 MB to 1 GB tho).

It's an interesting point of their SSD I/O design nonetheless and shows what tradeoffs they've made in capacity to gain an advantage in speed and latency. But I guess the question still remains how does this affect the rest of the I/O, which is something we'll have to see down the line. WRT Series X SSD I/O and handling the DRAM-less cache, there's actually a few posts from B3D from iroboto and DSoup which had some very plausible ideas on what MS could be doing. I'll have to also look for those again and edit this post sometime, but their ideas kind of fit into what the FlashMap papers go into detail on, and like you I hope there's a similar "deep dive" of sorts into XvA in the near future, though there may be parts of it they are saving on going into until things are ready for deployment in the PC space.
 

rnlval

Member
L1 is shared per shader array.
4 shader arrays means 4 L1 caches and 4 ROP blocks.
I.e. less L1 per CU compared to 36CU varant.
XSX GPU's higher DCU count results in higher LDS (Local Data Store), Vec L0 cache, Instruc, and Scalar SRAM on-chip storage i.e. ~30% higher than 20 DCU RX 5700 XT.

fqvK7bgMNGxQdNKNnHKZHQ-1366-80.png


XSX GPU has 5MB L2 cache which is 25% higher than RX 5700 XT's 4MB L2 cache.
 
Last edited:

rnlval

Member
It is not because the pixel fillrate is bandwidth limited on PS5. On series X it is ROP limited.
Wrong narrative.

2syHa9S.jpg


Better learn some math and compute science.

RGBA8: 1825 x 64 x 4 bytes = 467 GB/s (ROPS bound)
RGBA16F: 1825 x 64 x 8 bytes = 467 GB/s (BW bound)
RGBA32F: 1825 x 64 x 16 bytes = 1,868 GB/s (BW bound)

Tile cache renders technique to L2 cache will be important and XSX GPU has 5MB L2 cache which is 25% higher than RX 5700 XT's

ROPS bound can be workaround with texture units. Review Avalanche Studio's GDC 2014 lecture on it.
 
Last edited:

Marlenus

Member
Wrong narrative.

2syHa9S.jpg


Better learn some math and compute science.

RGBA8: 1825 x 64 x 4 bytes = 467 GB/s (ROPS bound)
RGBA16F: 1825 x 64 x 8 bytes = 467 GB/s (BW bound)
RGBA32F: 1825 x 64 x 16 bytes = 1,868 GB/s (BW bound)

Tile cache renders technique to L2 cache will be important and XSX GPU has 5MB L2 cache which is 25% higher than RX 5700 XT's

ROPS bound can be workaround with texture units. Review Avalanche Studio's GDC 2014 lecture on it.

My point was that while PS5 has an on paper pixel fillrate advantage it cannot use it because it has less bandwidth so their RBE performance is going to be about the same, even though the peak theoretical numbers differ.

My narrative was nothing to do with what you are saying so maybe learn to read and comprehend.
 

psorcerer

Banned
XSX GPU's higher DCU count results in higher LDS (Local Data Store), Vec L0 cache, Instruc, and Scalar SRAM on-chip storage i.e. ~30% higher than 20 DCU RX 5700 XT.

fqvK7bgMNGxQdNKNnHKZHQ-1366-80.png


XSX GPU has 5MB L2 cache which is 25% higher than RX 5700 XT's 4MB L2 cache.

Overall the same L0 caches per CU as 5700XT but much lower L1 and lower L2. Per CU.
GPU is a streaming processor, so it may not impact perf that much though.
 

rnlval

Member
My point was that while PS5 has an on paper pixel fillrate advantage it cannot use it because it has less bandwidth so their RBE performance is going to be about the same, even though the peak theoretical numbers differ.

My narrative was nothing to do with what you are saying so maybe learn to read and comprehend.
Your narrative on XSX being ROPS limited is still wrong.
 
Last edited:

rnlval

Member
Overall the same L0 caches per CU as 5700XT but much lower L1 and lower L2. Per CU.
GPU is a streaming processor, so it may not impact perf that much though.
Modern GPU with multi-MB L2 cache connected to ROPS/TMU/GEO has "tiled caching" render.

f34e39b49c7c.jpg


L0 cache scales with DCU count. XSX GPU's L0 cache is 30% higher than RX 5700 XT. Higher overall L0 cache and LDS storage enables to keep more data in-flight on the chip before hitting slower L2 and external memory.

XSX GPU's 5MB L2 cache has 25% percent L2 cache when compared to RX 5700 XT's 4 MB L2 cache.

At 4K resolution, Gears 5 builtin benchmarks show XSX GPU scales about 25% higher from RX 5700 XT's results.
 
XSX GPU's higher DCU count results in higher LDS (Local Data Store), Vec L0 cache, Instruc, and Scalar SRAM on-chip storage i.e. ~30% higher than 20 DCU RX 5700 XT.

fqvK7bgMNGxQdNKNnHKZHQ-1366-80.png


XSX GPU has 5MB L2 cache which is 25% higher than RX 5700 XT's 4MB L2 cache.
I mean, it's RDNA2 after all, 5700XT is RDNA1.
 

Marlenus

Member
Your narrative on XSX being ROPS limited is still wrong.

For typical 32bit textures the Series X is ROP limited.

Using the shaders (not texture units) can be useful but it cannot do everything so is situational.

The bottom line is that the extra clockspeed of the PS5 does nothing to help pixel fillrate in real world scenarios because it does not have the memory bandwidth available to max it out.

For the series X and typical 32 bit textures, despite being ROP bound the bandwidth requirements exceed the bandwidth of the PS5 so Series X pixel fillrate performance in the real world is going to be higher despite the lower theoretical number.

In the end they end up being pretty close (within 5% of each other) so realistically it will make zero difference.
 
Last edited:

Fafalada

Fafracer forever
The bottom line is that the extra clockspeed of the PS5 does nothing to help pixel fillrate in real world scenarios because it does not have the memory bandwidth available to max it out.
ROP limited workloads are unlikely to be using much in terms of textures (or shader work for that matter) - we're mostly talking things like filling attribute buffers, depth-prepass, shadow-map generation, etc. How much that ends up contributing to the overall frame-time will entirely depend on specific game and workloads of its render pipeline.

I do wonder at what point real-time Jpeg decompression acceleration in GPUs might be reconsidered as the desirable choice for texture mapping..
Basically never.
Block compression already exists that achieves 2bpp at quality comparable to 4bpp that BC formats allow. Can get as low as 1bpp at some 'acceptable' loss.
JPeg variable bit-rate will always be more cumbersome to implement/use in GPU-pipeline, and given that 1bpp is pretty much the threshold where quality dropoff for everything (JPeg included) becomes untennable for the storage gains - there really isn't any great competitive benefit to using it over said state-of-the art block-compression.

Maybe someone will eventually come up with a ML method that can fabricate(instead of decompress) texture data similar to how DLSS 'increases' resolution - and that will eventually get us past the 1bpp treshold at acceptable perceptual quality - but that would just cement JPeg as non-option.
 

PaintTinJr

Member
....
Basically never.
Block compression already exists that achieves 2bpp at quality comparable to 4bpp that BC formats allow. Can get as low as 1bpp at some 'acceptable' loss.
JPeg variable bit-rate will always be more cumbersome to implement/use in GPU-pipeline, and given that 1bpp is pretty much the threshold where quality dropoff for everything (JPeg included) becomes untennable for the storage gains - there really isn't any great competitive benefit to using it over said state-of-the art block-compression.

Maybe someone will eventually come up with a ML method that can fabricate(instead of decompress) texture data similar to how DLSS 'increases' resolution - and that will eventually get us past the 1bpp treshold at acceptable perceptual quality - but that would just cement JPeg as non-option.
unfortunately, I firmly disagree with your assessment of JPEG (compared to BC formats) and the usefulness to game graphics looking forward.

I've tried a few times to write a counter argument and have found it ends up too much down the technical rabbit hole - hence why I haven't replied sooner - but I now think the best way to explain my opposite viewpoint is with the following question:

Do you think the algorithms in BCn, or even ASTC are data agnostic enough - like JPEGs mathematics based signal processing algorithm is - for them to remain largely unchanged, and equally effective when GPUs properly move beyond RGB_888 and hand-adjusted hacked lighting models of this gen?

I believe that compression formats are very useful to offset slow rising memory bandwidth gains, but long term my instincts tells me that fidelity preservation will require lossless, or lossy ratios at less than 4:1 for PSNR gains to store improved rendering results - the Ericsson Texture Compression 2 paper claims that a 1 dB gain in PSNR is noticeable to a viewer - and in that situation I believe that compression will follow the signal needs and low bits per pixel solutions won't be applicable.

IMHO DXT/BCn compression have seen easy freebie PSNR wins in game graphics -since the beginning - by increasing the DXT/BCn texture resolution faster that the game framebuffer resolution (as we are only just looking at 4K native for this coming gen in mainstream gaming).

With every increase in DXT/BCn texture resolution, the high frequency signal detail being lost is either unnoticed because of the error being minified and filtered when composited in a lower resolution framebuffer, or the data that was used to generate the compressed image isn't providing enough high frequency detail for the compression blocks to struggle to find a good encoding - a problem that I think will become a big issue as better indirect GI gets used like it is in the UE5 demo.

Game graphics converging with film cgi seems a little away, but a closer possibility next-gen. And just like film cgi, (IMHO) motion JPEG compression will be considered the more predictable, resilient, filterable and more informative encoding for ML to analyses- per frame or frame sequence - to enhance.
 

rnlval

Member
For typical 32bit textures the Series X is ROP limited.

Using the shaders (not texture units) can be useful but it cannot do everything so is situational.

The bottom line is that the extra clockspeed of the PS5 does nothing to help pixel fillrate in real world scenarios because it does not have the memory bandwidth available to max it out.

For the series X and typical 32 bit textures, despite being ROP bound the bandwidth requirements exceed the bandwidth of the PS5 so Series X pixel fillrate performance in the real world is going to be higher despite the lower theoretical number.

In the end they end up being pretty close (within 5% of each other) so realistically it will make zero difference.
Nope,

RGBA8 framebuffer: 1825 x 64 x 4 bytes = 467 GB/s (ROPS bound)
RGBA16F framebuffer: 1825 x 64 x 8 bytes = 467 GB/s (BW bound)
RGBA32F framebuffer: 1825 x 64 x 16 bytes = 1,868 GB/s (BW bound)

Texture data are fetched via TMU units!

Also

2933506-5268574942-28653.jpg


ROPS bypass method is via UAV texture path i.e. TMU.
 
Last edited:
Top Bottom