Support NeoGAF

Nostremitus · Jan 23, 2014

You guys should request that this thread be moved to Community before it's locked like the GPU one...

phosphor112 · Jan 23, 2014

Nostremitus said:
You guys should request that this thread be moved to Community before it's locked like the GPU one...

It got locked??? Laaaame.

krizzx · Jan 23, 2014

It got locked because people were incessantly offtopic and treating it like the Nintendo news social thread. Please don't bring that to this one.

I'm actually interested in learning about and decyphering the hardware.

prag16 · Jan 24, 2014

phosphor112 said:
It got locked??? Laaaame.

He lives!

phosphor112 · Jan 24, 2014

prag16 said:
He lives!

Lol, I need to think before I post sometimes, man. Thanks for the welcome back.

Also, interesting insight from Marcan.

Donnie · Jan 24, 2014

tipoo said:
Some more bits about the processor in here

http://fail0verflow.com/blog/2014/console-hacking-2013-omake.html

ie

They mention Starbuck quite a few times in that article but never specify anything about it. Did we ever find out exactly what it is? Decent article anyway, though not sure who thought PPC750 was in-order.

Unfortunate to see the GPU thread locked BTW. We were still occasionally getting some semi interesting bits of info from it. I agree with stopping all the general chit chat in there, but not locking due to age.
.

krizzx · Jan 24, 2014

Donnie said:
They mention Starbuck quite a few times in that article but never specify anything about it. Did we ever find out exactly what it is? Decent article anyway, though not sure who thought PPC750 was in-order.

Unfortunate to see the GPU thread locked BTW. We were still occasionally getting some semi interesting bits of info from it. I agree with stopping all the general chit chat in there, but not locking due to age.
.

I thought we always new what it was. Its an ARM co-processor that servers mostly the same purpose as Starlet on the Wii. For the Wii, Starlet handled security and OS functions. I'd imagine Starbuck does mostly the same with probably some backwards compatibility functionality.

I doubt it does anything for game performance one way or the other. Would be nice if it did though.

tipoo · Jan 24, 2014

Donnie said:
They mention Starbuck quite a few times in that article but never specify anything about it. Did we ever find out exactly what it is? Decent article anyway, though not sure who thought PPC750 was in-order.

There was a bit of debate over how OoO it was iirc, as out of order has varying definitions. Something that could issue two instructions per clock and execute out of order would technically be called that, but not be very out of order for lack of better terminology in comparison to modern processors.

What about the Starbuck? It serves the same purpose as the Starlett it was nicknamed after. It's the security core.

Donnie · Jan 26, 2014

I'm wondering what it is rather than what it does. I assume its an ARM core, but which core and at what clock speed. Not important but I just want to know anyway

krizzx · Jan 26, 2014

I've done some research and it seems that it is as I suspected. Manually assigning CPU Thread gives better performance. It seems that it decreases execution times by a substantial amount.
http://zone.ni.com/reference/en-XX/help/371361J-01/lvconcepts/cpu_assign/

The threads have to be allocated manually to the Wii U CPU, that makes the Wii U CPU a little harder to use, but gives overall better performance. I'd imagine that in conjunction with the shorter pipelines, this makes the Espresso's performance far higher than what people imagine.

BaBaRaRa · Jan 27, 2014

krizzx said:
I've research and it seems that it is as I suspected. Manually assigning CPU Thread gives better performance. It seems that it decreases execution times by a substantial amount.
http://zone.ni.com/reference/en-XX/help/371361J-01/lvconcepts/cpu_assign/

The threads have to be allocated manually to the Wii U CPU, that makes the Wii U CPU a little harder to use, but gives overall better performance. I'd imagine that in conjunction with the shorter pipelines, this makes the Espresso's performance far higher than what people imagine.

I don't think that link is the proof you're looking for.

Can you clarify your original question?

mf luder · Jan 27, 2014

Why has the gpu thread been locked?

Nostremitus · Jan 27, 2014

mf luder said:
Why has the gpu thread been locked?

It had become a general speculation megathread... ...of sorts. I think it would've been better to rename it and move it to community as the Nintendo Speculation Community Thread, but that's just me.

prag16 · Jan 27, 2014

Nostremitus said:
It had become a general speculation megathread... ...of sorts. I think it would've been better to rename it and move it to community as the Nintendo Speculation Community Thread, but that's just me.

Yeah, it went off on a hell of a multiple page long tangent about nintendo's ongoing strategy.

I agree that renaming and moving to the community side would have probably made more sense. Oh well.

Argyle · Jan 27, 2014

mf luder said:
Why has the gpu thread been locked?

wsippel got it locked by posting some random rumor he heard in there instead of making a new thread and it derailed into multiple pages of speculation about Nintendo's next console, etc.

tipoo · Jan 27, 2014

krizzx said:
I think I asked this earlier, but what are the benefits/problems with having the developer have to manually pic which code runs on which CPU? No one has ever spoken on this are far as I can remember.

Wouldn't their be some benefit to not having the overhead and iffyness of the CPU auto delegating tasks? That would be one process less the CPU needs to execute.

krizzx said:
I've research and it seems that it is as I suspected. Manually assigning CPU Thread gives better performance. It seems that it decreases execution times by a substantial amount.
http://zone.ni.com/reference/en-XX/help/371361J-01/lvconcepts/cpu_assign/

The threads have to be allocated manually to the Wii U CPU, that makes the Wii U CPU a little harder to use, but gives overall better performance. I'd imagine that in conjunction with the shorter pipelines, this makes the Espresso's performance far higher than what people imagine.

I'm not sure when this came up? I don't remember seeing anything saying Espresso needed manual thread assignments while the others did not, are you referring to this paragraph? If you are, it doesn't mean what you think it means.

In fact, the SMPization of the 750 in the Espresso is not perfect. There appears to be a bug that affects load-exclusive and store-exclusive instructions (an explicit cache flush is required), which means that using SMP Linux with them will require patching the kernel and libpthread to work around it (and possibly other software that directly uses these instructions to e.g. implement atomics). They would’ve never shipped such a blatant bug on a general-purpose CPU, but I guess they called it good enough for a game console since they can just work around it in the SDK (which they do: the Cafe OS locking primitives have the workaround).

And anyways, as programmers we can add hints to guide processor paralelization, it's not just processors throwing stuff at the wall at random. There's also speculative multithreading.
http://en.wikipedia.org/wiki/Speculative_multithreading

Rolf NB · Jan 27, 2014

Replacing the entire execution state of a CPU core including L2 cache is a ~2MB data transfer (1MB* out, 1MB* in). You can do this literally thousands of times per second even on the Wii U's not-so-fast memory subsystem. Doing it 60 times per second (a decent forced scheduling rate for a games console, if we assume such a thing is even done) is kinda negligible.

*technically, L2 + L1I + size of architectural registers (2^20 + 2^15 + 32 * 4 bytes for the "big" core); L1D doesn't count because all its data is mirrored in L2. 1MB is not far off.

Apophis2036 · Jan 27, 2014

Argyle said:
wsippel got it locked by posting some random rumor he heard in there instead of making a new thread and it derailed into multiple pages of speculation about Nintendo's next console, etc.

He should make a thread (maybe even in community as no mega threads are allowed now ?), where we could discus the next Nintendo console hardware. Krizz mentioned making one but I don't think he is allowed to make threads anymore for some reason.

krizzx · Jan 30, 2014

BaBaRaRa said:
I don't think that link is the proof you're looking for.

Can you clarify your original question?

Clarify it? What about it do you not understand?

BaBaRaRa · Jan 31, 2014

krizzx said:
Clarify it? What about it do you not understand?

Basically, what information led you to:

krizzx said:
I think I asked this earlier, but what are the benefits/problems with having the developer have to manually pic which code runs on which CPU? No one has ever spoken on this are far as I can remember.

Wouldn't their be some benefit to not having the overhead and iffyness of the CPU auto delegating tasks? That would be one process less the CPU needs to execute.

Brad Grenz · Feb 1, 2014

He's probably half remembering the affinity problems associated with AMD's Bulldozer chips where the programs at first didn't recognize the physical relationship of the core pairs and would assign tasks arbitrarily to logical threads rather than preferentially to the physical pairs so they could benefit from the shared resources (Cache, FPU, etc). But that was actually a software problem fixed in the OS, not a problem with modern multi-core processors in general.

blu · Feb 1, 2014

My guess is krizzx is talking about the cooperative mutithreading that wiiu os employs. It allows for fine-grain control over latencies, largely by removing the need to save/restore thread contexts. Speaking of which..

Rolf NB said:
Replacing the entire execution state of a CPU core including L2 cache is a ~2MB data transfer (1MB* out, 1MB* in). You can do this literally thousands of times per second even on the Wii U's not-so-fast memory subsystem. Doing it 60 times per second (a decent forced scheduling rate for a games console, if we assume such a thing is even done) is kinda negligible.

*technically, L2 + L1I + size of architectural registers (2^20 + 2^15 + 32 * 4 bytes for the "big" core); L1D doesn't count because all its data is mirrored in L2. 1MB is not far off.

I assume you're talking of thread context switches here, right? If so, on one hand, you're forgetting quite a bit from the cpu's programming-model state - the 32 64bit FPR registers. On the other hand, context switching rarely concerns the caches - their content is subject to the common caching/eviction policies. On the third hand, 60hz is a rather arbitrary rate to context switch - normally it's neither regular, nor at this rate.

/context switch

Sorry about getting back to you only now, tipoo:

tipoo said:
Well, being designed to perform well at clock speeds in excess of 4GHz and hitting 150W barriers well before that causing all sorts of unexpected compromises and throttling was part of it.

Why are you bringing up throttling in a general IPC topic? And p4 was neither the first nor the last design to employ active speed scaling to avoid critical conditions. The most recent HPC Intel part I worked with does that just as well - no Prescott-scale issues whatsoever.

The hyper-long pipeline wasn't ideal, true, but that doesn't mean shorter is always better either, I believe I linked an engineering study early in this thread showing 11-14 stages being most optimal for most code regardless of clock speed, and iirc that's mostly where Intel targets (Haswell is a 14 stager).

You seem to be confusing subjects again - the length of the pipeline is dictated by the target clocks. There's no 'N stages pipeline is optimal for software, regardless of clock speed'. The optimal pipeline regardless of clock speed would be one which does not allow stalls, be that in the form of bubbles of flushes. Of course, there's no such thing.

Even if those were best case scenarios, that's from back in the Pentium 4 days, and even generation to generation back then they were improving the misprediction rate. I don't know what it is now, but I'd be interested to see it too.

This past week I was fighting a bottleneck in critical code path where the branch predictor's success rate was staying in the 60s, on a top-of-the-crop Xeon cpu. I had to rewrite the entire algorithm to fix that, because there's no branch predictor in the entire world which could do well on the original one, not with the given data set.

About your talk of pipeline flushes, modern processors don't even flush every stage in the pipeline in a misprediction, just the relevant stages now. So something with a XX stage pipeline may flush half of them if half are irrelevant, or any other fraction. I forget what this is called, but I think anything post Core 2 has it. I'll try to find the name.

You're confusing mispredictions with branch-target misses. The latter result in large bubbles (as large as 8 clocks) on modern Intels. The former result in straight royal flushes, just like most other speculations gone bad (look up 'machine clears' in recent Intel pipelines).

Anyways, even if some of my rambling did need correction, you agree that pipeline length isn't black and white as was my point, right? There's a balance to be struck and crazy short isn't necessarily ideal as crazy long isn't, was what I was getting at. This lower always = higher IPC notion is what I was attacking.

Among two hypothetical architectures A and B, where B is essentially A with fewer-but-fatter stages, B will always have better IPC. The Bobcat-Jaguar example you used earlier does not show what you think it does, because all the enhancements AMD did to improve the IPC on Jaguar were also applicable to Bobcat, but never occurred. Once again, Jaguar had improved IPC despite its longer pipeline, not because of it.

krizzx · Feb 13, 2014

blu said:
My guess is krizzx is talking about the cooperative mutithreading that wiiu os employs. It allows for fine-grain control over latencies, largely by removing the need to save/restore thread contexts. Speaking of which..

I assume you're talking of thread context switches here, right? If so, on one hand, you're forgetting quite a bit from the cpu's programming-model state - the 32 64bit FPR registers. On the other hand, context switching rarely concerns the caches - their content is subject to the common caching/eviction policies. On the third hand, 60hz is a rather arbitrary rate to context switch - normally it's neither regular, nor at this rate.

/context switch

Sorry about getting back to you only now, tipoo:

Why are you bringing up throttling in a general IPC topic? And p4 was neither the first nor the last design to employ active speed scaling to avoid critical conditions. The most recent HPC Intel part I worked with does that just as well - no Prescott-scale issues whatsoever.

You seem to be confusing subjects again - the length of the pipeline is dictated by the target clocks. There's no 'N stages pipeline is optimal for software, regardless of clock speed'. The optimal pipeline regardless of clock speed would be one which does not allow stalls, be that in the form of bubbles of flushes. Of course, there's no such thing.

This past week I was fighting a bottleneck in critical code path where the branch predictor's success rate was staying in the 60s, on a top-of-the-crop Xeon cpu. I had to rewrite the entire algorithm to fix that, because there's no branch predictor in the entire world which could do well on the original one, not with the given data set.

You're confusing mispredictions with branch-target misses. The latter result in large bubbles (as large as 8 clocks) on modern Intels. The former result in straight royal flushes, just like most other speculations gone bad (look up 'machine clears' in recent Intel pipelines).

Among two hypothetical architectures A and B, where B is essentially A with fewer-but-fatter stages, B will always have better IPC. The Bobcat-Jaguar example you used earlier does not show what you think it does, because all the enhancements AMD did to improve the IPC on Jaguar were also applicable to Bobcat, but never occurred. Once again, Jaguar had improved IPC despite its longer pipeline, not because of it.

That is correct.

Also, I have an offtopic question. Is the PS4 CPU stronger than the CELL?

phosphor112 · Feb 13, 2014

krizzx said:
That is correct.

Also, I have an offtopic question. Is the PS4 CPU stronger than the CELL?

Depends on how you look at it.

GPU's are used for single precision while CPU's are used for double precision. Both can do the other though. CPU's will have SIMD cores, and GPU's can do compute processes, but neither replace the other.

The Cell was unique. It was a stream processor. Imagine a CPU that can do GPU functions. GPU's have high threads, and low clocks, allowing bulk processing of simple tasks really quickly. CPU's have low threads and high clocks, allowing extremely fast processing of complex data, but less jobs at a time. Imagine a CPU reads through a book in chapters. 1 core per chapter going through very quickly. A GPU reads through a book with 1 core, reading 1 page at a time, but with hundreds of cores.

Anyway, it only had 6 spu's (simd cores) and 1 ppu (general processing/scheduling). So this thing was relatively fast for SP. But because it only had 1 general core, it's DP was crappy. It was able to even render scenes by itself. Polyphony Digital got a slow (but functional) tessellation working on the PS3 through the Cell, because the GPU didn't support it. But it's web browsing and general purpose capabilities were garbage.

Anyway. In short. It depends on how you look at it.

Single Precision: Cell
Double Precision: 2x Jaguar

EDIT: Added more detail.

blu · Feb 14, 2014

phosphor112 said:
Depends on how you look at it.

GPU's are used for single precision while CPU's are used for double precision. Both can do the other though. CPU's will have SIMD cores, and GPU's can do compute processes, but neither replace the other.

That's a bit strange way to look at things, moreover that GPUs have been doing double precision for a while now. The division is more in line of CPUs do low latency tasks and tasks that otherwise won't fit the GPU, and GPUs do high-throughput, high-latency 'streaming' tasks.

Anyway, it only had 6 spu's (simd cores) and 1 ppu (general processing/scheduling). So this thing was relatively fast for SP. But because it only had 1 general core, it's DP was crappy.

Cell is not 'relatively fast' at SP, it's abnormally fast for a CPU. Also Cell SPEs can do DP. They don't particularly excel at that (not until the 2008 server-targeted chip revision anyway), but they don't do 0 DP FLOPS either; in fact they do more DP FLOPS than the PPE (just because the latter cannot do doubles via AltiVec). Of course, power-conservative x86 cores don't excel at DP either. Fun fact: on Bobcat you can actually get a 'decelleration' if you try to SIMD-ify some double precision code, just because the 64-bit ALUs do one double at a time, but you still pay the potential price for the SIMD-ification overhead. i.e. just getting your data in a SIMD-friendly shape might not come for free.

Sipheren · Feb 24, 2014

I know people are pretty well over this and what not but I saw this article and thought it might be interesting to look at. It is more about the GPU but that thread is dead so I wasn't sure where else to post it.

It's been speculated that taking the information above into consideration, the Wii U's total bandwidth of gigabytes per second, including the possible 1024 bits per macro and GPU which, according to TechPowerUp clocks in at 550mhz would come out to around 563.2GB per second. Keep in mind that the Xbox One runs about 170GB per second of bandwidth between the DDR3 and eSRAM, as outlined by Xbit Labs.

Source

With lines like that I would like to get the people who know about this stuff thoughts on it.

prag16 · Feb 24, 2014

Sipheren said:
I know people are pretty well over this and what not but I saw this article and thought it might be interesting to look at. It is more about the GPU but that thread is dead so I wasn't sure where else to post it.

Source

With lines like that I would like to get the people who know about this stuff thoughts on it.

The hell is this fuzzy math? They're gonna need to do better than that...

Also, phosphor banned again?? What now???

Trevelyan9999 · Feb 24, 2014

Sipheren said:
I know people are pretty well over this and what not but I saw this article and thought it might be interesting to look at. It is more about the GPU but that thread is dead so I wasn't sure where else to post it.

Source

With lines like that I would like to get the people who know about this stuff thoughts on it.

That article is just speculation and actually false. The e6760 part in the article says it was a rumor but that was always speculation and never came from a rumor source. I should know since it was my own speculation from different areas of reasoning that started that discussion nearly 2 years ago. It was never a legit rumor though. Oh you Internet you

LordOfChaos · Apr 3, 2014

So, why would they do this as a 512kb/2mb/512kb fixed L2 cache setup, instead of a modern dynamic allocation l2 cache depending on each cores need? And the split also prevents asset sharing, I think, which can be quite costly. Is it just a reduction in die size, or the effort to make the 750s compatible with dynamic cache, or is there an end user performance benefit too?

It somewhat makes sense for backwards compatibility, but there should have been a way to just enable the right amount of cache for that without having to gimp the Wii U mode.

blu · Apr 3, 2014

LordOfChaos said:
So, why would they do this as a 512kb/2mb/512kb fixed L2 cache setup, instead of a modern dynamic allocation l2 cache depending on each cores need?

Huh?

edit: Ok, I was about to wait for you to elaborate, but there's nothing really to elaborate on.

LLC (last-level-cache) sharing is normally done at L3*. Intel's Smart Cache is little but marketing speak where it comes to (non-LLC) L2; there L2 is not really shared - each core has its L2 portion, and that is subject to cache coherence protocols like any other cpu under the sun. Now, if somehow only one core remained running, while all the rest idled, then ISC might give an advantage not unlike Turbo Boost (another common technique these days), so the last remaining core could take ahold of the entire L2. But how often do you expect such a condition to occur, and why would it be desirable? (ed: that's simply not true - Intel's L2 has been strictly per-core since the Nehalem days; my memory has been mixin-n-matching technologies).

Now, if you're actually looking for something akin to L3 in WiiU - that'd be the GPU edram pool, which participates in the cache hierarchy of all three CPUs (and the GPU, apparently).

* It could also be done for L2 when that is LLC, but that normally poses a performance degradation and is never done for high-performance CPUs - e.g. you'd never see a Xeon sharing its L2, given they have L3 for the purpose.

LordOfChaos · Apr 4, 2014

blu said:
LLC (last-level-cache) sharing is normally done at L3*. Intel's Smart Cache is little but marketing speak where it comes to (non-LLC) L2; there L2 is not really shared - each core has its L2 portion, and that is subject to cache coherence protocols like any other cpu under the sun. Now, if somehow only one core remained running, while all the rest idled, then ISC might give an advantage not unlike Turbo Boost (another common technique these days), so the last remaining core could take ahold of the entire L2. But how often do you expect such a condition to occur, and why would it be desirable?

L3 is almost always shared by all cores, understood, but L2 isn't always core exclusive. In modern quads for instance, often each set of two CPU cores will share a pool of L2. Not quite fully dynamic, but not quite fully set either, it's divided two ways and then dynamic between those pools, rather than divided four ways. So L2 is somewhere between set-per-core L1 and dynamic L3.

http://www.hotchips.org/wp-content/...23.19.911-Sandy-Bridge-Lempel-Intel-Rev 7.pdf
http://en.wikipedia.org/wiki/CPU_cache#Multi-core_chips

blu · Apr 4, 2014

LordOfChaos said:
L3 is almost always shared by all cores, understood, but L2 isn't always core exclusive. In modern quads for instance, often each set of two CPU cores will share a pool of L2. Not quite fully dynamic, but not quite fully set either, it's divided two ways and then dynamic between those pools, rather than divided four ways. So L2 is somewhere between set-per-core L1 and dynamic L3.

Which modern CPU by which vendor does that? The pdf you linked to talks exclusively about LLC sharing, and Intel hasn't used L2 LLC past the early Core architecture days. The reason they stopped doing that is that it did not pan out particularly well. Here's some read: http://ixbtlabs.com/articles2/cpu/rmmt-l2-cache.html

TL;DR, shared L2 was a bad idea and has been abandoned for generations now.

Panajev2001a · Apr 4, 2014

Sipheren said:
I know people are pretty well over this and what not but I saw this article and thought it might be interesting to look at. It is more about the GPU but that thread is dead so I wasn't sure where else to post it.

Source

With lines like that I would like to get the people who know about this stuff thoughts on it.

If you count DRAM macro to page buffer bandwidth, the GS in PS2 was way north of 100 GB/s and that was about 14 years ago: 48 GB/s was the aggregate bandwidth between page buffers and the Pixel Engines, but the e-DRAM macros could talk to the page buffers at a much higher rate.

User Tron · Apr 4, 2014

LordOfChaos said:
So, why would they do this as a 512kb/2mb/512kb fixed L2 cache setup, instead of a modern dynamic allocation l2 cache depending on each cores need? And the split also prevents asset sharing, I think, which can be quite costly. Is it just a reduction in die size, or the effort to make the 750s compatible with dynamic cache, or is there an end user performance benefit too?

It somewhat makes sense for backwards compatibility, but there should have been a way to just enable the right amount of cache for that without having to gimp the Wii U mode.

It makes the code more predictable because the cores don't "fight" for "their" cache lines.

LordOfChaos · Apr 4, 2014

blu said:
TL;DR, shared L2 was a bad idea and has been abandoned for generations now.

I see, I guess I was misinformed and thinking about the Core 2 era.

Has anyone else done the approach with having one core having more cache than the others? I guess not so much for general processors like we use in PCs, but for a gaming machine it may make more sense as one thread even to this day is generally loaded more than othres.

efyu_lemonardo · Apr 4, 2014

As long as this thread has been bumped, there's something I've been wanting to ask GAF's advice on, with regards to Latte, the Wii U GPU - the dedicated thread for which was unfortunately locked.

I've done some reading on unconventional lighting techniques and learned about a method for producing shadows that provides surprisingly high accuracy at a rather small computational cost, but relies on having some additional GPU instructions and architectural changes that aren't necessarily implemented in any standard (at least not sufficiently), and are probably difficult to emulate efficiently.

I'm referring to Irregular Shadow Maps, or more accurately, using an Irregular Z-buffer to facilitate the rendering of alias-free shadow maps.

Hopefully some on GAF are familiar with this technique and with the difficulties in implementing it. Perhaps it was even discussed already in previous threads, in which case I'd love to be enlightened.

Irregular Shadow Maps were first introduced in a research paper published in 2004 (link1) by a team from the University of Texas, and in 2006 additional research was published by the same team to show what changes needed to be implemented in the standard pipeline of the time to allow for a GPU to run such an algorithm (link2).
The most fundamental change from an architectural point of view was support for scatter operations, which means sending data from one process to several other processes, in such a way that each process gets a different part of the data to work on. Performing a scatter operation can also mean having to write to arbitrary parts of the memory, and this is something that apparently isn't as straight forward to do on a GPU as it is on a CPU, which has true random access memory with a clear and transparent memory hierarchy. At the time of publication of these works, such operations weren't supported at all in GPU hardware, but that situation has gradually changed since, and over the past few years some implementations or partial implementations of shadow maps via an Irregular Z-buffer have begun to show up in research papers (link3).

(link1) ftp://userweb.cs.utexas.edu/pub/techreports/tr04-09.pdf

(link2) http://www.cs.cmu.edu/afs/cs.cmu.ed...-f11/www/readings/johnson05_irregularzbuf.pdf

(link3) http://publications.lib.chalmers.se/records/fulltext/123790.pdf

If necessary I can go into more detail about the benefits of such an approach to rendering accurate shadows at a relatively low computational cost (though the linked papers already do a very good job explaining this), as well as why I think it might be a good fit to Latte's architecture based on what others have speculated or figured out in the Wii U GPU thread, to the extent that certain parts of Latte's custom logic may have even been designed with a method like this one in mind. I can also provide examples from Wii U games which seem to feature superior shadows to what was the norm on X360 and PS3.
All this comes with the important disclaimer that I am a novice in this field who is learning as he goes along, and as such am clearly susceptible to misunderstandings.

With that said, I'd love to hear opinions and criticisms from some of the more knowledgeable members of GAF.

blu · Apr 4, 2014

efyu_lemonardo said:
As long as this thread has been bumped, there's something I've been wanting to ask GAF's advice on, with regards to Latte, the Wii U GPU - the dedicated thread for which was unfortunately locked.

I've done some reading on unconventional lighting techniques and learned about a method for producing shadows that provides surprisingly high accuracy at a rather small computational cost, but relies on having some additional GPU instructions and architectural changes that aren't necessarily implemented in any standard (at least not sufficiently), and are probably difficult to emulate efficiently.

I'm referring to Irregular Shadow Maps, or more accurately, using an Irregular Z-buffer to facilitate the rendering of alias-free shadow maps.

Hopefully some on GAF are familiar with this technique and with the difficulties in implementing it. Perhaps it was even discussed already in previous threads, in which case I'd love to be enlightened.

While I've never implemented the irregular Z-buffer algorithm above, an ongoing pet project of mine does something conceptually in the same venue - real-time raytraced global ambient occlusion on voxels. Currently it's CPU-only, but for the next iteration it should be GPGPU (and a lot faster). So to answer indirectly your question - yes, modern GPUs have the facilites to implement that algorithm, and an entire domain of such algorithms, including actual raytracing. Of course, various implementations can yield various performances (and verious benefits - from accurate point-light shadows, to soft shadows from area lights, to AO, etc, etc), and to tell how fast Latte would be on some of those you'd have to actually run that on the hw. But Latte does have MSAA reads, and scatters, and some of the requirements of the algorithm are 'fakable', so I would not be shocked if Latte could actually run that viably.

edit: for the record, neither NintendoLand nor MK8 seem to be using that irregular shadow maps algo, as as high-res shadows as they demonstrate, both titles show certain artifacts that irregular shadow maps should have eliminated. Apropos, back in early NL days I had my suspicions NL could be using stencil shadows, but a close inspection showed the tell-tale shadow maps artifacts.

krizzx · Apr 8, 2014

blu said:
While I've never implemented the irregular Z-buffer algorithm above, an ongoing pet project of mine does something conceptually in the same venue - real-time raytraced global ambient occlusion on voxels. Currently it's CPU-only, but for the next iteration it should be GPGPU (and a lot faster). So to answer indirectly your question - yes, modern GPUs have the facilites to implement that algorithm, and an entire domain of such algorithms, including actual raytracing. Of course, various implementations can yield various performances (and verious benefits - from accurate point-light shadows, to soft shadows from area lights, to AO, etc, etc), and to tell how fast Latte would be on some of those you'd have to actually run that on the hw. But Latte does have MSAA reads, and scatters, and some of the requirements of the algorithm are 'fakable', so I would not be shocked if Latte could actually run that viably.

edit: for the record, neither NintendoLand nor MK8 seem to be using that irregular shadow maps algo, as as high-res shadows as they demonstrate, both titles show certain artifacts that irregular shadow maps should have eliminated. Apropos, back in early NL days I had my suspicions NL could be using stencil shadows, but a close inspection showed the tell-tale shadow maps artifacts.

While we are on the subject or the GPU, exactly what type of lighting would you say they are using in MK8? http://cdn.nintendonews.com/wp-content/uploads/2014/04/mario_kart_8_rainbow_road.jpg
http://technabob.com/blog/wp-content/uploads/2014/04/mario_kart_8_rainbow_road-620x343.jpg
http://www.j1studios.com/wordpress/wp-content/uploads/Mario-Kart-8-3.jpg

I notice that all of the light sources give off a corona. Even on the headlights on the random vehicles.

The visuals stood out quite a bit to me primarily do to the lighting complexity. Exactly where would you estimate the limit of light sources for the Wii U? I remember someone making a thread about the lighting a while back, and the lighting seems to be on average few steps above what I'm used to seeing in the last gen even in most of my PC game.

Going all the way ack to ZombiU, lighting has always been the biggest thing that has stood out to me on the Wii U.

Also, on the subject of Shadows, why do so many games on the Wii U have really blocky shadows (BlOps2 and Ghost for example) yet there are always so smooth and plentiful in Nintendo made game and some major third party titles. Is there are problem with the hardware's ability to produce shadows or were the games with the blocky shadows simply using the hardware poorly?

OG_Original Gamer · Apr 8, 2014

krizzx said:
While we are on the subject or the GPU, exactly what type of lighting would you say they are using in MK8? http://cdn.nintendonews.com/wp-content/uploads/2014/04/mario_kart_8_rainbow_road.jpg
http://technabob.com/blog/wp-content/uploads/2014/04/mario_kart_8_rainbow_road-620x343.jpg
http://www.j1studios.com/wordpress/wp-content/uploads/Mario-Kart-8-3.jpg

I notice that all of the light sources give off a corona. Even on the headlights on the random vehicles.

The visuals stood out quite a bit to me primarily do to the lighting complexity. Exactly where would you estimate the limit of light sources for the Wii U? I remember someone making a thread about the lighting a while back, and the lighting seems to be on average few steps above what I'm used to seeing in the last gen even in most of my PC game.

Going all the way ack to ZombiU, lighting has always been the biggest thing that has stood out to me on the Wii U.

Also, on the subject of Shadows, why do so many games on the Wii U have really blocky shadows (BlOps2 and Ghost for example) yet there are always so smooth and plentiful in Nintendo made game and some major third party titles. Is there are problem with the hardware's ability to produce shadows or were the games with the blocky shadows simply using the hardware poorly?

The shadows didn't require much work, basically they were easy to enable for a cheap port.

blu · Apr 9, 2014

krizzx said:
While we are on the subject or the GPU, exactly what type of lighting would you say they are using in MK8?

They use some of the most competent deferred shading/deferred lighting (not clear which as it's practically impossible to tell apart) I've seen in a very long time. What it does is it allows the use of orders of magnitude more light sources than immediate shading techniques. If you pay really close attention to the latest trailer, you'd notice how every minute lightsource affects the character. While on the wheel, Mario's gloves get a blue hue from the antigrav kart's neon-blue headlights and underlights; karts and characters always get lit by 'local' illumination events like passing over a glowing boost bar, engine exhaust fires, explosion-style lightups - basically everything you'd normally expect to change the lighting scape around an object in real wold does that in the game. That, combined with some impeccable self-shadowing, can fool the brain those are physical, be that plastic or rubber, objects. Nothing fools the brain something on a screen is real as rich lighting interactions.

I notice that all of the light sources give off a corona. Even on the headlights on the random vehicles.

That's just a bloom effect. The new thing here (new for the MK series) is its combination with DOF effects, which produces some really nice photo-like views.

Also, on the subject of Shadows, why do so many games on the Wii U have really blocky shadows (BlOps2 and Ghost for example) yet there are always so smooth and plentiful in Nintendo made game and some major third party titles. Is there are problem with the hardware's ability to produce shadows or were the games with the blocky shadows simply using the hardware poorly?

Shadows get blocky whenever their shadow maps (i.e. textures that hold shadow info) get inadequate resolutions for the given distance from the camera. That was arguably last gen's greates issue with shadows - when otherwise great-shaded titles would suffer from inadequate shadowmap res. And I'm not even referring to the rest of the artifacts typical for shadowmaps. The paper efyu_lemonardo brough up describes a shadomap technique addressing all those shortcomings. Anyway, 'good' shadowmaps require both extra fillrate (somewhat aleviated by GPUs' ability to draw shadowmaps at higher rates than anything else), some extra vertex processing for the extra pass that draws the shadow map, and the associated BW that goes with those. Just like with deferred shading, though, proper amounts of eDRAM help with shadowmaps as well.

OG_Original Gamer · Apr 9, 2014

There seems to be some hints of this lighting in Smash which wasn't there in previous videos projectiles didn't light the platform or opponents.

blu · Apr 9, 2014

OG_Original Gamer said:
There seems to be some hints of this lighting in Smash which wasn't there in previous videos projectiles didn't light the platform or opponents.

Smash is not a shading showpiece, at least not in my book.

LordOfChaos · Apr 16, 2014

Does locked cache DMA mean being able to use the cache like the PS2 scratchpad or the Cell SPUs local memory? I assume a dev could switch between that and cache mode? Any idea of the performance implications if a developer really dug in in terms of micromanaging scratchpad memory very well, like a handful did for the Cell?

https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/A88091CAFE0F19CE852575EE0073078A/$file/To%20CL%20-%20CL%20Special%20Features%206-22-09.pdf

ArchangelWest · Apr 16, 2014

LordOfChaos said:
Does locked cache DMA mean being able to use the cache like the PS2 scratchpad or the Cell SPUs local memory? I assume a dev could switch between that and cache mode? Any idea of the performance implications if a developer really dug in in terms of micromanaging scratchpad memory very well, like a handful did for the Cell?

https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/A88091CAFE0F19CE852575EE0073078A/$file/To%20CL%20-%20CL%20Special%20Features%206-22-09.pdf

I'd imagine that this is what Shin'en meant by this statement:

'Especially the workings of the CPU caches are very important to master. Otherwise you can lose a magnitude of power for cache relevant parts of your code. In the end the Wii U specs fit perfectly together and make a very efficient console when used right.'

I'd also imagine that this is how devs like Slightly Mad Studios are reportedly so far only using one CPU core for their Wii U build of Project CARS, in spite of the common thought amongst gamers that the Wii U CPU is inferior.

Astral Dog · Apr 16, 2014

This may be a silly question, but how much can the Gamepad impact the system performance?
I read somewhere that it takes like 32MB of RAM for the screen.alone, but im not sure if this is true.
Mario Kart 8 seems to be affected as it cannot render the game on TV and Gamepad at the same time

blu · Apr 16, 2014

LordOfChaos said:
Does locked cache DMA mean being able to use the cache like the PS2 scratchpad or the Cell SPUs local memory?

It's not exactly TCM, but actually it's a more flexible than that. The thing is, PPC has had cache control ops since the early days of the architecture, and Gekko takes that idea one step further. That explict control over cache lines, combined with the ability to lock half of the L1D effectively turns that half into a scratchpad, but that still participates in the cache coherency protocols. For instance, if you locked your cache and issued on op to load the cacheline from address N (or did a DMA), then did some access to address N scratchpad-style, and then called a routine that walks a large chunk of memory passing over N, than routine would get a cache-hit on N just like if N was sitting in non-locked cache. You cannot do that with a scratchpad.

I assume a dev could switch between that and cache mode?

Yes, devs can switch at will.

Any idea of the performance implications if a developer really dug in in terms of micromanaging scratchpad memory very well, like a handful did for the Cell?

Perhaps somebody who's done substantial work on both platforms would be qualified to answer that.

krizzx · Apr 21, 2014

This is probably more related to the GPU(aside from this game cacluating physics at 240 frame per second) and maybe not the informative but Shin'en have pulled off some pretty nice effects for AoB Neo

http://nintendoeverything.com/weekly-screenshot-art-of-balance-wii-u-1/

I wonder what happened with them using the tessellater in their next game though. I suppose in order of announcement, FAST RACING Neo would be next and it could be using in there. I haven't heard anything more about it.

Isn't there someone here who is in contact with them? Perhaps they could give us a little more technical detail on their development.

Sub Boss said:
I'm not sure if this is true.
Mario Kart 8 seems to be affected as it cannot render the game on TV and Gamepad at the same time

Where did you hear such a ridiculous claim? This is completely false.

TheGreatMightyPoo · Apr 21, 2014

Sub Boss said:
Mario Kart 8 seems to be affected as it cannot render the game on TV and Gamepad at the same time

Sense de non.

Fourth Storm · Apr 26, 2014

Was just looking around the internet out of boredom to see if eDRAM would be feasible in Nintendo's next-gen, and I came across this article. It appears to confirm that the Wii U GPU is made on a 45nm process node. That thread is locked, so I figured this was the best place. It also has a nice profile shot of Renesas' eDRAM and the TSMC eDRAM in the Xbox 360.

http://chipworksrealchips.blogspot.com/

megabytecr · Apr 26, 2014

Cool. I remember that thread and the back and forth. What does this confirmation means for the GPU flops, etc for us mortals?

Support NeoGAF

Wii U CPU |Espresso| Die Photo - Courtesy of Chipworks

Member

Banned

Junior Member

Banned

Banned

Member

Junior Member

Banned

Member

Junior Member

Member

Member

Member

Banned

Member

Banned

Member

Banned

Junior Member

Member

Member

Wants the largest console games publisher to avoid Nintendo's platforms.

Junior Member

Banned

Wants the largest console games publisher to avoid Nintendo's platforms.

Banned

Banned

Banned

Member

Wants the largest console games publisher to avoid Nintendo's platforms.

Member

Wants the largest console games publisher to avoid Nintendo's platforms.

GAF's Pleasant Genius

Member

Member

May I have a cookie?

Wants the largest console games publisher to avoid Nintendo's platforms.

Junior Member

Member

Wants the largest console games publisher to avoid Nintendo's platforms.

Member

Wants the largest console games publisher to avoid Nintendo's platforms.

Member

Member

Member

Wants the largest console games publisher to avoid Nintendo's platforms.

Junior Member

Banned

Member

Member

Similar threads