Nostremitus
Member
You guys should request that this thread be moved to Community before it's locked like the GPU one...
You guys should request that this thread be moved to Community before it's locked like the GPU one...
He lives!It got locked??? Laaaame.
He lives!
Some more bits about the processor in here
http://fail0verflow.com/blog/2014/console-hacking-2013-omake.html
ie
They mention Starbuck quite a few times in that article but never specify anything about it. Did we ever find out exactly what it is? Decent article anyway, though not sure who thought PPC750 was in-order.
Unfortunate to see the GPU thread locked BTW. We were still occasionally getting some semi interesting bits of info from it. I agree with stopping all the general chit chat in there, but not locking due to age.
.
They mention Starbuck quite a few times in that article but never specify anything about it. Did we ever find out exactly what it is? Decent article anyway, though not sure who thought PPC750 was in-order.
I've research and it seems that it is as I suspected. Manually assigning CPU Thread gives better performance. It seems that it decreases execution times by a substantial amount.
http://zone.ni.com/reference/en-XX/help/371361J-01/lvconcepts/cpu_assign/
The threads have to be allocated manually to the Wii U CPU, that makes the Wii U CPU a little harder to use, but gives overall better performance. I'd imagine that in conjunction with the shorter pipelines, this makes the Espresso's performance far higher than what people imagine.
Why has the gpu thread been locked?
It had become a general speculation megathread... ...of sorts. I think it would've been better to rename it and move it to community as the Nintendo Speculation Community Thread, but that's just me.
Why has the gpu thread been locked?
I think I asked this earlier, but what are the benefits/problems with having the developer have to manually pic which code runs on which CPU? No one has ever spoken on this are far as I can remember.
Wouldn't their be some benefit to not having the overhead and iffyness of the CPU auto delegating tasks? That would be one process less the CPU needs to execute.
I've research and it seems that it is as I suspected. Manually assigning CPU Thread gives better performance. It seems that it decreases execution times by a substantial amount.
http://zone.ni.com/reference/en-XX/help/371361J-01/lvconcepts/cpu_assign/
The threads have to be allocated manually to the Wii U CPU, that makes the Wii U CPU a little harder to use, but gives overall better performance. I'd imagine that in conjunction with the shorter pipelines, this makes the Espresso's performance far higher than what people imagine.
In fact, the SMPization of the 750 in the Espresso is not perfect. There appears to be a bug that affects load-exclusive and store-exclusive instructions (an explicit cache flush is required), which means that using SMP Linux with them will require patching the kernel and libpthread to work around it (and possibly other software that directly uses these instructions to e.g. implement atomics). They would’ve never shipped such a blatant bug on a general-purpose CPU, but I guess they called it good enough for a game console since they can just work around it in the SDK (which they do: the Cafe OS locking primitives have the workaround).
wsippel got it locked by posting some random rumor he heard in there instead of making a new thread and it derailed into multiple pages of speculation about Nintendo's next console, etc.
I don't think that link is the proof you're looking for.
Can you clarify your original question?
Clarify it? What about it do you not understand?
I think I asked this earlier, but what are the benefits/problems with having the developer have to manually pic which code runs on which CPU? No one has ever spoken on this are far as I can remember.
Wouldn't their be some benefit to not having the overhead and iffyness of the CPU auto delegating tasks? That would be one process less the CPU needs to execute.
I assume you're talking of thread context switches here, right? If so, on one hand, you're forgetting quite a bit from the cpu's programming-model state - the 32 64bit FPR registers. On the other hand, context switching rarely concerns the caches - their content is subject to the common caching/eviction policies. On the third hand, 60hz is a rather arbitrary rate to context switch - normally it's neither regular, nor at this rate.Replacing the entire execution state of a CPU core including L2 cache is a ~2MB data transfer (1MB* out, 1MB* in). You can do this literally thousands of times per second even on the Wii U's not-so-fast memory subsystem. Doing it 60 times per second (a decent forced scheduling rate for a games console, if we assume such a thing is even done) is kinda negligible.
*technically, L2 + L1I + size of architectural registers (2^20 + 2^15 + 32 * 4 bytes for the "big" core); L1D doesn't count because all its data is mirrored in L2. 1MB is not far off.
Why are you bringing up throttling in a general IPC topic? And p4 was neither the first nor the last design to employ active speed scaling to avoid critical conditions. The most recent HPC Intel part I worked with does that just as well - no Prescott-scale issues whatsoever.Well, being designed to perform well at clock speeds in excess of 4GHz and hitting 150W barriers well before that causing all sorts of unexpected compromises and throttling was part of it.
You seem to be confusing subjects again - the length of the pipeline is dictated by the target clocks. There's no 'N stages pipeline is optimal for software, regardless of clock speed'. The optimal pipeline regardless of clock speed would be one which does not allow stalls, be that in the form of bubbles of flushes. Of course, there's no such thing.The hyper-long pipeline wasn't ideal, true, but that doesn't mean shorter is always better either, I believe I linked an engineering study early in this thread showing 11-14 stages being most optimal for most code regardless of clock speed, and iirc that's mostly where Intel targets (Haswell is a 14 stager).
This past week I was fighting a bottleneck in critical code path where the branch predictor's success rate was staying in the 60s, on a top-of-the-crop Xeon cpu. I had to rewrite the entire algorithm to fix that, because there's no branch predictor in the entire world which could do well on the original one, not with the given data set.Even if those were best case scenarios, that's from back in the Pentium 4 days, and even generation to generation back then they were improving the misprediction rate. I don't know what it is now, but I'd be interested to see it too.
You're confusing mispredictions with branch-target misses. The latter result in large bubbles (as large as 8 clocks) on modern Intels. The former result in straight royal flushes, just like most other speculations gone bad (look up 'machine clears' in recent Intel pipelines).About your talk of pipeline flushes, modern processors don't even flush every stage in the pipeline in a misprediction, just the relevant stages now. So something with a XX stage pipeline may flush half of them if half are irrelevant, or any other fraction. I forget what this is called, but I think anything post Core 2 has it. I'll try to find the name.
Among two hypothetical architectures A and B, where B is essentially A with fewer-but-fatter stages, B will always have better IPC. The Bobcat-Jaguar example you used earlier does not show what you think it does, because all the enhancements AMD did to improve the IPC on Jaguar were also applicable to Bobcat, but never occurred. Once again, Jaguar had improved IPC despite its longer pipeline, not because of it.Anyways, even if some of my rambling did need correction, you agree that pipeline length isn't black and white as was my point, right? There's a balance to be struck and crazy short isn't necessarily ideal as crazy long isn't, was what I was getting at. This lower always = higher IPC notion is what I was attacking.
My guess is krizzx is talking about the cooperative mutithreading that wiiu os employs. It allows for fine-grain control over latencies, largely by removing the need to save/restore thread contexts. Speaking of which..
I assume you're talking of thread context switches here, right? If so, on one hand, you're forgetting quite a bit from the cpu's programming-model state - the 32 64bit FPR registers. On the other hand, context switching rarely concerns the caches - their content is subject to the common caching/eviction policies. On the third hand, 60hz is a rather arbitrary rate to context switch - normally it's neither regular, nor at this rate.
/context switch
Sorry about getting back to you only now, tipoo:
Why are you bringing up throttling in a general IPC topic? And p4 was neither the first nor the last design to employ active speed scaling to avoid critical conditions. The most recent HPC Intel part I worked with does that just as well - no Prescott-scale issues whatsoever.
You seem to be confusing subjects again - the length of the pipeline is dictated by the target clocks. There's no 'N stages pipeline is optimal for software, regardless of clock speed'. The optimal pipeline regardless of clock speed would be one which does not allow stalls, be that in the form of bubbles of flushes. Of course, there's no such thing.
This past week I was fighting a bottleneck in critical code path where the branch predictor's success rate was staying in the 60s, on a top-of-the-crop Xeon cpu. I had to rewrite the entire algorithm to fix that, because there's no branch predictor in the entire world which could do well on the original one, not with the given data set.
You're confusing mispredictions with branch-target misses. The latter result in large bubbles (as large as 8 clocks) on modern Intels. The former result in straight royal flushes, just like most other speculations gone bad (look up 'machine clears' in recent Intel pipelines).
Among two hypothetical architectures A and B, where B is essentially A with fewer-but-fatter stages, B will always have better IPC. The Bobcat-Jaguar example you used earlier does not show what you think it does, because all the enhancements AMD did to improve the IPC on Jaguar were also applicable to Bobcat, but never occurred. Once again, Jaguar had improved IPC despite its longer pipeline, not because of it.
Depends on how you look at it.That is correct.
Also, I have an offtopic question. Is the PS4 CPU stronger than the CELL?
That's a bit strange way to look at things, moreover that GPUs have been doing double precision for a while now. The division is more in line of CPUs do low latency tasks and tasks that otherwise won't fit the GPU, and GPUs do high-throughput, high-latency 'streaming' tasks.Depends on how you look at it.
GPU's are used for single precision while CPU's are used for double precision. Both can do the other though. CPU's will have SIMD cores, and GPU's can do compute processes, but neither replace the other.
Cell is not 'relatively fast' at SP, it's abnormally fast for a CPU. Also Cell SPEs can do DP. They don't particularly excel at that (not until the 2008 server-targeted chip revision anyway), but they don't do 0 DP FLOPS either; in fact they do more DP FLOPS than the PPE (just because the latter cannot do doubles via AltiVec). Of course, power-conservative x86 cores don't excel at DP either. Fun fact: on Bobcat you can actually get a 'decelleration' if you try to SIMD-ify some double precision code, just because the 64-bit ALUs do one double at a time, but you still pay the potential price for the SIMD-ification overhead. i.e. just getting your data in a SIMD-friendly shape might not come for free.Anyway, it only had 6 spu's (simd cores) and 1 ppu (general processing/scheduling). So this thing was relatively fast for SP. But because it only had 1 general core, it's DP was crappy.
It's been speculated that taking the information above into consideration, the Wii U's total bandwidth of gigabytes per second, including the possible 1024 bits per macro and GPU which, according to TechPowerUp clocks in at 550mhz would come out to around 563.2GB per second. Keep in mind that the Xbox One runs about 170GB per second of bandwidth between the DDR3 and eSRAM, as outlined by Xbit Labs.
I know people are pretty well over this and what not but I saw this article and thought it might be interesting to look at. It is more about the GPU but that thread is dead so I wasn't sure where else to post it.
Source
With lines like that I would like to get the people who know about this stuff thoughts on it.
I know people are pretty well over this and what not but I saw this article and thought it might be interesting to look at. It is more about the GPU but that thread is dead so I wasn't sure where else to post it.
Source
With lines like that I would like to get the people who know about this stuff thoughts on it.
Huh?So, why would they do this as a 512kb/2mb/512kb fixed L2 cache setup, instead of a modern dynamic allocation l2 cache depending on each cores need?
LLC (last-level-cache) sharing is normally done at L3*. Intel's Smart Cache is little but marketing speak where it comes to (non-LLC) L2; there L2 is not really shared - each core has its L2 portion, and that is subject to cache coherence protocols like any other cpu under the sun. Now, if somehow only one core remained running, while all the rest idled, then ISC might give an advantage not unlike Turbo Boost (another common technique these days), so the last remaining core could take ahold of the entire L2. But how often do you expect such a condition to occur, and why would it be desirable?
Which modern CPU by which vendor does that? The pdf you linked to talks exclusively about LLC sharing, and Intel hasn't used L2 LLC past the early Core architecture days. The reason they stopped doing that is that it did not pan out particularly well. Here's some read: http://ixbtlabs.com/articles2/cpu/rmmt-l2-cache.htmlL3 is almost always shared by all cores, understood, but L2 isn't always core exclusive. In modern quads for instance, often each set of two CPU cores will share a pool of L2. Not quite fully dynamic, but not quite fully set either, it's divided two ways and then dynamic between those pools, rather than divided four ways. So L2 is somewhere between set-per-core L1 and dynamic L3.
I know people are pretty well over this and what not but I saw this article and thought it might be interesting to look at. It is more about the GPU but that thread is dead so I wasn't sure where else to post it.
Source
With lines like that I would like to get the people who know about this stuff thoughts on it.
So, why would they do this as a 512kb/2mb/512kb fixed L2 cache setup, instead of a modern dynamic allocation l2 cache depending on each cores need? And the split also prevents asset sharing, I think, which can be quite costly. Is it just a reduction in die size, or the effort to make the 750s compatible with dynamic cache, or is there an end user performance benefit too?
It somewhat makes sense for backwards compatibility, but there should have been a way to just enable the right amount of cache for that without having to gimp the Wii U mode.
TL;DR, shared L2 was a bad idea and has been abandoned for generations now.
While I've never implemented the irregular Z-buffer algorithm above, an ongoing pet project of mine does something conceptually in the same venue - real-time raytraced global ambient occlusion on voxels. Currently it's CPU-only, but for the next iteration it should be GPGPU (and a lot faster). So to answer indirectly your question - yes, modern GPUs have the facilites to implement that algorithm, and an entire domain of such algorithms, including actual raytracing. Of course, various implementations can yield various performances (and verious benefits - from accurate point-light shadows, to soft shadows from area lights, to AO, etc, etc), and to tell how fast Latte would be on some of those you'd have to actually run that on the hw. But Latte does have MSAA reads, and scatters, and some of the requirements of the algorithm are 'fakable', so I would not be shocked if Latte could actually run that viably.As long as this thread has been bumped, there's something I've been wanting to ask GAF's advice on, with regards to Latte, the Wii U GPU - the dedicated thread for which was unfortunately locked.
I've done some reading on unconventional lighting techniques and learned about a method for producing shadows that provides surprisingly high accuracy at a rather small computational cost, but relies on having some additional GPU instructions and architectural changes that aren't necessarily implemented in any standard (at least not sufficiently), and are probably difficult to emulate efficiently.
I'm referring to Irregular Shadow Maps, or more accurately, using an Irregular Z-buffer to facilitate the rendering of alias-free shadow maps.
Hopefully some on GAF are familiar with this technique and with the difficulties in implementing it. Perhaps it was even discussed already in previous threads, in which case I'd love to be enlightened.
While I've never implemented the irregular Z-buffer algorithm above, an ongoing pet project of mine does something conceptually in the same venue - real-time raytraced global ambient occlusion on voxels. Currently it's CPU-only, but for the next iteration it should be GPGPU (and a lot faster). So to answer indirectly your question - yes, modern GPUs have the facilites to implement that algorithm, and an entire domain of such algorithms, including actual raytracing. Of course, various implementations can yield various performances (and verious benefits - from accurate point-light shadows, to soft shadows from area lights, to AO, etc, etc), and to tell how fast Latte would be on some of those you'd have to actually run that on the hw. But Latte does have MSAA reads, and scatters, and some of the requirements of the algorithm are 'fakable', so I would not be shocked if Latte could actually run that viably.
edit: for the record, neither NintendoLand nor MK8 seem to be using that irregular shadow maps algo, as as high-res shadows as they demonstrate, both titles show certain artifacts that irregular shadow maps should have eliminated. Apropos, back in early NL days I had my suspicions NL could be using stencil shadows, but a close inspection showed the tell-tale shadow maps artifacts.
While we are on the subject or the GPU, exactly what type of lighting would you say they are using in MK8? http://cdn.nintendonews.com/wp-content/uploads/2014/04/mario_kart_8_rainbow_road.jpg
http://technabob.com/blog/wp-content/uploads/2014/04/mario_kart_8_rainbow_road-620x343.jpg
http://www.j1studios.com/wordpress/wp-content/uploads/Mario-Kart-8-3.jpg
I notice that all of the light sources give off a corona. Even on the headlights on the random vehicles.
The visuals stood out quite a bit to me primarily do to the lighting complexity. Exactly where would you estimate the limit of light sources for the Wii U? I remember someone making a thread about the lighting a while back, and the lighting seems to be on average few steps above what I'm used to seeing in the last gen even in most of my PC game.
Going all the way ack to ZombiU, lighting has always been the biggest thing that has stood out to me on the Wii U.
Also, on the subject of Shadows, why do so many games on the Wii U have really blocky shadows (BlOps2 and Ghost for example) yet there are always so smooth and plentiful in Nintendo made game and some major third party titles. Is there are problem with the hardware's ability to produce shadows or were the games with the blocky shadows simply using the hardware poorly?
They use some of the most competent deferred shading/deferred lighting (not clear which as it's practically impossible to tell apart) I've seen in a very long time. What it does is it allows the use of orders of magnitude more light sources than immediate shading techniques. If you pay really close attention to the latest trailer, you'd notice how every minute lightsource affects the character. While on the wheel, Mario's gloves get a blue hue from the antigrav kart's neon-blue headlights and underlights; karts and characters always get lit by 'local' illumination events like passing over a glowing boost bar, engine exhaust fires, explosion-style lightups - basically everything you'd normally expect to change the lighting scape around an object in real wold does that in the game. That, combined with some impeccable self-shadowing, can fool the brain those are physical, be that plastic or rubber, objects. Nothing fools the brain something on a screen is real as rich lighting interactions.While we are on the subject or the GPU, exactly what type of lighting would you say they are using in MK8?
That's just a bloom effect. The new thing here (new for the MK series) is its combination with DOF effects, which produces some really nice photo-like views.I notice that all of the light sources give off a corona. Even on the headlights on the random vehicles.
Shadows get blocky whenever their shadow maps (i.e. textures that hold shadow info) get inadequate resolutions for the given distance from the camera. That was arguably last gen's greates issue with shadows - when otherwise great-shaded titles would suffer from inadequate shadowmap res. And I'm not even referring to the rest of the artifacts typical for shadowmaps. The paper efyu_lemonardo brough up describes a shadomap technique addressing all those shortcomings. Anyway, 'good' shadowmaps require both extra fillrate (somewhat aleviated by GPUs' ability to draw shadowmaps at higher rates than anything else), some extra vertex processing for the extra pass that draws the shadow map, and the associated BW that goes with those. Just like with deferred shading, though, proper amounts of eDRAM help with shadowmaps as well.Also, on the subject of Shadows, why do so many games on the Wii U have really blocky shadows (BlOps2 and Ghost for example) yet there are always so smooth and plentiful in Nintendo made game and some major third party titles. Is there are problem with the hardware's ability to produce shadows or were the games with the blocky shadows simply using the hardware poorly?
Smash is not a shading showpiece, at least not in my book.There seems to be some hints of this lighting in Smash which wasn't there in previous videos projectiles didn't light the platform or opponents.
Does locked cache DMA mean being able to use the cache like the PS2 scratchpad or the Cell SPUs local memory? I assume a dev could switch between that and cache mode? Any idea of the performance implications if a developer really dug in in terms of micromanaging scratchpad memory very well, like a handful did for the Cell?
https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/A88091CAFE0F19CE852575EE0073078A/$file/To%20CL%20-%20CL%20Special%20Features%206-22-09.pdf
It's not exactly TCM, but actually it's a more flexible than that. The thing is, PPC has had cache control ops since the early days of the architecture, and Gekko takes that idea one step further. That explict control over cache lines, combined with the ability to lock half of the L1D effectively turns that half into a scratchpad, but that still participates in the cache coherency protocols. For instance, if you locked your cache and issued on op to load the cacheline from address N (or did a DMA), then did some access to address N scratchpad-style, and then called a routine that walks a large chunk of memory passing over N, than routine would get a cache-hit on N just like if N was sitting in non-locked cache. You cannot do that with a scratchpad.Does locked cache DMA mean being able to use the cache like the PS2 scratchpad or the Cell SPUs local memory?
Yes, devs can switch at will.I assume a dev could switch between that and cache mode?
Perhaps somebody who's done substantial work on both platforms would be qualified to answer that.Any idea of the performance implications if a developer really dug in in terms of micromanaging scratchpad memory very well, like a handful did for the Cell?
I'm not sure if this is true.
Mario Kart 8 seems to be affected as it cannot render the game on TV and Gamepad at the same time
Mario Kart 8 seems to be affected as it cannot render the game on TV and Gamepad at the same time