• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

Absolutely HUGE, HUGE write-up on CELL and thus - Playstation3

If over clocked sufficiently (over 3.0GHz) and using some very optimised code (SSE assembly), 5 dual core Opterons directly connected via HyperTransport should be able to achieve a similar level of performance in stream processing - as a single Cell.

The PlayStation 3 is expected to have have 4 Cells.

:lol
 

doncale

Banned
Ars Technica comments on the huge Cell-article

http://arstechnica.com/news.ars/post/20050124-4551.html

Cell "analysis" a mixed bag

1/24/2005 11:33:48 PM, by Hannibal

Last week, OS News published an analysis of IBM's Cell-related patents. This article presents some of the information in the patents in an easily digestible format, but it has some serious flaws, as well. And I'm not talking about Cell-specific flaws, though there are those, but what appear to be problems with the author's understanding of basic computer architecture.

For instance, the author, Nicholas Blachford, starts off with a fantastic and completely made-up benchmark estimate for how fast Cell will complete a SETI@Home work unit (i.e. 5 mins). In the footnotes, we find that this number is extrapolated from the SETI numbers for a 1.33GHz G4. The extrapolation is done using a combination of real (for the G4 )and hypothetical (for the Cell) FLOPS ratings, which are not only fairly meaningless as a cross-platform performance metric but also take no account of the kinds of platform-specific optimizations that are all-important for SETI performance. So this is pretty much hogwash.

In another part of the article, Blachford claims that the cell processing units have no "cache." Instead, they each have a "local memory" that fetches data from main memory in 1024-bit blocks. Well, that's sort of like saying that an iMac doesn't have a "monitor," but it does have a surface on which visual output is displayed. In other words, the Cell "local memories," which are roughly analogous to the vector units' "scratchpad RAM" on the PS2's Emotion Engine, function as caches for the PUs. What has thrown the author for a loop is that they're small, and the fact that they're tied to each cellular processing unit means that they don't function in the memory heirarchy in the exact same way that an L1 does in a traditional processor design. They do, however, cache things. But maybe I'm being nitpicky with this.

Blachford also declares that the longstanding problems inherent in code parallelism and multithreaded programming are now solved, because the Cell will just miraculously do all this stuff for you via fancy compiler and process scheduling tricks. Unfortunately, parallelization is a fundamental application design problem that's rooted in the inherently serial nature of many of the kinds of tasks that we ask computers to perform. There are good parallelizing compilers out there, but they can only extract parallelism that's already latent in the input code and in the algorithm that the code implements; they can't magically parallelize an inherently serial sequence of steps.

These are just three of the many basic flaws in this article. Furthermore, the article is chock full of wild-eyed and completely unsubstantiated claims about exactly how much butt, precisely measured in kilograms and centimeters squared, that the Cell will kick, and how hard, measured in decibels, that the Cell will rock. I'm as excited about the Cell as the next geek, but there's no need to go way over the top like this about hardware that won't even seen the light of day for a year. And it's especially ill-advised to compare it to existing hardware and declare that we have a hands-down winner.

Finally, to address something more specific to the Cell architecture itself, on page 1 we find this claim:

It has been speculated that the vector units are the same as the AltiVec units found in the PowerPC G4 and G5 processors. I consider this highly unlikely as there are several differences. Firstly the number of registers is 128 instead of AltiVec's 32, secondly the APUs use a local memory whereas AltiVec does not, thirdly Altivec is an add-on to the existing PowerPC instruction set and operates as part of a PowerPC processor, the APUs are completely independent processors.

The author appears to be confusing an instruction set with an implementation. The 128-register detail is a problem, because, as the author correctly points out, conventional Altivec has only 32 vector registers. So obviously it's a given that Cell won't be using straight-up Altivec. But it's entirely possible that it'll use some kind of 128-register derivative of the Altivec instruction set. The fact that the individual processing units have a local cache has little to do with whether or not the PUs themselves implement some hypothetical Altivec derivative. Finally, the statement, "Altivec is an add-on to the existing PowerPC instruction set," is correct, but the rest of that sentence--"and operates as part of a PowerPC processor"--doesn't make a whole lot of sense to me in this context. Altivec is an ISA extension that is implemented in different ways on different PowerPC processors. The Cell processor's PUs could very well implement a hypothetical 128-register Altivec2 ISA extension, or they could implement some other SIMD ISA extension. The fact that SIMD code, written to whatever ISA, is farmed out to individual PUs has nothing to do with it. (If what I just said confuses you, you might check out this article.)

Anyway, I could go on, but I'll stop here. You get the idea. Caveat lector and such.

I should note that the author has published some "Clarifications" on his website, and he does back off some of the wackier claims. For instance, in response the criticisms of his claims about magical code parallelization, he says, "This is not true. You still have to break up problems into software Cells." Um, yeah. Precisely.

At any rate, if you have some intermediate level of computer science knowledge and you read the article with a critical eye, throwing out things that are obviously bogus and/or overblown, then you can actually pick up some information on the architecture. Mind you, there are no new revelations in the article (except for the stuff that's made up (e.g. SETI) and/or wrong (e.g. "check it out! no cache!")), but Blachford did manage to pull together a lot of what's already known into one place.
 

doncale

Banned
oh and look here, a Blachford rebuttal to the Ars article :lol (fight!)


http://www.blachford.info/computer/Cells/Rebuttal.html

Rebuttal to Ars article.

Any program has bugs. We have software testers to find them so they can be reported. A big article is like this as well, there's bugs in it to. In an article however, we use human languages which unlike computer can be interpreted in different ways. This means even if there are no specific errors, what readers take out of an article may still be incorrect.

Professional authors have their equivalent of software testers, these are called editors. I am not a professional author, my article did not get any editing apart that from myself.

Editing yourself alone is not always a good idea as you know what you've written and it's all too easy to miss bit's while reading. Another problem is you know what you mean yourself, you're not going to misinterpret yourself. A third problem is that if you're not perfect at spelling or grammar (I'm not) you're not going to notice these errors.

I've been reading the various comments on the article around the web and as you may expect some things have indeed been misinterpreted. I've also had quite a bit of correspondence on the article and some of them have pointed out errors, if you've e-mailed me I'd like to take the opportunity to thank you.

In the above cases I have gone through the article and corrected the mistakes. Where I've seen that points have been misinterpreted I've checked them over to make the relevant point clearer. For people who have already read the article I added an extra section pointing out these clarifications.

For reasons unknown the author "Hannibal" at tech site Arstechnica decided to do a short write up critiquing my article. He seems somewhat unimpressed and talks about a number of points which he considers flawed.

My article is based on the reading of the Sony patent and other sources of information, the aim was to help people understand what the Cell architecture would look like and what it's potentialities are. It is speculative by nature, it is not a scientific paper.

Looking at it in a pedantic manner as Hannibal seems to have done is completely pointless. He manages to take small misunderstandings and blows them up into major points.

Why he decided to do this in public is beyond me, but as such I feel I should answer these points.

Two of these points he makes were already corrected before his article was published, the other two are at best "differences of interpretation".



SETI time Estimation

"For instance, the author, Nicholas Blachford, starts off with a fantastic and completely made-up benchmark estimate for how fast Cell will complete a SETI@Home work unit (i.e. 5 mins)."

The SETI figure of 5 minutes for 4 Cells to complete a unit is a "calculated guess". It relies on a number of assumptions which may or may not be wrong. It also relies on the maximum theoretical performance of the Cell. The same goes for the Opteron comparison.

Given that we don't have the chip yet to test and there is no hard data to look at, it is thus safe to assume any figure I give should be taken with a pinch of salt. I would have thought this would have been obvious to anyone reading it, evidently not.

The Cell may not even get close to this level of performance in real life but it's theoretical performance is so high that even at 25% it's still going to blow everything else clean out of the water.

GPUs already exhibit performance massively beyond any desktop processor, they're just not in wide or general purpose apps. I'm really not saying anything that spectacular here. Get SETI running on an Nvidia or ATI if you don't believe me.

Before this article was published I had already made it clear that the figure was a bit of a guess in the article and made a note of it in the clarifications section. This point was thus redundant.



Compilers

"Blachford also declares that the longstanding problems inherent in code parallelism and multithreaded programming are now solved, because the Cell will just miraculously do all this stuff for you via fancy compiler and process scheduling tricks."

I did not say anything of the sort. One sentence, if read in a pedantic manner and taken out of context could potentially be seen to say this, it's certainly not what I meant. What I was really saying was after code had been split into Cells the infrastructure handles how they are distributed. That is, once a program is broken into software Cells you don't need to worry about the number of hardware Cells they are computed on.

That said this technology does indeed exist, try the following exact lines in Google:

"auto-parallelizing" compiler
"auto vectorizing" compiler

I have read on several occasions that IBM have been involved in auto-vectorising efforts over the last year, perhaps this is why.

Again, this had already been corrected so the point was again redundant.



Local memory V's Cache

In another part of the article, Blachford claims that the cell processing units have no "cache." Instead, they each have a "local memory" that fetches data from main memory in 1024-bit blocks. Well, that's sort of like saying that an iMac doesn't have a "monitor," but it does have a surface on which visual output is displayed.

In the case of the analogy given both the "surface" and "monitor" would perform the same function. The local memory and cache do a similar job but there's distinct differences between the two.
It's true to say they solve the same problem but they do it in very different ways and there are trade-offs in both approaches.

A CRT and LCD may perform the same function but that's not the same as saying they are the same thing.

A cache divides a system's memory map into blocks and can hold a portion of each of those blocks. While modern caches can be controlled to a degree they are not addressable in the same way as memory. This portion limits how much you can load from any given area. If you are working on a 50K block of RAM the cache will only hold a small part of it at any one time.

A local memory is of a fixed size and is directly addressable. There are no portions to worry about other than the maximum size of the memory. If you want to load 50K into an APU's local memory you just load it.

If your application involves iterating over a block of this size many times which approach is going to be faster? One has to keep going to RAM, the other does not.

On the other hand if you are multitasking, the cache approach makes a lot more sense since when you switch tasks at least part of the data you want in in memory already. This makes less sense if you have many cores and spread applications over them.



AltiVec V's APUs

In reference to my points on the APUs most likely not being AltiVec this is written:

The author appears to be confusing an instruction set with an implementation.

I state 3 reasons why I consider the APUs to be using a different instruction set from AltiVec.

1) The number of registers is different.
2) APUs use local memory
3) AltiVec is part of the PowerPC instruction set and operates as part of it.

The important points are 2 and 3.

The fact that the individual processing units have a local cache has little to do with whether or not the PUs themselves implement some hypothetical AltiVec derivative.

In AltiVec data is moved to and from the memory into registers then processed, results are then written back to memory from the registers. If you try that on an APU you'll not get very far, primarily because the instructions to do it do not exist.

The APUs can instruct memory to be moved between main memory and local memory or, between local memory and registers. It cannot move data between registers and main memory.

Finally, the statement, "Altivec is an add-on to the existing PowerPC instruction set," is correct, but the rest of that sentence--"and operates as part of a PowerPC processor"--doesn't make a whole lot of sense to me in this context.

APUs will have to control the flow of instructions so you will probably find there are some extra instruction units and registers to handle this.

Altivec will use the PowerPC general purpose registers and instruction units to handle this. If the APUs did this they would in effect be full PowerPC cores. I see no indication of this whatsoever and I think it would go against the entire "Cray on a chip" philosophy of the Cell.

To repeat what I said in the article:

"There will no doubt be a great similarity between the two but don't expect any direct compatibility. It should however be relatively simple to convert between the two."



To Conclude

Hannibal usually writes very good articles which I enjoy reading. Quite why he decided to write this pedantic rant is beyond me. It is something of a disappointment, especially his "ivory tower" tone.

Of course if he thought the piece had flaws he could have just sent me an email pointing them out. That's what I'd do and that's what others have done.



If I am being enthusiastic I believe I have justification for being so. Follow the references in parts 5, especially "Stream" and anything on GPUs. Similar technology already exists and is already delivering incredible performance.
 

Dr_Cogent

Banned
I just hope that this doesn't turn out like the PS2 where very few devs really took advantage of the resources because Sony apparently didn't provide that much middleware or make it easy to develop for in general.

As far as whether this will end up in the PS3 or not, who knows. Sony can change anything up until a certain "point".
 

shpankey

not an idiot
CaptainABAB said:
I guess I'll have to wait until Ars Technica writes up an analysis.
I'm personally waiting for Anand (anandtech.com) to do the CELL. Only tech guru online I trust.

P.S.
and :lol at the author getting owned in his original article and then trying to explain his lies, falsehoods and exaggerations as being "bugs" in his article :lol holy shmokes :lol
 

Dsal

it's going to come out of you and it's going to taste so good
Wow, this guy wrote this huge write up barely knowing what he was talking about and then leaping to inane conclusions. Any technical reader of this article has to stop every five sentences or so and just shake his head and laugh at what an idiot this guy is.

Writing software for this is gonna be... good times... can't wait to start on 9 zillion apulets... executed out of order... yeah....
 

HokieJoe

Member
shpankey said:
I'm personally waiting for Anand (anandtech.com) to do the CELL. Only tech guru online I trust.

P.S.
and :lol at the author getting owned in his original article and then trying to explain his lies, falsehoods and exaggerations as being "bugs" in his article :lol holy shmokes :lol


The cat over at LostCircuits is pretty good too.
 

jimbo

Banned
Whether or not everything he wrote is blown out of proportion, it's still a damn good article that pretty much enlightened me on what's so great about Cell. It seems like an amazing chip either way you look at it, but I just have to wonder. If the Cell is this powerfull, wouldn't an Nvidia GPU limit its power? Why not just make your own cell-based GPU Sony?
 

Phoenix

Member
Both the original article and the rebuttal to that article are actually fairly good reads but both suffer from a lack of an actuall unit so both sides are engaging in a fair amount of speculation. One particular problem with Blachford's claims is that he's dealing with the 'best case performance' of the theoretical maximum. We you can take those numbers and wipe your ass with them - those numbers aren't the ones you'll ever see. Ars has a problem too however in that there are some fairly significant advances in parallelism both in tools, compilers, AND on the CPU that they seem to be writing off as if the Cell is nothing more than a hyperthreading processor.

Anyways, interesting to watch them through feces at each other at such an early stage in the game.
 

Deku

Banned
Phoenix said:
Both the original article and the rebuttal to that article are actually fairly good reads but both suffer from a lack of an actuall unit so both sides are engaging in a fair amount of speculation. One particular problem with Blachford's claims is that he's dealing with the 'best case performance' of the theoretical maximum. We you can take those numbers and wipe your ass with them - those numbers aren't the ones you'll ever see. Ars has a problem too however in that there are some fairly significant advances in parallelism both in tools, compilers, AND on the CPU that they seem to be writing off as if the Cell is nothing more than a hyperthreading processor.

Anyways, interesting to watch them through feces at each other at such an early stage in the game.


Given Sony's trackrecord of overpromising and underdelivering, it should be common sense not to assume best case performance for any of their products especially when speculating without a unit to test the speculations on.

At the end of the day Sony faces the same pressure all consumer electronics maker faces, costs, production capacity. Those tend to shave a lot of theoretical best cases out of the final product performance.
 
Top Bottom