Uttar
12-05-02, 01:55 PM
Hello everyone,
As you probably know, nVidia is currently having problems keeping high profit margins. Their explanation is that Intel is trying to sell all of their low-end celerons ( only compatible with PCI ) stock. And their profits on low-end GPUs isn't as high as with high-end ones ( thus resulting in higher demand of things like GF2 Mx and TNT2 )
Now, what's the problem here? TNT2s, for example, are made on 0.25 - and converting a chips from 0.25 to 0.15 or 0.13 isn't as easy as saying it.
Now, as I said in a quite recent thread of mine, nVidia VS architecture is probably more flexible for deriving low-end chips.
And I wouldn't be surprised if there was more to it than what it seems. Maybe Triangle Setup is done in a more easy to change way.
So, maybe nVidia wants to derive everything from their NV3x core. Including TNT2 or GeForce 2 MX/GTS replacements...
The problem? LMA 3 & Intellisampling. Both of those nearly certainly take a fair bit of transistors. So, for very low performing solutions, it would be strange to include that. But if you had to supress it, it would become a lot harder to derive the chip.
The solution? Decrease its efficiency. Who'd need so much power in the antialiasing unit ( final stage, to blend the samples ) when turning 4X Antialiasing would kill performance? And who'd need the 4 Z Sampling units, too?
After all, making it go horribly badly with more than 2X Antialiasing, and *wanting* it, is okay for mainstream. It justifies buying high-end cards.
As for Color Compression, Fast Color Clear, Adaptive Aniso... You can't do much about that. And that's why the transistor count will remain quite high. But is it that important, really?
How much does each part really take? That's an important question. Many websites would suggest the pipelines take 90% of the die. I suggest them to get their Voodoo out of their case and put in a GeForce :)
Fact is, that simply doesn't make any type of sense. Systems AnandTech gives an image with the GF4 Launch which explains how much each component takes of the die.
Here it is: http://www.anandtech.com/showdoc.html?i=1583&p=2
I wouldn't say that's amazingly accurate. But it'll certainly give a fair bit more info than someone saying everything is in the pipelines!
That pretty much is:
2VS = 12.5% -> 8M
2D/Video/HDVP + Display = 25% -> 12%
Pixel Shader = 12.5% -> 7.5M
Multisampling = 12% -> 7.5M
Texture Unit = 9% -> 6.5M
Host Interface = 9% -> 6.5M
LMA 2 = 20%
Sounds surprising so little is in the pixel pipelines and texturing? It's only about 15M transistors! Or nearly 25% of the NV25 die.
And... What could possibly explain all of that?
Cache. AGP needs cache because being able to keep that newly recieved information is critical. As for the Vertex Shader & Triangle Setup, you have to wait for the Pixel Pipelines to be ready to operate on those pixels - so keeping a cache is important so you can process a few vertices in advance. That way, once the Pixel Pipelines are ready, they read the caches and can operate immediatly.
As for MultiSampling - it obviously requires caches to stock the samples for blending. I don't want to speculate too much on where the caches are used, but it obviously require a little of them. And doing the blending require transistors, too.
The real problem here? As you can see, simply reducing VS & PS power isn't going to do it. Multisampling and LMA takes too much.
If you considered 50% PS and VS power on the GeForce 4, you'd get 18M transistors less. And that's a 40+M transistor part.
As a Trident interview says, 30M transistors cost 4 times less than 60M transistors. So, I'd guess 40M transistors costs 2 times less than 60M transistors. Not bad, but performance is also 2 times lower...
But there's something we've also got to consider. Reducing cache size. With the GF4, more than 30% of the die is cache. That's 20M transistors! Where's the cache used? Just about everywhere. But that includes multisampling. Now, obviously, since such a mainstream part would have significantly less memory bandwidth available, so much multisampling power is *useless* - so, instead of 12%, you could have something more like 6% by reducing cache and logic it got. That could save 4M transistors. Nothing major, but it's still something. By doing several other things, in final, you'd get a part which got 2 times less transistors for 2 times less performance and it would be 4 times less costly to produce.
So, let's talk about the NV30 now... The final problem is that it got even more LMA/Intellisampling power than the NV25. And LMA transistor count can't be reduced so easily without reducing performance quite significantly. And no, since it's not concurrently done with the VS/PS, cutting 50% of it *and* 50% of the PS/VS pipelines won't result in a 50% performance decrease but more of a 65% one.
The advantage is that the VS transistor count is a lot easier to reduce.
The solution? Simply don't care too much about all of that and reduce cache size everywhere it wouldn't reduce performance too significantly. And then cut back PS & VS. Finally, replace the memory by significantly cheaper one ( slower DDR1 I suppose ) and put 64MB of RAM instead of 128MB. A good reason for that is that since you're suggesting not to use more than 2X AA with that card, more memory bandwidth is useless.
Conclusion?
First of all, sorry if I repeated myself. If I did, please say so and I'll evantually edit this post. It took me a LONG while to write this as I searched for information and constantly modified the post to make it more correct.
You know where I'm going to? The NV34.
The NV34 is going to be nVidia's new low end card. Nobody knows what it's really going to be. And what I'm doing here is finding a financial reason for a speciific card model then backing it up by showing it's logical.
Thus, here's my personal expectations for the NV34:
2 Pixel Pipelines, 2 or 3 times less VS power clock for clock than the NV30, decreased cache sizes making 4X AA more costly and making 2X AA the only acceptable mode.
Transistor count: +- 45M transistors.
It would probably cost between 5 and 6 times less to produce than the NV30.
Price: $89 for a high-clocked one ( 325Mhz with 64MB slow DDR1? ) and $59 for a low clocked one ( 225Mhz with 32MB even slower DDR1? )
The high-clocked one could easily beat the Radeon 9000 Pro when AA/Aniso is activated because of Intellisampling ( the Radeon 9000 only got supersampling ) and the low-clocked one would probably couldn't beat the Radeon 9000, but that ain't the goal anyway. Of course, none would even touch the Radeon 9500. That's NV31's job. And that isn't this thread topic.
The low-end model could be stuffed in every single low-end PC on the planet. Think about the revenue! And putting everything in 0.13 instead of 0.25 for things like the TNT2 would certainly increase profit margins ( once 0.13 yields at TSMC become better, of course )
Remember, 0.13 only takes 27% of the space 0.25 uses! So, since the TNT2 is about 10.5M transistors on 0.25, it wouldn't take much more space to do 45M transistors on 0.13! And you can put higher clock rates, too.
Will be used for mobile chips.
Might evantually see the day as very low end workstation chips, for people who want to see exactly what their result looks like but who don't care too much about how much time it takes, they just want to be sure it runs fine. But it's not certain at all that'll even exist.
I believe this is what nVidia is so excited about. The amazing flexibility of their architecture, enabling mainstream components to be very different from the high-end components. And they actually hinted at it when they said on a financial website that low-end derivatives would come "sooner than expected"
I'll see if I can find a link for that tommorow.
Any comment? Possible idea? Speculation? Link? Feel free to post, I'd love to know what you people think.
Uttar
NOTE, added by edit: All of this is speculation. I've got no source within nVidia or anything. However, I think it's educated speculation, and I find it really possible and logical.
As you probably know, nVidia is currently having problems keeping high profit margins. Their explanation is that Intel is trying to sell all of their low-end celerons ( only compatible with PCI ) stock. And their profits on low-end GPUs isn't as high as with high-end ones ( thus resulting in higher demand of things like GF2 Mx and TNT2 )
Now, what's the problem here? TNT2s, for example, are made on 0.25 - and converting a chips from 0.25 to 0.15 or 0.13 isn't as easy as saying it.
Now, as I said in a quite recent thread of mine, nVidia VS architecture is probably more flexible for deriving low-end chips.
And I wouldn't be surprised if there was more to it than what it seems. Maybe Triangle Setup is done in a more easy to change way.
So, maybe nVidia wants to derive everything from their NV3x core. Including TNT2 or GeForce 2 MX/GTS replacements...
The problem? LMA 3 & Intellisampling. Both of those nearly certainly take a fair bit of transistors. So, for very low performing solutions, it would be strange to include that. But if you had to supress it, it would become a lot harder to derive the chip.
The solution? Decrease its efficiency. Who'd need so much power in the antialiasing unit ( final stage, to blend the samples ) when turning 4X Antialiasing would kill performance? And who'd need the 4 Z Sampling units, too?
After all, making it go horribly badly with more than 2X Antialiasing, and *wanting* it, is okay for mainstream. It justifies buying high-end cards.
As for Color Compression, Fast Color Clear, Adaptive Aniso... You can't do much about that. And that's why the transistor count will remain quite high. But is it that important, really?
How much does each part really take? That's an important question. Many websites would suggest the pipelines take 90% of the die. I suggest them to get their Voodoo out of their case and put in a GeForce :)
Fact is, that simply doesn't make any type of sense. Systems AnandTech gives an image with the GF4 Launch which explains how much each component takes of the die.
Here it is: http://www.anandtech.com/showdoc.html?i=1583&p=2
I wouldn't say that's amazingly accurate. But it'll certainly give a fair bit more info than someone saying everything is in the pipelines!
That pretty much is:
2VS = 12.5% -> 8M
2D/Video/HDVP + Display = 25% -> 12%
Pixel Shader = 12.5% -> 7.5M
Multisampling = 12% -> 7.5M
Texture Unit = 9% -> 6.5M
Host Interface = 9% -> 6.5M
LMA 2 = 20%
Sounds surprising so little is in the pixel pipelines and texturing? It's only about 15M transistors! Or nearly 25% of the NV25 die.
And... What could possibly explain all of that?
Cache. AGP needs cache because being able to keep that newly recieved information is critical. As for the Vertex Shader & Triangle Setup, you have to wait for the Pixel Pipelines to be ready to operate on those pixels - so keeping a cache is important so you can process a few vertices in advance. That way, once the Pixel Pipelines are ready, they read the caches and can operate immediatly.
As for MultiSampling - it obviously requires caches to stock the samples for blending. I don't want to speculate too much on where the caches are used, but it obviously require a little of them. And doing the blending require transistors, too.
The real problem here? As you can see, simply reducing VS & PS power isn't going to do it. Multisampling and LMA takes too much.
If you considered 50% PS and VS power on the GeForce 4, you'd get 18M transistors less. And that's a 40+M transistor part.
As a Trident interview says, 30M transistors cost 4 times less than 60M transistors. So, I'd guess 40M transistors costs 2 times less than 60M transistors. Not bad, but performance is also 2 times lower...
But there's something we've also got to consider. Reducing cache size. With the GF4, more than 30% of the die is cache. That's 20M transistors! Where's the cache used? Just about everywhere. But that includes multisampling. Now, obviously, since such a mainstream part would have significantly less memory bandwidth available, so much multisampling power is *useless* - so, instead of 12%, you could have something more like 6% by reducing cache and logic it got. That could save 4M transistors. Nothing major, but it's still something. By doing several other things, in final, you'd get a part which got 2 times less transistors for 2 times less performance and it would be 4 times less costly to produce.
So, let's talk about the NV30 now... The final problem is that it got even more LMA/Intellisampling power than the NV25. And LMA transistor count can't be reduced so easily without reducing performance quite significantly. And no, since it's not concurrently done with the VS/PS, cutting 50% of it *and* 50% of the PS/VS pipelines won't result in a 50% performance decrease but more of a 65% one.
The advantage is that the VS transistor count is a lot easier to reduce.
The solution? Simply don't care too much about all of that and reduce cache size everywhere it wouldn't reduce performance too significantly. And then cut back PS & VS. Finally, replace the memory by significantly cheaper one ( slower DDR1 I suppose ) and put 64MB of RAM instead of 128MB. A good reason for that is that since you're suggesting not to use more than 2X AA with that card, more memory bandwidth is useless.
Conclusion?
First of all, sorry if I repeated myself. If I did, please say so and I'll evantually edit this post. It took me a LONG while to write this as I searched for information and constantly modified the post to make it more correct.
You know where I'm going to? The NV34.
The NV34 is going to be nVidia's new low end card. Nobody knows what it's really going to be. And what I'm doing here is finding a financial reason for a speciific card model then backing it up by showing it's logical.
Thus, here's my personal expectations for the NV34:
2 Pixel Pipelines, 2 or 3 times less VS power clock for clock than the NV30, decreased cache sizes making 4X AA more costly and making 2X AA the only acceptable mode.
Transistor count: +- 45M transistors.
It would probably cost between 5 and 6 times less to produce than the NV30.
Price: $89 for a high-clocked one ( 325Mhz with 64MB slow DDR1? ) and $59 for a low clocked one ( 225Mhz with 32MB even slower DDR1? )
The high-clocked one could easily beat the Radeon 9000 Pro when AA/Aniso is activated because of Intellisampling ( the Radeon 9000 only got supersampling ) and the low-clocked one would probably couldn't beat the Radeon 9000, but that ain't the goal anyway. Of course, none would even touch the Radeon 9500. That's NV31's job. And that isn't this thread topic.
The low-end model could be stuffed in every single low-end PC on the planet. Think about the revenue! And putting everything in 0.13 instead of 0.25 for things like the TNT2 would certainly increase profit margins ( once 0.13 yields at TSMC become better, of course )
Remember, 0.13 only takes 27% of the space 0.25 uses! So, since the TNT2 is about 10.5M transistors on 0.25, it wouldn't take much more space to do 45M transistors on 0.13! And you can put higher clock rates, too.
Will be used for mobile chips.
Might evantually see the day as very low end workstation chips, for people who want to see exactly what their result looks like but who don't care too much about how much time it takes, they just want to be sure it runs fine. But it's not certain at all that'll even exist.
I believe this is what nVidia is so excited about. The amazing flexibility of their architecture, enabling mainstream components to be very different from the high-end components. And they actually hinted at it when they said on a financial website that low-end derivatives would come "sooner than expected"
I'll see if I can find a link for that tommorow.
Any comment? Possible idea? Speculation? Link? Feel free to post, I'd love to know what you people think.
Uttar
NOTE, added by edit: All of this is speculation. I've got no source within nVidia or anything. However, I think it's educated speculation, and I find it really possible and logical.