PDA

View Full Version : What could nVidia do to increase their profit margins? ( NV34 )


Uttar
12-05-02, 01:55 PM
Hello everyone,

As you probably know, nVidia is currently having problems keeping high profit margins. Their explanation is that Intel is trying to sell all of their low-end celerons ( only compatible with PCI ) stock. And their profits on low-end GPUs isn't as high as with high-end ones ( thus resulting in higher demand of things like GF2 Mx and TNT2 )

Now, what's the problem here? TNT2s, for example, are made on 0.25 - and converting a chips from 0.25 to 0.15 or 0.13 isn't as easy as saying it.
Now, as I said in a quite recent thread of mine, nVidia VS architecture is probably more flexible for deriving low-end chips.
And I wouldn't be surprised if there was more to it than what it seems. Maybe Triangle Setup is done in a more easy to change way.

So, maybe nVidia wants to derive everything from their NV3x core. Including TNT2 or GeForce 2 MX/GTS replacements...
The problem? LMA 3 & Intellisampling. Both of those nearly certainly take a fair bit of transistors. So, for very low performing solutions, it would be strange to include that. But if you had to supress it, it would become a lot harder to derive the chip.
The solution? Decrease its efficiency. Who'd need so much power in the antialiasing unit ( final stage, to blend the samples ) when turning 4X Antialiasing would kill performance? And who'd need the 4 Z Sampling units, too?
After all, making it go horribly badly with more than 2X Antialiasing, and *wanting* it, is okay for mainstream. It justifies buying high-end cards.

As for Color Compression, Fast Color Clear, Adaptive Aniso... You can't do much about that. And that's why the transistor count will remain quite high. But is it that important, really?

How much does each part really take? That's an important question. Many websites would suggest the pipelines take 90% of the die. I suggest them to get their Voodoo out of their case and put in a GeForce :)
Fact is, that simply doesn't make any type of sense. Systems AnandTech gives an image with the GF4 Launch which explains how much each component takes of the die.

Here it is: http://www.anandtech.com/showdoc.html?i=1583&p=2
I wouldn't say that's amazingly accurate. But it'll certainly give a fair bit more info than someone saying everything is in the pipelines!

That pretty much is:
2VS = 12.5% -> 8M
2D/Video/HDVP + Display = 25% -> 12%
Pixel Shader = 12.5% -> 7.5M

Multisampling = 12% -> 7.5M
Texture Unit = 9% -> 6.5M
Host Interface = 9% -> 6.5M
LMA 2 = 20%

Sounds surprising so little is in the pixel pipelines and texturing? It's only about 15M transistors! Or nearly 25% of the NV25 die.
And... What could possibly explain all of that?
Cache. AGP needs cache because being able to keep that newly recieved information is critical. As for the Vertex Shader & Triangle Setup, you have to wait for the Pixel Pipelines to be ready to operate on those pixels - so keeping a cache is important so you can process a few vertices in advance. That way, once the Pixel Pipelines are ready, they read the caches and can operate immediatly.
As for MultiSampling - it obviously requires caches to stock the samples for blending. I don't want to speculate too much on where the caches are used, but it obviously require a little of them. And doing the blending require transistors, too.

The real problem here? As you can see, simply reducing VS & PS power isn't going to do it. Multisampling and LMA takes too much.
If you considered 50% PS and VS power on the GeForce 4, you'd get 18M transistors less. And that's a 40+M transistor part.
As a Trident interview says, 30M transistors cost 4 times less than 60M transistors. So, I'd guess 40M transistors costs 2 times less than 60M transistors. Not bad, but performance is also 2 times lower...

But there's something we've also got to consider. Reducing cache size. With the GF4, more than 30% of the die is cache. That's 20M transistors! Where's the cache used? Just about everywhere. But that includes multisampling. Now, obviously, since such a mainstream part would have significantly less memory bandwidth available, so much multisampling power is *useless* - so, instead of 12%, you could have something more like 6% by reducing cache and logic it got. That could save 4M transistors. Nothing major, but it's still something. By doing several other things, in final, you'd get a part which got 2 times less transistors for 2 times less performance and it would be 4 times less costly to produce.


So, let's talk about the NV30 now... The final problem is that it got even more LMA/Intellisampling power than the NV25. And LMA transistor count can't be reduced so easily without reducing performance quite significantly. And no, since it's not concurrently done with the VS/PS, cutting 50% of it *and* 50% of the PS/VS pipelines won't result in a 50% performance decrease but more of a 65% one.

The advantage is that the VS transistor count is a lot easier to reduce.

The solution? Simply don't care too much about all of that and reduce cache size everywhere it wouldn't reduce performance too significantly. And then cut back PS & VS. Finally, replace the memory by significantly cheaper one ( slower DDR1 I suppose ) and put 64MB of RAM instead of 128MB. A good reason for that is that since you're suggesting not to use more than 2X AA with that card, more memory bandwidth is useless.


Conclusion?

First of all, sorry if I repeated myself. If I did, please say so and I'll evantually edit this post. It took me a LONG while to write this as I searched for information and constantly modified the post to make it more correct.

You know where I'm going to? The NV34.
The NV34 is going to be nVidia's new low end card. Nobody knows what it's really going to be. And what I'm doing here is finding a financial reason for a speciific card model then backing it up by showing it's logical.

Thus, here's my personal expectations for the NV34:
2 Pixel Pipelines, 2 or 3 times less VS power clock for clock than the NV30, decreased cache sizes making 4X AA more costly and making 2X AA the only acceptable mode.
Transistor count: +- 45M transistors.
It would probably cost between 5 and 6 times less to produce than the NV30.
Price: $89 for a high-clocked one ( 325Mhz with 64MB slow DDR1? ) and $59 for a low clocked one ( 225Mhz with 32MB even slower DDR1? )
The high-clocked one could easily beat the Radeon 9000 Pro when AA/Aniso is activated because of Intellisampling ( the Radeon 9000 only got supersampling ) and the low-clocked one would probably couldn't beat the Radeon 9000, but that ain't the goal anyway. Of course, none would even touch the Radeon 9500. That's NV31's job. And that isn't this thread topic.
The low-end model could be stuffed in every single low-end PC on the planet. Think about the revenue! And putting everything in 0.13 instead of 0.25 for things like the TNT2 would certainly increase profit margins ( once 0.13 yields at TSMC become better, of course )
Remember, 0.13 only takes 27% of the space 0.25 uses! So, since the TNT2 is about 10.5M transistors on 0.25, it wouldn't take much more space to do 45M transistors on 0.13! And you can put higher clock rates, too.

Will be used for mobile chips.
Might evantually see the day as very low end workstation chips, for people who want to see exactly what their result looks like but who don't care too much about how much time it takes, they just want to be sure it runs fine. But it's not certain at all that'll even exist.


I believe this is what nVidia is so excited about. The amazing flexibility of their architecture, enabling mainstream components to be very different from the high-end components. And they actually hinted at it when they said on a financial website that low-end derivatives would come "sooner than expected"
I'll see if I can find a link for that tommorow.

Any comment? Possible idea? Speculation? Link? Feel free to post, I'd love to know what you people think.


Uttar

NOTE, added by edit: All of this is speculation. I've got no source within nVidia or anything. However, I think it's educated speculation, and I find it really possible and logical.

PreservedSwine
12-05-02, 09:25 PM
I thought about reading all that, but then decided the simple answer would have to do...

Raise their prices.....:p

Bigus Dickus
12-05-02, 11:07 PM
Originally posted by Uttar
Price: $89 for a high-clocked one ( 325Mhz with 64MB slow DDR1? ) and $59 for a low clocked one ( 225Mhz with 32MB even slower DDR1? )
The high-clocked one could easily beat the Radeon 9000 Pro when AA/Aniso is activated because of Intellisampling ( the Radeon 9000 only got supersampling ) and the low-clocked one would probably couldn't beat the Radeon 9000, but that ain't the goal anyway. Of course, none would even touch the Radeon 9500. That's NV31's job. And that isn't this thread topic.
The low-end model could be stuffed in every single low-end PC on the planet. Think about the revenue!

Yeah, except that the Radeon 9000 is now several months old (half a year), and will be replaced by the RV350 core likely by next summer. That will be a cut back R300 core, much as you are describing, with DX9 compliance and .13u manufacturing.

So, what good does it do for the cut back NV34 to compete with the 9000 when it's competition is RV350 derived products? ATi has done their homework as well. ;)

Uttar
12-06-02, 01:14 AM
Originally posted by Bigus Dickus
Yeah, except that the Radeon 9000 is now several months old (half a year), and will be replaced by the RV350 core likely by next summer. That will be a cut back R300 core, much as you are describing, with DX9 compliance and .13u manufacturing.

So, what good does it do for the cut back NV34 to compete with the 9000 when it's competition is RV350 derived products? ATi has done their homework as well. ;)

Good point.
But if nVidia delivers with their intellisample technology, they could put a lower clocked module. And that would make their NV34 to be cheaper to produce.
I kinda think that the RV350 probably won't compete with the NV34. Fact is, ATI's RV products rarely made a horrible dip in performance. I'd guess more on a 4 pipeline, 2 VS architecture with 128 bit bus and less efficient multisampling.

And as I explained in the top of the post, the goal is not only to compete with the 9000. It's mostly the replace the old crap nVidia is still producing, such as the GeForce 2 MX and TNT2. It obviously can't be done in a day, but I really think nVidia wants to do that.

And I'd guess ATI would be taken by surprise if nVidia decided to create a NV3x chip for the very-low end. It would rather have been a competitor to Trident XP4. If it didn't suck :P

Bigus Dickus
12-06-02, 01:31 AM
Well, true, I think the RV350 will still remain roughly where the current R300 is performance-wise, and somewhere around the 9000 Pro / 9500 non-Pro cost-wise. It is possible though that ATi will produce something akin to the 9500 Pro/non-Pro versions. That is, an RV350 Pro with 8 pipelines and R300-ish performance, and an RV350 non-Pro with 4 pipelines and 9500 Pro-ish performance, the first falling above and the second falling below the $100 mark (or, perhaps the distinction would be 128 bit vs. 256 bit memory interface... whatever). Not bad for DX9 cards if that is the case.

As far as TNT2 and GF2MX class cards... are they still actually manufacturing those? I guess so, with so many OEM's still using them, huh? In that case, the Radeon 9000, as cheap as it already is, could probably be sold for ~$60 if it was switched to a .13u process with 64 MB of mediocre memory. Not a barn burner, that's sure. It's only 40 million transistors, and I doubt a cut down NV30 could really get much smaller, though it might retain higher performance.

StealthHawk
12-06-02, 03:58 AM
what nvidia really needs to compete in the low end is a motherboard for Intel processors that has nvidia integrated graphics on it. nvidia's biggest threat is probably not ATI, but Intel.

Uttar
12-06-02, 05:28 AM
Originally posted by Bigus Dickus
As far as TNT2 and GF2MX class cards... are they still actually manufacturing those? I guess so, with so many OEM's still using them, huh? In that case, the Radeon 9000, as cheap as it already is, could probably be sold for ~$60 if it was switched to a .13u process with 64 MB of mediocre memory. Not a barn burner, that's sure. It's only 40 million transistors, and I doubt a cut down NV30 could really get much smaller, though it might retain higher performance.

Yes, they're still producing those. And yes, a RV250 on 0.13 would be very cheap to produce and would be highly regarded by OEMs

But you're forgetting a few things. First of all, if it was sold with 64MB mediocre memory, its performance would be so darn low it wouldn't even make sense to use the RV250 core.
Of course, what you mean by mediocre could not be as bad as what I imply by it. My definition of mediocre would be 175Mhz ( effectively 350Mhz ) DDR memory.

Of course, you're going to suppose the performance of the NV34 with such memory would be *horrible* - think again. So, what's the difference between NV3x and RV2xx?
The NV3x got Early Z, as does NV2x. The RV2xx ( or the R2xx, for that matter ) doesn't. That significantly reduces fillrate usage, and thus less bandwidth is required there.
Also, the RV250 got SSAA, so even 2X antialiasing is out of the question. The NV3x architecture got a very efficient MSAA algorithm ( I refuse to compare it to the R300 algorithm because else we'd start a flame war and I don't want that )
Finally, the NV3x got Fast Color Clear, as the R3xx, but the R(V)2xx doesn't have that.

The NV34 would run fine with very low quality memory. However, since it would have 2 pipelines and little memory bandwidth ( and little memory, too, since it would only have 64MB and potentially a 32MB model ) , the most used settings would probably be:
800x600 with 2X AA. After all, with those settings, the frame buffer @ 32BPP only takes 10MB. So, if you also consider the Z buffer, we'd get about 16MB. That's a little less than 15MB for textures if we got 32MB. In other words, no Doom 3 for you!
If a game requires that much textures, you can always fall back to 640x480 with Quincux & Aniso ( since Aniso will be quite cheap according to what we know ) - that only takes a 6MB frame buffer, leaving at least 22MB for the textures.
And since it got AGP 8X, it's a lot cheaper to resend textures. So, in many cases, it'll all be fine. You shouldn't be hoping for stellar performance at $59, after all :)

In conclusion, the highly efficient memory saving techniques of the NV34 would let nVidia have higher margins by using cheaper memory and evantually less memory too.


Uttar

Bigus Dickus
12-06-02, 10:02 AM
I was thinking along the lines of a 350/250 core/memory part. Since you're speculating on 2 pipelines for the NV3x value variant, I don't think that a 4 pipe RV2xx would have a difficult time remaining competitive. It would have twice the fillrate, so 2X supersampling AA would still be a possibility, especially at the 800x600 res you're talking about.

But, I really see no reason to produce such a chip. It is probably possible to use a 4 pipe 128-bit RV350 variant on .13u to produce a sub-$100 card.

Uttar
12-06-02, 12:06 PM
Originally posted by Bigus Dickus
I was thinking along the lines of a 350/250 core/memory part. Since you're speculating on 2 pipelines for the NV3x value variant, I don't think that a 4 pipe RV2xx would have a difficult time remaining competitive. It would have twice the fillrate, so 2X supersampling AA would still be a possibility, especially at the 800x600 res you're talking about.

But, I really see no reason to produce such a chip. It is probably possible to use a 4 pipe 128-bit RV350 variant on .13u to produce a sub-$100 card.

As I said, that all depends as to if nVidia delivers with their color compression. If it would give a 5% hit 2X MSAA at 800x600 even with 175Mhz DDR1 ( remember the GF4 Ti4600 does that with 300Mhz DDR1 ) or maybe 200Mhz DDR1, the RV2xx SSAA wouldn't be able to touch it. The 8500, for example, would probably have a 15% hit since it got a 20% hit at 1024x768

ATI wouldn't have the Aniso advantage, since the NV34 would also have an adaptive algorithm.

Anyway, while a RV250 on 0.13 might beat the NV34, it would need significantly better memory to do so. And that would increase its cost, so for the same cost, the NV34 would have better performance.

Now, of course, I could be wrong here. Maybe nVidia will surprise me by announcing a NV34 not even being able to win against the Radeon 9000 Pro. We'll see that in a few months anyway, but nothing stops us from speculating :)

Anyway, yeah, a RV350 would probably beat the NV34. The question is... Did ATI think about doing this? *hopes he didn't just give ATI the idea to do this* And does ATI even think it's an interesting opportunity?


Uttar

StealthHawk
12-06-02, 02:53 PM
Originally posted by Uttar
The 8500, for example, would probably have a 15% hit since it got a 20% hit at 1024x768

how did you get that 15% number? SSAA always has around the same performance hit. which is 50% for 2x FSAA, unless perhaps you are in an extreme CPU limited engine.

and where did that 5% hit for 2x FSAA with a gf4 come from :confused:

Uttar
12-06-02, 04:30 PM
Originally posted by StealthHawk
how did you get that 15% number? SSAA always has around the same performance hit. which is 50% for 2x FSAA, unless perhaps you are in an extreme CPU limited engine.

and where did that 5% hit for 2x FSAA with a gf4 come from :confused:

Err, I only checked *one* benchmark for those numbers. And it sounds like it wasn't a good idea.

http://www.anandtech.com/showdoc.html?i=1583&p=12
http://www.anandtech.com/showdoc.html?i=1583&p=13

Here, you can see that at 1024x768 @ 2X AA, the Radeon 8500 goes from 58.7 to 46.4, or a 25% performance hit. So sorry for the 20% number, it's 25%. So I'd guess it's between 15% and 20% performance hit with 800x600
As for with the GF4 Ti4600, it goes from 85.6 to 78.8, a 9% performance hit. Icky, I really gotta review how I estimate % - without a calculator, I estimated 5%! Icky!

And it probably isn't a good idea to use that benchmark at all. Unreal got way too many vertices and is too CPU limited, and in most other games, the performance hit is significantly higher. Sorry, my mistake.
Here's a slightly better example: Quake 3. A less typical benchmark would be better, but oh well.

http://www6.tomshardware.com/graphic/02q3/0207181/radeon9000-10.html#fsaa

Radeon 8500: +- 60% performance hit
Ti4200: +- 30% performance hit

So that reasoning was kinda flawed. 2X SSAA at 800x600 wouldn't be acceptable for a RV250 on 0.13, even with higher clocks.
So, only the RV350 could pose a threat the the NV34 at those price levels, unless I made a mistake somewhere. And did I?



Uttar

Bigus Dickus
12-07-02, 02:29 AM
Originally posted by Uttar
So that reasoning was kinda flawed. 2X SSAA at 800x600 wouldn't be acceptable for a RV250 on 0.13, even with higher clocks.
So, only the RV350 could pose a threat the the NV34 at those price levels, unless I made a mistake somewhere. And did I?

Uttar Errmm.... yeah, I think you made a mistake somewhere. Regardless of the % drop, the current RV250 IS capable of 2X SSAA at 800x600 with perfectly acceptable framerates. With .13u and a core clock bump, I don't see how you figure things will get worse. :confused:

Just what is your logic here? You've been arguing that, based on lack of early Z rejection (though I thought that has been in since the original Radeon), lack of color compression, and lack of MSAA (where compression would help, and fillrate isn't an issue), that the RV250 wouldn't run AA/resolution modes that some chopped down guess of an NV3x core could.

Why don't you just look at existing performance graphs for the RV250 core, and see what it is capable of running? From the QIII URL you posted, the RV250 core is running 2X quality AA at 1024 x 768 perfectly fine. How you could have backed out of all your calculations that it wouldn't be "acceptable" at even lower resolutions I just don't understand.

Uttar
12-07-02, 04:49 AM
Originally posted by Bigus Dickus
Errmm.... yeah, I think you made a mistake somewhere. Regardless of the % drop, the current RV250 IS capable of 2X SSAA at 800x600 with perfectly acceptable framerates. With .13u and a core clock bump, I don't see how you figure things will get worse. :confused:

Just what is your logic here? You've been arguing that, based on lack of early Z rejection (though I thought that has been in since the original Radeon), lack of color compression, and lack of MSAA (where compression would help, and fillrate isn't an issue), that the RV250 wouldn't run AA/resolution modes that some chopped down guess of an NV3x core could.

Why don't you just look at existing performance graphs for the RV250 core, and see what it is capable of running? From the QIII URL you posted, the RV250 core is running 2X quality AA at 1024 x 768 perfectly fine. How you could have backed out of all your calculations that it wouldn't be "acceptable" at even lower resolutions I just don't understand.

Sorry for the Early Z rejection part; I think I wasn't sufficently clear.
Both Hierarchical Z ( available on the original Radeon and since the NV2x ) and Early Z rejects pixel before the pixel pipeline. Early Z, only available on NV2x, NV3x and R3xx, is simply more effective at doing that.

I know the current RV250 is able to do 1024x768 and 2XSSAA in Quake 3. And in fact, I guess even the NV34 would be able to do that quite easily. Quake 3 is a good benchmark to calculate AA cost since it's not CPU/VS limited at all. And it's the same story in RTCW.

So, let me redo my calculations :P
Hmm, well if the DDR1 was clocked at 200Mhz instead of my proposed 175Mhz, 4XAA might be very cheap too. Let's suppose a 30% performance hit. And you could even balance that Memory Bandwidth hit with a fillrate hit, potentially making 4X Aniso free since it's adaptive and shouldn't hit more than 30% even with 2 pipelines.

That sounds fairly acceptable in a lot of games. Of course, that won't work in Doom 3, but that's not what mainstream is for :)

Could this even be possible, or is that performance estimation too optimistic?


Uttar

EDIT: I realized I forgot to mention why I took 800x600 in the first place.
The explanation is simple: With the NV30, you'd want 1600x1200 nearly everywhere. So, everywhere you could have that but not more, with 2 pipelines instead of 8, you'd have to take 800x600.