View Full Version : NV30's 4 4 Z-sampling units/pipe: really a disadvantage?
Hello everyone,
Let me quote Bigus Dickus:
The NV30 only has 4 Z-sampling units per pipeline, which means it can achieve a maximum of 4 MSAA samples per pixel per clock.
I'm *not* quoting this because it's incorrect. That information is perfectly correct.
However, it isn't quite as important as all of you people seem to think IMO.
With the NV25's 4 Z-sampling units/pipe, everyone was happy because the maximum was 4X AA.
Another detail got to be remembered: The GeForce 4 had 2TMUs/pipe. So even pixels using 2 textures ( very common ) only took 1 clock.
On the GeForce FX, however, we only got 1TMU/pipe. And we still got 4 Z-sampling units/pipe.
So, if nVidia uses concurrency correctly, and Z-Sample at the same time as they use the TMUs, so that both are ready at the same time when using 2 textures...
It's just like if we had 8 Z-sampling units/pipe on a 2TMU/pipe architecture.
The only problem occurs with single-textured games when using more than 4X AA, where the performance is indeed slower than ATI's 6 Z Sampling units/pipe architecture.
But nVidia is really trying to push high-quality graphics; they don't want us to go back to the single texturing era. And single-textured games performance should be so darn high with the NV30 because they're probably old games... So it isn't a real problem.
Conclusion? While ATI will have an advantage in single-textured cases with their 6 Z-Sampling units/pipe, nVidia will be just as good with their 4 Z-Sampling units/pipe IF they designed it correctly. And we can only hope that they were able to do so with $400M...
Of course, many games use both single textured & multi textured polys. But since there's generally a majority of multi-textured polys, the performance hit won't be as high as many people thinked.
Now, maybe I did a mistake here. Maybe all of this is wrong. And that's why it's a good thing this is a forum - you people can flame, correct or annoy me. Please feel free to do so! :)
Uttar
EDIT: Corrected type in the thread title
gstanford
11-30-02, 10:58 AM
I don't know the answer to that. I can only speculate that nVidia has some way around a 4 z sample limit per pipe since you can use 6x FSAA and 8x FSAA on the GF4 with the correct drivers.
Antialiasing Modes - 11/25/02 4:42 am - By: MikeC - Source: Email
The German web site 3DCenter chips in with more information on the two new antialiasing modes in the Detonator 40.72 and 41.03 drivers. Their article gives a short description of them and provides some Unreal Tournament 2003 benchmarks and screenshots.
Note that the new antialiasing modes can be enabled by the aTuner or RivaTuner utilities.
Greg
Originally posted by gstanford
I don't know the answer to that. I can only speculate that nVidia has some way around a 4 z sample limit per pipe since you can use 6x FSAA and 8x FSAA on the GF4 with the correct drivers.
Greg
Hmm, I don't quite think you understand the problem...
Finding a workaround simply ISN'T possible. You've *got* to take two clocks for Z sampling when you do 6X AA and 8X AA on the GeForce FX.
My point is that, if nVidia did their design correctly, that won't be a problem when multitexturing. So, in practice, it isn't as problematic as most people have speculated.
That is, if I'm not making a mistake in my reasoning.
Uttar
Bigus Dickus
12-02-02, 01:28 AM
Here's a question: when they do the mixed MS/SS modes for 6Xs and 8X, does the SS portion require another clock cycle?
In any case, since the NV30 will likely be heavily bandwidth limited on the 6X and 8X modes, burning up fillrate to calculate more Z values probably won't be that big of a problem. Just another reality check for the guys continuously quoting the 48 GB/s bandwidth numbers, and running around shouting "free AA! free AA!"
Though, since the mixed modes are already carrying a fillrate penalty for the supersampling portion, burning up another clock cycle might actually bring them down into the realm of core limited, not memory limited (though that would seem to be pretty far fetched).
Originally posted by Bigus Dickus
Here's a question: when they do the mixed MS/SS modes for 6Xs and 8X, does the SS portion require another clock cycle?
In any case, since the NV30 will likely be heavily bandwidth limited on the 6X and 8X modes, burning up fillrate to calculate more Z values probably won't be that big of a problem. Just another reality check for the guys continuously quoting the 48 GB/s bandwidth numbers, and running around shouting "free AA! free AA!"
Though, since the mixed modes are already carrying a fillrate penalty for the supersampling portion, burning up another clock cycle might actually bring them down into the realm of core limited, not memory limited (though that would seem to be pretty far fetched).
Well, "free AA!" obviously is marketing BS. Maybe we'll get 4X AA at a very low performance cost at many resolutions ( but I'd be surprised by 16x12 with 4X AA for free... )
Well, the whole idea of the supersampling portion is that since the memory is already being hit so darn much, it's fine to hit fillrate a little too since that would only put it back as the bottleneck. That's why 4xS gives, in most cases, exactly the same performance hit as 4x.
Uttar
gstanford
12-02-02, 02:33 AM
The low cost, almost free anti aliasing claim goes back a very long way.
David Kirk was interviewed approx 1 to 1.5 years ago (by cnet or zd net I think) and in that interview he said nVidia had pretty much solved the performance hit problem of FSAA.
Unfortunately I've never been able to find that interview since (perhaps you have to subscribe to cnet who I am fairly certain were behind the interview), and if I saved a local copy It's long since gone. David also talked a little about NV30's pipelines IIRC.
Perhaps someone here knows the interview I'm referring to and can provide a link.
Greg
Look, these marketing crap is ment for previous generation (DX 8.1) 3d games. I would be surprised if you can run Quake 4 at 2048x1600 32bpp with 4xAA with no performance penalty with 4x FSAA enabled on the GFFX.
StealthHawk
12-02-02, 06:01 AM
Originally posted by K.I.L.E.R
Look, these marketing crap is ment for previous generation (DX 8.1) 3d games. I would be surprised if you can run Quake 4 at 2048x1600 32bpp with 4xAA with no performance penalty with 4x FSAA enabled on the GFFX.
why would you be surprised then? Quake 3 is a previous generation game, that would fall under the DX7 or even DX6 class game.
Bigus Dickus
12-02-02, 10:33 AM
Originally posted by Uttar
Well, the whole idea of the supersampling portion is that since the memory is already being hit so darn much, it's fine to hit fillrate a little too since that would only put it back as the bottleneck. That's why 4xS gives, in most cases, exactly the same performance hit as 4x.
Uttar
Well, except that supersampling does not lend itself to compression very well at all, whereas multisampling does. If we were talking about 8X MSAA vs. 8X MS/SSAA (with 4X SS 2X MS like someone suggested in another thread - which I think is probably incorrect, but this is hypothetical :) ) then I think there will be a definite bandwidth penalty due to the compression-unfriendly supersampling portion.
Originally posted by Bigus Dickus
Well, except that supersampling does not lend itself to compression very well at all, whereas multisampling does. If we were talking about 8X MSAA vs. 8X MS/SSAA (with 4X SS 2X MS like someone suggested in another thread - which I think is probably incorrect, but this is hypothetical :) ) then I think there will be a definite bandwidth penalty due to the compression-unfriendly supersampling portion.
Hmm, very interesting point you've got there.
I think nVidia found a way to compress pixels which aren't quite identical, but very similar. So, since those pixels would be very near, the difference wouldn't be too great.
Obviously, that compression can't be as efficient. So, you're making an interesting point: Will 4xS be significantly more costly than 4x on the NV30?
Uttar
Bigus Dickus
12-02-02, 04:29 PM
I don't think 4Xs will be significantly more expensive than 4X MSAA. After all, in that case we're only talking about the difference in 2X SSAA and 2X MSAA as far as compressibility is concerned, and even the NV30 should have enough bandwidth to handle the 2X SSAA portion without compression fairly easily.
However, the story might change with increased samples. At 6X or 8X AA, bandwidth is going to start really bringing things down performance-wise, even with the NV30's hyped compression techniques. At 6X and 8X, every little bit of bandwidth saved or wasted will make a difference in framerates, so I think the difference between compressible 6X MSAA and semi-compressible 6Xs will make a difference.
There's also a discussion over whether 8X is 4X SSAA + 2X MSAA, or 4X MSAA + 2X SSAA. The latter makes more sense to me, be I guess we'll just have to wait and see. If it is the former, you can bet that the inability to compress supersampling very much will make a rather large difference in performance. I wouldn't expect such a mode to be playable under many new games.
Well what game would you choose? Mr. I have a large ... :eek: ;) :D
Originally posted by StealthHawk
why would you be surprised then? Quake 3 is a previous generation game, that would fall under the DX7 or even DX6 class game.
StealthHawk
12-07-02, 12:08 AM
Originally posted by K.I.L.E.R
Well what game would you choose? Mr. I have a large ... :eek: ;) :D
hmm, what you said didn't make sense to me. you said that all the marketing junk was for DX8.1 games. this implies that gfFX should run really fast with all the eye candy on in older games. but then you say you would be surprised if a really old game like Quake 3 ran at high res with 4x FSAA with no penalty :confused:
Originally posted by StealthHawk
hmm, what you said didn't make sense to me. you said that all the marketing junk was for DX8.1 games. this implies that gfFX should run really fast with all the eye candy on in older games. but then you say you would be surprised if a really old game like Quake 3 ran at high res with 4x FSAA with no penalty :confused:
He said "Quake 4" , not "Quake 3"
K.I.L.E.R sure knows how to confuse us all :) Ah well, bad K.I.L.E.R, bad! Next time, but the 4 in bold please.
Uttar
LMFAO!!! :D
Someone noticed the 4 :)
StealthHawk
12-07-02, 07:05 AM
Originally posted by K.I.L.E.R
LMFAO!!! :D
Someone noticed the 4 :)
lol, point that out next time before you let me go on a tirade :o
hell if Doom 3 runs at 60fps at 1280 with 4x FSAA i'll be very impressed. i'll also buy one immediately :D For Quake 4 i'll drop the res down to 1024.
Couldn't you just read my post? :p ;) :D
It's quicker. :p
Originally posted by StealthHawk
lol, point that out next time before you let me go on a tirade :o
hell if Doom 3 runs at 60fps at 1280 with 4x FSAA i'll be very impressed. i'll also buy one immediately :D For Quake 4 i'll drop the res down to 1024.
From a nVidia released benchmark, it runs at about 45FPS in High Quality at 1280x1024 *without* FSAA or Aniso.
Sure, drivers can help, but miracles ain't gonna happen :)
And remember Doom 3 is mostly single player - as long as the FPS doesn't go below 30 too much and never below 25, it's fine :)
http://www.anandtech.com/video/showdoc.html?i=1749&p=7
Oh, and note the level name. "nvdemo3"
Uttar
StealthHawk
12-07-02, 10:46 PM
Originally posted by K.I.L.E.R
Couldn't you just read my post? :p ;) :D
It's quicker. :p
whenever i see "Quake #" i always assume Quake 3 when people mention benchmark numbers. i suspect that other people don't pay that much attention either, but if it's only me, hey, speed reading works....at times :D
AngelGraves13
12-10-02, 02:30 AM
Originally posted by StealthHawk
whenever i see "Quake #" i always assume Quake 3 when people mention benchmark numbers. i suspect that other people don't pay that much attention either, but if it's only me, hey, speed reading works....at times :D
That game has been benchmarked more than all the Intel processors to date! :D
Originally posted by gstanford
The low cost, almost free anti aliasing claim goes back a very long way.
David Kirk was interviewed approx 1 to 1.5 years ago (by cnet or zd net I think) and in that interview he said nVidia had pretty much solved the performance hit problem of FSAA.
Unfortunately I've never been able to find that interview since (perhaps you have to subscribe to cnet who I am fairly certain were behind the interview), and if I saved a local copy It's long since gone. David also talked a little about NV30's pipelines IIRC.
Perhaps someone here knows the interview I'm referring to and can provide a link.
Greg
Uhhh. . .is that the one where he claimed that was accomplished by switching the optimization default from non-FSAA to FSAA? In other words, non-FSAA would be slowed down (by losing the previous optimizations that were in place to its benefit) and FSAA would be quicker (by gaining optimizations appropriate to it), leaving apparently no (or little) difference in performance between the two?
Of course "slowed down" on non-FSAA would be masked in moving from one generation to the next by the higher clock, more pipes, and greater bandwidth so that even the non-FSAA would appear faster compared to the previous generation card. But it would be "slower" compared to what it would have been if they hadn't changed the optimization patterns to benefit FSAA instead.
If that's the interview you're referring to, then I remember it too, and no I don't have a copy.<shrugs>
Originally posted by Geo
Uhhh. . .is that the one where he claimed that was accomplished by switching the optimization default from non-FSAA to FSAA? In other words, non-FSAA would be slowed down (by losing the previous optimizations that were in place to its benefit) and FSAA would be quicker (by gaining optimizations appropriate to it), leaving apparently no (or little) difference in performance between the two?
Of course "slowed down" on non-FSAA would be masked in moving from one generation to the next by the higher clock, more pipes, and greater bandwidth so that even the non-FSAA would appear faster compared to the previous generation card. But it would be "slower" compared to what it would have been if they hadn't changed the optimization patterns to benefit FSAA instead.
If that's the interview you're referring to, then I remember it too, and no I don't have a copy.<shrugs>
Okay, this interview is from November 15, 2002 and sounds very similar (tho shorter) than what I remember the previous interview saying. So I'd say the previous thinking given above appears to still be in place:
http://www.extremetech.com/article2/0,3973,713565,00.asp
First and foremost, according to David Kirk, "4X AA is the design center of NV30. It is the most efficient operation of the hardware…. The focus is making 4X FSAA run fast, so everything is sized, everything is computed, and every data path's width is targeted at 4X FSAA."
Originally posted by Geo
Okay, this interview is from November 15, 2002 and sounds very similar (tho shorter) than what I remember the previous interview saying. So I'd say the previous thinking given above appears to still be in place:
http://www.extremetech.com/article2/0,3973,713565,00.asp
First and foremost, according to David Kirk, "4X AA is the design center of NV30. It is the most efficient operation of the hardware…. The focus is making 4X FSAA run fast, so everything is sized, everything is computed, and every data path's width is targeted at 4X FSAA."
The "data path's width" bit might be quite interesting.
Wouldn't that mean the 128 bit bus usage was optimized for 4X AA? So, if the performance hit is lower with 4XAA... It might largely be because it's more optimized than before with 4XAA, and you can't optimize as much for other modes then ( of course, 2X AA's performance cost isn't going to be higher... )
Uttar
vBulletin® v3.7.1, Copyright ©2000-2012, Jelsoft Enterprises Ltd.