PDA

View Full Version : So, what's next? Let's begin the NV40 speculation!


Uttar
12-21-02, 08:49 AM
Hello everyone,

Just felt like posting a 100% speculation thread. Of course, if anyone got any sources, feel free to say it. But the point here mostly is understanding "The NV30 is great and stuff, but how could you even enhance that?!"
I'm not going to talk about the NV35 here, as I'd guess it's nothing more than increased efficiency, memory and clock speeds. Maybe a few more things, but nothing more exciting than NV10->NV15 and NV20->NV25; no amazing features, but real nice performance increases.

Now, the first obvious step is PS3.0 and VS3.0
It's fairly likely the R400 will support both, but nothing is certain yet and we could be disappointed by ATI. However, what is pretty much certain is that the NV40 will support both of them. After all, with more than 200M transistors ( 0.09 easily allows that ) , not supporing that would be insane.

But let's suppose both VS3.0 and PS3.0 are in the NV40. That's great; but what are we getting beside that?
In the past, nVidia added features to nearly every generation of GPU. For the TNT, it was 32 bit color
For the GF1, it was T&L
For the GF3, it was programmable shaders, LMA ( Z-related optimizations ) & Multisampling
For the GFFX, it's higher color precision ( 64BPP + 128BPP ) & Intellisample ( Color & Aniso optimizations )

Now, all of those systems could get higher efficiency with the NV40. But that's more of a NVx5 ( NV35/NV45 for example ) advancement. So... What could they invent, really?

First of all, a new channel added to the standard RGBA: Luminance. I think it's John Carmack who asked for this, and I don't think it would be too hard to implement it. So, if they can get some free JC quotes, they'd obviously implement it ASAP :)
Now, if JC never asked for this, maybe I'm just imagining things. Anyone care to correct/certify this?

Next is the Programmable Primitive Processor.
Currently, the Primitive Processor is quite simply known as the "Triangle Setup" stage of the GPU. It is *very* fast, and it's absolutely never the bottleneck. This stage was introduced by the Voodoo 2 IIRC.
The original Voodoo still asked the CPU to do that - so, with the original Voodoo, you could use the CPU as a programmable primitive processor, and you weren't required to use triangles. It simply was preferred, because they were a lot faster than most other forms, and a lot more flexible ( you could do every type of primitive with triangles, it just sometimes requires a lot of them )

The idea with the current Triangle Setup engine is very simple: determine the edges using DDA, then determine the values for every variable of each pixel by determining how near it is to every of the three vertices. That's fast, and effective.

The problem, however, is that those edges cannot be curves. They are simple lines. And that's why you may sometimes require a lot more triangles ( and thus vertices )
With a programmable primitive processor, you could do curves and a lot of other things. It would all be very nice, it's a HOS dream.
It would potentially save a lot of vertex processing power while increasing curves quality. And what does that mean? Well, with less vertices, there is more power available for each vertex. Which means the NV30 insane limitations might actually become useful! :)

But a problem shows up with a programmable primitive processor. It's very nice to be able to do curves and stuff... But it would be even nicer if you weren't limited by only 3 vertices for a primitive. A lot nicer.
But doing a DIP ( DrawIndexedPrimitive ) call simply for a few rare 7 vertices primitives would be highly inefficient. So, what's the solution?
The indices lists system would probably be modified. Instead of assuming it's 3 vertices/tri because it's a triangle list of example, all of that would have to be determined manually. So, who said AGP 8X would not be useful for years? :)

The idea would instead be to begin every primitive by a header saying how many vertices are used and which are the values of the variables passed for each primitive ( yes, there'd be variables per primitive too )
After that, you'd have the used vertex for each index. Vertex Caches would thus still be effective. As for Primitive Strips... Well, you could obviously say that first two vertices of a primitive are the last two vertices of the primitive just before. But there's a problem here:
Degenerate Triangles were acceptable because Triangle Setup was the fastest stage. Now, in some cases, it might really no longer be. So, I think stripping might become significantly less useful, but it'll still be used for specific cases obviously.


So, that's all very nice features. But we've got a new problem here. With true curves, aliasing is guaranteed to be HORRIBLE. And I mean it.
So, what could be done on the AntiAliasing front?

Like it or not, the next step is Fragment Antialiasing.
And yes, Matrox already implemented that. But from my understanding, it isn't exactly traditional Fragment Antialiasing. I may have to research a little more on that, and it certainly merits an entire new thread IMO. But to summarize it, I think it would simply be 4 maximum colors for each pixel, but 16 bit masks of 2 bits each to select which color it is.
Such an algorithm would probably in many cases give something like 16x quality in best cases and 4x quality in worst cases, all of that at the cost of 6x - And really, most of the time, it would be 9x quality or better.

But with 6x Antialiasing-like costs, memory is again going to be a problem. With the NV30, nVidia introduced their Intellisample technology, whose primary advantage is Color Compression, even when not using FSAA ( that thus proofs it's a significantly more advanced algorithm than ATI's one, but that might also be a problem as in extreme cases, it might be less efficient than ATI's algorithm. I guess benchmarks will tell us that. )
However, now that we're optimizing both Z & Color... What's left, really? Yes, increasing efficiency, but that can't do miracles.

Well, if we exclude Color, Z and the Frame Buffer from the equation... What's left?
Textures.

Yes, we currently got texture compression ( DXTC for example )
But none of it is based on FP ( Floating Point ) , so there may be a few possible tricks there.

Remember the 64BPP Texture Compression rumor for the NV30? Nobody ever trusted it, because it had no basis and it seemed too amazing to be true.
But as strange as this may seem, if nVidia wants to gain efficiency in their memory bandwidth systems, it's the only optimization left.

I wouldn't say they're going to do it around FP or anything. But what's nearly certain, however, is that DXTC isn't sufficent anymore. Or maybe they're going to push PS-generated colors instead of textures a little more, but that seems insane. So I wouldn't bet on that at all.


Considering 0.09 should enable nVidia to do a 200M+ transistors part with the NV40, I don't think all of this is impossible at all.

A last point here: the R400. I don't think it'll have a primitive processor or Fragment AA. After all, it's still 0.13, and putting in too many things in it would be way too hard. But we might see a few of the things the NV40 will have, thus nearly certainly giving it an edge over the NV30 & NV35 feature wise. As for performance, it'll nearly certainly beat the NV30. As for the NV35, I'm not betting anything here - we know way too little right now.

Expected NV40 paper launch: Between February 2004 and early November 2004.
Expected NV40 in store avaibility: Between April 2004 and late December 2004.


Hope you enjoyed this huge post. It was fun to write, as it's always great to imagine what the future holds for us. I'm certain I did a fair bit of mistakes or obvious biasing towards nVidia. If you spot them, please post them! :)


Uttar

EDIT: I'd just like to remember all of you that the Fragment Antialiasing & Programmable Primitive Processor systems could be implemented in a LOT of ways. This is just a way to do it, and it's unlikely exactly such a way is going to be used.
Also, as always, this is nothing more than speculation. No sources here, people. If you want that type of info, go to The Inquirer.

nutball
12-21-02, 09:50 AM
I'd be very suprised to see an engine which supports any sort of HOSs which don't get turned into triangles at some point in the rendering pipeline.

The lowly triangular polygon has just so many things going for it (for a start: simplicity), that I think it'll be some time yet before it's displaced from it's current situation as the lynchpin of modern CGI.

Uttar
12-21-02, 10:05 AM
Originally posted by nutball
The lowly triangular polygon has just so many things going for it (for a start: simplicity), that I think it'll be some time yet before it's displaced from it's current situation as the lynchpin of modern CGI.

I don't think the triangle will really be destroyed. It might still be used in most places. But the flexbility to use other type of primitives would be very nice for specific cases.

Of course, triangles are a LOT more simple. And that's certainly why they were chosen originally. But today, that makes no more sense. Triangle Setup is significantly faster than other stages and it'll become even faster in the future, simply because the other stages are becoming slower and slower ( which is compensated by increased frequencies & concurrency )

Triangle Setup became TOO basic, so making it more complex is the next logical step to increase visual quality. And even if it's not in the NV40, then it'll certainly be in the R500 released at +- the same time if all goes as planned.

BTW, forgot to note in the original thread that for optimal performance & flexibility, a programmable primitive processor might require Dynamic Branching, so that swtiching between loads of different programs isn't required and different type of primitives can be in the same model.


Uttar

Mod
12-21-02, 10:31 AM
I think i't's going to be released cheaper. For a good reason and a good end:

good reason:

Since we have the "physical limition of transistor miniaturization versus chip size" problem , as explained in this thread ( http://www.nvnews.net/vbulletin/showthread.php?s=&threadid=5280 ), the engineers will end up naturaly working longer to produce NV40 and others.

Keeping this in mind, the R&D will be spread over time, and so being easily absorved by the nvidia earnings along this period. Besides, there will be more time to mature 90nm production process.

good end:

To estimulate the sales of video cards, since they have being decreasing consideraly in the last 2 years, due to their high prices.
Take a look here
:http://www.startribune.com/stories/535/3524128.html


So I think NV40 may be released even at US$ 250,00 or a bit lower .

I know this is a bit low estimated, but I am just guessing that video cards will keep loosing market ( so nVidia will make an effort to low it), and by the reasons explained above.

nutball
12-21-02, 10:50 AM
Originally posted by Uttar
I don't think the triangle will really be destroyed. It might still be used in most places. But the flexbility to use other type of primitives would be very nice for specific cases.


And therein lies the problem. Higher-order surfaces aren't really dreadfully useful. Some breeds of them are quite difficult to work with.

The way I look at it I can't see that in the short- to medium-term that future generations of consumer hardware will start to use techniques beyond those used in high-end CGI currently. At the moment all high-end modelling, regardless of whether HOSs are used at all, all ends up producing triangle meshes.

As it stands there's more than enough potential in the existing pipeline for it to have a life of at least a few more years yet. I mean this both in the sense of it's creative possibilities, and the scope for increasing performance by throwing more transistors at the problem (there's still a great deal of potential parallelism to be exploited).

What do I think NV40 and friends look like? I think it will be wider, faster and deeper. I think rendering to intermediate buffers will become more of a focus the hardware. Increased flexibility in this sort of area. I don't think NV40/R400 will be a leap forward like NV30/R300 are.

Mod
12-21-02, 11:42 AM
Originally posted by nutball
And therein lies the problem. Higher-order surfaces aren't really dreadfully useful. Some breeds of them are quite difficult to work with.


Correct me if I am wrong, but Uttar was telling us a way to implement HOS at hardware level that helps implement it at software level without the difficulties programmers find nowadays.

Uttar
12-21-02, 01:21 PM
First of all, that method wouldn't be amazingly easy to work with for programmers. But IMO, it might be easier than the current systems IF the engine was designed around it from the start.

Trying to "adapt" an engine to use such programmable primitive processors ( let's call it a PPP ) might lead to horrible results.

*However* , I did make one important assumption in my reasoning: That PPP would be able to treat primitives with more or less than 3 vertices ( example of less: a circle. You could specify the center and a few info to describe what happens when you move out of the center )

Nothing proofs us such a thing. Maybe we're getting a PPP which requires 3 vertices, no matter what.
Maybe we aren't going to see more flexibility until the NV50 or NV60. We simply don't know.

A PPP which only works with 3 vertices would be interesting, and fairly easy to implement. And for programmers, it would obviously be easier to convert to it ( Don't expect a patch/expansion pack doing that, however! It's still fairly signifiant... )
The idea, then, might be to determine the edges of the triangle based on per-primitive info ( & constants )
So, every primitive would get variables such as the intensity of the curve. It might also use some more complex system to determine the values of each pixel in the primitive. Stuff like that.
Nothing as revolutionary as being able to use more than 3 vertices, but it would still be very nice ( Note, however, that using less than 3 vertices might be possible: simply pass an emtpy vertex as index! )

Such a system would still allow indexing for vertices. I guess it would be a lot more realistic to hope for that in the NV40.

One question remains: If a primitive got 3 vertices, but the edges aren't systematically lines which relies AB & AC & BC, then is it still triangle?


Uttar

nutball
12-21-02, 02:59 PM
Originally posted by Mod
Correct me if I am wrong, but Uttar was telling us a way to implement HOS at hardware level that helps implement it at software level without the difficulties programmers find nowadays.

That wasn't what I meant by "difficult". Sure, some programmers have neither the maths nor programming capabilities to do HOSs. I think you'll find the top guys at the top games firms have both.

The problem is that HOSs, at least of the traditional form (cubic patches, and so on) just aren't a good approximation for much of the real world. They are difficult to use, not difficult to program.

nutball
12-21-02, 03:04 PM
Originally posted by Uttar
*However* , I did make one important assumption in my reasoning: That PPP would be able to treat primitives with more or less than 3 vertices ( example of less: a circle. You could specify the center and a few info to describe what happens when you move out of the center )

So is what you're suggesting a vertex program which can generate new vertices? This sort of model might be useful.

Point is, a circle can be approximated to the resolution of the screen by a collection of straight lines. In the same way a curved 3D surface can be approximated by an arbitrary number of planar patches (ie. triangles).

A triangular patch (which is what you're asking about in your final question) can be approximated by a collection of planar triangles. So what does all the extra maths buy you which some simple maths and some brute force doesn't? Not a lot.

Uttar
12-21-02, 04:16 PM
Originally posted by nutball
So is what you're suggesting a vertex program which can generate new vertices? This sort of model might be useful.

Actually, no, I'm not suggesting a vertex program which can generate new vertices. The PPP would simply treat the vertices in another way than the traditional Triangle Setup one, using variables & Constants ( VS/PS also uses variables & constants; so it's the same idea )

Originally posted by nutball
Point is, a circle can be approximated to the resolution of the screen by a collection of straight lines. In the same way a curved 3D surface can be approximated by an arbitrary number of planar patches (ie. triangles).

A triangular patch (which is what you're asking about in your final question) can be approximated by a collection of planar triangles. So what does all the extra maths buy you which some simple maths and some brute force doesn't? Not a lot.

That's where you're wrong.
With traditional HOS, you'd transform a triangle patch ( thanks for giving me the correct term, BTW :) ) into many triangles.
And thus... many vertices.
The NV30's amazing VS limitations are currently really, really... insane. I don't think anyone is going to use that much for a while because it kills performance very easily if you apply it to too many vertices.

The problem is that models in recent games can easily have about 2500 vertices ( let's not confuse vertices & polys, people. There could be nearly 4 times more polys than vertices in models, most of the time )
And why are they using so god darn many vertices/polys? To make lines look like curves.

Let's consider the following image, released by ATI to advertise the Radeon 9700 Displacement Mapping:

http://www.pl.tomshardware.com/graphic/02q3/020718/images/displacementmap.gif

In the Displacement Mapped N-Patch example, you get 10 vertices instead of 3 vertices. Now, the Vertex Shader got to work on all of those vertices, tripling its work amount!

If we want to be able to use more instructions per vertices, it might be useful to use LESS vertices. And that's exactly what a PPP enables. You've only got to do the huge instruction program for 3 vertices instead of the 10 vertices and you've got a better curve quality.
All of that because the PPP can use those 3 vertices to determine the pixels used for the primitive instead of requiring 10 vertices. It's a very signifiant saving.

In other words, yes, Brute Force can do it. But you'd have to sacrify Vertex Shading quality. And you wouldn't want that, now, would you? :)


Uttar

Mod
12-21-02, 08:01 PM
Originally posted by nutball
The problem is that HOSs, at least of the traditional form (cubic patches, and so on) just aren't a good approximation for much of the real world. They are difficult to use, not difficult to program.

There was an ATI video where they showed a dolphin made with HOS. It looked awesome to me.

Maybe most prog teams don't use HOS because most of the people that buy the games don't bother with the blocky graphics, since the overall eye candy and the simple increasing of polys in the models compensates the blocky effect.

Uttar
12-22-02, 04:11 AM
Originally posted by Mod
There was an ATI video where they showed a dolphin made with HOS. It looked awesome to me.

I think his point is that beside dolphins, you ain't gonna find much real world uses :D

What about making SimDolphin?


Uttar

nutball
12-22-02, 04:16 AM
Originally posted by Mod
There was an ATI video where they showed a dolphin made with HOS. It looked awesome to me.

Maybe most prog teams don't use HOS because most of the people that buy the games don't bother with the blocky graphics, since the overall eye candy and the simple increasing of polys in the models compensates the blocky effect.

Yes, that was precisely my point! :) Demos are fine and that, but they are only demos. ;)

nutball
12-22-02, 04:40 AM
Originally posted by Uttar
Actually, no, I'm not suggesting a vertex program which can generate new vertices. The PPP would simply treat the vertices in another way than the traditional Triangle Setup one, using variables & Constants ( VS/PS also uses variables & constants; so it's the same idea )


Ah, OK. Interestingly enough, I think that might be a feature we're going to see in the near future (spawning new vertices in a vertex program).


That's where you're wrong.
With traditional HOS, you'd transform a triangle patch ( thanks for giving me the correct term, BTW :) ) into many triangles.
And thus... many vertices.
The NV30's amazing VS limitations are currently really, really... insane. I don't think anyone is going to use that much for a while because it kills performance very easily if you apply it to too many vertices.


Right, so the solution you're suggesting makes life easier for the vertex shading unit, at least in the sense that there's fewer vertices for it to work on.

Trouble is it makes life a lot more difficult for other parts of the pipeline. A few examples which come to mind:

a) Calculating normals to do per-pixel lighting is a lot more difficult to do for non-planar patches. Currently of course much per-pixel lighting is starting to be done with bump maps (specified using textures), so maybe this is less of an issue. Even with bump mapping I think you still need to be able to calculate the normal to your surface at a given point.

b) Rasterisation. Scan-converting a non-planar patch is not trivial! Try writing some code to do it, and compare that with the few lines of code it takes to rasterise a planar polygon. Then translate that into transistors!

c) Interpolation. This is somewhat related to b), but the advantage of using planar polygons is that it's trivial to calculate a "distance" over the surface from one point to another (Pythagoras Theorem). Hence you can interpolate parameters (eg. texture coordinates) across a surface in a realistic looking way. On a non-planar surface this is no longer simple. You're into line integrals, and all sorts of fun like that. As an example, can you write down an expression for the distance travelled along a 2-dimensional Bezier curve as a function of it's parametric variable? It's very challenging. Even if there is an analytic solution (my maths runs out waaay before elliptic integrals), it's a right pain to implement. That's an often required operation, on a simple, well understood 2-D cubic curve.

I'm sure there are many other parts of the pipeline which become more difficult, it's too early on a Sunday morning to think of them!


Let's consider the following image, released by ATI to advertise the Radeon 9700 Displacement Mapping:

http://www.pl.tomshardware.com/graphic/02q3/020718/images/displacementmap.gif

In the Displacement Mapped N-Patch example, you get 10 vertices instead of 3 vertices. Now, the Vertex Shader got to work on all of those vertices, tripling its work amount!

If we want to be able to use more instructions per vertices, it might be useful to use LESS vertices. And that's exactly what a PPP enables. You've only got to do the huge instruction program for 3 vertices instead of the 10 vertices and you've got a better curve quality.
All of that because the PPP can use those 3 vertices to determine the pixels used for the primitive instead of requiring 10 vertices. It's a very signifiant saving.


Thing is that vertex shaders, as I understand their implementation, are distinct functional units. Extra units can be added pretty easily, given a large enough transistor budget.

Vertex bandwidth across host interfaces is also an issue, but can also be addressed by increasing the quantity of local memory on the card.

Extra vertex units and more memory are both "more of what we've got now", and as such are a hell of a lot easier to implement than re-jigging large fractions of the whole pipeline to deal with non-planar polygons.


In other words, yes, Brute Force can do it. But you'd have to sacrify Vertex Shading quality. And you wouldn't want that, now, would you? :)


Ahhh, well this is the age old question of which is better, an elegant solution or a brute force solution!

My memory of the history of the information technology sector says to me that 9 times out of 10 brute force will win, because it's cheaper to develop and gets to market first.

Hence Windows, the x86 architecture, and so on.

Mod
12-22-02, 10:05 AM
Originally posted by Uttar
I think his point is that beside dolphins, you ain't gonna find much real world uses :D
Uttar

I think there are many things that could be used with HOS.

Thingers, arms, noses, everything that has a round shape in a human body. You make this sub-models indepently and after you connect them to make the final model ( the human body in this case).

In airpanes, there are many round parts too. Mainly jets ( they look like dolphins... ).

You could make very round hills in flight simulators, etc...

The result would be very good, but it is easier to make everything in compact and high poly models, and cheat in the end with eye candy.

Uttar
12-22-02, 05:36 PM
Originally posted by nutball
Ah, OK. Interestingly enough, I think that might be a feature we're going to see in the near future (spawning new vertices in a vertex program).

Right, so the solution you're suggesting makes life easier for the vertex shading unit, at least in the sense that there's fewer vertices for it to work on.
Trouble is it makes life a lot more difficult for other parts of the pipeline. A few examples which come to mind:
...
I'm sure there are many other parts of the pipeline which become more difficult, it's too early on a Sunday morning to think of them!
...
Ahhh, well this is the age old question of which is better, an elegant solution or a brute force solution!

My memory of the history of the information technology sector says to me that 9 times out of 10 brute force will win, because it's cheaper to develop and gets to market first.

Hence Windows, the x86 architecture, and so on.

Hmm, all of those are good points. I'll have to check for a couple of those points to be certain there are no not-too-hard ways to fix those.

Just one thing I'd like to add: time isn't really an issue here. Everything is redesigned every generation. And the most important factor is that it's compatible with the previous generation. So, PPP could take a long time to be in GPUs. But someday, it'll be in them. My bet is either NV40 ( but in a rudimentary form, with many limitations ) or NV50 ( with still some limitations )

As for Vertex Programs spawning new Vertices... It's unlikely to come any time soon, and nearly certainly not before PPP. It's a fundamental change in the way you think about shading programs. The current idea is to simply execute instruction after instruction, without worrying of the outside world. That works just plain great.
But if you want to spawn new vertices, the increased level of flexbility required is HUGE. Probably not as important as PPP, but you'd really need a good reason to add all that flexibility! A really, really good one.
For example... Where is the index for the vertex going to come from? And that's just one of the billion of problems coming with such an idea.

Mod: I'm no HOS expert, but I think what he meant to say is that HOS currently isn't sufficently flexible, and that the examples you just gave probably wouldn't be able to be implemented without cutting some corners, lowering the visual quality and not giving a result as amazing as with the dolphin. Did I understand that right?


Uttar

sancheuz
12-22-02, 07:09 PM
Nv40:
Clock Rate: 2.1 Ghz
DDR III at 2 Ghz effective
7nm die size
512 bit memory
853 Gb/s transfer
8 TMUs
30 Render Pipes
340 million transistors

Capabilities:
Polygons: 1.5 trillion
AA: 46x at no frame cost
256 bit color capabilities

Price: $2,489.25

Mod
12-22-02, 08:52 PM
If NV40 is released at H1 2004, it won't have more than 190 million transistors because they have to follow moore law from now on.

Uttar
12-23-02, 05:41 AM
Originally posted by Mod
If NV40 is released at H1 2004, it won't have more than 190 million transistors because they have to follow moore law from now on.

Actually, after Huang recent comment on lenghtening product cycles between products, I really begin to think the NV40 will be more of a H2 2004.
So we'll see 1 refresh of the NV30, and one refresh of the NV35. That's like if we had 1 refresh of the GeForce 4 ( and we never had one beside the crappy NV28, since the NV30 was always expected to come sooner )

My bet is 230M transistors. That seems to be about what you'd get if you consider 120M transistors and use 0.09 instead of 0.13

As for clock rate... I'd say it all depend on if Black Diamond works. But in H2 2004, I'd be surprised TSMC still didn't figure it out...


Uttar

Mod
12-23-02, 08:30 AM
Originally posted by Uttar

Mod: I'm no HOS expert, but I think what he meant to say is that HOS currently isn't sufficently flexible, and that the examples you just gave probably wouldn't be able to be implemented without cutting some corners, lowering the visual quality and not giving a result as amazing as with the dolphin. Did I understand that right?

Not neceraly. Because in HOS the inclination of the curves that makes the patched surfaces can be equal, or almost equal, to the the extremal control points ( usualy the primitives ) in the curve. So, by adjusting the frontier points, this corner effects you said, could go unoticed, but this is difficult to get.