Uttar
12-21-02, 08:49 AM
Hello everyone,
Just felt like posting a 100% speculation thread. Of course, if anyone got any sources, feel free to say it. But the point here mostly is understanding "The NV30 is great and stuff, but how could you even enhance that?!"
I'm not going to talk about the NV35 here, as I'd guess it's nothing more than increased efficiency, memory and clock speeds. Maybe a few more things, but nothing more exciting than NV10->NV15 and NV20->NV25; no amazing features, but real nice performance increases.
Now, the first obvious step is PS3.0 and VS3.0
It's fairly likely the R400 will support both, but nothing is certain yet and we could be disappointed by ATI. However, what is pretty much certain is that the NV40 will support both of them. After all, with more than 200M transistors ( 0.09 easily allows that ) , not supporing that would be insane.
But let's suppose both VS3.0 and PS3.0 are in the NV40. That's great; but what are we getting beside that?
In the past, nVidia added features to nearly every generation of GPU. For the TNT, it was 32 bit color
For the GF1, it was T&L
For the GF3, it was programmable shaders, LMA ( Z-related optimizations ) & Multisampling
For the GFFX, it's higher color precision ( 64BPP + 128BPP ) & Intellisample ( Color & Aniso optimizations )
Now, all of those systems could get higher efficiency with the NV40. But that's more of a NVx5 ( NV35/NV45 for example ) advancement. So... What could they invent, really?
First of all, a new channel added to the standard RGBA: Luminance. I think it's John Carmack who asked for this, and I don't think it would be too hard to implement it. So, if they can get some free JC quotes, they'd obviously implement it ASAP :)
Now, if JC never asked for this, maybe I'm just imagining things. Anyone care to correct/certify this?
Next is the Programmable Primitive Processor.
Currently, the Primitive Processor is quite simply known as the "Triangle Setup" stage of the GPU. It is *very* fast, and it's absolutely never the bottleneck. This stage was introduced by the Voodoo 2 IIRC.
The original Voodoo still asked the CPU to do that - so, with the original Voodoo, you could use the CPU as a programmable primitive processor, and you weren't required to use triangles. It simply was preferred, because they were a lot faster than most other forms, and a lot more flexible ( you could do every type of primitive with triangles, it just sometimes requires a lot of them )
The idea with the current Triangle Setup engine is very simple: determine the edges using DDA, then determine the values for every variable of each pixel by determining how near it is to every of the three vertices. That's fast, and effective.
The problem, however, is that those edges cannot be curves. They are simple lines. And that's why you may sometimes require a lot more triangles ( and thus vertices )
With a programmable primitive processor, you could do curves and a lot of other things. It would all be very nice, it's a HOS dream.
It would potentially save a lot of vertex processing power while increasing curves quality. And what does that mean? Well, with less vertices, there is more power available for each vertex. Which means the NV30 insane limitations might actually become useful! :)
But a problem shows up with a programmable primitive processor. It's very nice to be able to do curves and stuff... But it would be even nicer if you weren't limited by only 3 vertices for a primitive. A lot nicer.
But doing a DIP ( DrawIndexedPrimitive ) call simply for a few rare 7 vertices primitives would be highly inefficient. So, what's the solution?
The indices lists system would probably be modified. Instead of assuming it's 3 vertices/tri because it's a triangle list of example, all of that would have to be determined manually. So, who said AGP 8X would not be useful for years? :)
The idea would instead be to begin every primitive by a header saying how many vertices are used and which are the values of the variables passed for each primitive ( yes, there'd be variables per primitive too )
After that, you'd have the used vertex for each index. Vertex Caches would thus still be effective. As for Primitive Strips... Well, you could obviously say that first two vertices of a primitive are the last two vertices of the primitive just before. But there's a problem here:
Degenerate Triangles were acceptable because Triangle Setup was the fastest stage. Now, in some cases, it might really no longer be. So, I think stripping might become significantly less useful, but it'll still be used for specific cases obviously.
So, that's all very nice features. But we've got a new problem here. With true curves, aliasing is guaranteed to be HORRIBLE. And I mean it.
So, what could be done on the AntiAliasing front?
Like it or not, the next step is Fragment Antialiasing.
And yes, Matrox already implemented that. But from my understanding, it isn't exactly traditional Fragment Antialiasing. I may have to research a little more on that, and it certainly merits an entire new thread IMO. But to summarize it, I think it would simply be 4 maximum colors for each pixel, but 16 bit masks of 2 bits each to select which color it is.
Such an algorithm would probably in many cases give something like 16x quality in best cases and 4x quality in worst cases, all of that at the cost of 6x - And really, most of the time, it would be 9x quality or better.
But with 6x Antialiasing-like costs, memory is again going to be a problem. With the NV30, nVidia introduced their Intellisample technology, whose primary advantage is Color Compression, even when not using FSAA ( that thus proofs it's a significantly more advanced algorithm than ATI's one, but that might also be a problem as in extreme cases, it might be less efficient than ATI's algorithm. I guess benchmarks will tell us that. )
However, now that we're optimizing both Z & Color... What's left, really? Yes, increasing efficiency, but that can't do miracles.
Well, if we exclude Color, Z and the Frame Buffer from the equation... What's left?
Textures.
Yes, we currently got texture compression ( DXTC for example )
But none of it is based on FP ( Floating Point ) , so there may be a few possible tricks there.
Remember the 64BPP Texture Compression rumor for the NV30? Nobody ever trusted it, because it had no basis and it seemed too amazing to be true.
But as strange as this may seem, if nVidia wants to gain efficiency in their memory bandwidth systems, it's the only optimization left.
I wouldn't say they're going to do it around FP or anything. But what's nearly certain, however, is that DXTC isn't sufficent anymore. Or maybe they're going to push PS-generated colors instead of textures a little more, but that seems insane. So I wouldn't bet on that at all.
Considering 0.09 should enable nVidia to do a 200M+ transistors part with the NV40, I don't think all of this is impossible at all.
A last point here: the R400. I don't think it'll have a primitive processor or Fragment AA. After all, it's still 0.13, and putting in too many things in it would be way too hard. But we might see a few of the things the NV40 will have, thus nearly certainly giving it an edge over the NV30 & NV35 feature wise. As for performance, it'll nearly certainly beat the NV30. As for the NV35, I'm not betting anything here - we know way too little right now.
Expected NV40 paper launch: Between February 2004 and early November 2004.
Expected NV40 in store avaibility: Between April 2004 and late December 2004.
Hope you enjoyed this huge post. It was fun to write, as it's always great to imagine what the future holds for us. I'm certain I did a fair bit of mistakes or obvious biasing towards nVidia. If you spot them, please post them! :)
Uttar
EDIT: I'd just like to remember all of you that the Fragment Antialiasing & Programmable Primitive Processor systems could be implemented in a LOT of ways. This is just a way to do it, and it's unlikely exactly such a way is going to be used.
Also, as always, this is nothing more than speculation. No sources here, people. If you want that type of info, go to The Inquirer.
Just felt like posting a 100% speculation thread. Of course, if anyone got any sources, feel free to say it. But the point here mostly is understanding "The NV30 is great and stuff, but how could you even enhance that?!"
I'm not going to talk about the NV35 here, as I'd guess it's nothing more than increased efficiency, memory and clock speeds. Maybe a few more things, but nothing more exciting than NV10->NV15 and NV20->NV25; no amazing features, but real nice performance increases.
Now, the first obvious step is PS3.0 and VS3.0
It's fairly likely the R400 will support both, but nothing is certain yet and we could be disappointed by ATI. However, what is pretty much certain is that the NV40 will support both of them. After all, with more than 200M transistors ( 0.09 easily allows that ) , not supporing that would be insane.
But let's suppose both VS3.0 and PS3.0 are in the NV40. That's great; but what are we getting beside that?
In the past, nVidia added features to nearly every generation of GPU. For the TNT, it was 32 bit color
For the GF1, it was T&L
For the GF3, it was programmable shaders, LMA ( Z-related optimizations ) & Multisampling
For the GFFX, it's higher color precision ( 64BPP + 128BPP ) & Intellisample ( Color & Aniso optimizations )
Now, all of those systems could get higher efficiency with the NV40. But that's more of a NVx5 ( NV35/NV45 for example ) advancement. So... What could they invent, really?
First of all, a new channel added to the standard RGBA: Luminance. I think it's John Carmack who asked for this, and I don't think it would be too hard to implement it. So, if they can get some free JC quotes, they'd obviously implement it ASAP :)
Now, if JC never asked for this, maybe I'm just imagining things. Anyone care to correct/certify this?
Next is the Programmable Primitive Processor.
Currently, the Primitive Processor is quite simply known as the "Triangle Setup" stage of the GPU. It is *very* fast, and it's absolutely never the bottleneck. This stage was introduced by the Voodoo 2 IIRC.
The original Voodoo still asked the CPU to do that - so, with the original Voodoo, you could use the CPU as a programmable primitive processor, and you weren't required to use triangles. It simply was preferred, because they were a lot faster than most other forms, and a lot more flexible ( you could do every type of primitive with triangles, it just sometimes requires a lot of them )
The idea with the current Triangle Setup engine is very simple: determine the edges using DDA, then determine the values for every variable of each pixel by determining how near it is to every of the three vertices. That's fast, and effective.
The problem, however, is that those edges cannot be curves. They are simple lines. And that's why you may sometimes require a lot more triangles ( and thus vertices )
With a programmable primitive processor, you could do curves and a lot of other things. It would all be very nice, it's a HOS dream.
It would potentially save a lot of vertex processing power while increasing curves quality. And what does that mean? Well, with less vertices, there is more power available for each vertex. Which means the NV30 insane limitations might actually become useful! :)
But a problem shows up with a programmable primitive processor. It's very nice to be able to do curves and stuff... But it would be even nicer if you weren't limited by only 3 vertices for a primitive. A lot nicer.
But doing a DIP ( DrawIndexedPrimitive ) call simply for a few rare 7 vertices primitives would be highly inefficient. So, what's the solution?
The indices lists system would probably be modified. Instead of assuming it's 3 vertices/tri because it's a triangle list of example, all of that would have to be determined manually. So, who said AGP 8X would not be useful for years? :)
The idea would instead be to begin every primitive by a header saying how many vertices are used and which are the values of the variables passed for each primitive ( yes, there'd be variables per primitive too )
After that, you'd have the used vertex for each index. Vertex Caches would thus still be effective. As for Primitive Strips... Well, you could obviously say that first two vertices of a primitive are the last two vertices of the primitive just before. But there's a problem here:
Degenerate Triangles were acceptable because Triangle Setup was the fastest stage. Now, in some cases, it might really no longer be. So, I think stripping might become significantly less useful, but it'll still be used for specific cases obviously.
So, that's all very nice features. But we've got a new problem here. With true curves, aliasing is guaranteed to be HORRIBLE. And I mean it.
So, what could be done on the AntiAliasing front?
Like it or not, the next step is Fragment Antialiasing.
And yes, Matrox already implemented that. But from my understanding, it isn't exactly traditional Fragment Antialiasing. I may have to research a little more on that, and it certainly merits an entire new thread IMO. But to summarize it, I think it would simply be 4 maximum colors for each pixel, but 16 bit masks of 2 bits each to select which color it is.
Such an algorithm would probably in many cases give something like 16x quality in best cases and 4x quality in worst cases, all of that at the cost of 6x - And really, most of the time, it would be 9x quality or better.
But with 6x Antialiasing-like costs, memory is again going to be a problem. With the NV30, nVidia introduced their Intellisample technology, whose primary advantage is Color Compression, even when not using FSAA ( that thus proofs it's a significantly more advanced algorithm than ATI's one, but that might also be a problem as in extreme cases, it might be less efficient than ATI's algorithm. I guess benchmarks will tell us that. )
However, now that we're optimizing both Z & Color... What's left, really? Yes, increasing efficiency, but that can't do miracles.
Well, if we exclude Color, Z and the Frame Buffer from the equation... What's left?
Textures.
Yes, we currently got texture compression ( DXTC for example )
But none of it is based on FP ( Floating Point ) , so there may be a few possible tricks there.
Remember the 64BPP Texture Compression rumor for the NV30? Nobody ever trusted it, because it had no basis and it seemed too amazing to be true.
But as strange as this may seem, if nVidia wants to gain efficiency in their memory bandwidth systems, it's the only optimization left.
I wouldn't say they're going to do it around FP or anything. But what's nearly certain, however, is that DXTC isn't sufficent anymore. Or maybe they're going to push PS-generated colors instead of textures a little more, but that seems insane. So I wouldn't bet on that at all.
Considering 0.09 should enable nVidia to do a 200M+ transistors part with the NV40, I don't think all of this is impossible at all.
A last point here: the R400. I don't think it'll have a primitive processor or Fragment AA. After all, it's still 0.13, and putting in too many things in it would be way too hard. But we might see a few of the things the NV40 will have, thus nearly certainly giving it an edge over the NV30 & NV35 feature wise. As for performance, it'll nearly certainly beat the NV30. As for the NV35, I'm not betting anything here - we know way too little right now.
Expected NV40 paper launch: Between February 2004 and early November 2004.
Expected NV40 in store avaibility: Between April 2004 and late December 2004.
Hope you enjoyed this huge post. It was fun to write, as it's always great to imagine what the future holds for us. I'm certain I did a fair bit of mistakes or obvious biasing towards nVidia. If you spot them, please post them! :)
Uttar
EDIT: I'd just like to remember all of you that the Fragment Antialiasing & Programmable Primitive Processor systems could be implemented in a LOT of ways. This is just a way to do it, and it's unlikely exactly such a way is going to be used.
Also, as always, this is nothing more than speculation. No sources here, people. If you want that type of info, go to The Inquirer.