Re: EQ2 Shader 3.0 upgrade

That does sound more or less accurate to me. The NV35 should have been able to uses upgraded FP16 units and combine them into a single 32 bit float. Obviously this would cause the performance throughput to half. Theres no doubt that the Geforce FX 5900 was heavily engineered with FP16 in mind as it was never really capable of running FP32 at anything but half speed. With the Geforce FX 5800 and the NV30. I dont really think that the card was ever really "Designed" with DirectX 9.0 mind. It had the capability to run FP32 through one of its shaders. But its other shader units were largely integer based as you mentioned. So I dont really see anything wrong with your history of the FX 5900/6800.

((well one minor mistake. FX 5800 was 4x2 and 6800 was 16x1 I dont really consider the Zpixel capabilities of either hardware as "True" pipelines. Because that would make the 6800 32x1.)) But thats a pretty easy thing to forget.

6800 changed this up by offering 2 FP32 ALU's in 16 independant pipelines, each capable of executing 8 floating point instructions in dual issue, so double the theoretical pixel shader performance of the NV35. Probably faster than that due to use of SIMD programming model vs VLIW (and better utilization as a 16*1 and not 8*2). Not sure whats going on with constant registers, but assuming is the same. FP16 normalize, so has some FP16 specific optimizations still. Registers are once again orthogonal for full FP32 performance. In this case I can't imagine any reason other than space and some slight latency penalty for using FP32 compared to FP16. Assuming it used the same register combiner scheme, you would have 2x the # of FP16 registers as is needed, which could come in handy maybe to ease the state change penalty associated with storing constant registers in instruction slots.
This was actually one of the hardest things for people to understand at the time. People assumed the Geforce 6 used the same shader setup as the FX cards and it didnt. People assumed FP32 would mean drastic slowdown on the NV40 and NV45. But the reality was there was too much going on in the pipeline for that ever to become a primary bottleneck. The 3Danalyze program was great for toying with this back when FP16/FP32 existed,

Actually if you look at these rightmark3d Benches. You'll notice the 7800/7900 series copes with 16FP-> 32FP even better than the Geforce 6 series. I cant say why the Blinn shader seems to do better. But my guess is its makes more use of MADD type instructions. As well other pipeline improvements. Obviously IMO the Geforce 6 and Geforce 7 still share heritage from the NV35/Nv3x. Something alot of people will not want to admit.

Also, with regard to NV35, would it be fair to say that since two registers were still needed for 1 FP32 instruction that NV35 was limited by latency for issuing FP32 instructions.
Perfectly fair. In all my experiences. FP32 was roughly half the speed of FP16 on the Geforce FX5900. Which is why PP did so much good for the FX 5900 in comparison too FX 5800 which really needed FX16 to do anything at acceptable speeds. I'm not sure if you remember early FX drama. But there was a ton of pictures of "Vase rendererings". When the FX 5800 was out Nvidia replaced all FP operations with FX16 operations. This went away when the 5900 came out and they could cope with FP16 better.
