Originally Posted by ChrisRay
Yes they do.
Geforce FX 5800/5200/5600 prefer FX16 integer based operations. Unable to run FP16 at any form of acceptable speeds.
Geforce FX 5900/FX5700 replaced 2 integer units with FP16. However lacked the registry space to fully utilize those units. Did not have the capability of Running SM 2.0 code with any floating point proficiency. FP16 did perform better than on the NV30 but was still largely hampered by its registry space as the new units did not solve that problem. FP32 simply increases the registry space usage the problem just got bigger.
Geforce 6 hardware. Offered a greatly increased registry. Therefore using integer calls in place of Floating Point calls did not offer large performance improvements. Which was a big problem with the FX series. And FP32 was no where near as devestating to performance compared to the Geforce FX. This is where the distinction is clearly being drawn. In games such as Half Life 2, Far Cry, Forcing FP16 or FP32 only caused minor performance deficits. ((usually within the range of 2-3 FPS)) The percentage loss for going from FP16 to FP32 is not that large. But there is still some benefit on the Geforce 6 cards.
The geforce 7 series, further improved it with better registry space. Although this change was minor compared to the Geforce FX/6 changes.
The Geforce 8 just ignores PP and performs all operations at full precision. ((which is FP32 by SM 3.0/DX10 specification)). DirectX 10 does not support partial precision. But Geforce 8 will just completely ignore it.
So yes each one of these hardware behaves differently when requesting FP 16 and FP32 operations.
After a little crash course -:> Let me know if my terminology is off, or if the general idea is on, or vice versa
From what I see, the only reason that nv40 benefits from FP16 pp or FX12 int pp is that latency associated with executing FP32 operands is higher than wiht FP16 or int12, and maybe smaller memory footprint. NV30's (and 40's) ability to use lower precisions had to do with the fact that they didn't have constant registers, but stored the data in instruction slots instead, and could therefore store either int or variable float precisions... This had some associated penalties for state changes though. From recollection, although CineFx's ALU's were FP32 precision, it only had the ability execute 1/2 the FP32 operands/ cycle due to register limitation. So we have 64 output registers, but 2 count for 1 FP32 instruction. However, simultaneously the 5800 could store 64 FP16 instructions in its temporary registers, or 32 FP32. So its registers were orthogonal in that sense. R300 was limited to 32 temp. registers regardless of the precision used. So maximum throughput should be 64 FP16 instructions vs. R300's 32 FP24. In 5900's case, 1 register counted for 1 FP32 instruction, so therefore should in theory be able to execute 64 FP32 instructions. The R300 also has 32 floating point constant registers. However, as stated above, for flexibility (or due to the fact that 5800 was probably more geared for Direct X 8.1 than 9) FX5800 has no constant registers, and instead uses its copious instruction slots to store fragment program instructions. Can store int instructions as well, for FX12. Carries heavy penalties for instruction state changes.
6800 changed this up by offering 2 FP32 ALU's in 16 independant pipelines, each capable of executing 8 floating point instructions in dual issue, so double the theoretical pixel shader performance of the NV35. Probably faster than that due to use of SIMD programming model vs VLIW (and better utilization as a 16*1 and not 8*2). Not sure whats going on with constant registers, but assuming is the same. FP16 normalize, so has some FP16 specific optimizations still. Registers are once again orthogonal for full FP32 performance. In this case I can't imagine any reason other than space and some slight latency penalty for using FP32 compared to FP16. Assuming it used the same register combiner scheme, you would have 2x the # of FP16 registers as is needed, which could come in handy maybe to ease the state change penalty associated with storing constant registers in instruction slots.
G80 is an obvious extension and evolution of the G70 ALU. Do they do away with register combiners for FP16 ops?
Does this sound right.
Also, with regard to NV35, would it be fair to say that since two registers were still needed for 1 FP32 instruction that NV35 was limited by latency for issuing FP32 instructions. Whereas NV40 had less to arbitrate due to 1 FP32 instr. = 1 register. However, would 2 FP16 instructions fit in 1 register? So with a good driver efficiency can be gained when FP32 instruction fills out the register (wheareas teh FP16 instructions can be more tightly packed).
edit: And btw the "32" constant registers typically reported for NV30, matches nicely with the "512" instruction slots, as this allows for 16 FP32 instructions to be issued concurrently, right? So doesn't this in effect mean that NV30 has 1/2 the FP32 performance that R300 has for FP24? Due to instruction slot limit?