PDA

View Full Version : Think the Nv40 will offer 8 Floating Point Shader Units?


Pages : 1 [2]

suburbanguy
08-29-03, 07:18 PM
R100 / Radeon64 was a 2x3

R200 / Radeon 8500 was a 4x2

R300 / Radeon 9700 was a 8x1

R400 (R420 now) could be a 16x1 or 8x2 or 12x1

????


although ATI doubles the pipes per generation, most likely, 16x1



then R500 might probably be a wicked 32x1 or 16x2 :D

or R500 might do away with pipes completely ?

aapo
08-30-03, 06:22 AM
Originally posted by Demirug
There is one pixelpipeline (aka Pixelshader aka Pixelprocessor) that includes many small FP-units. Each of this units work with 4 datasets (pixel) at the same time.

Hmm... This clears it out a lot for me. It is indeed very different construction than I thought.

This raises an additional question: if the pixelpipeline is really similar with NV30 and NV35, how it is possible NV35 is much better in shading intensive tests?

Demirug
08-30-03, 07:13 AM
Originally posted by aapo
Hmm... This clears it out a lot for me. It is indeed very different construction than I thought.

This raises an additional question: if the pixelpipeline is really similar with NV30 and NV35, how it is possible NV35 is much better in shading intensive tests?

It is similar bur not identical. NV30 have Int-Units too. Most of this units are replaced with FP-Units in NV35. But it looks like that this units support to different operation modes:

-1 FP32 Baseoperation
-2 Int12 Baseoperation

aapo
08-30-03, 11:48 AM
Originally posted by Demirug
But it looks like that this units support to different operation modes:

-1 FP32 Baseoperation
-2 Int12 Baseoperation

This is new information too, I thought FX12 operations were done on one-vs-one emulation basis on NV35. :) So, NV35 can actually do FX12 operations faster than FP32 operations? Why would the FX12 units exist in NV30, if the FX operations can be done efficiently in FP32 units?

BTW: Sorry about the flood of questions, but you seem to be very knowledgeable about the NV3X architecture, as you stated at the NFI forums. I hope your sources have the correct information. ;)

Demirug
08-30-03, 12:29 PM
Originally posted by aapo
This is new information too, I thought FX12 operations were done on one-vs-one emulation basis on NV35. :) So, NV35 can actually do FX12 operations faster than FP32 operations? Why would the FX12 units exist in NV30, if the FX operations can be done efficiently in FP32 units?

The two operation modes is a information that is not confirmed.

It is an result of some tests that show that the Integerperformance is near equal between NV30 and NV35 but the FP performance ist improved.

theoretical peak performance per clock are:
NV30: (8 Tex or 4 FP) + 8 Int
NV35: (8 Tex or 4 FP) + (4FP or 8 Int)

To compare:

R300/R350: 8Tex + ((8 FP/Int Vec3 + 8FP/Int scalar) or 8FP/Int Vec4)

Theorectical NV3X should be able to execute a Vec3 and a scalar operation at the same time in exchange for one Vec4 operation, too. But the pixelshader diagram i have does not contain the information about each possible path configuration. There are only examples for the default PS-operations.

To repeat me this all is not 100% confirmed.

Originally posted by aapo
BTW: Sorry about the flood of questions, but you seem to be very knowledgeable about the NV3X architecture, as you stated at the NFI forums. I hope your sources have the correct information. ;)

I am sure that my source have the correct information but I am not sure If I have understand everything right.

Uttar
08-31-03, 02:43 AM
Originally posted by Demirug
The two operation modes is a information that is not confirmed.

It is an result of some tests that show that the Integerperformance is near equal between NV30 and NV35 but the FP performance ist improved.

theoretical peak performance per clock are:
NV30: (8 Tex or 4 FP) + 8 Int
NV35: (8 Tex or 4 FP) + (4FP or 8 Int)

To compare:

R300/R350: 8Tex + ((8 FP/Int Vec3 + 8FP/Int scalar) or 8FP/Int Vec4)

Theorectical NV3X should be able to execute a Vec3 and a scalar operation at the same time in exchange for one Vec4 operation, too. But the pixelshader diagram i have does not contain the information about each possible path configuration. There are only examples for the default PS-operations.

To repeat me this all is not 100% confirmed.



I am sure that my source have the correct information but I am not sure If I have understand everything right.

Hmm, first, I wouldn't say "or 8FP/Int Vec4" for the R300. I'd rather say it simply uses the Vec3 and Scalar units to do a Vec4, so the "or" is not accurate IMO. Yes, that's a detail, doesn't matter much :)

As for the NV35: That makes sense, but I'm personally rather suspecting:
NV30: (8 Tex or 4 FP32) + 8 Int
NV35: (8 Tex or 4 FP32) + (4FP32 or 8 FP16)

It just seems to me the FP32 performance is 2/3 of the FP16 performance on the NV35, even in very simple programs where register usage, on the NV30, didn't matter at all. This has been shown with a few tests I've done, as well as my Dawn Precision Patches, but with a very small performance increase with FX12.

That actually makes a lot of sense: in the NV30, they had FP32 units, and FP32 registers which could divide themselves into twice the FP16 registers ( or maybe FP16 registers who could unite themselves, I'm not sure on that part ).
And in the NV35, they'd probably have in that second unit 2 FP16 units, capable of doing FP32 in two cycles ( or maybe unite themselves and do 1 FP32 op in one cycle, which is practically the same, but could cause a gained cycle - like anyone would care! )


Uttar

Demirug
08-31-03, 07:01 AM
Originally posted by Uttar
Hmm, first, I wouldn't say "or 8FP/Int Vec4" for the R300. I'd rather say it simply uses the Vec3 and Scalar units to do a Vec4, so the "or" is not accurate IMO. Yes, that's a detail, doesn't matter much :)

Yes, but if I want to write it as a formula it is hard to do in a other way.

Originally posted by Uttar
As for the NV35: That makes sense, but I'm personally rather suspecting:
NV30: (8 Tex or 4 FP32) + 8 Int
NV35: (8 Tex or 4 FP32) + (4FP32 or 8 FP16)

It just seems to me the FP32 performance is 2/3 of the FP16 performance on the NV35, even in very simple programs where register usage, on the NV30, didn't matter at all. This has been shown with a few tests I've done, as well as my Dawn Precision Patches, but with a very small performance increase with FX12.[/B]

Yes, this is possible too but until today nobody will confirm this. The main point why am not use this statment at the moment is that nVidia say that there are able to execute 12 operations per clock (8 tex + 4 fp32 match this). I believe that if the can do 2 fp16 operations instead of one fp32 they would tell you that they are able to do 16 operations per clock.

Originally posted by Uttar
That actually makes a lot of sense: in the NV30, they had FP32 units, and FP32 registers which could divide themselves into twice the FP16 registers ( or maybe FP16 registers who could unite themselves, I'm not sure on that part ).
And in the NV35, they'd probably have in that second unit 2 FP16 units, capable of doing FP32 in two cycles ( or maybe unite themselves and do 1 FP32 op in one cycle, which is practically the same, but could cause a gained cycle - like anyone would care! )

Uttar [/B]

They registermemory is organized as 64Bit (4*16) cells. Two cells can bind together for fp32.

What they have done in detail from NV30 to NV35 is out of my they scope because I have only information about the design that all NV3X chips share.