PDA

View Full Version : CineFX (NV30) Inside discussion @ 3dcenter


Pages : [1] 2

Sazar
09-02-03, 04:34 AM
a look @ the architecture of the nv30/35 and r300/350 and what is occuring behind the scenes...

saw link posted by mboeller @ b3d... so credit to him for link... www.3dcenter.com for the article..

http://www.3dcenter.de/artikel/cinefx/index_e.php

^^^ is the translated article link...

Remi
09-02-03, 08:19 AM
:clap: :clap: :clap: :clap: :clap: :clap: :clap: :clap: :thumbsup:

Bert
09-02-03, 12:11 PM
Wow, great article. These German guys are awesome! This quite nicely explains the odd performance characteristics of NV30 ...

If we compare NV30 to its competitor from ATi on a per-clock basis, the NV30 receives a thorough beating from R300. ATi comes ahead with 8 texture ops and 8 arithmetic ops in parallel. The higher core clock improves the situation a bit for NV30, but it still only does 2600 MTO/s + 700 MAO/s or 2000 MTO/s + 1000 MAO/s.

The Shader Core allows to exchange texture ops for arithmetic ops in a 2:1 relation (see graph below). The R300 as R9700Pro reaches 2600 MTO/s + 2600 MAO/s without this balancing option. nVidia can only dare to dream of this raw performance, or try to overclock an NV30 to 1 GHz. Not only the award for best performance per clock, but also the one for best overall raw performance goes to R300 from ATi.

Sazar
09-02-03, 12:27 PM
well the nv35 is a different cup of tea from the nv30 per their tests and that is pretty cool to read about... its improvements and similarities...

digitalwanderer
09-02-03, 12:49 PM
Urgh, me head is swooning from sickness...could someone give me a thumbnail of what all those charts & graphs mean?

Bert
09-02-03, 02:02 PM
Originally posted by digitalwanderer
Urgh, me head is swooning from sickness...could someone give me a thumbnail of what all those charts & graphs mean?

They show how the different processors deal with various mixtures of texture/arithemtic instructions to optimize throughput. The article explains the internal architecture of CineFx, which lead to some interesting insights:


With the right instruction mix, NV30 is able to outperform R300
However, shaders optimized as advertised by Microsoft/ATI (which is the majority since they were first to market) are the worst case for NV30
NV35 in particular increases this worst-case performance
The performance can further be increased by reordering shader instructions (basically interleaving tex fetches and arithmetic ops)
Det40 does this by having replacement shaders in the driver for specific games
Det50 could generalize this for all applications by recompiling shaders
Reducing temp register usage will increase throughput too, because if more temps are used, fewer pixels can circulate in the pipeline.

digitalwanderer
09-02-03, 02:08 PM
Originally posted by Bert
They show how the different processors deal with various mixtures of texture/arithemtic instructions to optimize throughput. The article explains the internal architecture of CineFx, which lead to some interesting insights:


With the right instruction mix, NV30 is able to outperform R300
However, shaders optimized as advertised by Microsoft/ATI (which is the majority since they were first to market) are the worst case for NV30
NV35 in particular increases this worst-case performance
The performance can further be increased by reordering shader instructions (basically interleaving tex fetches and arithmetic ops)
Det40 does this by having replacement shaders in the driver for specific games
Det50 could generalize this for all applications by recompiling shaders
Reducing temp register usage will increase throughput too, because if more temps are used, fewer pixels can circulate in the pipeline.

Thanks Bert, I'm much obliged.

So there IS a longshot possiblility that the Det 5's could do something is the general gist I'm getting, but only for NV35>?

The Baron
09-02-03, 03:17 PM
# Det40 does this by having replacement shaders in the driver for specific games
# Det50 could generalize this for all applications by recompiling shaders
Does anyone else have a sinking feeling that IQ might go down in all games that use PS2.0 when The Next Det Series comes out?

andypski
09-02-03, 03:27 PM
Originally posted by Bert
They show how the different processors deal with various mixtures of texture/arithemtic instructions to optimize throughput. The article explains the internal architecture of CineFx, which lead to some interesting insights:


With the right instruction mix, NV30 is able to outperform R300
Only if they are right about NV30/35 and R300 performance characteristics. Perhaps they don't know the whole story?

Just for starters they agree that in their graphs they have ignored any effects of scalar co-issue to keep them simple, and that they don't know for certain if NV30 or NV35 can do this type of operation.

- Andy.

Bert
09-02-03, 04:00 PM
Originally posted by andypski
Only if they are right about NV30/35 and R300 performance characteristics.
Yep. It would be nice to support this with benchmark data. Although the arguments are pretty much in line with what we see with NV pixel shaders.

Sazar
09-02-03, 07:53 PM
Originally posted by The Baron
Does anyone else have a sinking feeling that IQ might go down in all games that use PS2.0 when The Next Det Series comes out?

perhaps but not it need not necessarily be a big IQ drop.. at least perhaps not noticeable...

nvidia could well re-order their instructions in a manner that allows for decent enough performance while not handicapping their hardware with heavy duty shader instructions...

end result == what we have been referring to as hackery... but there will likely be a speed boost...

what I found odd was the characteristics of the nv30 v/s nv35... seems odd considering the architectural similarities between the 2...

whatever happens... nvidia will be not be following the dx9 api too closely it would seem... not if they are reducing precision and replacing shaders the way it seems likely they will have to do in order to compete on the top end level...

Descombobulator
09-03-03, 05:38 PM
Originally posted by The Baron
Does anyone else have a sinking feeling that IQ might go down in all games that use PS2.0 when The Next Det Series comes out?

I have a sinking feeling that choosing the best video card makes solving work hunger look trivial.

MUYA
09-03-03, 05:44 PM
Originally posted by Bert
They show how the different processors deal with various mixtures of texture/arithemtic instructions to optimize throughput. The article explains the internal architecture of CineFx, which lead to some interesting insights:


With the right instruction mix, NV30 is able to outperform R300
However, shaders optimized as advertised by Microsoft/ATI (which is the majority since they were first to market) are the worst case for NV30
NV35 in particular increases this worst-case performance
The performance can further be increased by reordering shader instructions (basically interleaving tex fetches and arithmetic ops)
Det40 does this by having replacement shaders in the driver for specific games
Det50 could generalize this for all applications by recompiling shaders
Reducing temp register usage will increase throughput too, because if more temps are used, fewer pixels can circulate in the pipeline.


Thank you Bert because after reading that article I had a migrane and still was none-the-wiser as to what was going on? :p

Cheers Man!!!!

MUYA

Nutty
09-03-03, 06:22 PM
Re-ordering instructions doesn't mean there will be an IQ drop at all. In fact cpu's re-order instructions if possible, and if they got any bits wrong, your whole pc would come to a grinding halt.

Changing the shader operation, or lowering precision can and probably will lead to IQ lowering, but not in all instances.

It would be nice if the new shader re-order/compiler built into Det50's does give a nice speed boost. I think this is what JC was referring to a while back, when he mentioned that a future driver update would improve the compilation of shader operations to improve performance.

Just a matter of when they get released, which should be quite soon, given HL2 will be out soon, and if they re-sort to manual shader replacement, you will see it as an obvious speed increase in the drivers released around HL2's release.

The Baron
09-03-03, 07:14 PM
Okay. Phew. I was worried that if it was a way to generalize Det40-style shader replacements, it would result in IQ losses from lower precision.

Glad to hear that if it is a much-improved compiler, IQ should remain the same while speed increases a good bit.

sebazve
09-04-03, 07:48 AM
the thing is that for gffx users will have to wait to get new realeases of dets to have decent ps performance, and it has to be made by hand for all the games so i think only the popular ones will be touch, leaving the rest well...:barf:

Zenikase
09-04-03, 08:57 AM
Originally posted by sebazve
the thing is that for gffx users will have to wait to get new realeases of dets to have decent ps performance, and it has to be made by hand for all the games so i think only the popular ones will be touch, leaving the rest well...:barf:

Originally posted by Bert
Det50 could generalize this for all applications by recompiling shaders


It seems like NVIDIA is getting a little ahead of themselves. Out-of-order execution? Maybe they'll implement that and branch prediction into the NV40 hardware, if they haven't done so already. It's another sign of video processing units becoming more and more like CPUs.

I tried reading that article and my head started spinning after the first three pages. Maybe I'll muster up the courage to try it again today.

digitalwanderer
09-04-03, 09:51 AM
Originally posted by Descombobulator
I have a sinking feeling that choosing the best video card makes solving work hunger look trivial.
No, actually nVidia is making it pretty damned easy this round... ;)

Descombobulator
09-04-03, 10:24 AM
'Work' hunger?? ...<sigh>...what am I on :rolleyes:

Descombobulator
09-04-03, 10:26 AM
Originally posted by Bert
Det50 could generalize this for all applications by recompiling shaders


Can someone explain to me what exactly this entails?

Edit: I mean in terms of what happens in the application, the drivers, the hardware with respect to shaders.

Skuzzy
09-04-03, 01:10 PM
DirectX supplies default shaders for both pixel and vertex. If you want to use custom shaders, then you write your own in shader assembly, or HLSL (I am leaving Cg out of this as it has no real merit any more).

Now you have your custom shader written, but it is just an ASCII text file. So you have to get it compiled. The DirectX API has functions for doing this. You pass some information to the function along with the shader code.
The DX function, does a call to the video driver supplied compiler (this is supplied to the video card companies as part of the DX driver dev kit and has to be modified before it can be used at all). The video driver takes the source code and turns it into the specific opcodes needed for its video hardware.
A handle to this code is passed back to the game/application which is used for each successive call to draw a primitive (SetVertextShader and SetPixelShader). The registers are also set with the data needed for the shaders using the SetVertexShaderConstant and SetPixelShaderConstant functions.

Need more information?

Zenikase
09-04-03, 02:46 PM
I am a retarded, watch me go.

Descombobulator
09-04-03, 03:00 PM
Thanks Skuzzy, that's cleared things up for me.

So if custom shaders are created, are these written once, by nVidia, to enhance all games, or are they created on a game by game basis by the application developers (with help from nVidia)?

The Baron
09-04-03, 03:25 PM
That's the thing--custom shaders don't have to be written. A custom compiler will be used rather than custom shaders in order to optimize for NV3x architecture.

ATI did the same thing on a shader in 3DMark03 a while back, gained about 8% performance from it.

If it happens, it's a very good solution (it won't make up for hardware shortcomings, but it will HELP). This is another reason why I like OGL1.5--the compiler is ALWAYS provided by the IHV rather than the ARB or something like that, so basically every company will optimize OGL1.5 performance before they release a driver with support for it.

Bert
09-04-03, 03:59 PM
To clarify this a bit ... this is from page 5 of the 3DCenter article:
A pixel shader 2.0 written according to the recommendations from ATi or created with the current DX9-SDK HLSL compiler unfortunately highlights all those problems. Such shaders use many temp register and have texture instructions packed in a block. A shader of version 1.4 uses the same principles.


The problem is that for PS 1.4 and 2.0, the recommended instruction order is to bunch all texture fetches in the beginning, storing them in temporary registers, and then perform any arithmetic operations on them.

Here is a much simplified and inaccurate example:
temp1=fetch();
temp2=fetch();
temp3=fetch();
temp4=temp1+temp2;
result=temp3*temp4;

If such code is executed by NV30, much performance is wasted because it is build to fetch and calculate at the same time. With optimized code, it would perform much better:
temp1=fetch();
temp1=temp1+fetch();
result=temp1*fetch();

This not only uses fewer registers, but also less cycles. But, it is exactly the same mathematically, so there is no IQ loss.

Now a good compiler can analyze the first code example above and transform it into the second. However, a good compiler does not grow on trees, but has to be written. That's why it was easier for NVIDIA to just rewrite some shaders and put them into Det40, but we all hope they will eventually get it done ;)

Quote 3DCenter again:
nVidia will have to tackle these problems, the earlier the better. Microsoft is working on an SDK update which will include a new version of the compiler that can produce more "NV30-friendly" code with less temp registers and paired texture ops. But if developers use PS 1.4 or generate only one PS 2.0 shader, this compiler won't help nVidia much. They will have to solve the problem in the driver.