I understand that there is a necessary, although perhaps somewhat limiting policy of requiring a standard nvidia-bug-report in advance of any attempt to provide assistance from nvidias side which I can provide when I have access to the machine again. However, here is a description of the symptons and the machine, as an initial step. I will provide a bug report as well.
The machine has dual 252 (single core) opterons on a Tyan S2895 (aka K8WE) board with fully populated RAM (16GB) and BIOS v1.04 from 12/06. The power supply is a Silverstone ST-65ZF 650W. Tyan published BIOS 1.05 this month but I hesitate to use that because of serious problem reports. I upgraded from a single 6800GT to now dual 7900GSes. The 6800GT was rock stable, using BIOS 1.01T, and the 1.0.8776 driver. The second 7900GS required an upgrade of the BIOS to be recognized properly, and is connected to the first via a SLI bridge. The upgrade is mostly a test to check the viability of using high end Quadros in SLI for our work group.
I run Gentoo 2.6.20-r8 in 64bit, and now one of the latest nvidia drivers, 100.14.09. Enabling SLI via the xorg.conf option line in any of the SLI modes seems to work as reported by Xorg.0.log and nvidia-settings. The problem appears at times when a new opengl rendered window is displayed, seemingly at random instances. Then CPU usage of X goes to 100% quickly and X becomes completely unresponsive, to the point of not responding to keyboard events. The system keeps running and I can log in remotely. Killing the X server and the opengl application does not reenable the display or keyboard, the only recourse is to shutdown. Init works most the time if issued within minutes after the incident, so I can often safely shutdown.
In Auto or AFR SLI mode, this happens often enough to make the system not useful. Sofar it happened at a 100% rate if the SLI mode is set to SFR.
dmesg does not have a "PCI: Using MMCONFIG" line, so no problem in respect, I gather.
In the Xorg.0.log I once observed lines beginning
(WW) NVIDIA(0): WAIT ...
at the time of the incident. But in most cases there are no messages in this log or in the kernel log at the time of the freeze.
Some other problem required me to use the kernel command line option noexec=off. http://gentoo-wiki.com/HOWTO_nVidia_...or_OpenGL_apps
recommends to enable a noexecute bit in the BIOS because it interferes with the nvidia driver. Is that correct, or a possible remedy ?
As I did not observe any SLI performance increase, when it works, with the opengl apps I use, we may end up not using a dual setup, in any case. A single 7900GS works fine and provides a tiny bit of a performance increase, felt 10%, vs. the 6800GT. Stability is much more important than SLI. We may be able to test a Quadro as well in the near future.
I am grateful for comments or tips, kernel, bios or driver related, anybody may have.