Re: Instability of NVIDIA drivers on Linux 64 bit - thoughts?


I know it's been a while since anyone posted here, but this is the only thread I can find which describes the very same problem I'm currently having (and have had for several months...). The first post in this thread by Rambetter describes the problem/symptoms accurately - everything there goes for me as well.

However, in contrast to the reports in here, my stability problems remain after upgrading to the latest driver. Current version is the official 195.36.15 downloaded from NVidia's site, but I had the same problem with the previous 180, 185 and 190 driver versions.

I can't reproduce the freeze at will, but inevitably it will occur, be that after 5 minutes or 5 hours of work. Today only one freeze, last week I had six in one day. Also, it seems to occur in periods - for the past month it has been very often (several times per day), but prior to that it was maybe a month ago the last burst of freezes occurred.

I see a possibly interesting - or not? - variation of the problem when running X by starting it from console in "init 3" runlevel with "startx -- -logverbose 6" (this was of course done in order to produce the nvidia bug report as required). Rather than the usual total lockup requiring reboot when X is started by init, I get a series of short 5-10s freezes, that continue forever as long as any 3D graphics program is active (incl. glxgears). However, in between the freezes I am able to shut the program(s) down manually and stop them. If I try to start one of the programs again, the freezes will start immediately. Also, once the freezes have started, they'll come back immediately when I start a 3D graphics program even after shutting down X and starting a new session from console.

Either way, Xorg.0.log starts to produce a few lines similar to what is reported in the previous posts, e.g.:
(WW) May 06 12:49:54 NVIDIA(0): WAIT (0, 6, 0x8000, 0x00003f68, 0x00003f68)

Sometimes, a couple of short (5-10 seconds) freezes will precede the "major freeze-up", after which a line like that will appear in Xorg.0.log.

GPU heating does not seem to be the problem. I was monitoring it the other day and it was stable around 70 deg. C until the freeze, after which it seemed to drop a little (to around 68 degrees - I was looking at the nvidia-settings panel from a remote host via ssh -X).

I'm working with 3D graphics applications (structural/computational chemistry) on a high-end HP Z800 workstation with 2 quad-core Xeon x86_64 cpus, a Quadro FX 3800, 24 GB RAM, running CentOS 5.4 and kernel v. 2.6.18-164.15.1.el5 (should be the latest official CentOS kernel, afaik). As mentioned, I have the latest NV-driver (195.36.15). I have also updated the BIOS to the latest (v. 03.07) from, and the problem persists.

I'm thinking it could be a hardware related issue. Other people in the office have the exact same setup as me and have never had this problem. I have tried various xorg.conf settings, e.g. decreasing the update frequency to not push the screen to the limit, but that doesn't help (and don't why it should, either - if that was actually the problem I figure I should see a bad image on the screen or something, not an Xorg process running at 100% cpu).

Latest nvidia bug report is attached. I was running X with "startx -- -logverbose 6".

Hope you can help.


