I bought a computer some time ago which came with an 8600 GS. I experienced stability problems, so I returned it. Now I've bought another one in pieces (edit: I mean in parts)
, including an 8600 GT. The stability problems are still there, so I've investigated them as much as I could. Here are the results.
All of them turn out to be related to a small CPU monitor applet called wmcpu. I'm using version 1.3 (link
), the Debian flavour (link
). My X server is also the Debian flavour, sid version. I'm using WindowMaker but that doesn't seem relevant, as I could reproduce the crashes I'm getting also with IceWM. wmcpu is designed as a WindowMaker dock app but runs in a window in other window managers.
Of course I could live without wmcpu (although it's quite useful) but I'm afraid of launching a different program some time which uses the same drawing function and that crashes my X server.
Description of the problem
I'm using version 173.14.12 of the drivers. After starting the X server, including wmcpu, everything is fine. It takes a random amount of time, ranging from a few minutes to a couple of days of uninterrupted session, to start failing. When it does, the symptoms are that the whole screen starts blinking at exactly the refresh rate of wmcpu. A look at /var/log/Xorg.0.log reveals many repetitions of this (obtained with -logverbose 6), one per blink, I assume:
(II) NVIDIA(0): The NVIDIA X driver has encountered an error; attempting to
(II) NVIDIA(0): recover...
(II) NVIDIA(0): Initialized GPU GART.
(II) NVIDIA(0): Error recovery was successful.
When that happens, killing wmcpu brings the apparent stability back (when using the earier version of the driver, 173.14.09 I think, the system crashed after just a few blinks, so I couldn't even diagnose anything and I didn't even know wmcpu was the cause). If I don't kill it, switching to a console via Ctrl+Alt+F1 (in my case the consoles are in text mode, I have no framebuffer) immediately crashes the X server with a segfault. I often have to regain control of the keyboard with SysReq+R (unRaw); on rare occasions, the computer gets locked, even failing to respond to ssh, and I have to press the reset button. The same type of crashes happens if I launch wmcpu again instead of using Ctrl+Alt+F1. Sometimes the backtrace is generated and sometimes it's not; when it is, it always contains the same function in the first line: wfbCopyNtoN.
I've said apparent stability because, from then on, the system (or the card) is in a permanent failure status, in the following sense. When I launch X again, starting wmcpu immediately causes the segfault, unless I first run "nvidia-settings -a InitialPixmapPlacement=2" (seen in the performance problems thread
), which instead of a segfault, causes the blink to reappear. That setting doesn't prevent the crash if I switch to a console, though. The GPU GART initialization apparently resets InitialPixmapPlacement to 1, so if I kill wmcpu and restart it without readjusting InitialPixmapPlacement, I get a segfault again.
Only a reboot makes the system stable again and wmcpu to work, until it starts failing some time later. Unloading the nvidia module with rmmod does not help. Seems like it's the card which is in a failure status.
If that matters, I always launch X via startx from the console; I have no xdm, kdm, gdm or similar. I have already tried all settings of NvAGP and everything else listed in the stability problems thread
, to no avail. I have only failed to check if my MB's BIOS is the last version, but I doubt that would help.
Attached are the nvidia-bug-report.log (renamed to nvidia-bug-report_1.log) after an X crash provoked by starting wmcpu while in the failure status, and a verbose log made by forcing the blinking with the InitialPixmapPlacement=2 setting, then pressing Ctrl+Alt+F1 to force a crash.