Various hangs requiring reboot with GTX 275 896MB

There seem to be problems on my machine using pretty well any driver version. I'm currently using 256.35 but it also happens with 256.44. Have also had problems using pre- 256 drivers (195, 190, 185, 180) but I can't use any of those versions now because my current display locks at an eye-burning 100% brightness with pre- 256 series drivers. The display is a 120hz alienware, but the problems used to happen with my previous (60hz) display as well. Have tried using 8x and 16x PCI-E 2.0 slots on several different mainboards (DFI T3eH6, Asus P6T, currently Gigabyte X58A-UD3R 8x slot).

Message signaled interrupts are enabled for the card, but disabling that doesn't help.

I am currently using KDE 4.4.5 with compiz, but the hangs can still occur even using kwin with the composition disabled.

The hangs rarely happen shortly after a reboot, usually it happens after the system has been running for 3-7 days (7 days is pushing it, usually it's already happened by then). The most common trigger seems to be task switching between a game and some other application, this is with both native games like nexuiz and quake 3, and games using wine eg left 4 dead.

I believe it is related, that sometimes a task switch away from a game causes all applications to redraw extremely slowly, sometimes leading to a complete hang, other times switching back to the game is possible and then things continue to run normally.

Also related (I believe), is that after a few days it usually becomes impossible to rmmod nvidia (for example, to change driver versions) even if there have been no problems so far and Xorg has stopped cleanly - the module is allegedly still in use, but ps shows no instances of Xorg.

When the hangs occur, sometimes the mouse cursor can still be moved and switching to a VT might be possible, other times there's no response to any input. However, the system is always still contactable via ssh, and I have observed a number of different strange behaviours when the system is in this state:
  • Xorg is fully utilising one CPU core; this happens pretty much every time
  • kill -9 is required to stop Xorg; again pretty much every time
  • often, it's not possible to rmmod nvidia even when ps shows there are no instances of Xorg (module is apparently still in use)
  • if the module is still in use and I attempt to rmmod -f nvidia, the kernel has an oops and startx fails until after a reboot - but often, after trying to rmmod -f nvidia, the system will hang if I try to reboot and a power cycle is required
  • sometimes it is impossible to kill Xorg, even with repeated signal 9's
  • sometimes the GPU seems to lock up completely, and it isn't even possible to do a chvt (the command just hangs until ^c)
I can't always access another machine to ssh from, I'll try to find a working laptop so I can get a post- crash bug report (or any other info that might be useful) next time it happens.
