View Single Post
Old 04-13-05, 11:20 AM   #1
gilboa
Linux addict...
 
Join Date: Jan 2004
Posts: 540
Default Trying to make headway into finding the Xid crashes source...

Hello all,

Since buying a new Leadtek FX6600GT card, my once stable machine went to hell:
I'm getting constant Xid 6, 8 and 23 when trying to run OpenGL software and games.
(See bug report: http://www.nvnews.net/vbulletin/showthread.php?t=48965)

Being pig-headed, I decided to forget about sleep last night and tried isolated the problem (which seems to be shared by *many* FX6600/GT users)

Before I begin:
A. While I write Linux kernel code for a living I don't pretend to know how the nVidia driver works. I'm basing my findings on semi-educated (at times, abysmal) guess-work.
B. I know how to play OpenGL games; I know nothing about the inner working of the OpenGL library.
C. If you find one of assumptions to be wrong, please, feel free to educate me.

A short description of the problem:
A. The driver loads.
B. The OpenGL software starts.
C. Somewhere along the line, the machine locks and the OpenGL process (or it's parent X process) eats 100%.
D. The machine becomes unresponsive.
E. "NVRM: Xid nn" messages appear in the kernel log.
F. I can access the machine using SSH after it had crashed and kill the run-away process (be that the X, glgears, doom3, etc).
G. The nvidia kernel module doesn't oops.

Machine configuration:
Asus A7M266D.
2 x Athlon MP 2400
1GB RAM
Leadtek A6600GT.
SCSI
Fedora Core 3, 2.6.11.

Step one: rule out unlikely sources:
Hardware:
A. The machine doesn't lock after a crash - I can access it using SSH.
B. The machine was rock solid with older nVidia cards.
B. The FX660GT eats less power then the previous FX5900 that was running on this machine.

Kernel:
A. The kernel doesn't oops.
B. The machine doesn't lock after a crash - I can access it using SSH.
C. I've tested all the 6x and 7x driver with 2.6.9, 10 and 11. They all behaved the same.
D. The same driver/kernel was rock solid with older nVidia cards.
E. The run away process eats the CPU, not the kernel.
F. The NvAGP and the agpgart behave the same.

Driver. (Here the guess work begins)
A. The driver doesn't oops.
B. I can rmmod driver after a X/software crash.
C. I can kill the faulty process. (It doesn't go into an uninterruptible sleep - which would have suggested a kernel module crash.)
D. I can restart the X session, without restarting the driver. (Which should suggest that the driver data contexts have survived the crash)
E. Using the driver's debug flags didn't add anything useful.

X Problem:
A. Same X configuration works just fine with the same driver and software with the old FX5900 card.
B. 2D works just fine.
C. RenderAccel and Composite enabled, or disabled, behave the same.

User-mode Process problem.
A. The same processes (glgears, xscreensaver, Quake3, UT2004, etc worked just fine before the upgrade).
B. All of them gone mad?

In-order to try and gain an insight into the driver's inner working I tried looking for a debug switch, thankfully, I found one.
$ modprobe nvidia NVreg_ResmanDebugLevel=0 (The value seems to be a debug-level umask)
Once I start the driver with this flag, my kernel log gets flooded with nvidia NVRM messages.
Using the same flag I started an X session and ran an enemy-territory session until it crashed. Nothing in the kernel log suggested that anything has gone terribly wrong. (Beside the "usual" Xid message and the gazillion ioctl calls that [I assume] originated from the nVidia libGL*.so library.)
I've attached the nvidia-bug-report.log of this debug session. (Beware: the file is compressed by bz2... this log is *big*).

As the kernel side of the OpenGL seem to function OK, I've decided to check the user-mode side of things.
Sadly enough, unlike the MesaGL library, nVidia didn't include any tracing (LIBGL_DEBUG like) capability so I had to resort to strace.
I start glgears under stance and after a long while (and a 1.2GB log) it crashed.
In the wee hours of the night I started sifting around the huge log file trying to find something interesting... and lucky for me... it did.
I've cut most of the log, trying to make it up-loadable. (See attached gears_short.log, again compressed by bz2)

The strace log looks something like this:
Initialization. (Lines 0 - 20682)
Notice the nvidiactl open at line 452.
...
while running (Sample: 20811 till 20837)
{
Get current time.
Check if there's anything to read in the nvidiactl character device. (Called by libGL, I assume)
Yield CPU
}


Crash:

The kernel log was the usual one:

Apr 13 02:36:54 gilboa-home-dev kernel: NVRM: Xid: 23, L1 -> L0
Apr 13 02:36:54 gilboa-home-dev kernel: NVRM: Xid: 6, PE0004 1818 bfaadd49 0031fe04 ffffffff fffffff

Now, if you look at the log at mid second 1113349014 [since 1970...], the "while" loop above is being replaced by a spur of ioctls (1113349014 begins in line 20841, crash at 27,159 till 27,309), again, I assume, between the libGL and the kernel driver.
After the crash (27,310) the loops returns to it's old self; The process eats 100% CPU and X doesn't handle events.

As I have no access to nVidia's ioctls documentation, I can't really do anything with the trace. Hopefully someone at nVidia (Zander?) will be able to use it.

Please,
A brand new copy of Doom3 is waiting on my shelf for nVidia to fix this bug.
I WANT TO PLAY DOOM3!

Help? Zander? Anyone?

(Oh... don't forget to bunzip2 the logs before viewing them... duh!)
Attached Files
File Type: txt gears_short.log.bz2.txt (40.5 KB, 545 views)
File Type: txt nvidia-bug-report.log.bz2.txt (31.5 KB, 586 views)
__________________
DEV-NG: Intel S2600C0, 2xE52658V2, 32GB, 4x2TB, GTX680, F19/x86_64, Dell U2711.
DEV: Intel S5520SC, 2xX5680, 36GB, 5x320GB, GTX550, F19/x86_64, Dell U2711 (^).
SRV: Tyan Tempest i5400XT, 2xE5335, 8GB, 4x2TB, 9800GTX, F19/x86-64, Dell U2412.
LAP: ASUS N56VJ, i7-3630QM, 16GB, 1TB, 635M, F19/x86_64.

Last edited by gilboa; 04-13-05 at 12:42 PM.
gilboa is offline