Go Back   nV News Forums > Linux Support Forums > NVIDIA Linux

Newegg Daily Deals

Closed Thread
 
Thread Tools
Old 04-13-05, 11:20 AM   #1
gilboa
Linux addict...
 
Join Date: Jan 2004
Posts: 540
Default Trying to make headway into finding the Xid crashes source...

Hello all,

Since buying a new Leadtek FX6600GT card, my once stable machine went to hell:
I'm getting constant Xid 6, 8 and 23 when trying to run OpenGL software and games.
(See bug report: http://www.nvnews.net/vbulletin/showthread.php?t=48965)

Being pig-headed, I decided to forget about sleep last night and tried isolated the problem (which seems to be shared by *many* FX6600/GT users)

Before I begin:
A. While I write Linux kernel code for a living I don't pretend to know how the nVidia driver works. I'm basing my findings on semi-educated (at times, abysmal) guess-work.
B. I know how to play OpenGL games; I know nothing about the inner working of the OpenGL library.
C. If you find one of assumptions to be wrong, please, feel free to educate me.

A short description of the problem:
A. The driver loads.
B. The OpenGL software starts.
C. Somewhere along the line, the machine locks and the OpenGL process (or it's parent X process) eats 100%.
D. The machine becomes unresponsive.
E. "NVRM: Xid nn" messages appear in the kernel log.
F. I can access the machine using SSH after it had crashed and kill the run-away process (be that the X, glgears, doom3, etc).
G. The nvidia kernel module doesn't oops.

Machine configuration:
Asus A7M266D.
2 x Athlon MP 2400
1GB RAM
Leadtek A6600GT.
SCSI
Fedora Core 3, 2.6.11.

Step one: rule out unlikely sources:
Hardware:
A. The machine doesn't lock after a crash - I can access it using SSH.
B. The machine was rock solid with older nVidia cards.
B. The FX660GT eats less power then the previous FX5900 that was running on this machine.

Kernel:
A. The kernel doesn't oops.
B. The machine doesn't lock after a crash - I can access it using SSH.
C. I've tested all the 6x and 7x driver with 2.6.9, 10 and 11. They all behaved the same.
D. The same driver/kernel was rock solid with older nVidia cards.
E. The run away process eats the CPU, not the kernel.
F. The NvAGP and the agpgart behave the same.

Driver. (Here the guess work begins)
A. The driver doesn't oops.
B. I can rmmod driver after a X/software crash.
C. I can kill the faulty process. (It doesn't go into an uninterruptible sleep - which would have suggested a kernel module crash.)
D. I can restart the X session, without restarting the driver. (Which should suggest that the driver data contexts have survived the crash)
E. Using the driver's debug flags didn't add anything useful.

X Problem:
A. Same X configuration works just fine with the same driver and software with the old FX5900 card.
B. 2D works just fine.
C. RenderAccel and Composite enabled, or disabled, behave the same.

User-mode Process problem.
A. The same processes (glgears, xscreensaver, Quake3, UT2004, etc worked just fine before the upgrade).
B. All of them gone mad?

In-order to try and gain an insight into the driver's inner working I tried looking for a debug switch, thankfully, I found one.
$ modprobe nvidia NVreg_ResmanDebugLevel=0 (The value seems to be a debug-level umask)
Once I start the driver with this flag, my kernel log gets flooded with nvidia NVRM messages.
Using the same flag I started an X session and ran an enemy-territory session until it crashed. Nothing in the kernel log suggested that anything has gone terribly wrong. (Beside the "usual" Xid message and the gazillion ioctl calls that [I assume] originated from the nVidia libGL*.so library.)
I've attached the nvidia-bug-report.log of this debug session. (Beware: the file is compressed by bz2... this log is *big*).

As the kernel side of the OpenGL seem to function OK, I've decided to check the user-mode side of things.
Sadly enough, unlike the MesaGL library, nVidia didn't include any tracing (LIBGL_DEBUG like) capability so I had to resort to strace.
I start glgears under stance and after a long while (and a 1.2GB log) it crashed.
In the wee hours of the night I started sifting around the huge log file trying to find something interesting... and lucky for me... it did.
I've cut most of the log, trying to make it up-loadable. (See attached gears_short.log, again compressed by bz2)

The strace log looks something like this:
Initialization. (Lines 0 - 20682)
Notice the nvidiactl open at line 452.
...
while running (Sample: 20811 till 20837)
{
Get current time.
Check if there's anything to read in the nvidiactl character device. (Called by libGL, I assume)
Yield CPU
}


Crash:

The kernel log was the usual one:

Apr 13 02:36:54 gilboa-home-dev kernel: NVRM: Xid: 23, L1 -> L0
Apr 13 02:36:54 gilboa-home-dev kernel: NVRM: Xid: 6, PE0004 1818 bfaadd49 0031fe04 ffffffff fffffff

Now, if you look at the log at mid second 1113349014 [since 1970...], the "while" loop above is being replaced by a spur of ioctls (1113349014 begins in line 20841, crash at 27,159 till 27,309), again, I assume, between the libGL and the kernel driver.
After the crash (27,310) the loops returns to it's old self; The process eats 100% CPU and X doesn't handle events.

As I have no access to nVidia's ioctls documentation, I can't really do anything with the trace. Hopefully someone at nVidia (Zander?) will be able to use it.

Please,
A brand new copy of Doom3 is waiting on my shelf for nVidia to fix this bug.
I WANT TO PLAY DOOM3!

Help? Zander? Anyone?

(Oh... don't forget to bunzip2 the logs before viewing them... duh!)
Attached Files
File Type: txt gears_short.log.bz2.txt (40.5 KB, 546 views)
File Type: txt nvidia-bug-report.log.bz2.txt (31.5 KB, 587 views)
__________________
DEV-NG: Intel S2600C0, 2xE52658V2, 32GB, 4x2TB, GTX680, F19/x86_64, Dell U2711.
DEV: Intel S5520SC, 2xX5680, 36GB, 5x320GB, GTX550, F19/x86_64, Dell U2711 (^).
SRV: Tyan Tempest i5400XT, 2xE5335, 8GB, 4x2TB, 9800GTX, F19/x86-64, Dell U2412.
LAP: ASUS N56VJ, i7-3630QM, 16GB, 1TB, 635M, F19/x86_64.

Last edited by gilboa; 04-13-05 at 12:42 PM.
gilboa is offline  
Old 04-13-05, 01:26 PM   #2
angshuman
Pink GBA
 
angshuman's Avatar
 
Join Date: Apr 2005
Location: State College, PA
Posts: 207
Default Re: Trying to make headway into finding the Xid crashes source...

Good luck with your debugging

And thanks for the debug switch... I'll try to use that to track down my problem (which seems quite different from yours).

Angshuman
angshuman is offline  
Old 04-13-05, 01:36 PM   #3
LubosD
Registered User
 
Join Date: Jan 2005
Location: Czech Republic
Posts: 451
Send a message via ICQ to LubosD
Default Re: Trying to make headway into finding the Xid crashes source...

I'd just add some more info:

Sometimes (after Xid) the system "unfreezes", but the image is scrambled and doesn't change (only music plays). It is enough to exit that game (uhh, difficult without seeing where do you click). Then you can start the game again.

OR

It doesn't unfreeze and if you wait a few minutes, you can get multiple Xids in the log like this:

Code:
Apr 13 15:36:02 localhost kernel: NVRM: Xid: 23,  L1 -> L0
Apr 13 15:36:02 localhost kernel: NVRM: Xid: 8, Channel 00000020
Apr 13 15:38:33 localhost kernel: NVRM: Xid: 23,  L1 -> L0
Apr 13 15:38:33 localhost kernel: NVRM: Xid: 8, Channel 00000020
Apr 13 15:41:20 localhost kernel: NVRM: Xid: 23,  L1 -> L0
Apr 13 15:41:20 localhost kernel: NVRM: Xid: 8, Channel 00000000
LubosD is offline  
Old 04-13-05, 02:36 PM   #4
technikolor
Registered User
 
Join Date: Jul 2003
Posts: 18
Default Re: Trying to make headway into finding the Xid crashes source...

I just put an XFX 6600GT in my AthlonMP box (Tyan Thunder K7). Its "locking" or "crashing" too. In reality its not really a crash at all, at some undetermined point X just seems to come unglued at goes 99% busy on a single proc (thanks to the dual procs it doesn't choke the system) and just chews away forever. If you attempt to kill the process it reattaches to 1 as a zombee and keeps chewing up CPU. You can't strace it because its all happening in userland.

I order to get around this problems I upgraded to the latest Xorg release (6.8.2), tried running a 2.4.30 kernel instead of 2.6.x, disabled VGA on the console completely, downgraded from NV driver 7174 to 6629, nothings worked. Sometimes it takes a dump after seconds, sometimes after hours, you can't tell. Generally it happens when you load images, such as in a browser. All you can do is log into the box from another system and reboot the box, you can't get X shutdown no matter what you try.

I haven't tried a driver older than 6629 yet but I probly should. I did disable Composite completely since some folks have suggested that caused problems but thats not helping either. When the card manages to stay up while I use 3D it works great, I've played Q3A and some mods, and play nicely at 1600x1200 and pretty smooth too, but X randomly freaking out on your just isn't practical. In the meantime I've reverted to the Xorg "nv" driver and I'm not having problems at all.
__________________
Tyan ThunderK7
Dual AthlonMP 1.2G
1GB PC2100 Reg ECC
XFX 6600GT AGP
TrueCombat: Q3A Mod - truecombat.com
technikolor is offline  
Old 04-13-05, 03:57 PM   #5
gilboa
Linux addict...
 
Join Date: Jan 2004
Posts: 540
Default Re: Trying to make headway into finding the Xid crashes source...

LubosD,

Please post your complete hardware configuration... maybe we can find a common ground...?
__________________
DEV-NG: Intel S2600C0, 2xE52658V2, 32GB, 4x2TB, GTX680, F19/x86_64, Dell U2711.
DEV: Intel S5520SC, 2xX5680, 36GB, 5x320GB, GTX550, F19/x86_64, Dell U2711 (^).
SRV: Tyan Tempest i5400XT, 2xE5335, 8GB, 4x2TB, 9800GTX, F19/x86-64, Dell U2412.
LAP: ASUS N56VJ, i7-3630QM, 16GB, 1TB, 635M, F19/x86_64.
gilboa is offline  
Old 04-13-05, 05:01 PM   #6
Ironi
Registered User
 
Join Date: Dec 2004
Posts: 86
Default Re: Trying to make headway into finding the Xid crashes source...

Several people have reported various different ways to eliminate the "active mouse pointer freeze," but so far no one has determined exactly what is causing the issue. I personally have experienced it with 6629 through 7174; others claim that 6629 is fine while only the 7xxx drivers are problematic.

Anyways, the various things that you can toggle to try to eliminate the problem include:
  • Composite (this is problematic anyways, and should be disabled).
  • RenderAccel (I don't notice much of a difference whether it's enabled or not).
  • GW-Write Mask AGP Request (or something along those lines, if it's not missing entirely) in the System BIOS configuration. Enabling it seems to reduce AGP bandwidth somewhat.
  • APIC. I don't recommend disabling this, personally: most modern mobos (even uniprocessor) support IO-APIC, a drastically superior way of providing and assigning IRQs. If your board does support IO-APIC, then disabling APIC support entirely probably impacts performance in a negative way.

Also, overclocking might be a factor as well, but AFAIK no one has mentioned it.
Ironi is offline  
Old 04-13-05, 05:03 PM   #7
LubosD
Registered User
 
Join Date: Jan 2005
Location: Czech Republic
Posts: 451
Send a message via ICQ to LubosD
Default Re: Trying to make headway into finding the Xid crashes source...

Quote:
Originally Posted by gilboa
LubosD,

Please post your complete hardware configuration... maybe we can find a common ground...?
OK, here's the complete description (except external devices):

Code:
Motherboard FIC VI11 (SiS 645DX chipset)
Pentium 4 2.4 GHz, 533MHz FSB
1792 MB RAM DDR (PC2700)
HDD Seagate 40GB (UATA/100)
HDD Maxtor 250GB (UATA/133)
DVD-ROM Samsung 16x
DVD±RW DL Teac DV-W516GA
Inno3D GeForce 6800 Ultra 256MB GDDR3 AGP (running on 4x only)
Leadtek WinFast TV2000 Deluxe
Kouwell KW7207 (3xIEEE1394a, 3xUSB 2.0, 2xSATA)
Realtek 8139
Creative SB Audigy
VBIOS version: 5.40.02.15.06

Last edited by LubosD; 04-13-05 at 05:11 PM. Reason: more info
LubosD is offline  
Old 04-13-05, 05:33 PM   #8
Ironi
Registered User
 
Join Date: Dec 2004
Posts: 86
Default Re: Trying to make headway into finding the Xid crashes source...

Interesting... my hardware is similar to yours. Relevant items:

Soyo P4S 645DX Dragon Ultra
P4 2.53, 533Mhz FSB
1GB PC2700
GF 5700 Personal Cinema (@ AGP 4x)
Creative Audigy 2 ZS

FWIW, I enabled AGP write masking in the SBIOS and the problem disappeared. My system is stable with Composite (yeah, I know), RenderAccel, and APIC enabled (although IO-APIC isn't supported by my mobo). It's slightly overclocked too, but I'm not touching the video card's clocks.
Ironi is offline  

Old 04-14-05, 12:09 AM   #9
nagual.hsu
Registered User
 
Join Date: Feb 2005
Posts: 19
Default Re: Trying to make headway into finding the Xid crashes source...

Quote:
Originally Posted by gilboa
Hello all,

As I have no access to nVidia's ioctls documentation, I can't really do anything with the trace. Hopefully someone at nVidia (Zander?) will be able to use it.

Please,
A brand new copy of Doom3 is waiting on my shelf for nVidia to fix this bug.
I WANT TO PLAY DOOM3!

Help? Zander? Anyone?

(Oh... don't forget to bunzip2 the logs before viewing them... duh!)
This is quite an effort. Thanks giboa! My guess is similar as yours but I don't
trace it in such details. Sometimes different hardwares make problems hidden
in the driver/library pair. I mean a buggy driver can occasionally have the
AP waiting/checking forever(whatever busy-wait or not) , not necessarilly
having kernel oops. I've found this on some linux open source drivers.


Anyway, hope nvidia can make an emphasis on debugging this. If they do and
finally erase this bug, they should thank you. (Because apparently, it seems
no one in Nvidia tries to insert flags line by line and some printks to hunt this
down. ie. The same as I hunted the "occasional" bugs on the open source driver.
The best debugging tool is developing by oneself according to what the bug look like.
Common debugging tools won't help much in these hard cases).
nagual.hsu is offline  
Old 04-14-05, 05:24 AM   #10
gilboa
Linux addict...
 
Join Date: Jan 2004
Posts: 540
Default Re: Trying to make headway into finding the Xid crashes source...

Quote:
My guess is similar as yours but I don't trace it in such details. Sometimes different hardwares make problems hidden in the driver/library pair. I mean a buggy driver can occasionally have the AP waiting/checking forever(whatever busy-wait or not) , not necessarilly having kernel oops. I've found this on some linux open source drivers.
Thought about it.
If nVidia has a non-blocking function where the GL library issues the command and then continuously uses another ioctl to poll its completion, the behavior could be the same.
When you poll something, you usually sleep or resched between polling attempts; As the user-land application eats all of the CPU, I can assume that the GL busy-waiting function was designed with the assumption that the function completion will be (close to) immediate.
What I don't get is this: If indeed the driver fails in user-land context, how come the user-land application doesn't get stuck (be that in the first function, or the proceeding poll attempts) in an un-interruptible sleep? (Zombie)

Anyways, it would have been nice if I could get my hands on a printk-able version of the driver... nVidia should have release two sets of drivers: Normal and debug (or actually) version. If the normal version dies on you, install the trace version and post your findings instead of posting mute "I run quake the machine locks".

Cheers,
__________________
DEV-NG: Intel S2600C0, 2xE52658V2, 32GB, 4x2TB, GTX680, F19/x86_64, Dell U2711.
DEV: Intel S5520SC, 2xX5680, 36GB, 5x320GB, GTX550, F19/x86_64, Dell U2711 (^).
SRV: Tyan Tempest i5400XT, 2xE5335, 8GB, 4x2TB, 9800GTX, F19/x86-64, Dell U2412.
LAP: ASUS N56VJ, i7-3630QM, 16GB, 1TB, 635M, F19/x86_64.
gilboa is offline  
Old 04-14-05, 08:56 AM   #11
alexmk
Registered User
 
Join Date: Apr 2005
Posts: 1
Default Re: Trying to make headway into finding the Xid crashes source...

gilboa, how are interrupts on your PC configured? PIC or APIC? If latter, could you, please, check if nVidia interrupt is level-triggered or edge-triggered? As far as I remeber, it is supposed to be level-triggered.
alexmk is offline  
Old 04-14-05, 10:55 AM   #12
gilboa
Linux addict...
 
Join Date: Jan 2004
Posts: 540
Default Re: Trying to make headway into finding the Xid crashes source...

alexmk,

APIC. (SMP + SCSI with noapic is not a pretty sight...)
nVidia interrupts are level-triggered.

Code:
cat /proc/interrupts
           CPU0       CPU1
  0:   56361040   56358817    IO-APIC-edge  timer
  1:        611        505    IO-APIC-edge  i8042
  8:          0          1    IO-APIC-edge  rtc
  9:          1          0   IO-APIC-level  acpi
 14:     506560     506438    IO-APIC-edge  ide0
 15:     506397     506600    IO-APIC-edge  ide1
169:     167897     155629   IO-APIC-level  aic7xxx
177:     108476         13   IO-APIC-level  CMI8738-MC6, ehci_hcd, eth0
185:          0          0   IO-APIC-level  EMU10K1
193:         20         86   IO-APIC-level  ohci_hcd
201:       1287        475   IO-APIC-level  ohci_hcd, nvidia
NMI:          0          0
LOC:  112707564  112720690
ERR:          4
MIS:          0
__________________
DEV-NG: Intel S2600C0, 2xE52658V2, 32GB, 4x2TB, GTX680, F19/x86_64, Dell U2711.
DEV: Intel S5520SC, 2xX5680, 36GB, 5x320GB, GTX550, F19/x86_64, Dell U2711 (^).
SRV: Tyan Tempest i5400XT, 2xE5335, 8GB, 4x2TB, 9800GTX, F19/x86-64, Dell U2412.
LAP: ASUS N56VJ, i7-3630QM, 16GB, 1TB, 635M, F19/x86_64.
gilboa is offline  
Closed Thread


Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


Similar Threads
Thread Thread Starter Forum Replies Last Post
Random crashes, NVRM Xid messages Iesos NVIDIA Linux 90 10-04-12 04:27 AM

All times are GMT -5. The time now is 01:43 AM.


Powered by vBulletin® Version 3.7.1
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Copyright ©1998 - 2014, nV News.