![]() |
|
|
|
#1 | |
|
Linux addict...
Join Date: Jan 2004
Posts: 512
|
Hello all,
Since buying a new Leadtek FX6600GT card, my once stable machine went to hell: I'm getting constant Xid 6, 8 and 23 when trying to run OpenGL software and games. (See bug report: http://www.nvnews.net/vbulletin/showthread.php?t=48965) Being pig-headed, I decided to forget about sleep last night and tried isolated the problem (which seems to be shared by *many* FX6600/GT users) Before I begin: A. While I write Linux kernel code for a living I don't pretend to know how the nVidia driver works. I'm basing my findings on semi-educated (at times, abysmal) guess-work. B. I know how to play OpenGL games; I know nothing about the inner working of the OpenGL library. C. If you find one of assumptions to be wrong, please, feel free to educate me. A short description of the problem: A. The driver loads. B. The OpenGL software starts. C. Somewhere along the line, the machine locks and the OpenGL process (or it's parent X process) eats 100%. D. The machine becomes unresponsive. E. "NVRM: Xid nn" messages appear in the kernel log. F. I can access the machine using SSH after it had crashed and kill the run-away process (be that the X, glgears, doom3, etc). G. The nvidia kernel module doesn't oops. Machine configuration: Asus A7M266D. 2 x Athlon MP 2400 1GB RAM Leadtek A6600GT. SCSI Fedora Core 3, 2.6.11. Step one: rule out unlikely sources: Hardware: A. The machine doesn't lock after a crash - I can access it using SSH. B. The machine was rock solid with older nVidia cards. B. The FX660GT eats less power then the previous FX5900 that was running on this machine. Kernel: A. The kernel doesn't oops. B. The machine doesn't lock after a crash - I can access it using SSH. C. I've tested all the 6x and 7x driver with 2.6.9, 10 and 11. They all behaved the same. D. The same driver/kernel was rock solid with older nVidia cards. E. The run away process eats the CPU, not the kernel. F. The NvAGP and the agpgart behave the same. Driver. (Here the guess work begins) A. The driver doesn't oops. B. I can rmmod driver after a X/software crash. C. I can kill the faulty process. (It doesn't go into an uninterruptible sleep - which would have suggested a kernel module crash.) D. I can restart the X session, without restarting the driver. (Which should suggest that the driver data contexts have survived the crash) E. Using the driver's debug flags didn't add anything useful. X Problem: A. Same X configuration works just fine with the same driver and software with the old FX5900 card. B. 2D works just fine. C. RenderAccel and Composite enabled, or disabled, behave the same. User-mode Process problem. A. The same processes (glgears, xscreensaver, Quake3, UT2004, etc worked just fine before the upgrade). B. All of them gone mad? In-order to try and gain an insight into the driver's inner working I tried looking for a debug switch, thankfully, I found one. $ modprobe nvidia NVreg_ResmanDebugLevel=0 (The value seems to be a debug-level umask) Once I start the driver with this flag, my kernel log gets flooded with nvidia NVRM messages. Using the same flag I started an X session and ran an enemy-territory session until it crashed. Nothing in the kernel log suggested that anything has gone terribly wrong. (Beside the "usual" Xid message and the gazillion ioctl calls that [I assume] originated from the nVidia libGL*.so library.) I've attached the nvidia-bug-report.log of this debug session. (Beware: the file is compressed by bz2... this log is *big*). As the kernel side of the OpenGL seem to function OK, I've decided to check the user-mode side of things. Sadly enough, unlike the MesaGL library, nVidia didn't include any tracing (LIBGL_DEBUG like) capability so I had to resort to strace. I start glgears under stance and after a long while (and a 1.2GB log) it crashed. In the wee hours of the night I started sifting around the huge log file trying to find something interesting... and lucky for me... it did. I've cut most of the log, trying to make it up-loadable. (See attached gears_short.log, again compressed by bz2) The strace log looks something like this: Initialization. (Lines 0 - 20682) Notice the nvidiactl open at line 452. ... while running (Sample: 20811 till 20837) { Get current time. Check if there's anything to read in the nvidiactl character device. (Called by libGL, I assume) Yield CPU } Crash: The kernel log was the usual one: Apr 13 02:36:54 gilboa-home-dev kernel: NVRM: Xid: 23, L1 -> L0 Apr 13 02:36:54 gilboa-home-dev kernel: NVRM: Xid: 6, PE0004 1818 bfaadd49 0031fe04 ffffffff fffffff Now, if you look at the log at mid second 1113349014 [since 1970...], the "while" loop above is being replaced by a spur of ioctls (1113349014 begins in line 20841, crash at 27,159 till 27,309), again, I assume, between the libGL and the kernel driver. After the crash (27,310) the loops returns to it's old self; The process eats 100% CPU and X doesn't handle events. As I have no access to nVidia's ioctls documentation, I can't really do anything with the trace. Hopefully someone at nVidia (Zander?) will be able to use it. Please, A brand new copy of Doom3 is waiting on my shelf for nVidia to fix this bug. I WANT TO PLAY DOOM3! Help? Zander? Anyone? (Oh... don't forget to bunzip2 the logs before viewing them... duh!)
__________________
TEST: Tyan Tiger K8WE, 2xOpt248, 3GB, 3x80GB, 8600GT, CentOS5/x86-64. SRV: Gigabyte M56-S3, A64/5000X2, 4GB, 4x320GB, 8600GT, F11/x86-64, E228WFP+L1732S. DEV: Tyan Tempest i5400XT, 2xE5335, 8GB, 3x320GB, 9800GTX, F11/x86-64, 2408WFP. Last edited by gilboa; 04-13-05 at 01:42 PM. |
|
|
|
|
|
#2 | |
|
Pink GBA
Join Date: Apr 2005
Location: State College, PA
Posts: 217
|
Good luck with your debugging
![]() And thanks for the debug switch... I'll try to use that to track down my problem (which seems quite different from yours). Angshuman |
|
|
|
| Sponsored Ads - Guests Only | ||
|
|
|
|
#3 |
|
ASUS GeForce 7950GX2
|
I'd just add some more info:
Sometimes (after Xid) the system "unfreezes", but the image is scrambled and doesn't change (only music plays). It is enough to exit that game (uhh, difficult without seeing where do you click). Then you can start the game again. OR It doesn't unfreeze and if you wait a few minutes, you can get multiple Xids in the log like this: Code:
Apr 13 15:36:02 localhost kernel: NVRM: Xid: 23, L1 -> L0 Apr 13 15:36:02 localhost kernel: NVRM: Xid: 8, Channel 00000020 Apr 13 15:38:33 localhost kernel: NVRM: Xid: 23, L1 -> L0 Apr 13 15:38:33 localhost kernel: NVRM: Xid: 8, Channel 00000020 Apr 13 15:41:20 localhost kernel: NVRM: Xid: 23, L1 -> L0 Apr 13 15:41:20 localhost kernel: NVRM: Xid: 8, Channel 00000000 |
|
|
|
|
#4 |
|
Registered User
Join Date: Jul 2003
Posts: 19
|
I just put an XFX 6600GT in my AthlonMP box (Tyan Thunder K7). Its "locking" or "crashing" too. In reality its not really a crash at all, at some undetermined point X just seems to come unglued at goes 99% busy on a single proc (thanks to the dual procs it doesn't choke the system) and just chews away forever. If you attempt to kill the process it reattaches to 1 as a zombee and keeps chewing up CPU. You can't strace it because its all happening in userland.
I order to get around this problems I upgraded to the latest Xorg release (6.8.2), tried running a 2.4.30 kernel instead of 2.6.x, disabled VGA on the console completely, downgraded from NV driver 7174 to 6629, nothings worked. Sometimes it takes a dump after seconds, sometimes after hours, you can't tell. Generally it happens when you load images, such as in a browser. All you can do is log into the box from another system and reboot the box, you can't get X shutdown no matter what you try. I haven't tried a driver older than 6629 yet but I probly should. I did disable Composite completely since some folks have suggested that caused problems but thats not helping either. When the card manages to stay up while I use 3D it works great, I've played Q3A and some mods, and play nicely at 1600x1200 and pretty smooth too, but X randomly freaking out on your just isn't practical. In the meantime I've reverted to the Xorg "nv" driver and I'm not having problems at all.
__________________
Tyan ThunderK7 Dual AthlonMP 1.2G 1GB PC2100 Reg ECC XFX 6600GT AGP TrueCombat: Q3A Mod - truecombat.com |
|
|
|
|
#5 |
|
Linux addict...
Join Date: Jan 2004
Posts: 512
|
LubosD,
Please post your complete hardware configuration... maybe we can find a common ground...?
__________________
TEST: Tyan Tiger K8WE, 2xOpt248, 3GB, 3x80GB, 8600GT, CentOS5/x86-64. SRV: Gigabyte M56-S3, A64/5000X2, 4GB, 4x320GB, 8600GT, F11/x86-64, E228WFP+L1732S. DEV: Tyan Tempest i5400XT, 2xE5335, 8GB, 3x320GB, 9800GTX, F11/x86-64, 2408WFP. |
|
|
|
|
#6 |
|
Registered User
Join Date: Dec 2004
Posts: 87
|
Several people have reported various different ways to eliminate the "active mouse pointer freeze," but so far no one has determined exactly what is causing the issue. I personally have experienced it with 6629 through 7174; others claim that 6629 is fine while only the 7xxx drivers are problematic.
Anyways, the various things that you can toggle to try to eliminate the problem include:
Also, overclocking might be a factor as well, but AFAIK no one has mentioned it. |
|
|
|
|
#7 | |
|
ASUS GeForce 7950GX2
|
Quote:
Code:
Motherboard FIC VI11 (SiS 645DX chipset) Pentium 4 2.4 GHz, 533MHz FSB 1792 MB RAM DDR (PC2700) HDD Seagate 40GB (UATA/100) HDD Maxtor 250GB (UATA/133) DVD-ROM Samsung 16x DVD±RW DL Teac DV-W516GA Inno3D GeForce 6800 Ultra 256MB GDDR3 AGP (running on 4x only) Leadtek WinFast TV2000 Deluxe Kouwell KW7207 (3xIEEE1394a, 3xUSB 2.0, 2xSATA) Realtek 8139 Creative SB Audigy Last edited by LubosD; 04-13-05 at 06:11 PM. Reason: more info |
|
|
|
|
|
#8 |
|
Registered User
Join Date: Dec 2004
Posts: 87
|
Interesting... my hardware is similar to yours. Relevant items:
Soyo P4S 645DX Dragon Ultra P4 2.53, 533Mhz FSB 1GB PC2700 GF 5700 Personal Cinema (@ AGP 4x) Creative Audigy 2 ZS FWIW, I enabled AGP write masking in the SBIOS and the problem disappeared. My system is stable with Composite (yeah, I know), RenderAccel, and APIC enabled (although IO-APIC isn't supported by my mobo). It's slightly overclocked too, but I'm not touching the video card's clocks. |
|
|
|
|
#9 | |
|
Registered User
Join Date: Feb 2005
Posts: 19
|
Quote:
trace it in such details. Sometimes different hardwares make problems hidden in the driver/library pair. I mean a buggy driver can occasionally have the AP waiting/checking forever(whatever busy-wait or not) , not necessarilly having kernel oops. I've found this on some linux open source drivers. Anyway, hope nvidia can make an emphasis on debugging this. If they do and finally erase this bug, they should thank you. (Because apparently, it seems no one in Nvidia tries to insert flags line by line and some printks to hunt this down. ie. The same as I hunted the "occasional" bugs on the open source driver. The best debugging tool is developing by oneself according to what the bug look like. Common debugging tools won't help much in these hard cases). |
|
|
|
|
|
#10 | |
|
Linux addict...
Join Date: Jan 2004
Posts: 512
|
Quote:
If nVidia has a non-blocking function where the GL library issues the command and then continuously uses another ioctl to poll its completion, the behavior could be the same. When you poll something, you usually sleep or resched between polling attempts; As the user-land application eats all of the CPU, I can assume that the GL busy-waiting function was designed with the assumption that the function completion will be (close to) immediate. What I don't get is this: If indeed the driver fails in user-land context, how come the user-land application doesn't get stuck (be that in the first function, or the proceeding poll attempts) in an un-interruptible sleep? (Zombie) Anyways, it would have been nice if I could get my hands on a printk-able version of the driver... nVidia should have release two sets of drivers: Normal and debug (or actually) version. If the normal version dies on you, install the trace version and post your findings instead of posting mute "I run quake the machine locks". Cheers,
__________________
TEST: Tyan Tiger K8WE, 2xOpt248, 3GB, 3x80GB, 8600GT, CentOS5/x86-64. SRV: Gigabyte M56-S3, A64/5000X2, 4GB, 4x320GB, 8600GT, F11/x86-64, E228WFP+L1732S. DEV: Tyan Tempest i5400XT, 2xE5335, 8GB, 3x320GB, 9800GTX, F11/x86-64, 2408WFP. |
|
|
|
|
|
#11 |
|
Registered User
Join Date: Apr 2005
Posts: 1
|
gilboa, how are interrupts on your PC configured? PIC or APIC? If latter, could you, please, check if nVidia interrupt is level-triggered or edge-triggered? As far as I remeber, it is supposed to be level-triggered.
|
|
|
|
|
#12 |
|
Linux addict...
Join Date: Jan 2004
Posts: 512
|
alexmk,
APIC. (SMP + SCSI with noapic is not a pretty sight...) nVidia interrupts are level-triggered. Code:
cat /proc/interrupts
CPU0 CPU1
0: 56361040 56358817 IO-APIC-edge timer
1: 611 505 IO-APIC-edge i8042
8: 0 1 IO-APIC-edge rtc
9: 1 0 IO-APIC-level acpi
14: 506560 506438 IO-APIC-edge ide0
15: 506397 506600 IO-APIC-edge ide1
169: 167897 155629 IO-APIC-level aic7xxx
177: 108476 13 IO-APIC-level CMI8738-MC6, ehci_hcd, eth0
185: 0 0 IO-APIC-level EMU10K1
193: 20 86 IO-APIC-level ohci_hcd
201: 1287 475 IO-APIC-level ohci_hcd, nvidia
NMI: 0 0
LOC: 112707564 112720690
ERR: 4
MIS: 0
__________________
TEST: Tyan Tiger K8WE, 2xOpt248, 3GB, 3x80GB, 8600GT, CentOS5/x86-64. SRV: Gigabyte M56-S3, A64/5000X2, 4GB, 4x320GB, 8600GT, F11/x86-64, E228WFP+L1732S. DEV: Tyan Tempest i5400XT, 2xE5335, 8GB, 3x320GB, 9800GTX, F11/x86-64, 2408WFP. |
|
|
![]() |
| Shop Online | |
|
|
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
|
|