View Single Post
Old 04-24-12, 01:03 AM   #20
Psicopatico
Registered User
 
Join Date: Apr 2012
Posts: 2
Default Re: [BUG] nvidia crashes kernel with 'Xid 13' and attempted to yield the CPU while at

Hi all,

I just wanted to share my experience with those hard lock-ups and inform that I've found a solution for my case, although limited in its application.

First of all, my relevant hardware:
- MoBo is an Asrock AM2NF3-VSTA with latest available BIOS update (P3.30)
- CPU is an AMD PhenomII X4 965 non overclocked (more than that, I usually keep underclocked in its max clocking from 3.4GHz down to 2.7GHz via cpufreq kernel module)
- GPU is an GeForce 7600GT AGP with bios version 05.73.22.51.95. Following is the relevant output of "lspci -v":
Quote:
01:00.0 VGA compatible controller: nVidia Corporation G73 [GeForce 7600 GT] (rev a2) (prog-if 00 [VGA controller])
Subsystem: XFX Pine Group Inc. Device 2247
Flags: bus master, 66MHz, medium devsel, latency 248, IRQ 19
Memory at fd000000 (32-bit, non-prefetchable) [size=16M]
Memory at e0000000 (32-bit, prefetchable) [size=256M]
Memory at fc000000 (32-bit, non-prefetchable) [size=16M]
[virtual] Expansion ROM at feae0000 [disabled] [size=128K]
Capabilities: [60] Power Management version 2
Capabilities: [44] AGP version 3.0
Kernel driver in use: nvidia
System in use is OpenSUSE 11.4 "Celadon":
Quote:
Linux linux 2.6.37.6-0.7-desktop #1 SMP PREEMPT 2011-07-21 02:17:24 +0200 i686 athlon i386 GNU/Linux
I've been plagued by NVRM Xid faults for so much time I've lost count.
Here's a very little sample of the offending ones obtained by grepping system logs:
Quote:
Apr 18 11:25:09 linux kernel: [1662639.915779] NVRM: Xid (0000:01:00): 6, PE0000 1718 00000000 00316ee0 ffffffff 00000000
Apr 18 11:25:37 linux kernel: [1662667.396163] NVRM: Xid (0000:01:00): 8, Channel 00000020
Apr 18 11:43:42 linux kernel: [ 937.566287] NVRM: Xid (0000:01:00): 6, PE0002 1f0c 4608239b 00110ce4 bf000000 be800000
Apr 18 11:43:42 linux kernel: [ 937.604027] NVRM: Xid (0000:01:00): 6, PE0000 1f28 3f0f0775 00000000 ffffffff 00000000
Apr 18 12:39:02 linux kernel: [ 853.312233] NVRM: Xid (0000:01:00): 7, Ch 0000001e M 00000000 D 00000000 intr 00011000
Apr 18 12:42:41 linux kernel: [ 106.738263] NVRM: Xid (0000:01:00): 13, 0000 e0014200 00000062 00000184 0000baf5 00000002
Apr 19 01:56:22 linux kernel: [ 2429.440420] NVRM: Xid (0000:01:00): 50, L0 -> L0
Apr 19 02:58:52 linux kernel: [ 8865.582396] NVRM: Xid (0001:00): 47, L0 -> L0
Apr 19 08:09:18 linux kernel: [ 5752.276583] NVRM: Xid (0001:00): 3, C 00000000 SC 00000006 M 00000318 Data 00000000
Don't mind the difference in how data is presented, I've switched drivers version during this.
As you can see, there's a wide range of cases, all mixed with repeated "NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context" messages.

I think I tried every and each trick I found with the help of the Mighty Google:
- switched back to each and every older driver version (betas included) back to the 260.19.04 release (older ones wouldn't build for my system);
- switched between kernel's AGPGart and Nvidia drivers' supplied one;
- lowered GPU and memory clocks to rule out temperature issues (but this card doesn't get hot: maximum I've seen is 78C under full load at max stock clocks);
- switched off ACPI PCI IRQ sharing at boot, as this is a multi-core system with HPET;
- switched off SideBand and FastWrite AGP options;
- others I can't even remember.

Always got a failure: graphic system hung repeatedly. Remote access via SSH was always possible to clean reboot the machine.
The interesting bit is this occurred always while running software via Wine (mostly games, most offending ones were the already cited World of Tanks and Need for Speed World, as well as others) but never with native applications (be them games, or Compiz and similar, or even GlxGears, etc.).

Well in the end, I tried mostly by chance to lower the AGP rate from 8X down to 4X via reloading kernel module with the NVreg_ReqAGPRate=4 option.
Result: it's almost a week my system is rock-solid. Not a single crash. Not one.
Yes, it took a noticeable performance hit, but I'd gladly take that against a stability issue any day.
Also, I understand this solution may not be applicable for everyone as today common system are PCI-X, but - hey! - it's a start.
Net result: now I'm on the latest suggested driver version (295.20) with module loaded and set up for AGP 4X and so far, so good.

I hope this may help and enlighten someone.

Regards
Psicopatico is offline   Reply With Quote