Go Back   nV News Forums > Linux Support Forums > NVIDIA Linux

Newegg Daily Deals

Reply
 
Thread Tools
Old 08-08-07, 02:21 PM   #1
mppardee
Registered User
 
Join Date: Jul 2004
Posts: 8
Question advanced debugging of kernel oops

Is it possible to tell which card in a 4-card system has a hardware problem from the kernel messages and xorg.log? There is one x server spanning 4 screens, one screen per card. The setup worked fine for a year and is now failing periodically with kernel OOPS but the machine is halfway across the country and I'd like to only replace the one card that has failed. Can we tell from this information if the motherboard is bad instead of one of the cards? Because the problem only happens once a week or so it would be overly time consuming to replace one piece of hardware at a time and see if it is fixed.

The output from dmesg:
http://open-sense.com/downloads/dmesg-4cards.log

The xorg.log doesn't show any errors but is there enough info about memory ranges to determine which card is the culprit?
http://open-sense.com/downloads/Xorg-4cards.log

excerpt from dmesg where problem starts:
[33260.705478] NVRM: Xid (0001:00): 8, Channel 00000000
[33268.117899] NVRM: Xid (0001:00): 6, PE0000 17c4 00990000 0000ed90 00990000 00990000
[33268.122775] BUG: unable to handle kernel NULL pointer dereference at virtual address 00000080
[33268.122779] printing eip:
[33268.122781] f9288057
[33268.122782] *pde = 00000000
[33268.122785] Oops: 0000 [#1]
[33268.122787] SMP
[33268.122789] Modules linked in: udf nvidia(P) ppdev lp tc1100_wmi sony_acpi pcc_acpi dev_acpi video sbs ipv6 i2c_ec dock container button battery asus_acpi backlight ac fuse psmouse serio_raw pcspkr parport_pc parport k8temp snd_via82xx_modem snd_via82xx gameport snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_mpu401_uart snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device snd soundcore i2c_viapro i2c_core af_packet sk98lin shpchp pci_hotplug amd64_agp agpgart joydev evdev tsdev squashfs loop unionfs nls_cp437 usbhid hid isofs ide_cd cdrom ata_generic via82cxxx generic ehci_hcd uhci_hcd usbcore sata_via libata scsi_mod skge thermal processor fan fbcon tileblit font bitblit softcursor vesafb capability commoncap
[33268.122842] CPU: 0
[33268.122843] EIP: 0060:[<f9288057>] Tainted: P VLI
[33268.122845] EFLAGS: 00213287 (2.6.20-16-generic #2)
[33268.122999] EIP is at _nv008760rm+0xf/0x20 [nvidia]
[33268.123003] eax: 00000000 ebx: f42a4000 ecx: f42d8000 edx: f42d8000
[33268.123006] esi: f4edda28 edi: f4edda38 ebp: f4edd9bc esp: f4edd9ac
[33268.123009] ds: 007b es: 007b ss: 0068
[33268.123012] Process Xorg (pid: 8566, ti=f4edc000 task=f5aeb030 task.ti=f4edc000)
[33268.123014] Stack: 000000a6 000000a1 00000a00 00000010 f4eddac0 f92d0b58 f42d8000 00000000
[33268.123021] 00000009 00000000 000000ff 00000001 f4eddb58 f42a4000 f93a2999 f4ec7400
[33268.123027] f42a4000 f4edda20 f42b1000 f42b1800 f432f000 f42a3000 f7fcac00 f42a4000
[33268.123034] Call Trace:
[33268.123056] [<f92d0b58>] _nv003367rm+0x908/0x9e8 [nvidia]
[33268.123233] [<f93a2999>] _nv005646rm+0xf1/0xfc [nvidia]
[33268.123456] [<f920fc9f>] _nv002939rm+0x2b/0x84 [nvidia]
[33268.123594] [<f920feb1>] _nv002933rm+0x1d/0x2c [nvidia]
[33268.123716] [<f920feba>] _nv002933rm+0x26/0x2c [nvidia]
[33268.123840] [<f9284be2>] _nv008763rm+0x52/0x70 [nvidia]
[33268.123991] [<f93a3197>] _nv005892rm+0x23/0x28 [nvidia]
...
...
...
[33268.132202] [<f94a5211>] nv_kern_unlocked_ioctl+0x0/0x1d [nvidia]
[33268.132366] [<c01826ab>] do_ioctl+0x2b/0x90
[33268.132388] [<c018276c>] vfs_ioctl+0x5c/0x2a0
[33268.132408] [<c0182a22>] sys_ioctl+0x72/0x90
[33268.132426] [<c01031f0>] sysenter_past_esp+0x69/0xa9
[33268.132432] [<c0104629>] set_intr_gate+0x19/0x30
[33268.132481] =======================
[33268.132482] Code: 14 50 6a 03 8b 45 10 50 51 53 8b 42 14 ff d0 90 8d 74 26 00 8b 5d e8 89 ec 5d c3 90 55 89 e5 83 ec 08 8b 4d 08 8b 45 0c 83 c4 f8 <8b> 90 80 00 00 00 50 51 8b 42 04 ff d0 89 ec 5d c3 55 89 e5 8b
[33268.132509] EIP: [<f9288057>] _nv008760rm+0xf/0x20 [nvidia] SS:ESP 0068:f4edd9ac
[33268.132651]
mppardee is offline   Reply With Quote
Old 08-08-07, 02:31 PM   #2
netllama
NVIDIA Corporation
 
Join Date: Dec 2004
Posts: 8,763
Default Re: advanced debugging of kernel oops

How did you determine that this is a hardware problem?
netllama is offline   Reply With Quote
Old 08-08-07, 03:13 PM   #3
zander
NVIDIA Corporation
 
zander's Avatar
 
Join Date: Aug 2002
Posts: 3,740
Default Re: advanced debugging of kernel oops

Does this problem reproduce with 1.0-9639?
zander is offline   Reply With Quote
Old 08-08-07, 04:23 PM   #4
mppardee
Registered User
 
Join Date: Jul 2004
Posts: 8
Default Re: advanced debugging of kernel oops

thanks for the quick response guys. This machine was just fine for over a year, and there are 10 more machines just like it in service at the same site with the same software and hardware, but this is the only one experiencing this problem. If the logs can't show which card is the likely culprit then we'll just replace everything.

I may try 9639 if the customer has more patience, is there any way to enable more debugging messages?
mppardee is offline   Reply With Quote
Old 08-08-07, 04:42 PM   #5
chunkey
#!/?*
 
Join Date: Oct 2004
Posts: 662
Default Re: advanced debugging of kernel oops

Quote:
Originally Posted by mppardee

[33260.705478] NVRM: Xid (0001:00): 8, Channel 00000000
[33268.117899] NVRM: Xid (0001:00): 6, PE0000 17c4 00990000 0000ed90 00990000 00990000
[33268.122775] BUG: unable to handle kernel NULL pointer dereference at virtual address 00000080
This is just a wild guess, but 0001:00 looks like a pci tree number.
=> It's probably the card in the AGP slot??
chunkey is offline   Reply With Quote
Old 08-08-07, 05:04 PM   #6
zander
NVIDIA Corporation
 
zander's Avatar
 
Join Date: Aug 2002
Posts: 3,740
Default Re: advanced debugging of kernel oops

Correct, the value in the parantheses is a PCI bus:device tuple.
zander is offline   Reply With Quote
Reply


Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


Similar Threads
Thread Thread Starter Forum Replies Last Post
Random crashes, NVRM Xid messages Iesos NVIDIA Linux 90 10-04-12 04:27 AM
Corrupted display - 302.17 - Dell Precision T3500 (G98 [Quadro NVS 295]) gbailey NVIDIA Linux 1 06-27-12 11:24 AM
UEFI+Nvidia - NVRM: Your system is not currently configured to drive a VGA console... interzoneuk NVIDIA Linux 0 06-26-12 05:51 AM
xorg locks-up with newest nvidia drivers w/ vdpau. theroot NVIDIA Linux 1 06-24-12 12:04 PM
Crash when logout from X TGL NVIDIA Linux 10 09-13-02 09:22 PM

All times are GMT -5. The time now is 04:23 AM.


Powered by vBulletin® Version 3.7.1
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Copyright 1998 - 2014, nV News.