Go Back   nV News Forums > Linux Support Forums > NVIDIA Linux

Newegg Daily Deals

Reply
 
Thread Tools
Old 08-13-12, 10:25 PM   #61
rockob
Registered User
 
Join Date: Nov 2008
Posts: 95
Default Re: Random crashes, NVRM Xid messages

Sadly, the just-released 304.37 release does not fix the problem (though I think according to the release notes it is supposed to). The symptoms are slightly different, but X still locks up completely and requires a cold boot.

I didn't actually see the "GPU has fallen off the bus" message (or any Xid errors) in the log, but the video did freeze after about a minute of gameplay in Crysis2 (the audio kept going, though), and when I tried to close its window, its bumblebee X process froze, locking the main X shortly afterwards. I *was* able to ssh into the machine, which is new, but I was unable to kill -9 any of the locked processes, and restarting lightdm failed, as did a reboot command. Only a hard reset fixed it. fwiw, I was running the 3.6-rc1 kernel.
rockob is offline   Reply With Quote
Old 08-19-12, 06:23 AM   #62
Iesos
Registered User
 
Join Date: Apr 2012
Posts: 15
Default Re: Random crashes, NVRM Xid messages

The 304.37 does not fix the problem. The crash is a bit less severe, it crashes the application (in this case Diablo 3), and makes it impossible to kill. And it won't let X start new graphical programs for some reason, it seemed.

The new error in messages is:

Quote:
Aug 19 13:03:00 localhost kernel: NVRM: Xid (0000:01:00): 13, 0003 00000000 00009197 00002390 00000000 00000000
Aug 19 13:03:02 localhost kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Aug 19 13:03:04 localhost kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Aug 19 13:03:06 localhost kernel: NVRM: GPU at 0000:01:00.0 has fallen off the bus.
Aug 19 13:04:06 localhost kernel: INFO: rcu_sched detected stall on CPU 6 (t=6000 jiffies)
Aug 19 13:04:06 localhost kernel: Pid: 28499, comm: Diablo III.exe Tainted: P O 3.3.8-gentoo-jesus22 #1
Aug 19 13:04:06 localhost kernel: Call Trace:
Aug 19 13:04:06 localhost kernel: <IRQ> [<ffffffff810bdf9d>] ? __rcu_pending+0x1ed/0x400
Aug 19 13:04:06 localhost kernel: [<ffffffff810be710>] ? rcu_check_callbacks+0x90/0xd0
Aug 19 13:04:06 localhost kernel: [<ffffffff8106355f>] ? update_process_times+0x3f/0x80
Aug 19 13:04:06 localhost kernel: [<ffffffff810992cb>] ? tick_sched_timer+0x5b/0xb0
Aug 19 13:04:06 localhost kernel: [<ffffffff81076821>] ? __run_hrtimer.clone.30+0x61/0x140
Aug 19 13:04:06 localhost kernel: [<ffffffff81077100>] ? hrtimer_interrupt+0xd0/0x200
Aug 19 13:04:06 localhost kernel: [<ffffffff81024d13>] ? smp_apic_timer_interrupt+0x63/0xa0
Aug 19 13:04:06 localhost kernel: [<ffffffff8165241e>] ? apic_timer_interrupt+0x6e/0x80
Aug 19 13:04:06 localhost kernel: <EOI> [<ffffffffa00d5527>] ? _nv015043rm+0x43/0x51 [nvidia]
Aug 19 13:04:06 localhost kernel: [<ffffffffa00d4aea>] ? _nv014655rm+0xba/0x1c2 [nvidia]
Aug 19 13:04:06 localhost kernel: [<ffffffffa04750a1>] ? _nv009368rm+0x9ad/0xdf8 [nvidia]
Aug 19 13:04:06 localhost kernel: [<ffffffffa0132c06>] ? _nv002323rm+0x2de/0x30a [nvidia]
Aug 19 13:04:06 localhost kernel: [<ffffffffa0132e21>] ? _nv002027rm+0x1ef/0x205 [nvidia]
Aug 19 13:04:06 localhost kernel: [<ffffffffa034011e>] ? _nv005899rm+0x6dd/0x707 [nvidia]
Aug 19 13:04:06 localhost kernel: [<ffffffffa0342ef4>] ? _nv006015rm+0xc7/0x61b [nvidia]
Aug 19 13:04:06 localhost kernel: [<ffffffffa0342ea2>] ? _nv006015rm+0x75/0x61b [nvidia]
Aug 19 13:04:06 localhost kernel: [<ffffffffa04b4332>] ? _nv010267rm+0xad/0x256 [nvidia]
Aug 19 13:04:06 localhost kernel: [<ffffffffa04c0af1>] ? _nv010265rm+0x106/0x432 [nvidia]
Aug 19 13:04:06 localhost kernel: [<ffffffffa0429f65>] ? _nv008173rm+0x24e/0x4a9 [nvidia]
Aug 19 13:04:06 localhost kernel: [<ffffffffa0435c0a>] ? _nv008168rm+0x17b/0x4e0 [nvidia]
Aug 19 13:04:06 localhost kernel: [<ffffffffa00c4e34>] ? _nv001073rm+0x1de9/0x2d09 [nvidia]
Aug 19 13:04:06 localhost kernel: [<ffffffffa00c2f94>] ? _nv001039rm+0xd23/0xd59 [nvidia]
Aug 19 13:04:06 localhost kernel: [<ffffffffa00c30be>] ? _nv001073rm+0x73/0x2d09 [nvidia]
Aug 19 13:04:06 localhost kernel: [<ffffffffa00bb5b8>] ? _nv000947rm+0x26/0x147 [nvidia]
Aug 19 13:04:06 localhost kernel: [<ffffffffa069d0ed>] ? _nv001106rm+0x34d/0xaaf [nvidia]
Aug 19 13:04:06 localhost kernel: [<ffffffffa06a8ce6>] ? rm_ioctl+0x76/0x100 [nvidia]
Aug 19 13:04:06 localhost kernel: [<ffffffffa06c7472>] ? nv_kern_ioctl+0x152/0x480 [nvidia]
Aug 19 13:04:06 localhost kernel: [<ffffffffa06ccf9c>] ? nv_kern_vma_release+0x6c/0x160 [nvidia]
Aug 19 13:04:06 localhost kernel: [<ffffffffa06c77bc>] ? nv_kern_compat_ioctl+0x1c/0x30 [nvidia]
Aug 19 13:04:06 localhost kernel: [<ffffffff811447f4>] ? compat_sys_ioctl+0x84/0xdc0
Aug 19 13:04:06 localhost kernel: [<ffffffff810ebbe5>] ? do_munmap+0x2d5/0x360
Aug 19 13:04:06 localhost kernel: [<ffffffff81011c15>] ? read_tsc+0x5/0x20
Aug 19 13:04:06 localhost kernel: [<ffffffff81092758>] ? getnstimeofday+0x48/0xc0
Aug 19 13:04:06 localhost kernel: [<ffffffff81092820>] ? do_gettimeofday+0x10/0x50
Aug 19 13:04:06 localhost kernel: [<ffffffff81652ed6>] ? sysenter_dispatch+0x7/0x21
Which seems to be very similar to the message in the newer thread: http://www.nvnews.net/vbulletin/showthread.php?t=167848
Iesos is offline   Reply With Quote
Old 08-19-12, 06:38 PM   #63
rockob
Registered User
 
Join Date: Nov 2008
Posts: 95
Default Re: Random crashes, NVRM Xid messages

I have found that the severity of the crashes varies - at first it seems that only the application graphics have locked up, but after time other things start to fail. And if I try to kill the hung task, X often locks up. In any case, the kernel can't reboot without a hard reset.

And 304.37 seems much worse than its predecessors for this bug, particulary with kernel 3.6-rc2. I can barely get 30 seconds into crysis2 or cod mw3 before it crashes. Increasing the priority of the wine task doesn't help any more, either.
rockob is offline   Reply With Quote
Old 08-20-12, 02:00 AM   #64
rockob
Registered User
 
Join Date: Nov 2008
Posts: 95
Default Re: Random crashes, NVRM Xid messages

Quote:
Originally Posted by sandipt View Post
NVIDIA internal Bug ID: 970252 to track this issue.
@sandipt: could you let us know what is the status of bug #970252? Was 304.37 supposed to fix this issue?
rockob is offline   Reply With Quote
Old 08-20-12, 09:27 AM   #65
Iesos
Registered User
 
Join Date: Apr 2012
Posts: 15
Default Re: Random crashes, NVRM Xid messages

Quote:
Originally Posted by rockob View Post
I have found that the severity of the crashes varies - at first it seems that only the application graphics have locked up, but after time other things start to fail. And if I try to kill the hung task, X often locks up. In any case, the kernel can't reboot without a hard reset.

And 304.37 seems much worse than its predecessors for this bug, particulary with kernel 3.6-rc2. I can barely get 30 seconds into crysis2 or cod mw3 before it crashes. Increasing the priority of the wine task doesn't help any more, either.
Well, I would not be surprised if beta/rc kernels causes problems. I'm running 304.37 on a stable kernel, and it seems to be more stable than the older ones (<304.37), in the sense that the crash is less severe.

Anyway, it is not fixed. It would be nice if we could get some pointers, like, is there a diagnostics tool we could use to help the resolution of this bug? Is there some tests we can make? nvidia options? X org flags? optirun flags?
Iesos is offline   Reply With Quote
Old 08-20-12, 08:04 PM   #66
rockob
Registered User
 
Join Date: Nov 2008
Posts: 95
Default Re: Random crashes, NVRM Xid messages

Iesos, are you running using bumblebee and wine? Over at the bumblebee forum they suggested that in order to rule out bumblebee/virtualgl that I try opening another program with optirun so that there's an Xorg server running on display :8, and then running the game in demo mode, sending the output directly to :8 by exporting DISPLAY=:8 first. But I'm having trouble getting crysis2 to run in its benchmark mode (missing dlls) and in any case doing it this way you can't interact with the program at all, so I can't click the first messagebox that asks me if it's OK to keep running with my graphics card. So if you're using bumblebee and wine, are you able to try running a demo to see if it still crashes?

It doesn't matter whether I run with the 3.6-rc2 kernel or the stable 3.5.2 kernel, nvidia always crashes and X seems to remain stable until you try to kill the nvidia process or launch another X program.
rockob is offline   Reply With Quote
Old 08-21-12, 09:31 AM   #67
Iesos
Registered User
 
Join Date: Apr 2012
Posts: 15
Default Re: Random crashes, NVRM Xid messages

Quote:
Originally Posted by rockob View Post
Iesos, are you running using bumblebee and wine? Over at the bumblebee forum they suggested that in order to rule out bumblebee/virtualgl that I try opening another program with optirun so that there's an Xorg server running on display :8, and then running the game in demo mode, sending the output directly to :8 by exporting DISPLAY=:8 first. But I'm having trouble getting crysis2 to run in its benchmark mode (missing dlls) and in any case doing it this way you can't interact with the program at all, so I can't click the first messagebox that asks me if it's OK to keep running with my graphics card. So if you're using bumblebee and wine, are you able to try running a demo to see if it still crashes?
First, could you give me a link to that discussion?

Second, How would that rule out bumblebee/vgl? Nomatter how I start a server on :8, if I run the program on :8 then it is bumb/vgl that takes care of it, no?

And running demos won't work for me. There are probably only some specific graphical elements that gets generated that crashes my computer. I have been running a lot of stuff on the nvidia card, just to see if I can get the same effects (benchmarks, desktops, native linux games), only some games with wine crash like this.

I tried the MSI thing you mentioned, I also got some more time out of it. But the crash was harder, syslog could only record: "Aug 20 21:10:07 localhost kernel: NVRM: Xid (0000:" then the computer locked to syslog couldn't do anything more.
Iesos is offline   Reply With Quote
Old 08-21-12, 11:29 AM   #68
Iesos
Registered User
 
Join Date: Apr 2012
Posts: 15
Default Re: Random crashes, NVRM Xid messages

I have played around with some options, and I don't want it all to go to waste, so I'll summarize them here.

* Not messing around, the usual Xid is 13. Together with a trace of some driver crashing, a "Attempted to yield the CPU" message, etc.

* Activating MSI, the error seems to change to a Xid 32. Since there is nothing that covers what these means, maybe this is not so helpful for us, but maybe for nv devs.

* Running it with "taskset 1" so that the process' will keep to one CPU made it even possible to kill the process without the coumper crashing. The whole output into messages was then:

Code:
Aug 21 18:18:24 localhost kernel: NVRM: Xid (0000:01:00): 31, Ch 00000003, engmask 00000101, intr 30000000
Aug 21 18:18:24 localhost kernel: NVRM: Xid (0000:01:00): 39, CCMDs 00000004 000090b5
Aug 21 18:18:26 localhost kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Aug 21 18:18:28 localhost kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Aug 21 18:19:25 localhost kernel: Clocksource tsc unstable (delta = -1962370138 ns)
Aug 21 18:19:25 localhost kernel: ------------[ cut here ]------------
Aug 21 18:19:25 localhost kernel: WARNING: at drivers/gpu/drm/i915/i915_irq.c:649 ironlake_irq_handler+0x4f2/0x500()
Aug 21 18:19:25 localhost kernel: Hardware name: Dell System XPS L502X
Aug 21 18:19:25 localhost kernel: Missed a PM interrupt
Aug 21 18:19:25 localhost kernel: Modules linked in: nvidia(PO) coretemp bbswitch(O) rtc cdc_ether usbnet cdc_acm snd_hda_codec_hdmi snd_hda_codec_realtek dell_wmi sparse_keymap sg dcdbas snd_hda_intel snd_hda_codec xhci_hcd ehci_hcd thermal [last unloaded: nvidia]
Aug 21 18:19:25 localhost kernel: Pid: 29938, comm: Diablo III.exe Tainted: P           O 3.3.8-gentoo-jesus23 #2
Aug 21 18:19:25 localhost kernel: Call Trace:
Aug 21 18:19:25 localhost kernel: <IRQ>  [<ffffffff8105697b>] ? warn_slowpath_common+0x7b/0xc0
Aug 21 18:19:25 localhost kernel: [<ffffffff81056a75>] ? warn_slowpath_fmt+0x45/0x50
Aug 21 18:19:25 localhost kernel: [<ffffffff813b3412>] ? ironlake_irq_handler+0x4f2/0x500
Aug 21 18:19:25 localhost kernel: [<ffffffff81011c75>] ? read_tsc+0x5/0x20
Aug 21 18:19:25 localhost kernel: [<ffffffff810b917a>] ? handle_irq_event_percpu+0x3a/0x140
Aug 21 18:19:25 localhost kernel: [<ffffffff810b92ba>] ? handle_irq_event+0x3a/0x70
Aug 21 18:19:25 localhost kernel: [<ffffffff810bc107>] ? handle_edge_irq+0x67/0x100
Aug 21 18:19:25 localhost kernel: [<ffffffff8100c5b5>] ? handle_irq+0x15/0x20
Aug 21 18:19:25 localhost kernel: [<ffffffff8100c283>] ? do_IRQ+0x53/0xd0
Aug 21 18:19:25 localhost kernel: [<ffffffff816602ee>] ? common_interrupt+0x6e/0x6e
Aug 21 18:19:25 localhost kernel: [<ffffffff8105c900>] ? __do_softirq+0x50/0x120
Aug 21 18:19:25 localhost kernel: [<ffffffff81098b9f>] ? clockevents_program_event+0x6f/0x120
Aug 21 18:19:25 localhost kernel: [<ffffffff81661e5c>] ? call_softirq+0x1c/0x30
Aug 21 18:19:25 localhost kernel: [<ffffffff8100c625>] ? do_softirq+0x65/0xa0
Aug 21 18:19:25 localhost kernel: [<ffffffff8105cc3e>] ? irq_exit+0x8e/0xb0
Aug 21 18:19:25 localhost kernel: Switching to clocksource hpet
Aug 21 18:19:25 localhost kernel: [<ffffffff81025178>] ? smp_apic_timer_interrupt+0x68/0xa0
Aug 21 18:19:25 localhost kernel: [<ffffffff816615de>] ? apic_timer_interrupt+0x6e/0x80
Aug 21 18:19:25 localhost kernel: <EOI>  [<ffffffffa0694015>] ? _nv014794rm+0x36/0x3a [nvidia]
Aug 21 18:19:25 localhost kernel: [<ffffffffa00c40f8>] ? _nv014846rm+0x2d/0x33 [nvidia]
Aug 21 18:19:25 localhost kernel: [<ffffffffa043ef86>] ? _nv009814rm+0xea/0x13a [nvidia]
Aug 21 18:19:25 localhost kernel: [<ffffffffa04612ea>] ? _nv004046rm+0x4a81/0xae8b [nvidia]
Aug 21 18:19:25 localhost kernel: [<ffffffffa041d220>] ? _nv008399rm+0x60/0xa2 [nvidia]
Aug 21 18:19:25 localhost kernel: [<ffffffffa0430d1f>] ? _nv008400rm+0xcbf/0xf94 [nvidia]
Aug 21 18:19:25 localhost kernel: [<ffffffffa00ba2bd>] ? _nv001092rm+0x404/0x485 [nvidia]
Aug 21 18:19:25 localhost kernel: [<ffffffffa00b79e3>] ? _nv001073rm+0x1998/0x2d09 [nvidia]
Aug 21 18:19:25 localhost kernel: [<ffffffffa00b7986>] ? _nv001073rm+0x193b/0x2d09 [nvidia]
Aug 21 18:19:25 localhost kernel: [<ffffffffa00b5f94>] ? _nv001039rm+0xd23/0xd59 [nvidia]
Aug 21 18:19:25 localhost kernel: [<ffffffffa00b6033>] ? _nv016414rm+0xe/0x26 [nvidia]
Aug 21 18:19:25 localhost kernel: [<ffffffffa00b654f>] ? _nv001073rm+0x504/0x2d09 [nvidia]
Aug 21 18:19:25 localhost kernel: [<ffffffffa00b5f94>] ? _nv001039rm+0xd23/0xd59 [nvidia]
Aug 21 18:19:25 localhost kernel: [<ffffffffa00b6033>] ? _nv016414rm+0xe/0x26 [nvidia]
Aug 21 18:19:25 localhost kernel: [<ffffffffa00b62d7>] ? _nv001073rm+0x28c/0x2d09 [nvidia]
Aug 21 18:19:25 localhost kernel: [<ffffffffa00b5f94>] ? _nv001039rm+0xd23/0xd59 [nvidia]
Aug 21 18:19:25 localhost kernel: [<ffffffffa00b6007>] ? _nv016416rm+0x3d/0x5b [nvidia]
Aug 21 18:19:25 localhost kernel: [<ffffffffa0699a77>] ? _nv001082rm+0xdf/0x1c3 [nvidia]
Aug 21 18:19:25 localhost kernel: [<ffffffffa069c00c>] ? rm_free_unused_clients+0x98/0x12d [nvidia]
Aug 21 18:19:25 localhost kernel: [<ffffffff8107b27c>] ? __wake_up_sync_key+0x4c/0x90
Aug 21 18:19:25 localhost kernel: [<ffffffffa06bb14b>] ? nv_kern_ctl_close+0x7b/0x130 [nvidia]
Aug 21 18:19:25 localhost kernel: [<ffffffff811028fa>] ? fput+0xea/0x240
Aug 21 18:19:25 localhost kernel: [<ffffffff810fecbf>] ? filp_close+0x5f/0x90
Aug 21 18:19:25 localhost kernel: [<ffffffff8105a0f5>] ? put_files_struct+0x75/0xf0
Aug 21 18:19:25 localhost kernel: [<ffffffff8105a3a6>] ? do_exit+0x166/0x7e0
Aug 21 18:19:25 localhost kernel: [<ffffffff8165eb90>] ? __schedule+0x2a0/0x6e0
Aug 21 18:19:25 localhost kernel: [<ffffffff8105acc3>] ? do_group_exit+0x53/0xd0
Aug 21 18:19:25 localhost kernel: [<ffffffff81066f19>] ? get_signal_to_deliver+0x199/0x4e0
Aug 21 18:19:25 localhost kernel: [<ffffffff81066f62>] ? get_signal_to_deliver+0x1e2/0x4e0
Aug 21 18:19:25 localhost kernel: [<ffffffff81009ecd>] ? do_signal+0x9d/0x780
Aug 21 18:19:25 localhost kernel: [<ffffffff81093599>] ? ktime_get_ts+0xb9/0xe0
Aug 21 18:19:25 localhost kernel: [<ffffffff81011c75>] ? read_tsc+0x5/0x20
Aug 21 18:19:25 localhost kernel: [<ffffffff8109354d>] ? ktime_get_ts+0x6d/0xe0
Aug 21 18:19:25 localhost kernel: [<ffffffff8100a635>] ? do_notify_resume+0x65/0x90
Aug 21 18:19:25 localhost kernel: [<ffffffff810a65c3>] ? compat_sys_clock_gettime+0x83/0xa0
Aug 21 18:19:25 localhost kernel: [<ffffffff81660df2>] ? int_signal+0x12/0x17
Aug 21 18:19:25 localhost kernel: ---[ end trace b40b7c165a9fd073 ]---
Aug 21 18:19:30 localhost kernel: NVRM: GPU at 0000:01:00.0 has fallen off the bus.
Can anyone help me decipher these messages? htep/tsc clocksource? Why is intel crashing? What exactly is crashing it? etc
Iesos is offline   Reply With Quote

Old 08-21-12, 06:50 PM   #69
rockob
Registered User
 
Join Date: Nov 2008
Posts: 95
Default Re: Random crashes, NVRM Xid messages

@Iesos:

The bumblebee issue I raised is: https://github.com/Bumblebee-Project...bee/issues/237

You can use bumblebee to start an X server on :8, and if you then do "DISPLAY=:8 someprogram" then it will attach to the server on :8. Since you haven't run it using optirun, virtualgl won't know about it and won't link its I/O with the main server on display 0. Hence if it crashes, it's not related to virtualgl. (But you'd have to monitor the syslog to see the nvidia driver crashing, because there will be no I/O from it.)

Things I've tried: turning off all the Sync to VBlanks in nvidia-settings, turning off autogroup scheduling, deadline vs cfq scheduling, and setting iommu to passthrough (because it fixed the nvidia crash for someone here: http://forums.opensuse.org/english/g...nel-3-3-a.html). But none of it stops nvidia from crashing.

I still get Xid 13 errors even with MSI. I do also get other Xid errors when nvidia crashes, seemingly at random.

TSC is the time stamp counter (http://en.wikipedia.org/wiki/Time_Stamp_Counter) and hpet is the high precision event counter (http://en.wikipedia.org/wiki/High_Precision_Event_Timer). The kernel prefers TSC because it's faster, but will switch to HPET if it detects that TSC is unstable. We could try booting with clocksource=hpet on the kernel command line to see if it's related to this bug (but I have seen at least one other post where it didn't make any difference other than removing the error message). The wiki suggests that HPET can cause missed interrupts, though. My system has TSC, HPET, and ACPI_PM clocksources (cat /sys/devices/system/clocksource/clocksource0/available_clocksource) so I guess I could also try ACPI_PM to see if it makes any difference.

The intel gpu isn't crashing, it's just warning about a missed interrupt, which I'm guessing is the one that nvidia messed up. Or maybe it's related to the TSC/HPET thing.

Using taskset 1 to stop the PC crashing is interesting. Does it lock up that CPU core though? Even if X doesn't crash immediately when I kill the hung process, the affected CPU stays locked at max speed and presumably can't be used until a reboot.

Update: using hpet doesn't help ("sudo -s" then "echo hpet>/sys/devices/system/clocksource/clocksource0/current_clocksource"), I got this crash:

Aug 22 08:25:33 sierra kernel: [26684.862433] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 304.37 Wed Aug 8 19:52:48 PDT 2012
Aug 22 08:25:34 sierra kernel: [26686.133486] nvidia 0000:01:00.0: irq 55 for MSI/MSI-X
Aug 22 08:25:35 sierra kernel: [26687.021997] NVRM: GPU at 0000:01:00: GPU-1b1589e9-15df-5ca5-919b-2f748fae640f
Aug 22 08:28:23 sierra kernel: [26854.694480] NVRM: Xid (0000:01:00): 13, 0006 00000000 00009197 00002390 00000000 00000000
Aug 22 08:28:25 sierra kernel: [26856.698611] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Aug 22 08:28:27 sierra kernel: [26858.697999] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Aug 22 08:28:45 sierra kernel: [26876.970584] Clocksource tsc unstable (delta = -1769858659 ns)
Aug 22 08:28:45 sierra kernel: [26876.970589] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
Aug 22 08:28:45 sierra kernel: [26876.971993] [sched_delayed] sched: RT throttling activated

Although the tsc unstable message does still appear.
rockob is offline   Reply With Quote
Old 08-24-12, 04:30 AM   #70
rockob
Registered User
 
Join Date: Nov 2008
Posts: 95
Default Re: Random crashes, NVRM Xid messages

For the record, setting persistence mode on the nvidia card doesn't stop the crash either. (To do this, you have to set KeepUnusedXServer=true in /etc/bumblebee/bumblebee.conf, "sudo restart bumblebeed" and then "sudo optirun /usr/lib/nvidia-current/bin/nvidia-smi -pm 1" before running the game, because by default persistence mode is disabled if bumblebeed unloads the nvidia module and turns the card to low power mode.)

This time I got a Xid 31 error:

Code:
Aug 24 17:21:23 sierra kernel: [75848.971725] pci 0000:01:00.0: power state changed by ACPI to D0
Aug 24 17:21:23 sierra kernel: [75849.008904] vgaarb: device changed decodes: PCI:0000:01:00.0,olddecodes=none,decodes=none:owns=none
Aug 24 17:21:23 sierra kernel: [75849.009029] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  304.37  Wed Aug  8 19:52:48 PDT 2012
Aug 24 17:21:23 sierra kernel: [75849.023087] nvidia 0000:01:00.0: irq 55 for MSI/MSI-X
Aug 24 17:21:24 sierra kernel: [75849.915655] NVRM: GPU at 0000:01:00: GPU-1b1589e9-15df-5ca5-919b-2f748fae640f
Aug 24 17:26:17 sierra kernel: [76143.099200] NVRM: Xid (0000:01:00): 31, Ch 00000006, engmask 00000101, intr 10000000
Aug 24 17:26:17 sierra kernel: [76143.103234] NVRM: Xid (0000:01:00): 39, CCMDs 00000007 000090b5
Aug 24 17:26:19 sierra kernel: [76145.102562] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Aug 24 17:26:40 sierra kernel: [76166.103395] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
rockob is offline   Reply With Quote
Old 08-25-12, 09:54 AM   #71
Iesos
Registered User
 
Join Date: Apr 2012
Posts: 15
Default Re: Random crashes, NVRM Xid messages

Quote:
Originally Posted by rockob View Post
For the record, setting persistence mode on the nvidia card doesn't stop the crash either. (To do this, you have to set KeepUnusedXServer=true in /etc/bumblebee/bumblebee.conf, "sudo restart bumblebeed" and then "sudo optirun /usr/lib/nvidia-current/bin/nvidia-smi -pm 1" before running the game, because by default persistence mode is disabled if bumblebeed unloads the nvidia module and turns the card to low power mode.)

This time I got a Xid 31 error:

Code:
Aug 24 17:21:23 sierra kernel: [75848.971725] pci 0000:01:00.0: power state changed by ACPI to D0
Aug 24 17:21:23 sierra kernel: [75849.008904] vgaarb: device changed decodes: PCI:0000:01:00.0,olddecodes=none,decodes=none:owns=none
Aug 24 17:21:23 sierra kernel: [75849.009029] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  304.37  Wed Aug  8 19:52:48 PDT 2012
Aug 24 17:21:23 sierra kernel: [75849.023087] nvidia 0000:01:00.0: irq 55 for MSI/MSI-X
Aug 24 17:21:24 sierra kernel: [75849.915655] NVRM: GPU at 0000:01:00: GPU-1b1589e9-15df-5ca5-919b-2f748fae640f
Aug 24 17:26:17 sierra kernel: [76143.099200] NVRM: Xid (0000:01:00): 31, Ch 00000006, engmask 00000101, intr 10000000
Aug 24 17:26:17 sierra kernel: [76143.103234] NVRM: Xid (0000:01:00): 39, CCMDs 00000007 000090b5
Aug 24 17:26:19 sierra kernel: [76145.102562] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Aug 24 17:26:40 sierra kernel: [76166.103395] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
I tried changing clock source at boot, it only got rid of that message.

Activating MSI did some good, but the crash still seemed inevitable.

With taskset, one processor is locked up and a hard reboot is needed to get it back.

With persistence mode something new happened. The same nvidia errors, but the process was killable, and everything went back to normal. To tryi it out I tried to start the game again, and then VGL complained about opengl-something. Lost the message.

Is there any debugging mode for the nvidia module? Or some kernel setting that would help debugging this?
Iesos is offline   Reply With Quote
Old 08-25-12, 04:35 PM   #72
rockob
Registered User
 
Join Date: Nov 2008
Posts: 95
Default Re: Random crashes, NVRM Xid messages

Quote:
Originally Posted by Iesos View Post
I tried changing clock source at boot, it only got rid of that message.

Activating MSI did some good, but the crash still seemed inevitable.

With taskset, one processor is locked up and a hard reboot is needed to get it back.

With persistence mode something new happened. The same nvidia errors, but the process was killable, and everything went back to normal. To tryi it out I tried to start the game again, and then VGL complained about opengl-something. Lost the message.

Is there any debugging mode for the nvidia module? Or some kernel setting that would help debugging this?
There is a debug level defined in os-interface.c. The code indicates that it defaults already to max debug level:

Code:
// The current debug display level (default to maximum debug level)
U032 cur_debuglevel = 0xffffffff;
I haven't looked more in depth at the code. Perhaps you need to define the DEBUG macro and rebuild the module for the debug level to take effect?

I wanted to see if nvidia would behave better with the new Xserver 1.13RC, so I tried it in Ubuntu 12.10 and after some shenanigans needed to get the deb package to install, can confirm that 304.37 still crashes on Xserver 1.13. One interesting side note was that for the first time ever the crashed process seemed to close down cleanly when I closed the frozen window - my CPUs even went down to the lowest frequency afterwards and all the wine processes vanished. But five minutes later when I returned to the computer it had locked with a blank screen, which may have been related to the crash.

As an aside, when I first tried it out on ubuntu 12.10, bumblebee loaded nouveau instead of nvidia, and it turned out it could run crysis2, albeit at slower framerates. Pity about the slower framerates, because it doesn't crash.
rockob is offline   Reply With Quote
Reply


Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT -5. The time now is 04:33 AM.


Powered by vBulletin® Version 3.7.1
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Copyright 1998 - 2014, nV News.