|
|
#61 | |
|
Registered User
Join Date: Nov 2008
Posts: 95
|
Sadly, the just-released 304.37 release does not fix the problem (though I think according to the release notes it is supposed to). The symptoms are slightly different, but X still locks up completely and requires a cold boot.
I didn't actually see the "GPU has fallen off the bus" message (or any Xid errors) in the log, but the video did freeze after about a minute of gameplay in Crysis2 (the audio kept going, though), and when I tried to close its window, its bumblebee X process froze, locking the main X shortly afterwards. I *was* able to ssh into the machine, which is new, but I was unable to kill -9 any of the locked processes, and restarting lightdm failed, as did a reboot command. Only a hard reset fixed it. fwiw, I was running the 3.6-rc1 kernel. |
|
|
|
|
|
|
#62 | |
|
Registered User
Join Date: Apr 2012
Posts: 15
|
The 304.37 does not fix the problem. The crash is a bit less severe, it crashes the application (in this case Diablo 3), and makes it impossible to kill. And it won't let X start new graphical programs for some reason, it seemed.
The new error in messages is: Quote:
|
|
|
|
|
|
|
#63 |
|
Registered User
Join Date: Nov 2008
Posts: 95
|
I have found that the severity of the crashes varies - at first it seems that only the application graphics have locked up, but after time other things start to fail. And if I try to kill the hung task, X often locks up. In any case, the kernel can't reboot without a hard reset.
And 304.37 seems much worse than its predecessors for this bug, particulary with kernel 3.6-rc2. I can barely get 30 seconds into crysis2 or cod mw3 before it crashes. Increasing the priority of the wine task doesn't help any more, either. |
|
|
|
|
|
#64 | |
|
Registered User
Join Date: Nov 2008
Posts: 95
|
|
|
|
|
|
|
|
#65 | |
|
Registered User
Join Date: Apr 2012
Posts: 15
|
Quote:
Anyway, it is not fixed. It would be nice if we could get some pointers, like, is there a diagnostics tool we could use to help the resolution of this bug? Is there some tests we can make? nvidia options? X org flags? optirun flags? |
|
|
|
|
|
|
#66 | |
|
Registered User
Join Date: Nov 2008
Posts: 95
|
Iesos, are you running using bumblebee and wine? Over at the bumblebee forum they suggested that in order to rule out bumblebee/virtualgl that I try opening another program with optirun so that there's an Xorg server running on display :8, and then running the game in demo mode, sending the output directly to :8 by exporting DISPLAY=:8 first. But I'm having trouble getting crysis2 to run in its benchmark mode (missing dlls) and in any case doing it this way you can't interact with the program at all, so I can't click the first messagebox that asks me if it's OK to keep running with my graphics card. So if you're using bumblebee and wine, are you able to try running a demo to see if it still crashes?
It doesn't matter whether I run with the 3.6-rc2 kernel or the stable 3.5.2 kernel, nvidia always crashes and X seems to remain stable until you try to kill the nvidia process or launch another X program. |
|
|
|
|
|
|
#67 | |
|
Registered User
Join Date: Apr 2012
Posts: 15
|
Quote:
Second, How would that rule out bumblebee/vgl? Nomatter how I start a server on :8, if I run the program on :8 then it is bumb/vgl that takes care of it, no? And running demos won't work for me. There are probably only some specific graphical elements that gets generated that crashes my computer. I have been running a lot of stuff on the nvidia card, just to see if I can get the same effects (benchmarks, desktops, native linux games), only some games with wine crash like this. I tried the MSI thing you mentioned, I also got some more time out of it. But the crash was harder, syslog could only record: "Aug 20 21:10:07 localhost kernel: NVRM: Xid (0000:" then the computer locked to syslog couldn't do anything more. |
|
|
|
|
|
|
#68 |
|
Registered User
Join Date: Apr 2012
Posts: 15
|
I have played around with some options, and I don't want it all to go to waste, so I'll summarize them here.
* Not messing around, the usual Xid is 13. Together with a trace of some driver crashing, a "Attempted to yield the CPU" message, etc. * Activating MSI, the error seems to change to a Xid 32. Since there is nothing that covers what these means, maybe this is not so helpful for us, but maybe for nv devs. * Running it with "taskset 1" so that the process' will keep to one CPU made it even possible to kill the process without the coumper crashing. The whole output into messages was then: Code:
Aug 21 18:18:24 localhost kernel: NVRM: Xid (0000:01:00): 31, Ch 00000003, engmask 00000101, intr 30000000 Aug 21 18:18:24 localhost kernel: NVRM: Xid (0000:01:00): 39, CCMDs 00000004 000090b5 Aug 21 18:18:26 localhost kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context Aug 21 18:18:28 localhost kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context Aug 21 18:19:25 localhost kernel: Clocksource tsc unstable (delta = -1962370138 ns) Aug 21 18:19:25 localhost kernel: ------------[ cut here ]------------ Aug 21 18:19:25 localhost kernel: WARNING: at drivers/gpu/drm/i915/i915_irq.c:649 ironlake_irq_handler+0x4f2/0x500() Aug 21 18:19:25 localhost kernel: Hardware name: Dell System XPS L502X Aug 21 18:19:25 localhost kernel: Missed a PM interrupt Aug 21 18:19:25 localhost kernel: Modules linked in: nvidia(PO) coretemp bbswitch(O) rtc cdc_ether usbnet cdc_acm snd_hda_codec_hdmi snd_hda_codec_realtek dell_wmi sparse_keymap sg dcdbas snd_hda_intel snd_hda_codec xhci_hcd ehci_hcd thermal [last unloaded: nvidia] Aug 21 18:19:25 localhost kernel: Pid: 29938, comm: Diablo III.exe Tainted: P O 3.3.8-gentoo-jesus23 #2 Aug 21 18:19:25 localhost kernel: Call Trace: Aug 21 18:19:25 localhost kernel: <IRQ> [<ffffffff8105697b>] ? warn_slowpath_common+0x7b/0xc0 Aug 21 18:19:25 localhost kernel: [<ffffffff81056a75>] ? warn_slowpath_fmt+0x45/0x50 Aug 21 18:19:25 localhost kernel: [<ffffffff813b3412>] ? ironlake_irq_handler+0x4f2/0x500 Aug 21 18:19:25 localhost kernel: [<ffffffff81011c75>] ? read_tsc+0x5/0x20 Aug 21 18:19:25 localhost kernel: [<ffffffff810b917a>] ? handle_irq_event_percpu+0x3a/0x140 Aug 21 18:19:25 localhost kernel: [<ffffffff810b92ba>] ? handle_irq_event+0x3a/0x70 Aug 21 18:19:25 localhost kernel: [<ffffffff810bc107>] ? handle_edge_irq+0x67/0x100 Aug 21 18:19:25 localhost kernel: [<ffffffff8100c5b5>] ? handle_irq+0x15/0x20 Aug 21 18:19:25 localhost kernel: [<ffffffff8100c283>] ? do_IRQ+0x53/0xd0 Aug 21 18:19:25 localhost kernel: [<ffffffff816602ee>] ? common_interrupt+0x6e/0x6e Aug 21 18:19:25 localhost kernel: [<ffffffff8105c900>] ? __do_softirq+0x50/0x120 Aug 21 18:19:25 localhost kernel: [<ffffffff81098b9f>] ? clockevents_program_event+0x6f/0x120 Aug 21 18:19:25 localhost kernel: [<ffffffff81661e5c>] ? call_softirq+0x1c/0x30 Aug 21 18:19:25 localhost kernel: [<ffffffff8100c625>] ? do_softirq+0x65/0xa0 Aug 21 18:19:25 localhost kernel: [<ffffffff8105cc3e>] ? irq_exit+0x8e/0xb0 Aug 21 18:19:25 localhost kernel: Switching to clocksource hpet Aug 21 18:19:25 localhost kernel: [<ffffffff81025178>] ? smp_apic_timer_interrupt+0x68/0xa0 Aug 21 18:19:25 localhost kernel: [<ffffffff816615de>] ? apic_timer_interrupt+0x6e/0x80 Aug 21 18:19:25 localhost kernel: <EOI> [<ffffffffa0694015>] ? _nv014794rm+0x36/0x3a [nvidia] Aug 21 18:19:25 localhost kernel: [<ffffffffa00c40f8>] ? _nv014846rm+0x2d/0x33 [nvidia] Aug 21 18:19:25 localhost kernel: [<ffffffffa043ef86>] ? _nv009814rm+0xea/0x13a [nvidia] Aug 21 18:19:25 localhost kernel: [<ffffffffa04612ea>] ? _nv004046rm+0x4a81/0xae8b [nvidia] Aug 21 18:19:25 localhost kernel: [<ffffffffa041d220>] ? _nv008399rm+0x60/0xa2 [nvidia] Aug 21 18:19:25 localhost kernel: [<ffffffffa0430d1f>] ? _nv008400rm+0xcbf/0xf94 [nvidia] Aug 21 18:19:25 localhost kernel: [<ffffffffa00ba2bd>] ? _nv001092rm+0x404/0x485 [nvidia] Aug 21 18:19:25 localhost kernel: [<ffffffffa00b79e3>] ? _nv001073rm+0x1998/0x2d09 [nvidia] Aug 21 18:19:25 localhost kernel: [<ffffffffa00b7986>] ? _nv001073rm+0x193b/0x2d09 [nvidia] Aug 21 18:19:25 localhost kernel: [<ffffffffa00b5f94>] ? _nv001039rm+0xd23/0xd59 [nvidia] Aug 21 18:19:25 localhost kernel: [<ffffffffa00b6033>] ? _nv016414rm+0xe/0x26 [nvidia] Aug 21 18:19:25 localhost kernel: [<ffffffffa00b654f>] ? _nv001073rm+0x504/0x2d09 [nvidia] Aug 21 18:19:25 localhost kernel: [<ffffffffa00b5f94>] ? _nv001039rm+0xd23/0xd59 [nvidia] Aug 21 18:19:25 localhost kernel: [<ffffffffa00b6033>] ? _nv016414rm+0xe/0x26 [nvidia] Aug 21 18:19:25 localhost kernel: [<ffffffffa00b62d7>] ? _nv001073rm+0x28c/0x2d09 [nvidia] Aug 21 18:19:25 localhost kernel: [<ffffffffa00b5f94>] ? _nv001039rm+0xd23/0xd59 [nvidia] Aug 21 18:19:25 localhost kernel: [<ffffffffa00b6007>] ? _nv016416rm+0x3d/0x5b [nvidia] Aug 21 18:19:25 localhost kernel: [<ffffffffa0699a77>] ? _nv001082rm+0xdf/0x1c3 [nvidia] Aug 21 18:19:25 localhost kernel: [<ffffffffa069c00c>] ? rm_free_unused_clients+0x98/0x12d [nvidia] Aug 21 18:19:25 localhost kernel: [<ffffffff8107b27c>] ? __wake_up_sync_key+0x4c/0x90 Aug 21 18:19:25 localhost kernel: [<ffffffffa06bb14b>] ? nv_kern_ctl_close+0x7b/0x130 [nvidia] Aug 21 18:19:25 localhost kernel: [<ffffffff811028fa>] ? fput+0xea/0x240 Aug 21 18:19:25 localhost kernel: [<ffffffff810fecbf>] ? filp_close+0x5f/0x90 Aug 21 18:19:25 localhost kernel: [<ffffffff8105a0f5>] ? put_files_struct+0x75/0xf0 Aug 21 18:19:25 localhost kernel: [<ffffffff8105a3a6>] ? do_exit+0x166/0x7e0 Aug 21 18:19:25 localhost kernel: [<ffffffff8165eb90>] ? __schedule+0x2a0/0x6e0 Aug 21 18:19:25 localhost kernel: [<ffffffff8105acc3>] ? do_group_exit+0x53/0xd0 Aug 21 18:19:25 localhost kernel: [<ffffffff81066f19>] ? get_signal_to_deliver+0x199/0x4e0 Aug 21 18:19:25 localhost kernel: [<ffffffff81066f62>] ? get_signal_to_deliver+0x1e2/0x4e0 Aug 21 18:19:25 localhost kernel: [<ffffffff81009ecd>] ? do_signal+0x9d/0x780 Aug 21 18:19:25 localhost kernel: [<ffffffff81093599>] ? ktime_get_ts+0xb9/0xe0 Aug 21 18:19:25 localhost kernel: [<ffffffff81011c75>] ? read_tsc+0x5/0x20 Aug 21 18:19:25 localhost kernel: [<ffffffff8109354d>] ? ktime_get_ts+0x6d/0xe0 Aug 21 18:19:25 localhost kernel: [<ffffffff8100a635>] ? do_notify_resume+0x65/0x90 Aug 21 18:19:25 localhost kernel: [<ffffffff810a65c3>] ? compat_sys_clock_gettime+0x83/0xa0 Aug 21 18:19:25 localhost kernel: [<ffffffff81660df2>] ? int_signal+0x12/0x17 Aug 21 18:19:25 localhost kernel: ---[ end trace b40b7c165a9fd073 ]--- Aug 21 18:19:30 localhost kernel: NVRM: GPU at 0000:01:00.0 has fallen off the bus. |
|
|
|
|
|
#69 |
|
Registered User
Join Date: Nov 2008
Posts: 95
|
@Iesos:
The bumblebee issue I raised is: https://github.com/Bumblebee-Project...bee/issues/237 You can use bumblebee to start an X server on :8, and if you then do "DISPLAY=:8 someprogram" then it will attach to the server on :8. Since you haven't run it using optirun, virtualgl won't know about it and won't link its I/O with the main server on display 0. Hence if it crashes, it's not related to virtualgl. (But you'd have to monitor the syslog to see the nvidia driver crashing, because there will be no I/O from it.) Things I've tried: turning off all the Sync to VBlanks in nvidia-settings, turning off autogroup scheduling, deadline vs cfq scheduling, and setting iommu to passthrough (because it fixed the nvidia crash for someone here: http://forums.opensuse.org/english/g...nel-3-3-a.html). But none of it stops nvidia from crashing. I still get Xid 13 errors even with MSI. I do also get other Xid errors when nvidia crashes, seemingly at random. TSC is the time stamp counter (http://en.wikipedia.org/wiki/Time_Stamp_Counter) and hpet is the high precision event counter (http://en.wikipedia.org/wiki/High_Precision_Event_Timer). The kernel prefers TSC because it's faster, but will switch to HPET if it detects that TSC is unstable. We could try booting with clocksource=hpet on the kernel command line to see if it's related to this bug (but I have seen at least one other post where it didn't make any difference other than removing the error message). The wiki suggests that HPET can cause missed interrupts, though. My system has TSC, HPET, and ACPI_PM clocksources (cat /sys/devices/system/clocksource/clocksource0/available_clocksource) so I guess I could also try ACPI_PM to see if it makes any difference. The intel gpu isn't crashing, it's just warning about a missed interrupt, which I'm guessing is the one that nvidia messed up. Or maybe it's related to the TSC/HPET thing. Using taskset 1 to stop the PC crashing is interesting. Does it lock up that CPU core though? Even if X doesn't crash immediately when I kill the hung process, the affected CPU stays locked at max speed and presumably can't be used until a reboot. Update: using hpet doesn't help ("sudo -s" then "echo hpet>/sys/devices/system/clocksource/clocksource0/current_clocksource"), I got this crash: Aug 22 08:25:33 sierra kernel: [26684.862433] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 304.37 Wed Aug 8 19:52:48 PDT 2012 Aug 22 08:25:34 sierra kernel: [26686.133486] nvidia 0000:01:00.0: irq 55 for MSI/MSI-X Aug 22 08:25:35 sierra kernel: [26687.021997] NVRM: GPU at 0000:01:00: GPU-1b1589e9-15df-5ca5-919b-2f748fae640f Aug 22 08:28:23 sierra kernel: [26854.694480] NVRM: Xid (0000:01:00): 13, 0006 00000000 00009197 00002390 00000000 00000000 Aug 22 08:28:25 sierra kernel: [26856.698611] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context Aug 22 08:28:27 sierra kernel: [26858.697999] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context Aug 22 08:28:45 sierra kernel: [26876.970584] Clocksource tsc unstable (delta = -1769858659 ns) Aug 22 08:28:45 sierra kernel: [26876.970589] NVRM: GPU at 0000:01:00.0 has fallen off the bus. Aug 22 08:28:45 sierra kernel: [26876.971993] [sched_delayed] sched: RT throttling activated Although the tsc unstable message does still appear. |
|
|
|
|
|
#70 |
|
Registered User
Join Date: Nov 2008
Posts: 95
|
For the record, setting persistence mode on the nvidia card doesn't stop the crash either. (To do this, you have to set KeepUnusedXServer=true in /etc/bumblebee/bumblebee.conf, "sudo restart bumblebeed" and then "sudo optirun /usr/lib/nvidia-current/bin/nvidia-smi -pm 1" before running the game, because by default persistence mode is disabled if bumblebeed unloads the nvidia module and turns the card to low power mode.)
This time I got a Xid 31 error: Code:
Aug 24 17:21:23 sierra kernel: [75848.971725] pci 0000:01:00.0: power state changed by ACPI to D0 Aug 24 17:21:23 sierra kernel: [75849.008904] vgaarb: device changed decodes: PCI:0000:01:00.0,olddecodes=none,decodes=none:owns=none Aug 24 17:21:23 sierra kernel: [75849.009029] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 304.37 Wed Aug 8 19:52:48 PDT 2012 Aug 24 17:21:23 sierra kernel: [75849.023087] nvidia 0000:01:00.0: irq 55 for MSI/MSI-X Aug 24 17:21:24 sierra kernel: [75849.915655] NVRM: GPU at 0000:01:00: GPU-1b1589e9-15df-5ca5-919b-2f748fae640f Aug 24 17:26:17 sierra kernel: [76143.099200] NVRM: Xid (0000:01:00): 31, Ch 00000006, engmask 00000101, intr 10000000 Aug 24 17:26:17 sierra kernel: [76143.103234] NVRM: Xid (0000:01:00): 39, CCMDs 00000007 000090b5 Aug 24 17:26:19 sierra kernel: [76145.102562] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context Aug 24 17:26:40 sierra kernel: [76166.103395] NVRM: GPU at 0000:01:00.0 has fallen off the bus. |
|
|
|
|
|
#71 | |
|
Registered User
Join Date: Apr 2012
Posts: 15
|
Quote:
Activating MSI did some good, but the crash still seemed inevitable. With taskset, one processor is locked up and a hard reboot is needed to get it back. With persistence mode something new happened. The same nvidia errors, but the process was killable, and everything went back to normal. To tryi it out I tried to start the game again, and then VGL complained about opengl-something. Lost the message. Is there any debugging mode for the nvidia module? Or some kernel setting that would help debugging this? |
|
|
|
|
|
|
#72 | |
|
Registered User
Join Date: Nov 2008
Posts: 95
|
Quote:
Code:
// The current debug display level (default to maximum debug level) U032 cur_debuglevel = 0xffffffff; I wanted to see if nvidia would behave better with the new Xserver 1.13RC, so I tried it in Ubuntu 12.10 and after some shenanigans needed to get the deb package to install, can confirm that 304.37 still crashes on Xserver 1.13. One interesting side note was that for the first time ever the crashed process seemed to close down cleanly when I closed the frozen window - my CPUs even went down to the lowest frequency afterwards and all the wine processes vanished. But five minutes later when I returned to the computer it had locked with a blank screen, which may have been related to the crash. As an aside, when I first tried it out on ubuntu 12.10, bumblebee loaded nouveau instead of nvidia, and it turned out it could run crysis2, albeit at slower framerates. Pity about the slower framerates, because it doesn't crash. |
|
|
|
|
![]() |
| Thread Tools | |
|
|