View Single Post
Old 08-21-12, 06:50 PM   #69
rockob
Registered User
 
Join Date: Nov 2008
Posts: 95
Default Re: Random crashes, NVRM Xid messages

@Iesos:

The bumblebee issue I raised is: https://github.com/Bumblebee-Project...bee/issues/237

You can use bumblebee to start an X server on :8, and if you then do "DISPLAY=:8 someprogram" then it will attach to the server on :8. Since you haven't run it using optirun, virtualgl won't know about it and won't link its I/O with the main server on display 0. Hence if it crashes, it's not related to virtualgl. (But you'd have to monitor the syslog to see the nvidia driver crashing, because there will be no I/O from it.)

Things I've tried: turning off all the Sync to VBlanks in nvidia-settings, turning off autogroup scheduling, deadline vs cfq scheduling, and setting iommu to passthrough (because it fixed the nvidia crash for someone here: http://forums.opensuse.org/english/g...nel-3-3-a.html). But none of it stops nvidia from crashing.

I still get Xid 13 errors even with MSI. I do also get other Xid errors when nvidia crashes, seemingly at random.

TSC is the time stamp counter (http://en.wikipedia.org/wiki/Time_Stamp_Counter) and hpet is the high precision event counter (http://en.wikipedia.org/wiki/High_Precision_Event_Timer). The kernel prefers TSC because it's faster, but will switch to HPET if it detects that TSC is unstable. We could try booting with clocksource=hpet on the kernel command line to see if it's related to this bug (but I have seen at least one other post where it didn't make any difference other than removing the error message). The wiki suggests that HPET can cause missed interrupts, though. My system has TSC, HPET, and ACPI_PM clocksources (cat /sys/devices/system/clocksource/clocksource0/available_clocksource) so I guess I could also try ACPI_PM to see if it makes any difference.

The intel gpu isn't crashing, it's just warning about a missed interrupt, which I'm guessing is the one that nvidia messed up. Or maybe it's related to the TSC/HPET thing.

Using taskset 1 to stop the PC crashing is interesting. Does it lock up that CPU core though? Even if X doesn't crash immediately when I kill the hung process, the affected CPU stays locked at max speed and presumably can't be used until a reboot.

Update: using hpet doesn't help ("sudo -s" then "echo hpet>/sys/devices/system/clocksource/clocksource0/current_clocksource"), I got this crash:

Aug 22 08:25:33 sierra kernel: [26684.862433] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 304.37 Wed Aug 8 19:52:48 PDT 2012
Aug 22 08:25:34 sierra kernel: [26686.133486] nvidia 0000:01:00.0: irq 55 for MSI/MSI-X
Aug 22 08:25:35 sierra kernel: [26687.021997] NVRM: GPU at 0000:01:00: GPU-1b1589e9-15df-5ca5-919b-2f748fae640f
Aug 22 08:28:23 sierra kernel: [26854.694480] NVRM: Xid (0000:01:00): 13, 0006 00000000 00009197 00002390 00000000 00000000
Aug 22 08:28:25 sierra kernel: [26856.698611] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Aug 22 08:28:27 sierra kernel: [26858.697999] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Aug 22 08:28:45 sierra kernel: [26876.970584] Clocksource tsc unstable (delta = -1769858659 ns)
Aug 22 08:28:45 sierra kernel: [26876.970589] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
Aug 22 08:28:45 sierra kernel: [26876.971993] [sched_delayed] sched: RT throttling activated

Although the tsc unstable message does still appear.
rockob is offline   Reply With Quote