View Single Post
Old 01-04-11, 01:02 AM   #48
jpi110
Gentoo User
 
Join Date: Jan 2011
Location: Portland, Oregon
Posts: 14
Default Re: "[mi] EQ overflowing. The server is probably stuck in an infinite loop."

I ran into this same problem for a while and couldn't figure out the source of the issue. For sanity, I rarely recompile my kernel except when obvious bugs or performance issues impact me.

I'm using Gentoo on an amd64 arch. Hardware is:

2 Quad Core Opteron 2378s
16G of RAM
2 Nvidia Geforce GTX 295s (I also tried 2 GTX 275s) in SLI mode

I kept banging my head against this since my 2.6.34-gentoo-r1 kernel was just fine for many iterations of wine and the opengl apps I used -- as well as nvidia driver versions. At some point, though, I began to notice exactly the same error messages as posted in this thread.

I began to suspect IOMMU issues, bad drives, something with the drivers, but the last thing that had ever occurred to me was the kernel - after all it had worked for a very long time with no issues.

4677152 Aug 19 20:04 kernel-genkernel-x86_64-2.6.34-gentoo-r1

So, at least 4 months without issue. Then, suddenly, these errors cropped up. Google searches for it really didn't give any useful data. Though, obviously today that changed when I added the required Gentoo keyword which pointed me to this thread.

I've poked around both with some of the settings suggested in the kernel (Sanitizing the MTRR) and X specific settings. Fixed:It's shortened the amount of video memory I have on the cards (1792x 2 ... but only 1792 of that will be seen in SLI mode) to 1024M. I'll work out how to fix that problem at a later date.

It should be noted, however, that you'll need to recompile (via module-rebuild) your kernel drivers even if you think the settings changes are innocuous and/or it's the same kernel version. Some values it uses are apparently determined at initial compile time rather than detected on boot.

I also did as the other poster suggested in putting the nvidia-settings to always prefer performance (rather than adaptive). I'll post my findings once I can determine whether or not it works without crashing at some random interval later.

At the moment, however, the application seems to have started without causing my system to immediately die (X turns the monitor off on the 295s rather than the app freezing on screen like it did on the 275s) with this reported in /var/log/messages:

Jan 2 14:11:56 unimatrix-01 kernel: [577935.374469] NVRM: os_pci_init_handle: invalid context!
Jan 2 14:11:56 unimatrix-01 kernel: [577935.374471] NVRM: os_pci_init_handle: invalid context!
Jan 2 14:11:56 unimatrix-01 kernel: [577935.374573] NVRM: Xid (0006:00): 6, PE0001
Jan 2 14:11:58 unimatrix-01 kernel: [577937.374307] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jan 2 14:12:00 unimatrix-01 kernel: [577939.374336] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

Forgot to mention: I also ran into identical issues with 2.6.36-gentoo-r5. (And 2.6.36-ck-r3 and 2.6.34-ck-r3)

Edit to add: At the moment, mind you it's only been about a 40 minute test, the problem appears to have resolved itself with these fixes. However, I will know more tonight when I can run it for longer (sometimes the problem only occurs after 2 hours or so - and other times I can go an entire day before it just randomly appears).

I did also fix the amount of video memory available after enabling the MTRR sanitation by setting the iommu boot parameter.

One particular symptom to watch for when your display crashes, especially if you're using a multi-cpu system, is to run a program like htop. I had it running on my laptop sitting next to the system in question. I noted a few anomalies both in the logs (posted above) as well as CPU activity immediately jumping to 100% on one cpu (of the 8). X was the consumer in this case, as was expected. X was unable to be killed manually. Any other application was able to be terminated obviously.

It seems to be a race condition somewhere, but I don't know enough about the code or troubleshooting the drivers to determine the cause. Though, if provided some steps to take beyond what I was able to see in the logs (and htop, and dstat), I'd be happy to take them and provide the requisite data. I do think I have process accounting turned on as well if that would provide any measurable or useful data. I did, however, check the nvidia reporting tool and it didn't provide any useful data beyond what I was able to observe myself.

Last edited by jpi110; 01-04-11 at 06:33 PM. Reason: Updated with results of most recent test
jpi110 is offline   Reply With Quote