Go Back   nV News Forums > Linux Support Forums > NVIDIA Linux

Newegg Daily Deals

Reply
 
Thread Tools
Old 12-19-10, 08:26 PM   #37
ShiningArcanine
Registered User
 
Join Date: Aug 2006
Posts: 106
Default Re: "[mi] EQ overflowing. The server is probably stuck in an infinite loop."

I recompiled my kernel with CONFIG_MTRR_SANITIZER and all of the issues seem to have gone away.

There is an unusually high number of Gentoo Linux users posting about this issue (myself included) and almost all Gentoo Linux users compile their kernels. Most Gentoo Linux users modify a kernel .config file provided by kernel-seeds.org, which has CONFIG_MTRR_SANITIZER disabled by default. For those having this issue, please try recompiling your kernels with CONFIG_MTRR_SANITIZER. If you are having the issue and do have CONFIG_MTRR_SANITIZER enabled in your kernel, please report that you do. If enough people report what their kernel configurations are, perhaps we can get to the bottom of this.
ShiningArcanine is offline   Reply With Quote
Old 12-19-10, 09:18 PM   #38
alexmurray
Registered User
 
Join Date: Mar 2010
Posts: 11
Default Re: "[mi] EQ overflowing. The server is probably stuck in an infinite loop."

Or apparently you can enable the mtrr sanitizer dynamically at boot time by specifying the following boot command line options:

Code:
enable_mtrr_cleanup mtrr_spare_reg_nr=0
alexmurray is offline   Reply With Quote
Old 12-20-10, 11:13 AM   #39
ShiningArcanine
Registered User
 
Join Date: Aug 2006
Posts: 106
Default Re: "[mi] EQ overflowing. The server is probably stuck in an infinite loop."

I did some more reading on this. The zero value is likely wrong. It should be:

Code:
enable_mtrr_cleanup mtrr_spare_reg_nr=1
You can check whether or not you need MTRR cleanup by running cat /proc/mtrr. Here is what mine looked like before enabling MTRR cleanup:

Quote:
reg00: base=0x0d0000000 ( 3328MB), size= 256MB, count=1: uncachable
reg01: base=0x0e0000000 ( 3584MB), size= 512MB, count=1: uncachable
reg02: base=0x000000000 ( 0MB), size= 8192MB, count=1: write-back
reg03: base=0x200000000 ( 8192MB), size= 512MB, count=1: write-back
reg04: base=0x220000000 ( 8704MB), size= 256MB, count=1: write-back
With MTRR cleanup enabled, cat /proc/mtrr shows:

Quote:
reg00: base=0x000000000 ( 0MB), size= 2048MB, count=1: write-back
reg01: base=0x080000000 ( 2048MB), size= 1024MB, count=1: write-back
reg02: base=0x0c0000000 ( 3072MB), size= 256MB, count=1: write-back
reg03: base=0x100000000 ( 4096MB), size= 4096MB, count=1: write-back
reg04: base=0x200000000 ( 8192MB), size= 512MB, count=1: write-back
reg05: base=0x220000000 ( 8704MB), size= 256MB, count=1: write-back
If cat /proc/mtrr shows uncachable lines, then you need to do something about it. One way is to go into your BIOS, find the MTRR setting and change it from continuous to discrete. The other way is to enable MTRR cleanup, either via a boot parameter or by recompiling your kernel.

Here are my references:

http://www.gentoo.org/doc/en/nvidia-guide.xml#doc_chap4
http://en.gentoo-wiki.com/wiki/MTRR
http://linuxindetails.wordpress.com/category/x11/

The Gentoo Nvidia guide says that having uncachable in /proc/mtrr is a problem. The MTRR at the unofficial Gentoo Wiki explains the nature of the MTRR settings. The blog talks about the kernel parameters.

With the MTRR fixed, the issues I have been having appear to have gone away. I think that the reason for this is because as stuff runs, it is allocated randomly in system memory by heap randomization. If you have uncachable MTRR entries, then anything allocated inside of them will suffer a performance penalty, which will cause lag that can expose race conditions in code that normally would not occur. At the same time, there could be multiple issues at work here, so fixing this issue will not solve the fixing the other issues affecting people, but so far, my issues appear to be fixed.
ShiningArcanine is offline   Reply With Quote
Old 12-20-10, 02:11 PM   #40
braingravyuk
Registered User
 
Join Date: Oct 2010
Location: Manchester, UK
Posts: 10
Default Re: "[mi] EQ overflowing. The server is probably stuck in an infinite loop."

Hi,
Had a look at my mtrr and it had the following output:
Code:
$ cat /proc/mtrr
reg00: base=0x1c0000000 ( 7168MB), size= 1024MB, count=1: uncachable
reg01: base=0x000000000 (    0MB), size= 8192MB, count=1: write-back
reg02: base=0x0c0000000 ( 3072MB), size= 1024MB, count=1: uncachable
reg03: base=0x0bf800000 ( 3064MB), size=    8MB, count=1: uncachable
Enabled mtrr cleanup in the kernel with the settings stated in previous post, this got me this:
Code:
$ cat /proc/mtrr 
reg00: base=0x000000000 (    0MB), size= 2048MB, count=1: write-back
reg01: base=0x080000000 ( 2048MB), size= 1024MB, count=1: write-back
reg02: base=0x0bf800000 ( 3064MB), size=    8MB, count=1: uncachable
reg03: base=0x100000000 ( 4096MB), size= 2048MB, count=1: write-back
reg04: base=0x180000000 ( 6144MB), size= 1024MB, count=1: write-back
The system still crashed too

This friend was in my Xorg.log.old:
(WW) NVIDIA(0): WAIT (0, 6, 0x8000, 0x0000a390, 0x0000a390)

This was in my kernel log:
NVRM: Xid (0003:00): 53, CMDre 00000000 0000089c 0100cb12 00000007 00000000

Over my MTRR, what is the 8MB coming from, any ideas?
braingravyuk is offline   Reply With Quote
Old 12-20-10, 05:47 PM   #41
ShiningArcanine
Registered User
 
Join Date: Aug 2006
Posts: 106
Default Re: "[mi] EQ overflowing. The server is probably stuck in an infinite loop."

I think that I overstated the issue that uncachable memory causes, as there are certain conditions when uncachable memory is desireable. In your case, some component on your motherboard likely earmarks 8MB of your system's RAM, so that region is set as being uncachable to keep your cache from corrupting data in that region of memory.

Anyway, having uncachable entries in the MTRR does not cause crashes per se, but when there is buggy code on your system, it is possible for it to trigger race conditions when regions of memory used for programs are marked as being uncachable. Try upgrading your kernel to 2.6.36.2, your nvidia driver to 256.19.29 and your xorg server to 1.9.3. Those are the versions I am using on my system and things have been better since I fixed the MTRR issue.

An alternative that has worked for other people is downgrading xorg server to 1.8.x.
ShiningArcanine is offline   Reply With Quote
Old 12-21-10, 12:23 AM   #42
rleon
Registered User
 
Join Date: Dec 2010
Posts: 3
Default Re: "[mi] EQ overflowing. The server is probably stuck in an infinite loop."

I am still experiencing crashes and soft lockups despite MTRR enabled and latest NVIDIA driver.
Code:
[  1159.112] (II) NVIDIA GLX Module  260.19.29  Wed Dec  8 12:24:30 PST 2010
cat /proc/mtrr
Code:
reg00: base=0x000000000 (    0MB), size= 2048MB, count=1: write-back
reg01: base=0x080000000 ( 2048MB), size= 1024MB, count=1: write-back
reg02: base=0x100000000 ( 4096MB), size= 1024MB, count=1: write-back
grep MTRR /usr/src/linux/.config
Code:
CONFIG_MTRR=y
CONFIG_MTRR_SANITIZER=y
CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=1
CONFIG_MTRR_SANITIZER_SPARE_REG_NR_DEFAULT=1
uname -a
Code:
Linux tux 2.6.36-gentoo-r5
tail /var/log/messages
Code:
Dec 20 21:04:10 tux kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Dec 20 21:04:27 tux kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Dec 20 21:04:43 tux kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Dec 20 21:04:47 tux kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Dec 20 21:05:40 tux kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Dec 20 21:06:18 tux kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Dec 20 21:06:45 tux su[2218]: pam_unix(su:session): session closed for user root
Dec 20 21:07:01 tux su[2298]: Successful su for root by leon
Dec 20 21:07:01 tux su[2298]: + /dev/tty1 leon:root
Dec 20 21:07:01 tux su[2298]: pam_unix(su:session): session opened for user root by leon(uid=1000)
Xorg.0.log
Code:
[  1182.971] (WW) NVIDIA(0): WAIT (2, 6, 0x8000, 0x000000b0, 0x00005878)
[  1189.971] (WW) NVIDIA(0): WAIT (1, 6, 0x8000, 0x000000b0, 0x00005878)
[  1193.021] (WW) NVIDIA(0): WAIT (2, 6, 0x8000, 0x000000b0, 0x000069dc)
[  1205.953] (WW) NVIDIA(0): WAIT (0, 6, 0x8000, 0x00008cb8, 0x00008cb8)
[  1208.956] (WW) NVIDIA(0): WAIT (2, 6, 0x8000, 0x00008cb8, 0x0000734c)
[  1341.581] (WW) NVIDIA(0): WAIT (0, 6, 0x8000, 0x00007914, 0x00007914)
[  1455.864] (WW) NVIDIA(0): WAIT (0, 6, 0x8000, 0x00006160, 0x00006160)
[  1472.152] (WW) NVIDIA(0): WAIT (0, 6, 0x8000, 0x0000b880, 0x0000b880)
[  1475.168] (WW) NVIDIA(0): WAIT (2, 6, 0x8000, 0x0000c810, 0x0000c1cc)
[ 1478.958] [mi] EQ overflowing. The server is probably stuck in an infinite loop.
Code:
[  1478.958] 
Backtrace:
[  1478.958] 0: /usr/bin/X (xorg_backtrace+0x28) [0x4a2708]
[  1478.958] 1: /usr/bin/X (mieqEnqueue+0x1f4) [0x4a20a4]
[  1478.958] 2: /usr/bin/X (xf86PostMotionEventP+0xc4) [0x47f164]
[  1478.958] 3: /usr/lib64/xorg/modules/input/evdev_drv.so (0x7fb71ff1c000+0x423f) [0x7fb71ff2023f]
[  1478.958] 4: /usr/bin/X (0x400000+0x6d1e7) [0x46d1e7]
[  1478.959] 5: /usr/bin/X (0x400000+0x11b3f3) [0x51b3f3]
[  1478.959] 6: /lib/libpthread.so.0 (0x7fb725a3b000+0xf010) [0x7fb725a4a010]
[  1478.959] 7: /usr/lib64/xorg/modules/drivers/nvidia_drv.so (0x7fb7209c0000+0x6c1dd) [0x7fb720a2c1dd]
[  1478.959] 8: /usr/lib64/xorg/modules/drivers/nvidia_drv.so (0x7fb7209c0000+0x6ce19) [0x7fb720a2ce19]
[  1478.959] 9: /usr/lib64/xorg/modules/drivers/nvidia_drv.so (0x7fb7209c0000+0xcc686) [0x7fb720a8c686]
[  1478.959] 10: /usr/lib64/xorg/modules/drivers/nvidia_drv.so (0x7fb7209c0000+0x983b2) [0x7fb720a583b2]
[  1478.959] 11: /usr/lib64/xorg/modules/drivers/nvidia_drv.so (0x7fb7209c0000+0x3c142d) [0x7fb720d8142d]
[  1478.959] 12: /usr/bin/X (0x400000+0x16684c) [0x56684c]
[  1478.959] 13: /usr/bin/X (0x400000+0xa4b9c) [0x4a4b9c]
[  1478.959] 14: /usr/bin/X (BlockHandler+0x50) [0x434420]
[  1478.959] 15: /usr/bin/X (WaitForSomething+0x141) [0x45cec1]
[  1478.959] 16: /usr/bin/X (0x400000+0x2efc2) [0x42efc2]
[  1478.959] 17: /usr/bin/X (0x400000+0x2490b) [0x42490b]
[  1478.959] 18: /lib/libc.so.6 (__libc_start_main+0xfd) [0x7fb7249cebbd]
[  1478.959] 19: /usr/bin/X (0x400000+0x24499) [0x424499]
Attached Files
File Type: gz nvidia-bug-report.log.gz (39.8 KB, 107 views)
rleon is offline   Reply With Quote
Old 12-21-10, 09:06 AM   #43
rleon
Registered User
 
Join Date: Dec 2010
Posts: 3
Default Re: "[mi] EQ overflowing. The server is probably stuck in an infinite loop."

Just as an info:
"EQ overflowing" is not a bug, dang it. It's a symptom.
rleon is offline   Reply With Quote
Old 12-21-10, 02:12 PM   #44
rleon
Registered User
 
Join Date: Dec 2010
Posts: 3
Default Re: "[mi] EQ overflowing. The server is probably stuck in an infinite loop."

I found solution for my case . Via nvidia-select, I configured PowerMizer to maximum performance instead of adaptive mode, and no more soft locks, no more crashes.
rleon is offline   Reply With Quote

Old 12-23-10, 01:45 PM   #45
ShiningArcanine
Registered User
 
Join Date: Aug 2006
Posts: 106
Default Re: "[mi] EQ overflowing. The server is probably stuck in an infinite loop."

How do you obtain nvidia-select?
ShiningArcanine is offline   Reply With Quote
Old 12-23-10, 02:43 PM   #46
braingravyuk
Registered User
 
Join Date: Oct 2010
Location: Manchester, UK
Posts: 10
Default Re: "[mi] EQ overflowing. The server is probably stuck in an infinite loop."

It's a separate package in Gentoo, just emerge media-video/nvidia-settings (remember to grab unstable).

About changing powermizer, still crashed for me so no fix for me yet Not bothered to downgrade Xorg server yet.
braingravyuk is offline   Reply With Quote
Old 12-30-10, 10:55 AM   #47
one_and_only
Registered User
 
Join Date: Mar 2007
Posts: 47
Default Re: "[mi] EQ overflowing. The server is probably stuck in an infinite loop."

and some info from dev team about mtrr:
http://www.nvnews.net/vbulletin/showthread.php?t=133270
I don't know if I understood zander properly, but, accordig to what he wrote, if PAT is working properly, you don't need to manipulate MTRR... So how that could possibly help?
one_and_only is offline   Reply With Quote
Old 01-04-11, 01:02 AM   #48
jpi110
Gentoo User
 
Join Date: Jan 2011
Location: Portland, Oregon
Posts: 14
Default Re: "[mi] EQ overflowing. The server is probably stuck in an infinite loop."

I ran into this same problem for a while and couldn't figure out the source of the issue. For sanity, I rarely recompile my kernel except when obvious bugs or performance issues impact me.

I'm using Gentoo on an amd64 arch. Hardware is:

2 Quad Core Opteron 2378s
16G of RAM
2 Nvidia Geforce GTX 295s (I also tried 2 GTX 275s) in SLI mode

I kept banging my head against this since my 2.6.34-gentoo-r1 kernel was just fine for many iterations of wine and the opengl apps I used -- as well as nvidia driver versions. At some point, though, I began to notice exactly the same error messages as posted in this thread.

I began to suspect IOMMU issues, bad drives, something with the drivers, but the last thing that had ever occurred to me was the kernel - after all it had worked for a very long time with no issues.

4677152 Aug 19 20:04 kernel-genkernel-x86_64-2.6.34-gentoo-r1

So, at least 4 months without issue. Then, suddenly, these errors cropped up. Google searches for it really didn't give any useful data. Though, obviously today that changed when I added the required Gentoo keyword which pointed me to this thread.

I've poked around both with some of the settings suggested in the kernel (Sanitizing the MTRR) and X specific settings. Fixed:It's shortened the amount of video memory I have on the cards (1792x 2 ... but only 1792 of that will be seen in SLI mode) to 1024M. I'll work out how to fix that problem at a later date.

It should be noted, however, that you'll need to recompile (via module-rebuild) your kernel drivers even if you think the settings changes are innocuous and/or it's the same kernel version. Some values it uses are apparently determined at initial compile time rather than detected on boot.

I also did as the other poster suggested in putting the nvidia-settings to always prefer performance (rather than adaptive). I'll post my findings once I can determine whether or not it works without crashing at some random interval later.

At the moment, however, the application seems to have started without causing my system to immediately die (X turns the monitor off on the 295s rather than the app freezing on screen like it did on the 275s) with this reported in /var/log/messages:

Jan 2 14:11:56 unimatrix-01 kernel: [577935.374469] NVRM: os_pci_init_handle: invalid context!
Jan 2 14:11:56 unimatrix-01 kernel: [577935.374471] NVRM: os_pci_init_handle: invalid context!
Jan 2 14:11:56 unimatrix-01 kernel: [577935.374573] NVRM: Xid (0006:00): 6, PE0001
Jan 2 14:11:58 unimatrix-01 kernel: [577937.374307] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Jan 2 14:12:00 unimatrix-01 kernel: [577939.374336] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

Forgot to mention: I also ran into identical issues with 2.6.36-gentoo-r5. (And 2.6.36-ck-r3 and 2.6.34-ck-r3)

Edit to add: At the moment, mind you it's only been about a 40 minute test, the problem appears to have resolved itself with these fixes. However, I will know more tonight when I can run it for longer (sometimes the problem only occurs after 2 hours or so - and other times I can go an entire day before it just randomly appears).

I did also fix the amount of video memory available after enabling the MTRR sanitation by setting the iommu boot parameter.

One particular symptom to watch for when your display crashes, especially if you're using a multi-cpu system, is to run a program like htop. I had it running on my laptop sitting next to the system in question. I noted a few anomalies both in the logs (posted above) as well as CPU activity immediately jumping to 100% on one cpu (of the 8). X was the consumer in this case, as was expected. X was unable to be killed manually. Any other application was able to be terminated obviously.

It seems to be a race condition somewhere, but I don't know enough about the code or troubleshooting the drivers to determine the cause. Though, if provided some steps to take beyond what I was able to see in the logs (and htop, and dstat), I'd be happy to take them and provide the requisite data. I do think I have process accounting turned on as well if that would provide any measurable or useful data. I did, however, check the nvidia reporting tool and it didn't provide any useful data beyond what I was able to observe myself.

Last edited by jpi110; 01-04-11 at 06:33 PM. Reason: Updated with results of most recent test
jpi110 is offline   Reply With Quote
Reply


Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT -5. The time now is 07:12 PM.


Powered by vBulletin® Version 3.7.1
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Copyright 1998 - 2014, nV News.