|
|
#37 | |
|
Registered User
Join Date: Aug 2006
Posts: 106
|
I recompiled my kernel with CONFIG_MTRR_SANITIZER and all of the issues seem to have gone away.
There is an unusually high number of Gentoo Linux users posting about this issue (myself included) and almost all Gentoo Linux users compile their kernels. Most Gentoo Linux users modify a kernel .config file provided by kernel-seeds.org, which has CONFIG_MTRR_SANITIZER disabled by default. For those having this issue, please try recompiling your kernels with CONFIG_MTRR_SANITIZER. If you are having the issue and do have CONFIG_MTRR_SANITIZER enabled in your kernel, please report that you do. If enough people report what their kernel configurations are, perhaps we can get to the bottom of this. |
|
|
|
|
|
|
#38 | |
|
Registered User
Join Date: Mar 2010
Posts: 11
|
Or apparently you can enable the mtrr sanitizer dynamically at boot time by specifying the following boot command line options:
Code:
enable_mtrr_cleanup mtrr_spare_reg_nr=0 |
|
|
|
|
|
|
#39 | ||
|
Registered User
Join Date: Aug 2006
Posts: 106
|
I did some more reading on this. The zero value is likely wrong. It should be:
Code:
enable_mtrr_cleanup mtrr_spare_reg_nr=1 Quote:
Quote:
Here are my references: http://www.gentoo.org/doc/en/nvidia-guide.xml#doc_chap4 http://en.gentoo-wiki.com/wiki/MTRR http://linuxindetails.wordpress.com/category/x11/ The Gentoo Nvidia guide says that having uncachable in /proc/mtrr is a problem. The MTRR at the unofficial Gentoo Wiki explains the nature of the MTRR settings. The blog talks about the kernel parameters. With the MTRR fixed, the issues I have been having appear to have gone away. I think that the reason for this is because as stuff runs, it is allocated randomly in system memory by heap randomization. If you have uncachable MTRR entries, then anything allocated inside of them will suffer a performance penalty, which will cause lag that can expose race conditions in code that normally would not occur. At the same time, there could be multiple issues at work here, so fixing this issue will not solve the fixing the other issues affecting people, but so far, my issues appear to be fixed. |
||
|
|
|
|
|
#40 | |
|
Registered User
Join Date: Oct 2010
Location: Manchester, UK
Posts: 10
|
Hi,
Had a look at my mtrr and it had the following output: Code:
$ cat /proc/mtrr reg00: base=0x1c0000000 ( 7168MB), size= 1024MB, count=1: uncachable reg01: base=0x000000000 ( 0MB), size= 8192MB, count=1: write-back reg02: base=0x0c0000000 ( 3072MB), size= 1024MB, count=1: uncachable reg03: base=0x0bf800000 ( 3064MB), size= 8MB, count=1: uncachable Code:
$ cat /proc/mtrr reg00: base=0x000000000 ( 0MB), size= 2048MB, count=1: write-back reg01: base=0x080000000 ( 2048MB), size= 1024MB, count=1: write-back reg02: base=0x0bf800000 ( 3064MB), size= 8MB, count=1: uncachable reg03: base=0x100000000 ( 4096MB), size= 2048MB, count=1: write-back reg04: base=0x180000000 ( 6144MB), size= 1024MB, count=1: write-back This friend was in my Xorg.log.old: (WW) NVIDIA(0): WAIT (0, 6, 0x8000, 0x0000a390, 0x0000a390) This was in my kernel log: NVRM: Xid (0003:00): 53, CMDre 00000000 0000089c 0100cb12 00000007 00000000 Over my MTRR, what is the 8MB coming from, any ideas? |
|
|
|
|
|
|
#41 |
|
Registered User
Join Date: Aug 2006
Posts: 106
|
I think that I overstated the issue that uncachable memory causes, as there are certain conditions when uncachable memory is desireable. In your case, some component on your motherboard likely earmarks 8MB of your system's RAM, so that region is set as being uncachable to keep your cache from corrupting data in that region of memory.
Anyway, having uncachable entries in the MTRR does not cause crashes per se, but when there is buggy code on your system, it is possible for it to trigger race conditions when regions of memory used for programs are marked as being uncachable. Try upgrading your kernel to 2.6.36.2, your nvidia driver to 256.19.29 and your xorg server to 1.9.3. Those are the versions I am using on my system and things have been better since I fixed the MTRR issue. An alternative that has worked for other people is downgrading xorg server to 1.8.x. |
|
|
|
|
|
#42 | |
|
Registered User
Join Date: Dec 2010
Posts: 3
|
I am still experiencing crashes and soft lockups despite MTRR enabled and latest NVIDIA driver.
Code:
[ 1159.112] (II) NVIDIA GLX Module 260.19.29 Wed Dec 8 12:24:30 PST 2010 Code:
reg00: base=0x000000000 ( 0MB), size= 2048MB, count=1: write-back reg01: base=0x080000000 ( 2048MB), size= 1024MB, count=1: write-back reg02: base=0x100000000 ( 4096MB), size= 1024MB, count=1: write-back Code:
CONFIG_MTRR=y CONFIG_MTRR_SANITIZER=y CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=1 CONFIG_MTRR_SANITIZER_SPARE_REG_NR_DEFAULT=1 Code:
Linux tux 2.6.36-gentoo-r5 Code:
Dec 20 21:04:10 tux kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context Dec 20 21:04:27 tux kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context Dec 20 21:04:43 tux kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context Dec 20 21:04:47 tux kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context Dec 20 21:05:40 tux kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context Dec 20 21:06:18 tux kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context Dec 20 21:06:45 tux su[2218]: pam_unix(su:session): session closed for user root Dec 20 21:07:01 tux su[2298]: Successful su for root by leon Dec 20 21:07:01 tux su[2298]: + /dev/tty1 leon:root Dec 20 21:07:01 tux su[2298]: pam_unix(su:session): session opened for user root by leon(uid=1000) Code:
[ 1182.971] (WW) NVIDIA(0): WAIT (2, 6, 0x8000, 0x000000b0, 0x00005878) [ 1189.971] (WW) NVIDIA(0): WAIT (1, 6, 0x8000, 0x000000b0, 0x00005878) [ 1193.021] (WW) NVIDIA(0): WAIT (2, 6, 0x8000, 0x000000b0, 0x000069dc) [ 1205.953] (WW) NVIDIA(0): WAIT (0, 6, 0x8000, 0x00008cb8, 0x00008cb8) [ 1208.956] (WW) NVIDIA(0): WAIT (2, 6, 0x8000, 0x00008cb8, 0x0000734c) [ 1341.581] (WW) NVIDIA(0): WAIT (0, 6, 0x8000, 0x00007914, 0x00007914) [ 1455.864] (WW) NVIDIA(0): WAIT (0, 6, 0x8000, 0x00006160, 0x00006160) [ 1472.152] (WW) NVIDIA(0): WAIT (0, 6, 0x8000, 0x0000b880, 0x0000b880) [ 1475.168] (WW) NVIDIA(0): WAIT (2, 6, 0x8000, 0x0000c810, 0x0000c1cc) Code:
[ 1478.958] Backtrace: [ 1478.958] 0: /usr/bin/X (xorg_backtrace+0x28) [0x4a2708] [ 1478.958] 1: /usr/bin/X (mieqEnqueue+0x1f4) [0x4a20a4] [ 1478.958] 2: /usr/bin/X (xf86PostMotionEventP+0xc4) [0x47f164] [ 1478.958] 3: /usr/lib64/xorg/modules/input/evdev_drv.so (0x7fb71ff1c000+0x423f) [0x7fb71ff2023f] [ 1478.958] 4: /usr/bin/X (0x400000+0x6d1e7) [0x46d1e7] [ 1478.959] 5: /usr/bin/X (0x400000+0x11b3f3) [0x51b3f3] [ 1478.959] 6: /lib/libpthread.so.0 (0x7fb725a3b000+0xf010) [0x7fb725a4a010] [ 1478.959] 7: /usr/lib64/xorg/modules/drivers/nvidia_drv.so (0x7fb7209c0000+0x6c1dd) [0x7fb720a2c1dd] [ 1478.959] 8: /usr/lib64/xorg/modules/drivers/nvidia_drv.so (0x7fb7209c0000+0x6ce19) [0x7fb720a2ce19] [ 1478.959] 9: /usr/lib64/xorg/modules/drivers/nvidia_drv.so (0x7fb7209c0000+0xcc686) [0x7fb720a8c686] [ 1478.959] 10: /usr/lib64/xorg/modules/drivers/nvidia_drv.so (0x7fb7209c0000+0x983b2) [0x7fb720a583b2] [ 1478.959] 11: /usr/lib64/xorg/modules/drivers/nvidia_drv.so (0x7fb7209c0000+0x3c142d) [0x7fb720d8142d] [ 1478.959] 12: /usr/bin/X (0x400000+0x16684c) [0x56684c] [ 1478.959] 13: /usr/bin/X (0x400000+0xa4b9c) [0x4a4b9c] [ 1478.959] 14: /usr/bin/X (BlockHandler+0x50) [0x434420] [ 1478.959] 15: /usr/bin/X (WaitForSomething+0x141) [0x45cec1] [ 1478.959] 16: /usr/bin/X (0x400000+0x2efc2) [0x42efc2] [ 1478.959] 17: /usr/bin/X (0x400000+0x2490b) [0x42490b] [ 1478.959] 18: /lib/libc.so.6 (__libc_start_main+0xfd) [0x7fb7249cebbd] [ 1478.959] 19: /usr/bin/X (0x400000+0x24499) [0x424499] |
|
|
|
|
|
|
#43 |
|
Registered User
Join Date: Dec 2010
Posts: 3
|
Just as an info:
"EQ overflowing" is not a bug, dang it. It's a symptom. |
|
|
|
|
|
#44 |
|
Registered User
Join Date: Dec 2010
Posts: 3
|
I found solution for my case
. Via nvidia-select, I configured PowerMizer to maximum performance instead of adaptive mode, and no more soft locks, no more crashes. |
|
|
|
|
|
#45 |
|
Registered User
Join Date: Aug 2006
Posts: 106
|
How do you obtain nvidia-select?
|
|
|
|
|
|
#46 |
|
Registered User
Join Date: Oct 2010
Location: Manchester, UK
Posts: 10
|
It's a separate package in Gentoo, just emerge media-video/nvidia-settings (remember to grab unstable).
About changing powermizer, still crashed for me so no fix for me yet Not bothered to downgrade Xorg server yet. |
|
|
|
|
|
#47 |
|
Registered User
Join Date: Mar 2007
Posts: 47
|
and some info from dev team about mtrr:
http://www.nvnews.net/vbulletin/showthread.php?t=133270 I don't know if I understood zander properly, but, accordig to what he wrote, if PAT is working properly, you don't need to manipulate MTRR... So how that could possibly help? |
|
|
|
|
|
#48 |
|
Gentoo User
Join Date: Jan 2011
Location: Portland, Oregon
Posts: 14
|
I ran into this same problem for a while and couldn't figure out the source of the issue. For sanity, I rarely recompile my kernel except when obvious bugs or performance issues impact me.
I'm using Gentoo on an amd64 arch. Hardware is: 2 Quad Core Opteron 2378s 16G of RAM 2 Nvidia Geforce GTX 295s (I also tried 2 GTX 275s) in SLI mode I kept banging my head against this since my 2.6.34-gentoo-r1 kernel was just fine for many iterations of wine and the opengl apps I used -- as well as nvidia driver versions. At some point, though, I began to notice exactly the same error messages as posted in this thread. I began to suspect IOMMU issues, bad drives, something with the drivers, but the last thing that had ever occurred to me was the kernel - after all it had worked for a very long time with no issues. 4677152 Aug 19 20:04 kernel-genkernel-x86_64-2.6.34-gentoo-r1 So, at least 4 months without issue. Then, suddenly, these errors cropped up. Google searches for it really didn't give any useful data. Though, obviously today that changed when I added the required Gentoo keyword which pointed me to this thread. I've poked around both with some of the settings suggested in the kernel (Sanitizing the MTRR) and X specific settings. Fixed:It's shortened the amount of video memory I have on the cards (1792x 2 ... but only 1792 of that will be seen in SLI mode) to 1024M. I'll work out how to fix that problem at a later date. It should be noted, however, that you'll need to recompile (via module-rebuild) your kernel drivers even if you think the settings changes are innocuous and/or it's the same kernel version. Some values it uses are apparently determined at initial compile time rather than detected on boot. I also did as the other poster suggested in putting the nvidia-settings to always prefer performance (rather than adaptive). I'll post my findings once I can determine whether or not it works without crashing at some random interval later. At the moment, however, the application seems to have started without causing my system to immediately die (X turns the monitor off on the 295s rather than the app freezing on screen like it did on the 275s) with this reported in /var/log/messages: Jan 2 14:11:56 unimatrix-01 kernel: [577935.374469] NVRM: os_pci_init_handle: invalid context! Jan 2 14:11:56 unimatrix-01 kernel: [577935.374471] NVRM: os_pci_init_handle: invalid context! Jan 2 14:11:56 unimatrix-01 kernel: [577935.374573] NVRM: Xid (0006:00): 6, PE0001 Jan 2 14:11:58 unimatrix-01 kernel: [577937.374307] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context Jan 2 14:12:00 unimatrix-01 kernel: [577939.374336] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context Forgot to mention: I also ran into identical issues with 2.6.36-gentoo-r5. (And 2.6.36-ck-r3 and 2.6.34-ck-r3) Edit to add: At the moment, mind you it's only been about a 40 minute test, the problem appears to have resolved itself with these fixes. However, I will know more tonight when I can run it for longer (sometimes the problem only occurs after 2 hours or so - and other times I can go an entire day before it just randomly appears). I did also fix the amount of video memory available after enabling the MTRR sanitation by setting the iommu boot parameter. One particular symptom to watch for when your display crashes, especially if you're using a multi-cpu system, is to run a program like htop. I had it running on my laptop sitting next to the system in question. I noted a few anomalies both in the logs (posted above) as well as CPU activity immediately jumping to 100% on one cpu (of the 8). X was the consumer in this case, as was expected. X was unable to be killed manually. Any other application was able to be terminated obviously. It seems to be a race condition somewhere, but I don't know enough about the code or troubleshooting the drivers to determine the cause. Though, if provided some steps to take beyond what I was able to see in the logs (and htop, and dstat), I'd be happy to take them and provide the requisite data. I do think I have process accounting turned on as well if that would provide any measurable or useful data. I did, however, check the nvidia reporting tool and it didn't provide any useful data beyond what I was able to observe myself. Last edited by jpi110; 01-04-11 at 06:33 PM. Reason: Updated with results of most recent test |
|
|
|
![]() |
| Thread Tools | |
|
|