|
|
#1 | |
|
Registered User
Join Date: Jul 2005
Location: Munich
Posts: 910
|
Hi!
I have just updated my fedora core 4 installation to 2.6.15-1.1830_FC4smp and re-installed then nvidia graphics driver version 1.0-8178. The system was working not "too bad" with a hand-made 2.6.13 kernel (maybe once a week, the system freezed when switching virtual terminals), but with the new kernel, i'm now experiencing X lockups during "normal desktop operation" (i.e. while browsing "my usual" web-sites with Mozilla). When remote logging in with ssh, "top" says that the X-Server takes 99.9% processing time, "dmesg" reports lots of XIDs and "/var/log/Xorg.0.log" tells a couple of lines begining with "(WW) NVIDIA(0): WAIT". I guess these observations have been reported about a thousend time by other users at the nvnews.net nvidia linux forum. (Please refere to the nvidia-bug-report.log attached). I tried the kernel boot option "irqpoll" as suggested by "dmesg", but this only lead to a complete system freeze at some random time while doing something on the desktop instead of "only" a X lockup (the box doesn't responds to network pings). BTW.: i also tried a hand-made 2.6.15 kernel (vanilla source tree) and a 2.6.15-rt16 (realtime preemption), but with the same effect. I also recognized "strange" timing effects with 2.6.14, especially when the kernel was patched with Ingo Molnars realtime premption feature. From what i know, 2.6.15 includes most of Thomas Gleixners ktimer patch. I'm now suspecting for quite some time that there is a race condition between the kernel, the nvidia kernel module and the nvidia X-server module, where the ktimer changes might have a big impact in the timing. Hardware configuration: * Processor: Athlon64 4400+ (dual core, 2.2GHz, 2 x 1024 KByte) * Mainboard: ASUS A8N-SLI Premium (latest BIOS 1009) * PCIe Graphics cards: 2x Gainward Ultra/3500PCX (GeForce 7800GTX) * PCIe RAID Controller: Areca/Tekram ARC-1210 * Power Supply: Tagan 580 Watt The Mainboard is configured to operate the two PEG slots in PCIe x8 mode, each. The two graphic cards operate a triple display setup (first card in twinview mode, second card "tied in" with xinerama). I could try if things work any better when using only one graphics cards and when using an ATA hard disk instead of the RAID set, but i doubt this will cause any improovment since the described effect doesn't show up with 2.6.13. I didn't yet tried "maxcpus=1" which typically helped for the realtime preemption kernels. Any ideas what could be the problem? If you need additional information, please don't hesitate to contact me! Thanks in advance for your feedback! regards Bernhard |
|
|
|
|
|
|
#2 | |
|
Registered User
Join Date: Jul 2005
Location: Munich
Posts: 910
|
Please find attached the nvidia-bug-report.log file.
regards Bernhard |
|
|
|
|
|
|
#3 |
|
Registered User
Join Date: Jul 2005
Location: Munich
Posts: 910
|
Hi!
Applying "maxcpus=1" as kernel option seems to help. At least with vanilla-2.6.15 and 2.6.15-1.1830_FC4smp, the X-Server worked for a quater of an hour, each (typically the effect showed up within a few minutes, if not even seconds). 2.6.15-rt16 basically seems to work, as well, but there is now the effect that if moving a glxgears window, then the screen freezes for pretty exact three seconds (but the pointer still moves), then proceeds. This behaviour has been reported several time in the nvnews.net nvidia Linux forum. For me this effect only showed up with recent (means "ktimer" enhanced) realtime preemption kernels with two CPUs active. Appying "maxcpus=1" typcially made the effect going away (however, at the expense of a core). regards Bernhard |
|
|
|
|
|
#4 | |
|
Registered User
Join Date: Feb 2006
Posts: 4
|
Hi
I was curious about what ktimer was. I found out and then I read about this: http://seclists.org/lists/linux-kern.../Jan/6493.html Could it be that it has any interest since it states that ktimer is/will not be used in the new kernels since it is superseeded by hrtimers? (more I dont know at all - just thought it might help you and the nVidia driver developers!) |
|
|
|
|
|
|
#5 | |
|
Registered User
Join Date: Jul 2005
Location: Munich
Posts: 910
|
Quote:
polished version of ktimers: http://lwn.net/Articles/167315/ (see chapter "The high-resolution timer API"). Means: the hrtimers API has been keept, but under the hood, it's actually ktimer. regards Bernhard |
|
|
|
|
|
|
#6 | |
|
Registered User
Join Date: Jul 2005
Location: Munich
Posts: 910
|
Hi!
Here are some further details: i downgraded to 2.6.14-1.1656_FC4smp (RPM binary package) which doesn't include the new hrtimer/ktimer code in the hope that it will work better than 2.6.15-1.1830_FC4smp, but it behaves pretty similar, with the exception that the machine silently reboots after a couple of seconds when the screen has been frozen (instead of still being remotly accesible through ssh). I was able to catch the kernel console message trough the serial console during the crash "irq 50: nobody cared (try booting with the "irqpoll" option)". See full kernel log "irqroute", attached (since the computer reboots, i could generate an nvidia-bug-rebutr.log). I tried it with irqpoll as suggested, but then the kernel doesn't even come up and complaines "INIT: BUG: spinlock recursion on CPU#1, init/1 (Not tainted)" (see "irqpoll" log, attached) Applying "pci=routeirq" has no effect. Applying "pci=routeirq irqpoll" results in the same effect described above as if only "irqpoll" was applied (means pci=irqroute has no effect either way). Also, not applying "idle=poll" doesn't make any difference concerning stability. Ingo Molnar once suggested adding this option because the TSCs of the two cores may skew, resulting in strange (but very rare) timing behaviours in UT2004, which helped for 2.6.13, BTW. Technically, the follwing procedure leads to an "irq 50: nobody cared": 1. an irq 50 occures 2. the kernel interrupt layer calls all device interrupt handlers sharing this irq (a soundblaster and the first 7800GTX in my case), 3. the device interrupt handler are checking the hardware registers to figure out the root cause for the interrupt, react accordingly and then return "IRQ_HANDLED" to indicate that the irq was proberly servered. 4. if non of the device interrupt handlers returns IRQ_HANDLED, then the kernel disable the interrupt to make sure that the interrupt doesn't occur in a loop (this might happen, because the interrupt was not acknowladged by some driver interrupt handler), locking up the system. So if an "irq *: nobody cared" occures, then the kernel was obviously interrupted, but the root cause for the interrupt somehow "disapeared". Maybe the interrupt status register in the GPU is only valid for a certain time and the kernel couldn't check the status in time? Anayway, in earlier kernels (2.4), a device interrupt handler didn't provide a return value. Instead, the kernel had some kind of an interrupt-loop detection mechanism and disabled an interrupt only when so and so many interrupts per seconds dropped in (which could cause problems for certain applications where such a high number of interrupts was actually ok). It might be a good idea to go back, implementing the old scheme: even if the nvidia isr didn't handled the interrupt properly, maybe it does it with the next interrupt?! it's at least probably better than just disabling the whole irq :-) BTW.: the soundblaster PCI card has never been active at the times the system crashes - i could try it without the card physically installed, but i doubt it has an influence since the kernel worked mostly stable with 2.6.13 (i have experienced the described lockup maybe three times in maybe four month, always within the first few seconds after login after poweron). regards Bernhard |
|
|
|
|
|
|
#7 | |
|
Registered User
Join Date: Jul 2005
Location: Munich
Posts: 910
|
Quote:
return IRQ_HANDLED no matter what nv_kern_isr() reports, so that the interrupt will never be disabled. Since the nvidia driver is the only active device on irq 50, it sould always return IRQ_HANDLED anyway. However, now the system (2.6.15-1.1830_FC4smp) freezes with the following message (captured on serial console): Code:
NVRM: Xid: 16, Head 00000000 Count 00002ac5 NVRM: Xid: 16, Head 00000001 Count 00002ac4 NVRM: Xid: 8, Channel 00000020 do_IRQ: stack overflow: 324 after the X-server freezed. I guess the kernel keeps now looping in the nvidia isr. Concerning the stack overflow: all FC4 smp kernels are compiled with 4K kernel stack size, but as far as i know, that shouldn't be an issue since 1.0-6106 and newer. regards Bernhard |
|
|
|
|
|
|
#8 |
|
Registered User
Join Date: Jul 2005
Location: Munich
Posts: 910
|
Hi!
I have tried out a couple of setups and it seems that single card operation is not affected. Means the X-Server or the System only freezes in dual core, dual card operation (dual card, single core seems also to be ok). I have created patches for the kernel and the nvidia driver that re-orders the interrupt action processing, so that the nvidia isr handler is always called first in case the irq is shared with another device - this helped to a certain extend: Test setup 1: * snd_emu10k1 (Soundblaster) and snd_intel8x0 (nvidia on-chip sound) kernel modules unloaded (Soundblaster shared irq 50 with 7800 GTX residing in the first PEG slot). * X-server configured for dualhead in twinview mode * running desktop applictions: + fvwm95 (window manager) + Mozilla 1.7.12 + several xterms + glxgears glxgears is not strictly necessary to reproduce the problem, but glxgears+mozilla typcially freezes the system within minutes, if not seconds. * neither soundblaster nor the second 7800 GTX (residing in second PEG slot) have been removed. * kernel: 2.6.15-1.1830_FC4smp Result: seems to work stable, however, glxgears stops rotating the gears after about 20 minutes, but the console still tells reasonable FPS numbers. Restarting glxgears makes the gears rotating, again. I have observerd this effect several times with other kernels, so it might be an application releated problem. For details, refer to nvidia-bug-report.log-testsetup1. Test setup 2: * like test setup 1, but with snd_emu10k1 module loaded after nvidia module (isr action order: "emu10k1, nvidia") Result: same as for test setup 1. Test setup 3: * triple head setup * snd_emu10k1 unloaded * application like at test setup 1 * kernel: 2.6.15-1.1830_FC4smp Result: X-Server "partly" freezes: glxgears still running, pointer can still be moved, but windows dont react any more to user inputs. Effect showed up almost instantly after firing up glxgears. Remote ssh login still possible. For details, refer to nvidia-bug-report.log-testsetup3. Test setup 4: * triple head setup * snd_emu10k1 loaded * application like at test setup 1 * patch for kernel and nvidia driver to hook in the nvidia driver at the beginning of the interrupt action queue (lets call it "defer-patc"h). * kernel: vanilla 2.6.15 + defer-patch. Result: Almost complete X-Server freeze: application stop operation, but after a few seconds, "Disabling IRQ" (...) apeared in one of the xterms. Effect showed up almost instantly after firing up glxgears. Remote ssh login still possible. For details, refer to nvidia-bug-report.log-testsetup4. Test setup 5: * like test setup 4, but with snd_emu10k1 unloaded Result: Seemed to work stable for about 30 minutes, then crashed with: irq 66: nobody cared (try booting with the "irqpoll" option) [<c014063a>] __report_bad_irq+0x2a/0x90 [<c013fded>] handle_IRQ_event+0x3d/0x70 [<c0140747>] note_interrupt+0x87/0xe0 [<c013ff1c>] __do_IRQ+0xfc/0x110 [<c0105989>] do_IRQ+0x19/0x30 [<c0103c76>] common_interrupt+0x1a/0x20 [<c034e0d9>] _spin_unlock_irq+0x9/0x20 [<c034c82d>] schedule+0x54d/0x7a0 [<c011b55c>] sys_sched_yield+0x7c/0xa0 [<c011b55c>] sys_sched_yield+0x7c/0xa0 [<c01031d7>] sysenter_past_esp+0x54/0x75 handlers: [<f9064944>] (nv_kern_isr+0x0/0x64 [nvidia]) Disabling IRQ #66 NVRM: Xid: 16, Head 00000000 Count 0000375b NVRM: Xid: 16, Head 00000001 Count 0000375a NVRM: Xid: 8, Channel 00000020 Complete freeze, hence the nvidia-bug-report.log-testsetup5 has been done prior to the crash. |
|
|
|
|
|
#9 |
|
Registered User
Join Date: Jul 2005
Location: Munich
Posts: 910
|
Here are the patches to re-order processing of interrupt handlers,
just in case somebody might be interessted. regards Bernhard |
|
|
|
|
|
#10 |
|
Registered User
Join Date: Jul 2005
Location: Munich
Posts: 910
|
Hi!
I was wondering why PCIe devices actually share interrupts with other PCI devices since PCIe card are physically always using "Message Signaled Interrupts" (MSI) until i figured out that the kernel operates all PCIe devices in "pin-irq assertion mode" (legacy PCI interrupt emulation mode) unless a device driver explicitly enable MSI. So i have added "pci_enable_msi(dev)" in nv_kern_probe at line 3687 in nv.c, straight after "pci_enable_device(dev)" and now the interrupts for the nvidia cards changed to uniq interrupt vectors: first card from 66 to 90, second card from 217 to 82. So any potential conflicts with other device drivers, sharing the same IRQ, should be gone. However, when now starting the X-Server, the kernel tells NVRM: RmInitAdapter failed! (0x12:0x2b:1438) NVRM: rm_init_adapter(1) failed and Xorg.0.log tells: (EE) NVIDIA(0): The NVIDIA kernel module does not appear to be receiving (EE) NVIDIA(0): interrupts generated by the NVIDIA graphics device. Please refer to the attached nvidia-bug-report.log for details. The kernel/Documentation/MSI-HOWTO.txt says that it is only necessary to add pci_enable_msi() prior to request_irq(), so it should work. According to the MSI-HOWTO.txt, using MSI is a good idea and should be added to any PCIe device driver. If i have spare time, i will try to figure out what's the problem. regards Bernhard |
|
|
|
|
|
#11 |
|
Registered User
Join Date: Jul 2005
Location: Munich
Posts: 910
|
Hi!
Just to keep you in the loop: i have activated some additional debugging options in the kernel and do now get a different console output: Debugging options: Code:
# # Kernel hacking # # CONFIG_PRINTK_TIME is not set CONFIG_DEBUG_KERNEL=y CONFIG_MAGIC_SYSRQ=y CONFIG_LOG_BUF_SHIFT=15 CONFIG_DETECT_SOFTLOCKUP=y # CONFIG_SCHEDSTATS is not set # CONFIG_DEBUG_SLAB is not set CONFIG_DEBUG_SPINLOCK=y CONFIG_DEBUG_SPINLOCK_SLEEP=y # CONFIG_DEBUG_KOBJECT is not set # CONFIG_DEBUG_HIGHMEM is not set # CONFIG_DEBUG_BUGVERBOSE is not set # CONFIG_DEBUG_INFO is not set # CONFIG_DEBUG_FS is not set # CONFIG_DEBUG_VM is not set # CONFIG_FRAME_POINTER is not set # CONFIG_RCU_TORTURE_TEST is not set # CONFIG_EARLY_PRINTK is not set CONFIG_DEBUG_STACKOVERFLOW=y # CONFIG_DEBUG_STACK_USAGE is not set # CONFIG_DEBUG_PAGEALLOC is not set # CONFIG_4KSTACKS is not set CONFIG_X86_FIND_SMP_CONFIG=y CONFIG_X86_MPPARSE=y Code:
irq 66: nobody cared (try booting with the "irqpoll" option) [<c014117a>] __report_bad_irq+0x2a/0x90 [<c014094d>] handle_IRQ_event+0x3d/0x70 [<c0141287>] note_interrupt+0x87/0xe0 [<c0140a7c>] __do_IRQ+0xfc/0x110 [<c0105846>] do_IRQ+0x36/0x70 [<c0103b9a>] common_interrupt+0x1a/0x20 [<c01030aa>] sysenter_past_esp+0x3/0x75 handlers: [<f8908ab0>] (snd_emu10k1_interrupt+0x0/0x540 [snd_emu10k1]) [<f905b9f4>] (nv_kern_isr+0x0/0x64 [nvidia]) Disabling IRQ #66 NVRM: Xid: 16, Head 00000000 Count 00002bbb NVRM: Xid: 16, Head 00000001 Count 00002bba NVRM: Xid: 8, Channel 00000020 Unable to handle kernel paging request at virtual address 01000077 printing eip: c0116d98 *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: autofs4 sunrpc nvidia snd_intel8x0 snd_emu10k1 snd_rawmidi srCPU: 12632256 EIP: 0060:[<c0116d98>] Tainted: P VLI EFLAGS: 00010006 (2.6.15) EIP is at do_page_fault+0xa8/0x651 eax: f597e000 ebx: 00000000 ecx: 0000007b edx: 00ffffff esi: 00000001 edi: c0116cf0 ebp: f59862e0 esp: f597e080 ds: 007b es: 007b ss: 0068 Unable to handle kernel paging request at virtual address 00c0c138 printing eip: c0116d98 *pde = 00000000 BUG: spinlock lockup on CPU#0, xterm/2895, c03a1c08 [<c0220450>] __spin_lock_debug+0x90/0xc0 [<c02204e7>] _raw_spin_lock+0x67/0x90 [<c0179d20>] __pollwait+0x0/0xd0 [<c034e65c>] _spin_lock_irq+0xc/0x10 [<c010434f>] die+0x3f/0x190 [<c013e24f>] search_module_extables+0x7f/0x90 [<c011fb3b>] printk+0x1b/0x20 [<c0116f0c>] do_page_fault+0x21c/0x651 [<c0116cf0>] do_page_fault+0x0/0x651 [<c0103ccf>] error_code+0x4f/0x54 [<c0116cf0>] do_page_fault+0x0/0x651 just secondary effects of a dying system. BTW.: i also tried it with a generic x86 support (somebody claimed that the nvidia driver don't work well without) and with mem=512M as kernel boot option, but it didn't helped. regards Bernhard |
|
|
|
|
|
#12 |
|
Registered User
Join Date: Jul 2005
Location: Munich
Posts: 910
|
Hi!
just to keep you in the loop: i have "acid tested" several kernel versions and it seems that the regression started in 2.6.14.rc1: Code:
Kernel Test 2.6.13 passed 2.6.14-rc1 failed 2.6.14-rc3 failed 2.6.14 failed 2.6.14-1.1656_FC4smp failed 2.6.15 failed 2.6.15-1.1830_FC4smp failed 2.6.15.4 failed cards and two glxgears windows, also running on seperate graphic cards. It seems that the regression is worst for 2.6.14 and .15, where even bringing up the first glxgears window typically results in a freeze within seconds, whereas for the .14-rc kernels and the .15.4 kernel it usualy takes at least a couple of minutes to bring them down. 2.6.13 worked like a charm for one hour (then aborted). Somebody can confirm this observation? I went through the Changelog for 2.6.14-rc1 in order to figure out what might cause the difference, but there are quite a lot of changes ... regards Bernhard |
|
|
|
![]() |
| Thread Tools | |
|
|