Go Back   nV News Forums > Linux Support Forums > NVIDIA Linux

Newegg Daily Deals

Reply
 
Thread Tools
Old 02-06-06, 04:03 PM   #1
JaXXoN
Registered User
 
Join Date: Jul 2005
Location: Munich
Posts: 910
Default X lockup with nv-1.0-8178 and 2.6.15-1.1830_FC4smp

Hi!

I have just updated my fedora core 4 installation to
2.6.15-1.1830_FC4smp and re-installed then nvidia graphics
driver version 1.0-8178.

The system was working not "too bad" with a hand-made
2.6.13 kernel (maybe once a week, the system freezed when
switching virtual terminals), but with the new kernel, i'm
now experiencing X lockups during "normal desktop operation"
(i.e. while browsing "my usual" web-sites with Mozilla).

When remote logging in with ssh, "top" says that the X-Server
takes 99.9% processing time
, "dmesg" reports lots of XIDs and
"/var/log/Xorg.0.log" tells a couple of lines begining with
"(WW) NVIDIA(0): WAIT". I guess these observations have been
reported about a thousend time by other users at the
nvnews.net nvidia linux forum.
(Please refere to the nvidia-bug-report.log attached).

I tried the kernel boot option "irqpoll" as suggested
by "dmesg", but this only lead to a complete system
freeze at some random time while doing something
on the desktop instead of "only" a X lockup (the box
doesn't responds to network pings).

BTW.: i also tried a hand-made 2.6.15 kernel (vanilla
source tree) and a 2.6.15-rt16 (realtime preemption),
but with the same effect. I also recognized "strange"
timing effects with 2.6.14, especially when the kernel was
patched with Ingo Molnars realtime premption feature. From
what i know, 2.6.15 includes most of Thomas Gleixners
ktimer patch. I'm now suspecting for quite some time that
there is a race condition between the kernel, the nvidia
kernel module and the nvidia X-server module, where
the ktimer changes might have a big impact in the timing.


Hardware configuration:
* Processor: Athlon64 4400+ (dual core, 2.2GHz, 2 x 1024 KByte)
* Mainboard: ASUS A8N-SLI Premium (latest BIOS 1009)
* PCIe Graphics cards: 2x Gainward Ultra/3500PCX (GeForce 7800GTX)
* PCIe RAID Controller: Areca/Tekram ARC-1210
* Power Supply: Tagan 580 Watt

The Mainboard is configured to operate the two PEG slots in
PCIe x8 mode, each. The two graphic cards operate a triple display
setup (first card in twinview mode, second card "tied in"
with xinerama). I could try if things work any better when using
only one graphics cards and when using an ATA hard disk
instead of the RAID set, but i doubt this will cause any
improovment since the described effect doesn't show up with
2.6.13. I didn't yet tried "maxcpus=1" which typically helped
for the realtime preemption kernels.


Any ideas what could be the problem?

If you need additional information, please don't hesitate
to contact me!

Thanks in advance for your feedback!

regards

Bernhard
JaXXoN is offline   Reply With Quote
Old 02-06-06, 04:05 PM   #2
JaXXoN
Registered User
 
Join Date: Jul 2005
Location: Munich
Posts: 910
Default Re: X lockup with nv-1.0-8178 and 2.6.15-1.1830_FC4smp

Please find attached the nvidia-bug-report.log file.

regards

Bernhard
Attached Files
File Type: zip nvidia-bug-report.log.zip (49.2 KB, 158 views)
JaXXoN is offline   Reply With Quote
Old 02-06-06, 05:39 PM   #3
JaXXoN
Registered User
 
Join Date: Jul 2005
Location: Munich
Posts: 910
Default Re: X lockup with nv-1.0-8178 and 2.6.15-1.1830_FC4smp

Hi!

Applying "maxcpus=1" as kernel option seems to help. At least
with vanilla-2.6.15 and 2.6.15-1.1830_FC4smp, the X-Server
worked for a quater of an hour, each (typically the effect showed
up within a few minutes, if not even seconds). 2.6.15-rt16 basically
seems to work, as well, but there is now the effect that if
moving a glxgears window, then the screen freezes for pretty
exact three seconds (but the pointer still moves), then proceeds.
This behaviour has been reported several time in the nvnews.net
nvidia Linux forum. For me this effect only showed up with recent
(means "ktimer" enhanced) realtime preemption kernels with
two CPUs active. Appying "maxcpus=1" typcially made the
effect going away (however, at the expense of a core).

regards

Bernhard
JaXXoN is offline   Reply With Quote
Old 02-06-06, 06:54 PM   #4
eddtux
Registered User
 
Join Date: Feb 2006
Posts: 4
Default Re: X lockup with nv-1.0-8178 and 2.6.15-1.1830_FC4smp

Hi

I was curious about what ktimer was.
I found out and then I read about this:
http://seclists.org/lists/linux-kern.../Jan/6493.html

Could it be that it has any interest since it states that ktimer is/will not be used in the new kernels since it is superseeded by hrtimers?

(more I dont know at all - just thought it might help you and the nVidia driver developers!)
eddtux is offline   Reply With Quote
Old 02-06-06, 07:31 PM   #5
JaXXoN
Registered User
 
Join Date: Jul 2005
Location: Munich
Posts: 910
Default Re: X lockup with nv-1.0-8178 and 2.6.15-1.1830_FC4smp

Quote:
Originally Posted by eddtux
Hi
Could it be that it has any interest since it states that ktimer is/will not be used in the new kernels since it is superseeded by hrtimers?
From what i understand, the current hrtimers (re-)implementation is a
polished version of ktimers: http://lwn.net/Articles/167315/ (see chapter
"The high-resolution timer API"). Means: the hrtimers API has
been keept, but under the hood, it's actually ktimer.

regards

Bernhard
JaXXoN is offline   Reply With Quote
Old 02-07-06, 09:08 AM   #6
JaXXoN
Registered User
 
Join Date: Jul 2005
Location: Munich
Posts: 910
Default Re: X lockup with nv-1.0-8178 and 2.6.15-1.1830_FC4smp

Hi!

Here are some further details: i downgraded to 2.6.14-1.1656_FC4smp
(RPM binary package) which doesn't include the new hrtimer/ktimer
code in the hope that it will work better than 2.6.15-1.1830_FC4smp,
but it behaves pretty similar, with the exception that the
machine silently reboots after a couple of seconds when the
screen has been frozen (instead of still being remotly accesible
through ssh).

I was able to catch the kernel console message trough the
serial console during the crash "irq 50: nobody cared
(try booting with the "irqpoll" option)". See full kernel log "irqroute",
attached (since the computer reboots, i could generate
an nvidia-bug-rebutr.log). I tried it with irqpoll as suggested, but
then the kernel doesn't even come up and complaines "INIT: BUG:
spinlock recursion on CPU#1, init/1 (Not tainted)" (see "irqpoll" log,
attached)

Applying "pci=routeirq" has no effect. Applying
"pci=routeirq irqpoll" results in the same effect described
above as if only "irqpoll" was applied (means pci=irqroute has
no effect either way).

Also, not applying "idle=poll" doesn't make any difference
concerning stability. Ingo Molnar once suggested adding this
option because the TSCs of the two cores may skew, resulting
in strange (but very rare) timing behaviours in UT2004,
which helped for 2.6.13, BTW.


Technically, the follwing procedure leads to an
"irq 50: nobody cared":

1. an irq 50 occures

2. the kernel interrupt layer calls all device interrupt
handlers sharing this irq (a soundblaster and the first 7800GTX
in my case),

3. the device interrupt handler are checking the hardware
registers to figure out the root cause for the interrupt,
react accordingly and then return "IRQ_HANDLED" to indicate
that the irq was proberly servered.

4. if non of the device interrupt handlers returns
IRQ_HANDLED, then the kernel disable the interrupt to make
sure that the interrupt doesn't occur in a loop (this might
happen, because the interrupt was not acknowladged by some
driver interrupt handler), locking up the system.

So if an "irq *: nobody cared" occures, then the kernel
was obviously interrupted, but the root cause for the
interrupt somehow "disapeared". Maybe the interrupt status
register in the GPU is only valid for a certain time and the
kernel couldn't check the status in time?

Anayway, in earlier kernels (2.4), a device interrupt handler
didn't provide a return value. Instead, the kernel had some kind of
an interrupt-loop detection mechanism and disabled an interrupt
only when so and so many interrupts per seconds dropped in
(which could cause problems for certain applications where
such a high number of interrupts was actually ok).

It might be a good idea to go back, implementing the old
scheme: even if the nvidia isr didn't handled the interrupt
properly, maybe it does it with the next interrupt?!
it's at least probably better than just disabling the
whole irq :-)

BTW.: the soundblaster PCI card has never been active at
the times the system crashes - i could try it without the
card physically installed, but i doubt it has an influence
since the kernel worked mostly stable with 2.6.13 (i have
experienced the described lockup maybe three times in maybe
four month, always within the first few seconds after login
after poweron).

regards

Bernhard
Attached Files
File Type: zip nv-crash-output-irqroute.txt.zip (5.2 KB, 137 views)
File Type: zip nv-crash-output-irqpoll.txt.zip (5.4 KB, 135 views)
JaXXoN is offline   Reply With Quote
Old 02-07-06, 05:59 PM   #7
JaXXoN
Registered User
 
Join Date: Jul 2005
Location: Munich
Posts: 910
Default Re: X lockup with nv-1.0-8178 and 2.6.15-1.1830_FC4smp

Quote:
Originally Posted by JaXXoN
It might be a good idea to go back, implementing the old scheme
instead, i now changed the the nvidia interrupt handler to always
return IRQ_HANDLED no matter what nv_kern_isr() reports, so that
the interrupt will never be disabled. Since the nvidia driver is the
only active device on irq 50, it sould always return IRQ_HANDLED
anyway. However, now the system (2.6.15-1.1830_FC4smp) freezes
with the following message (captured on serial console):

Code:
NVRM: Xid: 16, Head 00000000 Count 00002ac5
NVRM: Xid: 16, Head 00000001 Count 00002ac4
NVRM: Xid: 8, Channel 00000020
do_IRQ: stack overflow: 324
It took a couple of seconds until this message showed up
after the X-server freezed. I guess the kernel keeps now
looping in the nvidia isr. Concerning the stack overflow:
all FC4 smp kernels are compiled with 4K kernel stack size,
but as far as i know, that shouldn't be an issue since 1.0-6106
and newer.

regards

Bernhard
JaXXoN is offline   Reply With Quote
Old 02-08-06, 10:21 AM   #8
JaXXoN
Registered User
 
Join Date: Jul 2005
Location: Munich
Posts: 910
Default Re: X lockup with nv-1.0-8178 and 2.6.15-1.1830_FC4smp

Hi!

I have tried out a couple of setups and it seems that
single card operation is not affected. Means the X-Server
or the System only freezes in dual core, dual card
operation (dual card, single core seems also to be ok).

I have created patches for the kernel and the
nvidia driver that re-orders the interrupt action
processing, so that the nvidia isr handler is always
called first in case the irq is shared with another
device - this helped to a certain extend:



Test setup 1:

* snd_emu10k1 (Soundblaster) and snd_intel8x0 (nvidia on-chip sound)
kernel modules unloaded (Soundblaster shared irq 50 with 7800 GTX
residing in the first PEG slot).

* X-server configured for dualhead in twinview mode

* running desktop applictions:
+ fvwm95 (window manager)
+ Mozilla 1.7.12
+ several xterms
+ glxgears

glxgears is not strictly necessary to reproduce the
problem, but glxgears+mozilla typcially freezes the
system within minutes, if not seconds.

* neither soundblaster nor the second 7800 GTX (residing in second
PEG slot) have been removed.

* kernel: 2.6.15-1.1830_FC4smp

Result: seems to work stable, however, glxgears stops rotating
the gears after about 20 minutes, but the console still tells
reasonable FPS numbers. Restarting glxgears makes the gears
rotating, again. I have observerd this effect several times
with other kernels, so it might be an application releated
problem. For details, refer to nvidia-bug-report.log-testsetup1.



Test setup 2:

* like test setup 1, but with snd_emu10k1 module loaded after
nvidia module (isr action order: "emu10k1, nvidia")

Result: same as for test setup 1.



Test setup 3:

* triple head setup

* snd_emu10k1 unloaded

* application like at test setup 1

* kernel: 2.6.15-1.1830_FC4smp

Result: X-Server "partly" freezes: glxgears still running,
pointer can still be moved, but windows dont react any
more to user inputs. Effect showed up almost instantly
after firing up glxgears. Remote ssh login still
possible. For details, refer to nvidia-bug-report.log-testsetup3.



Test setup 4:

* triple head setup

* snd_emu10k1 loaded

* application like at test setup 1

* patch for kernel and nvidia driver to hook in the
nvidia driver at the beginning of the interrupt action
queue (lets call it "defer-patc"h).

* kernel: vanilla 2.6.15 + defer-patch.

Result: Almost complete X-Server freeze: application stop
operation, but after a few seconds, "Disabling IRQ" (...)
apeared in one of the xterms. Effect showed up almost
instantly after firing up glxgears. Remote ssh login still
possible. For details, refer to nvidia-bug-report.log-testsetup4.



Test setup 5:

* like test setup 4, but with snd_emu10k1 unloaded

Result: Seemed to work stable for about 30 minutes, then crashed
with:

irq 66: nobody cared (try booting with the "irqpoll" option)
[<c014063a>] __report_bad_irq+0x2a/0x90
[<c013fded>] handle_IRQ_event+0x3d/0x70
[<c0140747>] note_interrupt+0x87/0xe0
[<c013ff1c>] __do_IRQ+0xfc/0x110
[<c0105989>] do_IRQ+0x19/0x30
[<c0103c76>] common_interrupt+0x1a/0x20
[<c034e0d9>] _spin_unlock_irq+0x9/0x20
[<c034c82d>] schedule+0x54d/0x7a0
[<c011b55c>] sys_sched_yield+0x7c/0xa0
[<c011b55c>] sys_sched_yield+0x7c/0xa0
[<c01031d7>] sysenter_past_esp+0x54/0x75
handlers:
[<f9064944>] (nv_kern_isr+0x0/0x64 [nvidia])
Disabling IRQ #66
NVRM: Xid: 16, Head 00000000 Count 0000375b
NVRM: Xid: 16, Head 00000001 Count 0000375a
NVRM: Xid: 8, Channel 00000020

Complete freeze, hence the nvidia-bug-report.log-testsetup5
has been done prior to the crash.
Attached Files
File Type: zip nvidia-bug-report.log-testsetup1.zip (49.5 KB, 125 views)
File Type: zip nvidia-bug-report.log-testsetup3.zip (49.6 KB, 132 views)
File Type: zip nvidia-bug-report.log-testsetup4.zip (48.4 KB, 134 views)
File Type: zip nvidia-bug-report.log-testsetup5.zip (48.1 KB, 134 views)
JaXXoN is offline   Reply With Quote

Old 02-08-06, 10:24 AM   #9
JaXXoN
Registered User
 
Join Date: Jul 2005
Location: Munich
Posts: 910
Default Re: X lockup with nv-1.0-8178 and 2.6.15-1.1830_FC4smp

Here are the patches to re-order processing of interrupt handlers,
just in case somebody might be interessted.

regards

Bernhard
Attached Files
File Type: zip defer.patch.zip (437 Bytes, 138 views)
File Type: zip nv-1.0-8178-2.6.15-SA_NODEFER.patch.zip (455 Bytes, 139 views)
JaXXoN is offline   Reply With Quote
Old 02-08-06, 10:29 AM   #10
JaXXoN
Registered User
 
Join Date: Jul 2005
Location: Munich
Posts: 910
Default Re: X lockup with nv-1.0-8178 and 2.6.15-1.1830_FC4smp

Hi!

I was wondering why PCIe devices actually share interrupts
with other PCI devices since PCIe card are physically always
using "Message Signaled Interrupts" (MSI) until i figured out
that the kernel operates all PCIe devices in "pin-irq
assertion mode" (legacy PCI interrupt emulation mode) unless
a device driver explicitly enable MSI.

So i have added "pci_enable_msi(dev)" in nv_kern_probe at
line 3687 in nv.c, straight after "pci_enable_device(dev)"
and now the interrupts for the nvidia cards changed to
uniq interrupt vectors: first card from 66 to 90, second
card from 217 to 82. So any potential conflicts with other
device drivers, sharing the same IRQ, should be gone.

However, when now starting the X-Server, the kernel tells

NVRM: RmInitAdapter failed! (0x12:0x2b:1438)
NVRM: rm_init_adapter(1) failed

and Xorg.0.log tells:

(EE) NVIDIA(0): The NVIDIA kernel module does not appear to be receiving
(EE) NVIDIA(0): interrupts generated by the NVIDIA graphics device.

Please refer to the attached nvidia-bug-report.log for details.

The kernel/Documentation/MSI-HOWTO.txt says that it is only necessary
to add pci_enable_msi() prior to request_irq(), so it should work.

According to the MSI-HOWTO.txt, using MSI is a good idea and should
be added to any PCIe device driver. If i have spare time, i will try
to figure out what's the problem.

regards

Bernhard
Attached Files
File Type: zip nvidia-bug-report.log-msi.zip (45.9 KB, 122 views)
JaXXoN is offline   Reply With Quote
Old 02-09-06, 09:42 AM   #11
JaXXoN
Registered User
 
Join Date: Jul 2005
Location: Munich
Posts: 910
Default Re: X lockup with nv-1.0-8178 and 2.6.15-1.1830_FC4smp

Hi!

Just to keep you in the loop: i have activated some additional
debugging options in the kernel and do now get a different
console output:

Debugging options:

Code:
#
# Kernel hacking
#
# CONFIG_PRINTK_TIME is not set
CONFIG_DEBUG_KERNEL=y
CONFIG_MAGIC_SYSRQ=y
CONFIG_LOG_BUF_SHIFT=15
CONFIG_DETECT_SOFTLOCKUP=y
# CONFIG_SCHEDSTATS is not set
# CONFIG_DEBUG_SLAB is not set
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_SPINLOCK_SLEEP=y
# CONFIG_DEBUG_KOBJECT is not set
# CONFIG_DEBUG_HIGHMEM is not set
# CONFIG_DEBUG_BUGVERBOSE is not set
# CONFIG_DEBUG_INFO is not set
# CONFIG_DEBUG_FS is not set
# CONFIG_DEBUG_VM is not set
# CONFIG_FRAME_POINTER is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_EARLY_PRINTK is not set
CONFIG_DEBUG_STACKOVERFLOW=y
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_DEBUG_PAGEALLOC is not set
# CONFIG_4KSTACKS is not set
CONFIG_X86_FIND_SMP_CONFIG=y
CONFIG_X86_MPPARSE=y
Console output:

Code:
irq 66: nobody cared (try booting with the "irqpoll" option)
 [<c014117a>] __report_bad_irq+0x2a/0x90
 [<c014094d>] handle_IRQ_event+0x3d/0x70
 [<c0141287>] note_interrupt+0x87/0xe0
 [<c0140a7c>] __do_IRQ+0xfc/0x110
 [<c0105846>] do_IRQ+0x36/0x70
 [<c0103b9a>] common_interrupt+0x1a/0x20
 [<c01030aa>] sysenter_past_esp+0x3/0x75
handlers:
[<f8908ab0>] (snd_emu10k1_interrupt+0x0/0x540 [snd_emu10k1])
[<f905b9f4>] (nv_kern_isr+0x0/0x64 [nvidia])
Disabling IRQ #66
NVRM: Xid: 16, Head 00000000 Count 00002bbb
NVRM: Xid: 16, Head 00000001 Count 00002bba
NVRM: Xid: 8, Channel 00000020
Unable to handle kernel paging request at virtual address 01000077
 printing eip:
c0116d98
*pde = 00000000
Oops: 0000 [#1]
SMP
Modules linked in: autofs4 sunrpc nvidia snd_intel8x0 snd_emu10k1 snd_rawmidi srCPU:    12632256
EIP:    0060:[<c0116d98>]    Tainted: P      VLI
EFLAGS: 00010006   (2.6.15)
EIP is at do_page_fault+0xa8/0x651
eax: f597e000   ebx: 00000000   ecx: 0000007b   edx: 00ffffff
esi: 00000001   edi: c0116cf0   ebp: f59862e0   esp: f597e080
ds: 007b   es: 007b   ss: 0068
Unable to handle kernel paging request at virtual address 00c0c138
 printing eip:
c0116d98
*pde = 00000000
BUG: spinlock lockup on CPU#0, xterm/2895, c03a1c08
 [<c0220450>] __spin_lock_debug+0x90/0xc0
 [<c02204e7>] _raw_spin_lock+0x67/0x90
 [<c0179d20>] __pollwait+0x0/0xd0
 [<c034e65c>] _spin_lock_irq+0xc/0x10
 [<c010434f>] die+0x3f/0x190
 [<c013e24f>] search_module_extables+0x7f/0x90
 [<c011fb3b>] printk+0x1b/0x20
 [<c0116f0c>] do_page_fault+0x21c/0x651
 [<c0116cf0>] do_page_fault+0x0/0x651
 [<c0103ccf>] error_code+0x4f/0x54
 [<c0116cf0>] do_page_fault+0x0/0x651
However, i think this page fault and spinlock lockup are
just secondary effects of a dying system.

BTW.: i also tried it with a generic x86 support (somebody claimed
that the nvidia driver don't work well without) and with mem=512M
as kernel boot option, but it didn't helped.

regards

Bernhard
JaXXoN is offline   Reply With Quote
Old 02-10-06, 03:34 PM   #12
JaXXoN
Registered User
 
Join Date: Jul 2005
Location: Munich
Posts: 910
Default Re: X lockup with nv-1.0-8178 and 2.6.15-1.1830_FC4smp

Hi!

just to keep you in the loop: i have "acid tested" several kernel
versions and it seems that the regression started in 2.6.14.rc1:

Code:
Kernel                 Test
2.6.13                 passed
2.6.14-rc1             failed
2.6.14-rc3             failed
2.6.14                 failed
2.6.14-1.1656_FC4smp   failed
2.6.15                 failed
2.6.15-1.1830_FC4smp   failed
2.6.15.4               failed
Acid test means: running two mozilla windows on seperate graphic
cards and two glxgears windows, also running on seperate graphic
cards. It seems that the regression is worst for 2.6.14 and .15,
where even bringing up the first glxgears window typically results in a
freeze within seconds, whereas for the .14-rc kernels and the .15.4
kernel it usualy takes at least a couple of minutes to bring them down.
2.6.13 worked like a charm for one hour (then aborted).

Somebody can confirm this observation?

I went through the Changelog for 2.6.14-rc1 in order to figure out
what might cause the difference, but there are quite a lot of changes ...

regards

Bernhard
JaXXoN is offline   Reply With Quote
Reply


Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT -5. The time now is 04:01 AM.


Powered by vBulletin® Version 3.7.1
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Copyright 1998 - 2014, nV News.