4363 SMP Deadlock
Thanks for your hard work in bringing the linux drivers to stability and for monitoring this site. I'm sure I speak for everyone when I say we appreciate having someone from Nvidia listen to us and interact with us.
After the DFP+SMP bug got fixed in the 4363 release I decided to give it a shot yesterday. The X server started up nicely, but unfortunately made it only about 30 minutes before another problem re-appeared. The X server appears to be in deadlock, failing to process any events but managing to use up 100% of all processors. This issue has been discussed for release 4191 but to my knowledge has not yet been resolved.
Before I spend too much time trying to isolate what is causing this, is anyone else still having this problem or did it clear up for other users in this release?
My system specs follow. Please see attached for X configuration file. More information is available upon request.
SuperMicro X5DAE Mobo (Intel E7505)
GeForce 4 Ti 4600 AGP [NV25]
Dual P4 Xeon 2.4GHz
Dual Sony SDM-X72 Digital Displays
RedHat 9 (with all current updates)
Does not seem to be affected by AGP being disabled, CRT vs DFP, kernel version, glibc version, etc.
Relevant to all releases later than 3123. (That is to say, I am using 3123 since it does not have this problem.)
I was just looking through the release notes and noticed the info about disabling APIC. I will try this when I get home tonight.
I tried disabling apic last night to no avail. As much as I'd like to keep an open mind and accept the possibility that this is hardware related, I have difficulty in doing so given that release 3123 does not have this problem, yet 4xxx all do.
This is when it becomes extremely frustrating not to have source code. If I did, the first thing I'd do is a diff between 3123 and 4363 and take a look at all the mutexs and other changes. But I can't. I can only guess at what the problem is, and my current guess is that there's a mutex somewhere getting stuck in deadlock.
Andy, what information can I give Nvidia? To reiterate, as best as I can tell, this issue does not concern AGP, RenderAccel, APIC, etc.
You can try disabling ACPI rather than APIC, too -- or disable both of them.
However, this isn't a deadlock. In a deadlock, there's 0% CPU utilization -- both threads (or all threads, or all processes, or whatever's involved in the deadlock) are waiting for each other to do something, using no CPU. This happens if thread 1 calls wait on lock 1, and thread 2 calls wait on lock 2, then they each try to wait on the other lock. They'll sleep forever, because neither one can wake up to release its lock.
What you're seeing, if it's any kind of locking issue, is a livelock, where the processes involved are spinning in some sort of busy-wait loop. However, AFAIK, X is not multithreaded -- and there's only one process running. So it can't deadlock or livelock with other threads. I therefore doubt that it's any kind of locking -- though I don't have source either, so it could be some resource contention on the card that we can't see.
Same thing here
SMP Pentium III Apollo PRO 133 chipset GeForce3 Ti 200, 2.4.21-rc1 kernel.
CONFIG_X86_GOOD_APIC, CONFIG_X86_IO_APIC and CONFIG_X86_LOCAL_APIC all set to "y".
2x AGP using the NVIDIA driver.
The X server spins for a while and eventually starts up (after 1-2 minutes). After that everything seems to be ok, can't say for sure, at least the undergrads haven't come to me to tell me for the nth time what I already know :-\
PS: I *hate* this discussion board system. Thanks for keeping it up and running and all of that, but I still hate it...
Yeah, I guess I wasn't using correct terminology. By deadlock I meant the X server appears to be waiting for a resource and becomes dead because it never gets it.
I also assumed that X is multithreaded. It seems to me if it wants to interact with multiple clients connected on Unix domain sockets and over TCP and be interactive with all of them, while also driving a high bandwidth video display, it needs to accept() connections in one thread and process incoming events on another several threads. But I guess it could also use select() or poll() to service all events and accept new connections in a giant while loop.
Either way, in this case I don't believe X is livelocking itself or another process. It only spins on 1 processor, implying only 1 thread. However, it could be in a spinlock waiting for, say, an irq or some other resource to change status.
So my new prognosis is that 4363 is not playing nice with some hardware on my system, but 3123 does play nice with it. So I'll try disabling acpi and see what happens.
m2-: see the README for long delays on X start - check out the IgnoreDisplayDevices option.
WaxyLemon: I haven't seen this inhouse, but i'll try to repro. Please send details to email@example.com.
I belive the issue is related to this one:
in which you wrote that you have reproduced it in house. But I will submit a bug report in order to provide additional information.
I also haven't checked to see if acpi is the magic bullet yet. I'll do that tonight.
That thread involves a hang on TNT2s when viewing large images. I don't think it's related to your bug.
|All times are GMT -5. The time now is 08:22 PM.|
Powered by vBulletin® Version 3.7.1
Copyright ©2000 - 2015, Jelsoft Enterprises Ltd.
Copyright ©1998 - 2014, nV News.