|
|
#1 | |
|
Registered User
Join Date: Aug 2009
Posts: 9
|
I have a reproducible problem whereby an application, when using a specific feature, causes X and the application to enter a live-locked state.
X CPU use reached 100% while it spins in a tight loop, the application is similarly live-locks. I can reliably reproduce this on an HP ws460c Blade workstation (with a Quardro 6000 card) running RHEL 5.5 and driver 295.20. The blade has the latest HP firmware loaded. I can't reproduce this effect on an HP Z800 workstation with the same application, driver version and OS version with either a Quadro 5800 or 6000 graphics card installed. This suggests either a timing problem or something specific to the architecture with this NVidia feature in use. The application seems to be using a shader to map color values to a polygon surface. Without this shader enabled everything's fine. One this shader is enabled, the application will hang within seconds. The X process goes to 100% and is in a loop attempting to perform an ioctl() on /dev/nvidiactl: setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0 rt_sigreturn(0) = 0 write(27, "\1\1\202\0\376\0\203\0", 8) = 8 semop(0, 0x7ffff461e2d0, 1) = 0 semop(0, 0x7ffff461e2e0, 1) = 0 select(28, [27], NULL, NULL, {0, 0}) = 1 (in [27], left {0, 0}) read(27, "\3", 1) = 1 read(27, "\0\240\6\230\2 \0\30\0", 9) = 9 ioctl(8, 0xc020462a, 0x7ffff461e070) = 0 ioctl(8, 0xc020462a, 0x7ffff461e070) = 0 ioctl(8, 0xc0384641, 0x7ffff461e150) = 0 # lsof -p 13081 | grep ' 8u' X 13081 root 8u CHR 195,255 20153 /dev/nvidiactl /var/log/Xorg.0.log contains: nvLock: client timed out, taking the lock nvLock: client timed out, taking the lock nvLock: client timed out, taking the lock nvLock: client timed out, taking the lock nvLock: client timed out, taking the lock The application is in a loop, periodically waiting for an ioctl() on /dev/nvidiactl dmesg sometimes shows Xid's like: NVRM: Xid (0000:06:00): 8, Channel 00000004 NVRM: Xid (0000:06:00): 31, Ch 00000004, engmask 00000101, intr 10000000 NVRM: Xid (0000:06:00): 31, Ch 00000004, engmask 00000101, intr 10000000 NVRM: Xid (0000:06:00): 31, Ch 00000004, engmask 00000101, intr 10000000 NVRM: Xid (0000:06:00): 31, Ch 00000004, engmask 00000101, intr 10000100 NVRM: Xid (0000:06:00): 8, Channel 00000001 NVRM: Xid (0000:06:00): 8, Channel 00000004 NVRM: Xid (0000:06:00): 32, Channel ID 00000004 intr 00800000 During the problem Xorg will update the framebuffer but one frame every 5-10 seconds so there's no complete deadlock. To get out of the live lock, the X server and/or the application have to be killed. I've seen other variants of behaviour when you strace the X server (tight loop on an rt_sigaction() call, constantly stating /proc/{PID_OF_APPLICATION]/cmdline) but these vary, probably depending on when I get an strace attached to the server. In all cases the X server is too busy to update the framebuffer. (The Blade is being accessed via HP RGS but this isn't a factor. The same behavior is reproducible when directly attached to the blade with a local monitor. The problem can't be reproduced using RGS to a Z800 workstation). Bug log attached taken while the problem was happening... |
|
|
|
|
|
|
#2 | |
|
Registered User
Join Date: Aug 2009
Posts: 9
|
Watching an strace of the X process before triggering the hang, we definitely halt in an ioctl call on /dev/nvidiactl
The application is spinning on access to the socket it uses to communicate with the X server. I also sometimes see this in Xorg.0.log: (WW) Feb 27 17:44:43 NVIDIA(0): WAIT (2, 7, 0x8000, 0x00003708, 0x0000373c) (WW) Feb 27 17:44:48 NVIDIA(0): WAIT (0, 7, 0x8000, 0x0000373c, 0x0000373c) (WW) Feb 27 17:44:51 NVIDIA(0): WAIT (2, 7, 0x8000, 0x00003ab4, 0x00003b00) (WW) Feb 27 17:44:58 NVIDIA(0): WAIT (1, 7, 0x8000, 0x00003ab4, 0x00003b00) |
|
|
|
|
|
|
#3 |
|
NVIDIA Corporation
Join Date: Mar 2005
Posts: 2,487
|
Can you please send us a test case we can use to reproduce the problem?
|
|
|
|
|
|
#4 | |
|
Registered User
Join Date: Aug 2009
Posts: 9
|
It's a complex multi-part application with a 42GB data set so that's not really a practical option. If possible, I'll PM you with the details...
|
|
|
|
|
|
|
#5 |
|
Registered User
Join Date: Nov 2003
Posts: 14
|
We seem to be having a similar problem with one of our applications. For us, the problem seems to be isolated to a particular hardware configuration. We have four Dell M6600 laptops with Quadro 3000M's and three of them will intermittently give us the "nvLock: client timed out, taking the lock" problem. We are running version 290.10. The other systems we have are running the same versions of the driver, OS, and kernel and do not have this problem.
I tried version 295.20 and that didn't fix the problem. |
|
|
|
![]() |
| Thread Tools | |
|
|