View Single Post
Old 02-27-12, 07:28 AM   #1
lmv
Registered User
 
Join Date: Aug 2009
Posts: 9
Default NVidia live-lock with Xorg and app on ws460c HP Blade Workstation with Quadro 6000

I have a reproducible problem whereby an application, when using a specific feature, causes X and the application to enter a live-locked state.

X CPU use reached 100% while it spins in a tight loop, the application is similarly live-locks.

I can reliably reproduce this on an HP ws460c Blade workstation (with a Quardro 6000 card) running RHEL 5.5 and driver 295.20. The blade has the latest HP firmware loaded.

I can't reproduce this effect on an HP Z800 workstation with the same application, driver version and OS version with either a Quadro 5800 or 6000 graphics card installed. This suggests either a timing problem or something specific to the architecture with this NVidia feature in use.

The application seems to be using a shader to map color values to a polygon surface. Without this shader enabled everything's fine. One this shader is enabled, the application will hang within seconds.

The X process goes to 100% and is in a loop attempting to perform an ioctl() on /dev/nvidiactl:

setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
rt_sigreturn(0) = 0
write(27, "\1\1\202\0\376\0\203\0", 8) = 8
semop(0, 0x7ffff461e2d0, 1) = 0
semop(0, 0x7ffff461e2e0, 1) = 0
select(28, [27], NULL, NULL, {0, 0}) = 1 (in [27], left {0, 0})
read(27, "\3", 1) = 1
read(27, "\0\240\6\230\2 \0\30\0", 9) = 9
ioctl(8, 0xc020462a, 0x7ffff461e070) = 0
ioctl(8, 0xc020462a, 0x7ffff461e070) = 0
ioctl(8, 0xc0384641, 0x7ffff461e150) = 0

# lsof -p 13081 | grep ' 8u'
X 13081 root 8u CHR 195,255 20153 /dev/nvidiactl

/var/log/Xorg.0.log contains:

nvLock: client timed out, taking the lock
nvLock: client timed out, taking the lock
nvLock: client timed out, taking the lock
nvLock: client timed out, taking the lock
nvLock: client timed out, taking the lock

The application is in a loop, periodically waiting for an ioctl() on /dev/nvidiactl

dmesg sometimes shows Xid's like:

NVRM: Xid (0000:06:00): 8, Channel 00000004
NVRM: Xid (0000:06:00): 31, Ch 00000004, engmask 00000101, intr 10000000
NVRM: Xid (0000:06:00): 31, Ch 00000004, engmask 00000101, intr 10000000
NVRM: Xid (0000:06:00): 31, Ch 00000004, engmask 00000101, intr 10000000
NVRM: Xid (0000:06:00): 31, Ch 00000004, engmask 00000101, intr 10000100
NVRM: Xid (0000:06:00): 8, Channel 00000001
NVRM: Xid (0000:06:00): 8, Channel 00000004
NVRM: Xid (0000:06:00): 32, Channel ID 00000004 intr 00800000

During the problem Xorg will update the framebuffer but one frame every 5-10 seconds so there's no complete deadlock.

To get out of the live lock, the X server and/or the application have to be killed.

I've seen other variants of behaviour when you strace the X server (tight loop on an rt_sigaction() call, constantly stating /proc/{PID_OF_APPLICATION]/cmdline) but these vary, probably depending on when I get an strace attached to the server. In all cases the X server is too busy to update the framebuffer.

(The Blade is being accessed via HP RGS but this isn't a factor. The same behavior is reproducible when directly attached to the blade with a local monitor. The problem can't be reproduced using RGS to a Z800 workstation).

Bug log attached taken while the problem was happening...
Attached Files
File Type: gz nvidia-bug-report.log.gz (81.0 KB, 36 views)
lmv is offline   Reply With Quote