nV News Forums

 
 

nV News Forums (http://www.nvnews.net/vbulletin/index.php)
-   NVIDIA Linux (http://www.nvnews.net/vbulletin/forumdisplay.php?f=14)
-   -   Repeatable X Crash, dual FX580 on HP Z600 (http://www.nvnews.net/vbulletin/showthread.php?t=150936)

Brad.Scalio 05-13-10 06:04 PM

Repeatable X Crash, dual FX580 on HP Z600
 
2 Attachment(s)
Greetings

Problem: Spontaneous, random, and fairly repeatable X crashes when users resize windows or change window attributes such as maximizing, minimizing, etc...

Running dual FX580 cards on HP Z600 workstations

OS: RedHat Enterprise Linux 5.2
KERNEL: 2.6.18-92.el5PAE
XORG-X11: 1.1.1-48.41.el5
NVRM: NVIDIA UNIX x86 Kernel Module 195.36.15
GCC: 4.1.2 20071124 (Red Hat 4.1.2-42)
GPU: NVIDIA Quadro FX580 (512)
RedHat Ticket: 861593

Attached are the bug report and Xorg logfile during crash.

Here is gdb attached to Xorg when it crashed, unfortunately, some lines in there (0x0000001?) seem to be nonsense, have seen RHEL5 gdb have similar issues when attached. Of course, there is no nvidia debuginfo, so the remaining missing symbology is attributed to such.

(gdb) bt f
#0 0x001a2410 in __kernel_vsyscall ()
No symbol table info available.
#1 0x00394d10 in raise () from /lib/libc.so.6
No symbol table info available
#2 0x00396621 in abort () from /lib/libc.so.6
No symbol table info available.
#3 0x080a0d55 in ddxGiveUp () at xf86Init.c:1261i = <value optimized
out>
#4 0x081ae6e3 in AbortServer () at log.c:408
No locals.
#5 0x081aec76 in FatalError (f=0x81c1c2c "Caught signal %d. Server
aborting\n") at log.c:554 args = 0xbffa5d94 "\v"beenhere = 1
#6 0x080e5a70 in xf86SigHandler (signo=11) at xf86Events.c:1484
No locals.
#7 <signal handler called>
No symbol table info available.
#8 0x014bbf88 in _nv001111X () from
/usr/lib/xorg/modules/drivers/nvidia_drv.so
No symbol table info available.
#9 0x0a507300 in ?? ()
No symbol table info available.
#10 0x0a507300 in ?? ()
No symbol table info available.
#11 0xbffa6158 in ?? ()
No symbol table info available.
#12 0xbffa6154 in ?? ()
No symbol table info available.
#13 0x00000001 in ?? ()
No symbol table info available.
#14 0xbffa623c in ?? ()
No symbol table info available.
#15 0x013899d0 in _nv002838X () from
/usr/lib/xorg/modules/drivers/nvidia_drv.so
No symbol table info available.
#16 0x0a24c458 in ?? ()
No symbol table info available.
#17 0x08156b63 in getDrawableDamageRef (pDrawable=0x0) at
damage.c:77
pPixmap = (PixmapPtr) 0x2

Have tried various things, it looks like most of the options I am used to, XaaNoPixmapCache, XaaNoOffScreenPixmaps, AccelMethod "xaa", all are not activated when I add them to the driver sections of the xorg.conf - I assume these have been incorporated into the driver with the newer versions.

Have tried back drivers as well 185.18.29 is the driver pre-packaged on the HP resource CD as approved for this workstation.

I understand that the Nehalem chipsets aren't supported officialy until RHEL5.3, however, whatever hardware acceleration that may or may not be occuring I doubt highly is directly attributable to this problem.

Any suggestions, comments, complaints, etc are more than welcome

JaXXoN 05-13-10 08:27 PM

Re: Repeatable X Crash, dual FX580 on HP Z600
 
AFAIR, RHEL 5.2 uses 4k kernel stacks. The nvidia driver
worked pretty well with only 4k for a while, but maybe there
is a regression? Can you please try a kernel with 8k?

Just pure speculation, though.

Bernhard

Brad.Scalio 05-14-10 05:50 AM

Re: Repeatable X Crash, dual FX580 on HP Z600
 
Thanks for the reply ... yes, the CONFIG_4KSTACKS flag is set to 'y' in the kernel configuration. Unfortunately, we are tied into CM to have this specific kernel ... one option we are pursuing is updating to RHEL5.3 to "officially" support the Nehalem chipsets that way taking that out of the equation of possible factors.

We have about 40 of these workstations out there, and there are consistent repeatable steps that can make X crash -- namely, opening up an application on each of the three heads, moving one partly off-screen, then launching another application (glxgears even crashes it) and resizing the windows.

Brad.Scalio 05-14-10 06:05 AM

XAA options with NVIDIA drivers 180.X +
 
Does anyone know the XAA options for off screen pixmaps and pixmapcache are still valid with the newer drivers. Whenever I implement these options in xorg.conf I get messages that they are ignored.

JaXXoN 05-14-10 07:14 AM

Re: Repeatable X Crash, dual FX580 on HP Z600
 
Quote:

Originally Posted by Brad.Scalio (Post 2251303)
we are tied into CM to have this specific kernel

Nevertheless I'd recommend compiling a kernel on your own, just to see if it
makes any difference. I couldn't see any hints in your logs what could be the
problem, so I fear you have to perform a couple of experiments, i.e. like trying
another distribution.

BTW.: do your LCD really have just 1280x1024? I guess those are beamers?
In this case you may like to consider purchasing a Matrox TripleHead2Go video
splitter, but in your case that would 40x $300 = $12,000 for all your workstation.
(I guess you could get a volume discount). So I'm not sure is this is really a good
solution for you :-) On the other hand, depending on your requirements, Multi-GPU
setups can be PITA and those TH2Gs make life *much* easier on Linux, so investing
into those TH2Gs may even pay off over time (because of the reduced maintenance
effort).

Please check the following threads for details:

http://www.nvnews.net/vbulletin/showthread.php?t=133740
http://www.nvnews.net/vbulletin/showthread.php?t=126134

regards

Bernhard

Brad.Scalio 05-14-10 07:30 AM

Re: Repeatable X Crash, dual FX580 on HP Z600
 
PITA is an understatement

I will try the 8k stack and see how that goes ... we have 40 out there right now, but, we stop our deployment of these new workstations, in total, we have 1,005 of them, so forking over half a million for the TH2G is not an option right now ;-) although, the cost of keeping these in warehouse I am sure will come close to that if we don't resolve it soon

JaXXoN 05-14-10 02:33 PM

Re: Repeatable X Crash, dual FX580 on HP Z600
 
Quote:

Originally Posted by Brad.Scalio (Post 2251326)
we have 1,005 of them, so forking over half a million for the TH2G is not an option right now ;-)

I see. I guess you already filed a bug report to nvidia? Maybe it is possible
to get in contact with nvidia on support contract basis? (fixing such a bug
shouldn't require more than 50 hours)? I mean they shouldn't have a problem
with you paying them to fix what should just work beforehand, eh? SCNR! :-)

Bernhard

Brad.Scalio 05-14-10 03:38 PM

Re: Repeatable X Crash, dual FX580 on HP Z600
 
I put the bug report in this thread -- is there a formal way to file such a report. I obtained a core dump from gdb as well as the memory maps for Xorg ... since gdb didn't provide much information in the backtrace, we were going to go through the core dump and see if we could find anything useful

How would one go about contacting NVIDIA to get a bug report started ... and contact information for such "support" sales?

Brad.Scalio 05-14-10 03:56 PM

Re: Repeatable X Crash, dual FX580 on HP Z600
 
One last thing today ... going to try and turn off DamageEvent reporting in xorg.conf ... see if this "resolves" the problem, or more so, prevents the problem from causing a crash - at this point prevention is tolerable or remedying

Brad.Scalio 05-14-10 04:25 PM

Re: Repeatable X Crash, dual FX580 on HP Z600
 
(**) NVIDIA(0): Option "DamageEvents" "no"
(**) NVIDIA(0): Enabling RENDER acceleration
(II) NVIDIA(0): Support for GLX with the Damage and Composite X extensions is
(II) NVIDIA(0): enabled.

Even though I set DamageEvents to "no" I still see an informational statement saying that GLX with Damage and Composite X extensions is enabled - I assume this is ok since I care more about XAA/EXA than GLX with my problem - it isn't connected to GLX in any way and we aren't using any openGL calls in the apps that are causing the crashes.

JaXXoN 05-14-10 06:42 PM

Re: Repeatable X Crash, dual FX580 on HP Z600
 
Quote:

Originally Posted by Brad.Scalio (Post 2251664)
(II) NVIDIA(0): Support for GLX with the Damage and Composite X extensions is

hmm ... it's known that the composite extension doesn't work well with xinerama,
but as far as I can tell, xinerama is not enabled in your setup, so it should be fine.
Maybe composite also doesn't work stable with a multi-screen setup.

You may try disabling several extension, just to see at which point things
start to get stable:

Code:

Section "Extensions"
    Option        "RENDER"      "False"
    Option        "DAMAGE"      "False"
    Option        "Composite"    "False"
EndSection

Also, did you yet tried some kernel boot options that often helped in the past,
like: noapic noacpi maxcpus=1


Concerning the "official" bug report: the announcement of earlier driver
versions suggested to send a mail to "linux-bugs at nvidia.com":
http://www.nvidia.de/object/linux-di....36.24-de.html
Maybe that still works.

As for purchasing support, I guess the only "official" way is to contact
customer care (I guess you could file your bug report there, as well):
http://www.nvidia.com/object/driverq...assurance.html
(Then click "NVIDIA Customer Care")

With 2000 FX580 laying around, I would expect you get at least some
little attention :-)

regards

Bernhard

Brad.Scalio 05-17-10 04:17 PM

Re: Repeatable X Crash, dual FX580 on HP Z600
 
From what I have seen so far, this is a problem in the nvidia driver. Another, less likely possibility is a bug in Xorg or both.

The code being executed being executed is in response to a window changing size for moving. Based on the size it could be either a top level window or one of the larger panes in GFE (an in house application that seems to always trigger the crash)

I think this code runs regardless of whether the DAMAGE, RENDER, or COMPOSITE extensions are active. The nvidia driver probably assumes OpenGL will be used, so I turning off GLX will probably not prevent this code from running either, but it may be worth trying anyway. This wouldn't be a solution since we do have one in-house app that requires openGL :-(

This is the call stack as I have been able to determine it:

_nv001111X+711280+104 (nvidia GCops PolyFillRect)
damagePolyFillRect+97 (damage.c:1238)
??? (not sure...)
??? _nv001111X+711280+?? (nvidia GCops PolyFillRect
--- not actually seen in stack (because of tail call?))
miDbePositionWindow+1217 (midbe.c:713)
compPositionWindow+89
miSlideAndSizeWindow+461
compResizeWindow+172
ConfigureWindow+3120
ProcConfigureWindow+161
Dispatch+410
main+1157

I've opened a customer care support ticket FWIW, but we'll see what happens from it. Right now I have HP and RedHat as well involved, but shockingly everyone else is pointing fingers.


All times are GMT -5. The time now is 04:13 PM.

Powered by vBulletin® Version 3.7.1
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Copyright 1998 - 2014, nV News.