Go Back   nV News Forums > Linux Support Forums > NVIDIA Linux

Newegg Daily Deals

Reply
 
Thread Tools
Old 12-19-06, 03:26 PM   #1
avoncampe
Registered User
 
Join Date: Dec 2006
Posts: 12
Default Random hangs on CentOS 4.4

I manage a little over a dozen CentOS 4.4 systems that have been hanging randomly. The first time I didn't think much of it, but it has been happening more frequently and to more users lately. When these systems hang the display is frozen and you can't log into them remotely (they don't even respond to a ping). This morning, I had another hang, but this time I was able to ssh in and observe what was going on. First, a little background about our setup (all systems are identical):

o Lenovo ThinkCentre M52
o 3 GB memory, 80 GB hard disk
o XFX GeForce 6200 graphics card

The hang this morning had a frozen screen from a screensaver (sorry, don't know which screensaver) and upon remo login (I could not switch to an alternate console) in I noticed that the X server was using 100% of the CPU as shown by this excerpt from top:

top - 08:44:22 up 10 days, 23:00, 10 users, load average: 1.04, 1.01, 1.00
Tasks: 115 total, 2 running, 113 sleeping, 0 stopped, 0 zombie
Cpu(s): 99.7% us, 0.3% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si
Mem: 3113468k total, 1361240k used, 1752228k free, 87312k buffers
Swap: 3047416k total, 0k used, 3047416k free, 957756k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4381 root 25 0 67748 42m 7776 R 99.8 1.4 782:53.37 X

In addition, the following lines were in /var/log/messages:

Dec 18 19:56:02 hepdsw04 kernel: NVRM: Xid (0001:00): 8, Channel 00000001
Dec 18 19:56:03 hepdsw04 kernel: NVRM: Xid (0001:00): 9, Channel 00000020 Instance 00000000 Intr 00100000
Dec 18 19:56:09 hepdsw04 Synergy 1.3.1: NOTE: CServerProxy.cpp,315: server is dead
Dec 18 19:56:10 hepdsw04 kernel: NVRM: Xid (0001:00): 8, Channel 00000020
Dec 18 19:56:11 hepdsw04 kernel: NVRM: Xid (0001:00): 9, Channel 00000020 Instance 00000000 Intr 00100000
Dec 18 19:56:18 hepdsw04 kernel: NVRM: Xid (0001:00): 8, Channel 00000020
Dec 18 19:56:19 hepdsw04 kernel: NVRM: Xid (0001:00): 9, Channel 00000020 Instance 00000000 Intr 00100000
Dec 18 19:56:26 hepdsw04 kernel: NVRM: Xid (0001:00): 8, Channel 00000020
Dec 18 19:56:27 hepdsw04 kernel: NVRM: Xid (0001:00): 9, Channel 00000020 Instance 00000000 Intr 00100000
Dec 18 19:56:34 hepdsw04 kernel: NVRM: Xid (0001:00): 8, Channel 00000001

I installed the NVIDIA driver on all systems by first creating a custom installer using the command "./NVIDIA-Linux-x86-1.0-8774-pkg1.run --add-this-kernel", and then running the resulting script using the -s option on all the other systems while at run level 3. I repeat these steps whenever a new kernel is released. I just noticed that the 1.0-9631 driver was released, and I will be installing that shortly. BTW, whenever I run the custom installer with the -s option as described above, I get the following warning:

Your driver installation has been altered since it was initially installed; this may happen, for example, if you have sice installed the NVIDIA driver through a mechanism other than the nvidia-installer (such as rpm or with the NVIDIA tarballs). The nvidia-installer will attempt to uninstall as best it can. Please see the file '/var/log/nvidia-installer.log' for details.

Is this expected? Is this the right way to quickly install the nvidia driver on multiple systems?

Finally, attached is the output from nvidia-bug-report.sh and my xorg.conf file.

In summary, I have been experiencing random system hangs where I can not even get into the system. Today, I as able to get in and collect the above information. I can configure syslog on theses systems to log any useful information remotely, so that we can get some more data. Please let me know what I should include in the first column of /etc/syslog.conf.

Thanks,
Alfred
Attached Files
File Type: gz nvidia.tar.gz (23.6 KB, 109 views)
avoncampe is offline   Reply With Quote
Old 12-19-06, 03:46 PM   #2
netllama
NVIDIA Corporation
 
Join Date: Dec 2004
Posts: 8,763
Default Re: Random hangs on CentOS 4.4

I'd be curious if this problem reproduces with 1.0-9631 and 1.0-9742.

I have a few questions:
0) Did these systems ship with the GeForce 6200, or was that added aftermarket?
1) Have you verified that you're using the latest BIOS for the system?
2) Please post the output from lspci.
3) If you boot with the noapic and/or acpi=off kernel parameters, does that have any impact on the problem?
4) Does this problem persist with a more recent kernel?
5) Are you able to setup a serial console to capture kernel messages from when the system has hung?

Thanks,
Lonni
netllama is offline   Reply With Quote
Old 12-19-06, 04:29 PM   #3
avoncampe
Registered User
 
Join Date: Dec 2006
Posts: 12
Default Re: Random hangs on CentOS 4.4

Quote:
Originally Posted by netllama
I'd be curious if this problem reproduces with 1.0-9631 and 1.0-9742.
It can take days or weeks for the hang to occur, so it's not easy to tell if the new driver fixes it. BTW, what is 1.0-9742? I can't find anything later than 1.0-9631 on the download page.

Quote:
0) Did these systems ship with the GeForce 6200, or was that added aftermarket?
Aftermarket. I could not get the on-board video to drive the LCDs at their native 1600x1200 resolution, so I ordered and installed the GeForce 6200 cards. And I had to install the binary driver, because I could not configure the open source driver to use the correct refresh frequency.

Quote:
1) Have you verified that you're using the latest BIOS for the system?
I just checked, and it looks like there is a newer BIOS available. I'll be installing that shortly as well as the latest NVIDIA driver (1.0-9631).

Quote:
2) Please post the output from lspci.
# lspci
00:00.0 Host bridge: Intel Corporation 945G/GZ/P/PL Express Memory Controller Hub (rev 02)
00:01.0 PCI bridge: Intel Corporation 945G/GZ/P/PL Express PCI Express Root Port (rev 02)
00:1b.0 Audio device: Intel Corporation 82801G (ICH7 Family) High Definition Audio Controller (rev 01)
00:1c.0 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 1 (rev 01)
00:1c.1 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 2 (rev 01)
00:1d.0 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #1 (rev 01)
00:1d.1 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #2 (rev 01)
00:1d.2 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #3 (rev 01)
00:1d.3 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #4 (rev 01)
00:1d.7 USB Controller: Intel Corporation 82801G (ICH7 Family) USB2 EHCI Controller (rev 01)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1)
00:1f.0 ISA bridge: Intel Corporation 82801GB/GR (ICH7 Family) LPC Interface Bridge (rev 01)
00:1f.2 IDE interface: Intel Corporation 82801GB/GR/GH (ICH7 Family) Serial ATA Storage Controller IDE (rev 01)
00:1f.3 SMBus: Intel Corporation 82801G (ICH7 Family) SMBus Controller (rev 01)
01:00.0 VGA compatible controller: nVidia Corporation NV44 [GeForce 6200 TurboCache(TM)] (rev a1)
02:00.0 Ethernet controller: Intel Corporation 82573E Gigabit Ethernet Controller (Copper) (rev 03)
0a:0a.0 Serial controller: NetMos Technology PCI 9835 Multi-I/O Controller (rev 01)

Quote:
3) If you boot with the noapic and/or acpi=off kernel parameters, does that have any impact on the problem?
I have not tried this, and it will take a while to see if this fixes the problem.

Quote:
4) Does this problem persist with a more recent kernel?
I am running the latest and greatest CentOS 4.4 (RHEL 4 Update 4 clone) kernel.

Quote:
5) Are you able to setup a serial console to capture kernel messages from when the system has hung?
I have not tried this yet, but will be looking into this soon.

Thanks,
Alfred
avoncampe is offline   Reply With Quote
Old 12-19-06, 04:38 PM   #4
netllama
NVIDIA Corporation
 
Join Date: Dec 2004
Posts: 8,763
Default Re: Random hangs on CentOS 4.4

1.0-9742:
http://www.nzone.com/object/nzone_do...etadriver.html

I understand that you're running the latest RHEL4 kernel, however I was interested in whether a kernel.org kernel helped at all.

Thanks,
Lonni
netllama is offline   Reply With Quote
Old 12-19-06, 05:07 PM   #5
avoncampe
Registered User
 
Join Date: Dec 2006
Posts: 12
Default Re: Random hangs on CentOS 4.4

Quote:
Originally Posted by netllama
Since my problem is intermittent, unless we can determine the cause and I know for sure that this new driver fixes the problem, I'd rather not install a beta driver.

Quote:
I understand that you're running the latest RHEL4 kernel, however I was interested in whether a kernel.org kernel helped at all.
That defeats the purpose of going with an Enterprise distribution in the first place. I want the stability that comes with RHEL/CentOS. Otherwise, I'd be using FC5 or FC6.

Alfred
avoncampe is offline   Reply With Quote
Old 12-19-06, 11:42 PM   #6
avoncampe
Registered User
 
Join Date: Dec 2006
Posts: 12
Default Re: Random hangs on CentOS 4.4

Over the next few days I will be updating the BIOS and the latest nvidia driver on all the CentOS 4.4 systems. But before I do, I wanted to confirm that the method I described above to install the nvidia driver is recommended:

Step 1: ./NVIDIA-Linux-x86-1.0-XXXX-pkg1.run --add-this-kernel
Step 2: ./NVIDIA-Linux-x86-1.0-XXXX-pkg1-custom.run -s

(where step 1. is performed on one system and step 2 on all the remaining systems). As mentioned in the original post, this produces a warning.

Ideally, I'd like to be able to install the nvidia driver on a newly installed kernel without booting into that kernel first. I haven't been able to get this to work, so I would appreciate either a step by step description or a pointer to a HOWTO.

Finally, what is the meaning of the entries in /var/log/messages that I included in the original post. Here are some examples:

Dec 18 19:56:02 hepdsw04 kernel: NVRM: Xid (0001:00): 8, Channel 00000001
Dec 18 19:56:03 hepdsw04 kernel: NVRM: Xid (0001:00): 9, Channel 00000020 Instance 00000000 Intr 00100000

Thanks,
Alfred
avoncampe is offline   Reply With Quote
Reply


Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


Similar Threads
Thread Thread Starter Forum Replies Last Post
Fedora 16, GTX 550Ti, random X hangs, "Attempted to yield the CPU..." MquwIBUO NVIDIA Linux 7 06-12-12 04:39 PM
Driver versions above 275.43 causing Xorg hangs with Quadro 130m ApeironTsuka NVIDIA Linux 2 05-13-12 04:42 PM
Graphic errors and hangs after standby reini122 NVIDIA Linux 4 05-07-12 04:37 AM
Random loot packs are the best-selling item in free-to-play City of Heroes News Archived News Items 0 05-06-12 12:00 PM

All times are GMT -5. The time now is 04:23 PM.


Powered by vBulletin® Version 3.7.1
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Copyright 1998 - 2014, nV News.