nV News Forums

 
 

nV News Forums (http://www.nvnews.net/vbulletin/index.php)
-   NVIDIA Linux (http://www.nvnews.net/vbulletin/forumdisplay.php?f=14)
-   -   Crashing when SMP enabled (http://www.nvnews.net/vbulletin/showthread.php?t=74547)

StevenChamberla 08-04-06 10:56 PM

Crashing when SMP enabled
 
2 Attachment(s)
Hi, the following bug might not even be caused by the nvidia driver, but I currently have no reliable way to determine whether the nvidia driver is or is not at fault.

I've been trying for over a week now to use my new dual-core Athlon properly in Linux. Everything seems okay in a uniprocessor kernel, but when I enable support for SMP, xorg almost always crashes the system hard. No log files are written to, and sshd becomes unreachable so I cannot determine any details after a crash.

What is most awkward is that when xorg crashes, I am unable to switch VT, reboot with ctrl-alt-delete, or even sync the disks with the magic SysRq key; therefore I am having to power cycle which has caused frequent filesystem corruption and has even resulted in my two oldest IDE drives dying after repeated spinning up/down.

First of all, I think the most important thing to mention is that the 'nv' driver (not 'nvidia') will not work at all, either on uniprocessor or SMP kernels. The display becomes garbled (mostly green lines with flashing areas) and then the kernel will sometimes crash. If someone could help me get the 'nv' driver working, I could establish for definite whether the 'nvidia' module is definitely at fault.

I have tried kernels 2.6.16, 2.6.17, 2.6.17.6, 2.6.17.7 with no success. I always use the reiser4 patch otherwise I cannot mount my root partition to test it. I have tried with and without Ingo's realtime-preempt patches. I have only tried nvidia driver version 1.0-8762 but I'm under the impression that an earlier version will not compile under these kernels. I have tried resetting my kernel config to the 'defaults' by running 'make menuconfig' without a .config present, and then enabling the bare minimum to get the system to start up.

I run a 'make clean && make && make install' in the extracted nvidia kernel module source directory and start xorg using the 'nvidia' driver. In a uniprocessor kernel (with additional no kernel options), things seem to work perfectly, but if SMP is enabled I most often see the left-hand display (CRT, VGA) display all white and the right-hand display (LCD, DVI) all black.

I have tried booting the SMP kernel using options "noacpi acpi=off pci=noapic irqpoll" and I do see the nVidia logo but both displays enter standby straight after and the system locks up. (I also tried the option "noapic" also, but something then crashes the kernel before X loads; the name of the offending moduie disappears off the top of the screen and I am unable to then Shift-PageUp.)

Since my graphics card uses PCI-Express, I don't know if the "NvAGP" option has any effect. However, I have tried setting the value to zero to no effect, although with this setting I can see "NvAGP" set as "3" in /proc/driver/nvidia/registry

Finally I'll just describe the hardware: Athlon 64 X2 4400+, Asus A8N-SLI (tried upgrading to latest BIOS), XFX GeForce 6600 GT (PCI-Express).

Any help will be very much appreciated. Please understand that each time I try anything different I risk filesystem corruption since the system crashes with disks mounted rw and I am left unable to sync. I don't know of a way around this.

Note: Had to split nvidia-bug-report.log into two files because of this error
"Your file of 159.5 KB bytes exceeds the forum's limit of 100.0 KB for this filetype."

Thank you all!
--
Steven Chamberlain
steven@pyro.eu.org

jdtate101 08-06-06 01:14 PM

Re: Crashing when SMP enabled
 
I had the same sort of thing in both suse and ubuntu...turns out some motherboards have problems with SMP (specifically the irqbalancer). If you have smp enabled with irqbalance running, try turning that off, it worked for me, and now I have a 100% stable system. This was supposed to be fixed in 2.6.17.

Also try turning off the composite setting if you have it enabled in the xorg.conf file, that also was a major point of instability for me.

StevenChamberla 08-06-06 01:45 PM

Re: Crashing when SMP enabled
 
Quote:

Originally Posted by jdtate101
I had the same sort of thing in both suse and ubuntu...turns out some motherboards have problems with SMP (specifically the irqbalancer). If you have smp enabled with irqbalance running, try turning that off, it worked for me, and now I have a 100% stable system. This was supposed to be fixed in 2.6.17.

Thanks for the advice, I still had the problem in 2.6.17 but I will try anyway by booting the kernel with the 'noirqbalancer' option.

Quote:

Originally Posted by jdtate101
> Also try turning off the composite setting if you have it enabled in the xorg.conf file, that
> also was a major point of instability for me.

I do not have the 'composite' extension in my xorg.conf file, but it seems to be somehow built-into Debian's distribution of xorg:
Code:

(II) Initializing built-in extension COMPOSITE
I've not been able to disable it, even adding the following had no effect:
Code:

Section "Extensions"
    Option      "Composite"    "Disable"
EndSection

I think I may have to recompile xorg myself to disable the extension.

I also realised that I have not yet tried disabling TwinView so when I am ready to restart for another test, I will do that at the same time. If I can get either the 'nv' or 'nvidia' driver to work with TwinView disabled I can be more sure that the problem lies with the 'nvidia' driver and not the kernel or hardware.

chunkey 08-06-06 02:10 PM

Re: Crashing when SMP enabled
 
Quote:

Originally Posted by StevenChamberla
What is most awkward is that when xorg crashes, I am unable to switch VT, reboot with ctrl-alt-delete, or even sync the disks with the magic SysRq key; therefore I am having to power cycle which has caused frequent filesystem corruption and has even resulted in my two oldest IDE drives dying after repeated spinning up/down.

Are you sure your PSU is OK?
Since I've "more or less the same system".
4200+ X2, A8N-SLI (Premium) 1009 Bios, 2x Gigabyte 6600GT (SLI works.)

StevenChamberla 08-06-06 02:31 PM

Re: Crashing when SMP enabled
 
Quote:

Originally Posted by chunkey
Are you sure your PSU is OK?
Since I've "more or less the same system".
4200+ X2, A8N-SLI (Premium) 1009 Bios, 2x Gigabyte 6600GT (SLI works.)

May I ask what size PSU you are using? I understand SLI requires a lot of power whereas I use only a single 6600GT.

I was actually suspicious about my PSU already...
I noticed that all of the voltages (3.3, 12, etc) were measuring noticably lower than they should be. My cheap PSU rated at 550W was originally powering five SCSI disk drives as well as the two IDEs, and a large number of fans to keep them all cool. This was stable with my single-core Athlon64.

However, I tried powering up the system with power only to the motherboard and CPU fan connected, and the BIOS hardware monitor showed that the voltages were still low. I assumed from this that it was probably 'natural' for that PSU to put out those voltages (and the voltage drop wasn't due to being overloaded). This is when I (perhaps wrongly) 'ruled-out' the PSU as being the problem. Unfortunately with the hard drives disconnected I can't boot Linux at all so I can't test the stability under an SMP kernel.

Also, when using the dual-core chip under a uniprocessor kernel, I've not experienced any crashes in over a week, whereas the SMP kernel will crash as soon as X starts up.

If you think the PSU's low output could be causing this I will get hold of a a better quality PSU to try. In the meantime I have already ordered an external storage array to house my SCSI drives and power them, to take load off the base unit's PSU. I may even use it as an opportunity to 'start over' with a fresh Debian install and see if I run into the same problems again or not.

Thanks again, I really appreciate these suggestions.

StevenChamberla 08-06-06 04:03 PM

Re: Crashing when SMP enabled
 
As I write this I am finally using an SMP kernel and the 'nvidia' driver, TwinView enabled.

I'm currently in the process of figuring out which of the many things I changed, has stopped X from crashing the whole system on startup. I recompiled xorg without 'composite', enabled SMP in the kernel, recompiled the nvidia drivers and booted the kernel with 'noirqbalancer' and probably changed other things.

I'm not yet totally convinced that everything is 100% stable -- can anyone recommend any benchmarks or similar software to ensure everything is working properly?

Thanks very much to the two people who replied with helpful advice.

I will be in touch once I figure out exactly what happened to get X to not crash.

chunkey 08-06-06 04:27 PM

Re: Crashing when SMP enabled
 
- games like ut2004, quake4/doom3 (you can get the demos)
chromium, tuxracer, ...
- googleearth (for linux)
- memtest86+ / cpuburn
- x11pref
- NVCrash
- run glxgears... and wait... wait... wait...

jdtate101 08-08-06 09:24 AM

Re: Crashing when SMP enabled
 
Glad to hear it worked for you!

It increadibly frustrating to have crash after crash and not be able to lock down the cause. It took me over 4 rebuilds to figure it out, by which time I was "almost" ready to go back to windows.:(

My Ubuntu is now 100% stable and rocks along with Quake4 at a right old lick.

Perhaps they will sort this out soon, so I can turn Irqbalancer on again.

Just for the forum search engine, other side effects of this bug are:

1) Jumpy crazy mouse & keyboard input after periods of low cpu activity, only solved by a reboot. (only USB keyb/mouse are affected)

2) HDD crashed "sense data lost" errors for sata drives, requiring a reboot (as the cpu has lost the irq for the drive).

3) Hardlocks when running intensive 3D programs.

Hope this helps anyone who searches for these things.....

(nana2)

StevenChamberla 08-12-06 05:49 PM

Re: Crashing when SMP enabled
 
Hello again...

I thought this issue was solved but I realised soon after posting that the system would crash under high CPU load. This would happen with/without SMP and the kernel would also crash when not using X at all. But I also have other problems which do seem to be specific to the nvidia driver.

So, in response to a suggestion in this thread I changed the PSU for a much more powerful one to no effect. I've also started over with a fresh Debian install onto a new external storage array, doing away with software RAID and reiser4 to simplify things.

I have been trying as many kernel versions as possible to see if there is one which works for me. For realtime audio I require my kernel to be patched with Ingo Molnar's 'realtime-preempt' patches, which limits my choice of kernels. Some of the patched kernels will not even compile (2.6.13-rt14).

I originally had the nvidia driver working under 2.6.17-rt8. The problem with this was the random crashes under high CPU load, and realtime preemption did not seem to be working properly (JACK daemon suffering from frequent xruns).

I have just tested 2.6.14-rt22 with an interesting result. Without X running I've not had any crashes yet. The open-source 'nv' driver is broken, as always. But if I start X with the 'nvidia' driver I suffer the exact same problem as I originally reported here (left display all white, right display all black, and complete lockup). However, the 'vesa' driver works fine, though without Twinview support.

To try and get the 'nvidia' driver working, I tried the same things as I did under 2.6.17:
  • booting the kernel with 'noirqbalancer' option
  • booting with 'noapic acpi=off' (2.6.14-rt22 crashes before init)
  • recompiling Xorg with 'composite' disabled
  • RenderAccel 0
  • disabling TwinView

Annoyingly, Windows XP (32-bit) works perfectly, and has never crashed on me yet.

I am truly puzzled as to why this bug keeps coming back, in the meantime I can only think to try other kernels, maybe even going back to a realtime patched 2.6.12 and/or older 'nvidia' drivers.

I've attached a nvidia-bug-report.log with X running under the 'vesa' driver, since the 'nvidia' driver crashes the system. There are some kernel errors to note, perhaps a result of using the realtime patches.

Thanks for any help!

netllama 08-12-06 05:54 PM

Re: Crashing when SMP enabled
 
I don't see anything attached to your lastest post, however if you're seeing a crash when not using X at all, then you really need to start ruling out factors unrelated to the nvidia X driver.

Perhaps the system is overheating?
Perhaps you have faulty hardware?
Perhaps the BIOS has a bug?

Setting up a serial console would be a good first step to track down any kernel messages that are generated when the system hangs.

You really need to ensure that the system is stable without X before trying to debug a more complex setup that includes X.

Thanks,
Lonni

nukem 08-12-06 06:12 PM

Re: Crashing when SMP enabled
 
I had a similar problem with my setup. I found that by disabling any parts of the mother board that I did not use helped alot. So for example I disabled the nvidia ethernet card(I use the marvel one) and Sil SATA controller(I use the nvida one). What I was finding is that the problem really comes down to a timing issue. It seems the system speeds up the clock under load. Try compiling something and look at your time(make sure you can see seconds). I put ntp on my system and all seems to be fine. O one more thing it was more of a problem on older kernels(like yours) im on the latest(2.6.17.8) and its fine.

StevenChamberla 08-12-06 06:56 PM

Re: Crashing when SMP enabled
 
2 Attachment(s)
[quote=netllama]I don't see anything attached to your lastest post[quote]
Oops, sorry, I forgot to attach it, so here it is.

After a careful read-through, I just stumbled upon this... it's coming from the 'vesa' driver, should I be worried?
Code:

(WW) VESA(0): Bad V_BIOS checksum
The 'dmesg' is quite messy for this kernel, with warnings from the realtime-preempt code and a timer-related bug.

Quote:

Originally Posted by netllama
Perhaps the system is overheating?

CPU temp 50*C, and is never higher than this
Motherboard temp. stays below 40*C
I will need to boot into Windows to check the GPU temperature but I believe it stays between 40-60*C which I hope is normal.

Quote:

Originally Posted by netllama
Perhaps you have faulty hardware?

I never had problems until I changed the Athlon64 chip for a brand-new Athlon64 X2... the rest is all less than one year old, so I really hope it's not a hardware fault.

Quote:

Originally Posted by netllama
Perhaps the BIOS has a bug?

When I first experienced the crash starting the 'nvidia' driver it was using the BIOS as shipped, I since flashed it with the latest currently available.

Quote:

Originally Posted by netllama
Setting up a serial console would be a good first step to track down any kernel messages that are generated when the system hangs.

Oh, I will try this as soon as I can. For this I'll need to find a long serial cable...

Quote:

Originally Posted by netllama
You really need to ensure that the system is stable without X before trying to debug a more complex setup that includes X.

The easiest way to crash it before was to set a kernel or Xorg compile going. My system crashed more than 10 times(!) under 2.6.17 trying to run a compile for 2.6.14. Now under 2.6.14-rt22 with SMP enabled I can run X under the 'vesa' driver and compile kernels and Xorg in one go. I still only have 2 hours' uptime so I guess it's too soon to be sure.

For now I think I'll build all the kernels I can from 2.6.12 to 2.6.17, and if my system manages to do that without crashing I can be more confident that this kernel is stable when not running the 'nvidia' driver. And if it is somehow a hardware bug, maybe I'll get lucky with the right combination of kernel/options/driver...


All times are GMT -5. The time now is 02:51 PM.

Powered by vBulletin® Version 3.7.1
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Copyright 1998 - 2014, nV News.