Go Back   nV News Forums > Linux Support Forums > NVIDIA Linux

Newegg Daily Deals

Reply
 
Thread Tools
Old 01-19-11, 09:32 AM   #1
Steffen M.
Registered User
 
Join Date: Jul 2005
Posts: 5
Default Sporadic freezes and the changes between 260.19.21, 260.19.26 and 260.19.29

Hi all,

I've encountered sporadic freezes of my system that most probably began with 260.19.21 on kernel 2.6.35.9 (or even earlier?). "Freeze" means here: The system doesn't respond in any way, i.e. the mouse is not moving, keyboard is not functioning (the keyboard LEDs aren't even switching on/off, X.Org can't be killed, ALT+STRG+Fx is not working, even ALT+SYSRQ is not doing anything), the machine is not reacting to ping on the network. Nothing is logged.

The reason why I post this in an NVIDIA forum is that the freezes only occurred with Desktop acitivity to to machine. Running jobs in the background didn't trigger it, even very high load (e.g. Prime95 on six cores for several days) didn't cause the problem, but sitting in front of the machine, using Firefox or Flash applications let the freezes occur. Therefore I see I kind of relation between the freezes and the graphics part of the system.

I've run memory tests (memtest86) and stress tests (Prime95). No errors have been found. The RAM which I use is ECC-capable, ECC is activated in BIOS. No machine check events have been logged.

The problem appeared when running NVIDIA drivers 260.19.21 (or earlier?) together with kernel 2.6.35.9. The problems persisted after upgrading to 260.19.26 (Beta). After then, around Christmas, I upgraded both, the kernel and the NVIDIA drivers, so I have kernel 2.6.35.10 and NVIDIA drivers 260.19.29 running since then. It seems that the problem is gone now, at least I haven't had any freeze since the time I upgraded.

What I would like to know is if anyone else had similar problems which got fixed with upgrading either the NVIDIA drivers to 260.19.29 or the kernel to 2.6.35.10. I didn't find anything in the changelog that would explain that the problem might have been solved.

I am currently in the dilemma that I don't know if the problem is definitely gone or if its occurrences just got more uncommon so it hasn't appeared, yet. So I would be very glad to get the opinions and experiences of other people who suffered from this problem. Perhaps one of you made a similar observation.

Of course, I know that upgrading kernel and NVIDIA drivers at the same time wasn't the best idea to work out what the cause of the problem might have been.

Some information about my system:

- Mainboard: ASUS M4A89TD Pro/USB3, Chipset: AMD 890FX
- BIOS version: 1101 (most recent version)
- Processor: AMD Phenom II X6 1090T
- Memory: 4 x Kingston KVR1333D3E9S/4G)
- Graphics Adapter: PNY GeForce NVIDIA 6600GT

- Kernel: 2.6.35.9, later 2.6.35.10 (from "kernel.org"), x86_64
- X.Org: 1.8.2 (openSUSE 11.3, x86_64)

Thank you very much in advance!

Kind regards,
Steffen
Steffen M. is offline   Reply With Quote
Old 01-19-11, 12:32 PM   #2
jpi110
Gentoo User
 
Join Date: Jan 2011
Location: Portland, Oregon
Posts: 14
Default Re: Sporadic freezes and the changes between 260.19.21, 260.19.26 and 260.19.29

Quote:
Some information about my system:

- Mainboard: ASUS M4A89TD Pro/USB3, Chipset: AMD 890FX
- BIOS version: 1101 (most recent version)
- Processor: AMD Phenom II X6 1090T
- Memory: 4 x Kingston KVR1333D3E9S/4G)
- Graphics Adapter: PNY GeForce NVIDIA 6600GT

- Kernel: 2.6.35.9, later 2.6.35.10 (from "kernel.org"), x86_64
- X.Org: 1.8.2 (openSUSE 11.3, x86_64)
I have a system somewhat similar to yours. ASUS M4N98TD Evo with the same CPU. Our video cards, however, are radically different. I use 2 Nvidia GeForce GTX 295s. I ran into similar stability issues with 2 CPUs (not 2 cores) as recent as last week.

However, my issues were resolved by switching to a 1 cpu (6 core) model. It's hard to say if that is the same issue for you since there's no logging or debugging to confirm/deny what's going on. My suggestion, if possible, is to have a terminal window open (ssh most likely) to your system from another one. Watch the load (htop/top/dstat). See if one of the CPU cores is at 100% when the system dies. Check /var/log/messages (or whatever is appropriate for your distribution).

And check to see if /var/log/Xorg.0.log says something cryptic about a driver issue and that it's trying to recover (but never does). This would definitely be symptomatic of what I (and several others) have seen.

Last edited by jpi110; 01-19-11 at 03:18 PM. Reason: Update to clarify word choice.
jpi110 is offline   Reply With Quote
Old 01-28-11, 02:38 PM   #3
Steffen M.
Registered User
 
Join Date: Jul 2005
Posts: 5
Smile Re: Sporadic freezes and the changes between 260.19.21, 260.19.26 and 260.19.29

Hi jpi110,

thank you very much for your reply!

At the time when I started this thread, I thought that upgrading to kernel 2.6.35.10 and NVIDIA 260.19.29 made the problem go away.

However, I had again freezes in the meantime, but maybe I've caught the culprit...

Quote:
Originally Posted by jpi110 View Post
I have a system somewhat similar to yours. ASUS M4N98TD Evo with the same CPU. Our video cards, however, are radically different. I use 2 Nvidia GeForce GTX 295s. I ran into similar stability issues with 2 CPUs (not 2 cores) as recent as last week.
So the problem you ran into occurred on a different board, did I get this correctly? (I'm a bit confused as the M4N98TD is a single-socket board).

Quote:
Originally Posted by jpi110 View Post
However, my issues were resolved by switching to a 1 cpu (6 core) model. It's hard to say if that is the same issue for you since there's no logging or debugging to confirm/deny what's going on.
I fully understand, as I have/had the same problem: Not even a tiny bit of logging information.

I tried a lot of things to get some logging information, but the machine just freezes totally, i.e. also network connections (like SSH) just ran into timeouts. The last visible outputs of a remote-login based running of vmstat, top, htop, and so on did not deliver anything out of interest. There is no process which is greedy for memory or for CPU. Besides the hints you gave me, I also used "netconsole" kernel module which works at a very low level - at least the documentation states this. But even with "netconsole", I haven't got any information before crash. There is only one thing that I have not been trying: the serial console - simply because my second computer, a MacBook, doesn't have any RS232 serial interface and I didn't have a serial-to-USB converter at hand.

But the day before yesterday, I found something to trigger the freezes, lets say, within about one hour. This made things much easier! I started to use the program "memtester" (not to be mixed-up with "memtest86" and its variants). The memtester program runs as a simple task under the regular Linux operating system. So on a machine having, for example, 16 GB of RAM you can let the memtester check 10 GB which means 6 GB are left for working. So you can do some long-term tests of the RAM while working on the machine. Of course, as there is the operating system between the memtester and the hardware, you are most probably not as accurate with finding the defect memory (compared to memtest86).

However, the most interesting thing (for my case) is: Running this memtester with 14 GB RAM caused my machine to freeze after minutes or at the latest within about one hour. I could reproduce it several times in one day. Of course, again here nothing in any logs. No ECC error, no failure found by this memtester. Nothing interesting in "/var/log/*", nothing there on the "netconsole".

Having an application that triggers the freezes means a lot of help in this case. I planned to put off memory modules to do further tests because stressing the memory seems to trigger the freezes, so they are possibly be memory-related. But before I started doing this, I looked at memory-related BIOS setup settings. Disabling some energy saving things (C1E, Memory Tristate, DDR power down and so on) didn't help. (I was quite sure that I've tried these already before, but it was so difficult to do it systematically if there is nothing to trigger the freezes and if there are weeks without freezes).

After then I deactivated "Memory Hole Remapping" in BIOS. Since then, I have not been able to reproduce a freeze anymore. Running "memtester" for almost two days hasn't caused any freezes, yet.

When activating the Hole Remapping, the freezes start to occur again.

Certainly, I cannot say that the problem is solved for sure by disabling this BIOS setup setting, but at least at the moment I would say that there are very strong indications that this setting was actually the crucial switch. It would, by the way, also explain why I already before had the feeling that using virtual machines seemed to trigger the freezes. Working with one or two virtual machines has been the most memory consuming use pattern on the affected computer in the last months. The more memory consumed by applications, the higher the probability the system freezes when Memory Hole Remapping is active. At least it looks like something is overwriting regions of memory which should not be overwritten.

The next step would be to look if disabling this Memory Hole Remapping option has any negative side effects (perhaps some MB of memory are lost - if that's all, then this should be no problem with 16 GB of RAM). And then the question: Is it a bug of the BIOS? Is it a problem of the kernel? Has one of the inserted PCIe or PCI cards problems with the remapping and causes the whole trouble when remapping is activated? Is even the ancient GeForce 6600GT the cause? (I plan to buy a graphics adapter in the group of GTX 5xx in the next months to do some CUDA requiring simulations on this machine).

For now, it seems that I've at least found a work-around and I really hope that it was a step into the right direction.

Perhaps this can be also helpful for others...

Best regards,
Steffen
Steffen M. is offline   Reply With Quote
Old 01-29-11, 06:28 AM   #4
f1f0
Linux User
 
Join Date: Nov 2006
Posts: 46
Default Re: Sporadic freezes and the changes between 260.19.21, 260.19.26 and 260.19.29

Glad that you've got it solved. Here is another anecdotal evidence, if anything, may help should you run into the problem again:

I've been using a custom kernel with "No Forced Preemption" (PREEMPT_NONE) and CONFIG_HZ_100 for a very long time. In a recent few months I changed to CONFIG_PREEMPT and CONFIG_HZ_1000 out of necessity and got complete system freeze (SysRq didn't help) ranging from 12 hours to 3 days. As soon as I switched back to the old settings, everything has been fine.

Granted, there are many factors involved (custom kernel, 64-bit Arch Linux, MSI enabled, etc.) but no other changes I've done had any positive impact to the freezing.
f1f0 is offline   Reply With Quote
Reply


Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT -5. The time now is 07:33 PM.


Powered by vBulletin® Version 3.7.1
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Copyright 1998 - 2014, nV News.