bms20 08-22-05 06:04 PM

x86_64 lock-up problems - any ideas?

I'm running a new setup: Tyan K8WE / Opteron 252 / LeadTek 7800GTX. I need to run open inventor / coin 3d for graphics programming.

I have installed 64 bit Fedora core 4 on the machine, and am running the 7676 drivers.

What seems to happen is that after about 4 hours of starting / restarting programs based on Coin3d the machine begins to fail to display the 3d opengl display. Rather than the 3d objects I'd expect, I get a black window. Several more tries and the machine hard locks - leaving the mouse unmovable, and the sound/network hung.

Are any of you experiencing these sorts of problems?

Have any of you fixed them?

Can you give me any tips on how to debug / workaround this problem?

My machine is running kernel 2.6.11-1.1369_FC4.

Any help would be greatly appreciated!

netllama 08-22-05 06:36 PM

If your entire system is truly locking up (such that you cannot ping/ssh/telnet it either), then I'd suggest setting up a serial console so that you can try to capture any kernel messages that are getting triggered at the time of the lockup.

If your system is not locked up, and you can telnet/ssh into it, then please generate and attach an nvidia-bug-report.log for review.


bms20 08-23-05 05:40 AM

Hi Lonni,

I actually placed my post prior to reading the "how to report a bug"...

The computer is at home, so I'll bring home the laptop to login remotly and wait for the crash to happen. I believe that it is hard locking though, as the music player (xmms) stops. Normally when I have had other problems the music player will keep running.

Good idea on the serial port console. I'll do that too.


TwinFinity 08-23-05 05:48 AM

Hi bms20,

You are not alone, I have have the same problems.

But it only happens when running as root.
The machine is fine when run as a user.
Also the machine is perfect under XP64...

I am running a K8WE/ 2xOpteron 275/ LeadTek 7800GTX
64 bit Fedora core 4, 7676 drivers.
Using a custom kernel as the dual cores oopsed with powernow on stock FC4 kernel.

Logging in as root all I need do to lock it up instantly is open the desktop properties select an OpenGL screensaver, the preview shows nicely, and then change to a different OpenGL screensaver. Locks up. Or run glgears twice. Locks up.
Also the machine occasionally acts like a key is stuck down on the keyboard eg after pressing n "nnnnnnnn...." then locking up entirely.
It responds to a ping only and the power button occasionally manages to shut it down.

I'd been logging in as root as I'd not gotten the machine stable enough to dream of setting up my user account properly...

I put a serial console on and rebooted then, for a change, I logged in as a standard user. Three hours later the machine was still working... Ran through all the opengl screensavers and no problem.

So as root it locks up, as a user it's fine. Possibly a weird permissions problem..?

Hope you find a solution, will post again if I find anything more!


netllama 08-23-05 10:48 AM

There is a known bug on some motherboards with the 7800GTX where running glxgears twice will result in a lockup. Can you please generate and post an nvidia-bug-report.log for review?


TwinFinity 08-23-05 11:41 AM

Hi Lonni,

Please find attached the bug-report.
This was run, just after the second run of glxgears froze the machine and while my ssh session still ran, machine locked up totally immediately after and needed a reset to restart.


netllama 08-23-05 12:00 PM

Thanks for the bug report. Unfortunately, I'm not seeing any of the errors that normally accompany the known bug in your bug report. In fact, I'm not seeing any errors at all, other than hald-probe-volume seg faulting twice.

I see that you're running FC4, yet you're not using an FC4 kernel. If you switch back to the latest FC4 kernel (2.6.12-1.1398_FC4) does the behavior, or kernel output change?


TwinFinity 08-23-05 12:41 PM

I've attached the 2.6.12-1.1398_FC4 (SMP) bug report, it does exactly the same thing.

Am using 2.6.13-rc6 as the stock FC4 kernel doesn't handle the dual cores properly and oopses a lot. But the only thing that takes the machine down is OpenGL...

It is very strange. It works as a regular user. Crashes as root.


netllama 08-23-05 12:47 PM

I see an Oops, but I think that's due to the dual core issue you mentioned. I'm still not seeing any errors related to the X crash. Are you able to setup a serial console to see if anything is getting output by the kernel when your system actually fully locks up?


chunkey 08-23-05 01:12 PM

Hmm, can you try a nosmp-kernel?
And this Oops is probably caused by the Cool&Quiet - kernel-module.
(The funny thing is that this Oops occurs just after the loading of nvidia's module...


NVRM: loading NVIDIA Linux x86_64 NVIDIA Kernel Module  1.0-7676  Fri Jul 29 13:15:16 PDT 2005
Losing some ticks... checking if CPU frequency changed.

Unable to handle kernel NULL pointer dereference at 0000000000000024 RIP:
PGD 7c630067 PUD 7d185067 PMD 0
Oops: 0002 [1] SMP
Modules linked in: sr_mod ...
Pid: 12, comm: events/2 Tainted: P      2.6.12-1.1398_FC4smp
RIP: 0010:[<ffffffff8011dae1>] <ffffffff8011dae1>{query_current_values_with_pending_wait+65}

Process events/2 (pid: 12, threadinfo ffff81007fc50000, task ffff810001f07780)

Call Trace:<ffffffff8011e0b1>{powernowk8_get+129} <ffffffff802e68a3>{cpufreq_get+115}

ACPI: PCI Interrupt 0000:02:00.0[A] -> Link [LNK3] -> GSI 19 (level, high) -> IRQ 201
PCI: Setting latency timer of device 0000:02:00.0 to 64
NVRM: loading NVIDIA Linux x86_64 NVIDIA Kernel Module  1.0-7676  Fri Jul 29 13:15:16 PDT 2005

TwinFinity 08-23-05 07:50 PM

With the non-smp kernel everything works as expected.

When the machine locks using an SMP kernel there is no oops data output on a serial console.


netllama 08-23-05 08:36 PM

To confirm, you are seeing the normal bootup output over the serial console? I'm honestly not sure what else to suggest at this point. All I can say at this point is that your process for triggering the lockups is identical to the known bug.


