nV News Forums

 
 

nV News Forums (http://www.nvnews.net/vbulletin/index.php)
-   NVIDIA Linux (http://www.nvnews.net/vbulletin/forumdisplay.php?f=14)
-   -   Lockups with Quadro FX 3400 + HP xw9300 + 7676 (http://www.nvnews.net/vbulletin/showthread.php?t=55572)

siersmak 08-24-05 05:16 PM

Lockups with Quadro FX 3400 + HP xw9300 + 7676
 
1 Attachment(s)
I'm trying to get the 7676 x86-64 driver running on SuSE 9.3 on an HP xw9300 (dual Opteron 250s, 8 GB RAM, nForce4 chipset. I've been experiencing two different lockups. One involves only an Xid:

Aug 24 16:53:16 ekk2 kernel: NVRM: Xid: 25, L1 -> L0
Aug 24 16:53:16 ekk2 kernel: NVRM: Xid: 13, 0005 beef3097 00004097 00001748 00000000 00000002

This error locks up my opengl app momentarily, but the X server has always recovered so far.

The other error I get starts with this message in the log:

Aug 24 17:05:26 ekk2 kernel: NVRM: not remapping 0x1000 bytes, 0x3c00000 total
Aug 24 17:05:26 ekk2 kernel: NVRM: VM: nv_vm_malloc_pages: failed to sg map pages
Aug 24 17:05:26 ekk2 kernel: ----------- [cut here ] --------- [please bite here ] ---------
Aug 24 17:05:26 ekk2 kernel: Kernel BUG at pageattr:154
Aug 24 17:05:26 ekk2 kernel: invalid operand: 0000 [1] SMP
Aug 24 17:05:26 ekk2 kernel: CPU 1

This error is more serious (and it seems to be more frequent too), as the X server never recovers and I have to shut down remotely. For some reason nvidia-bug-report.sh never finishes with the second error either.

I've attached a bug report generated after the Xid error. Please let me know if you have any insight.

Thank you,
Ken

Edit: It's an NVIDIA nForce Professional 2200 chipset actually, not nForce4.

netllama 08-25-05 01:31 PM

Re: Lockups with Quadro FX 3400 + HP xw9300 + 7676
 
siersmak,
The full lockup that you're reporting looks like a known kernel bug, which is documented in the NVIDIA README in the section titled "The X86-64 platform (AMD64/EM64T) and 2.6 kernels". The odd thing is that you're running a 2.6.11 kernel, so this bug should have already been resolved, unless SuSE did something odd with their kernel build. Can you try upgrading to a 2.6.12 based kernel from kernel.org and see if that helps at all?

For both of these issues, do you have a reliable means of reproducing them, or are they seemingly random?

Thanks,
Lonni

siersmak 08-25-05 01:43 PM

Re: Lockups with Quadro FX 3400 + HP xw9300 + 7676
 
Thanks for the reply netlama. I will try your suggestion of running on a 2.6.12 kernel next.

The lockups are reproducible by me - I start up glxgears, then I open up 1 or more copies of my companies OpenGL based modelling code, and it will usually crash relatively quickly (within 15 minutes or so). I can get it to crash more quickly if I open up multiple copies of my code.

Once during my testing today I only had a few xterms open and the KDE control center running when it locked up solid with the nv_vm_malloc_pages error.

I'll let you know how goes with a 2.6.12 kernel.

netllama 08-25-05 02:00 PM

Re: Lockups with Quadro FX 3400 + HP xw9300 + 7676
 
siersmak,
I should clarify, that upgrading the kernel may only resolve the full system lockup. The first issue you noted where there is a brief, recoverable lockup may very well be a different and unrelated bug.

Since it sounds like your company's OpenGL based modeling code is the only reliable reproduction path, would you be able to provide me with that code (or a compiled binary, if that is more acceptable) so that I can attempt to reproduce this? You can email it to linux-bugs@nvidia.com if you're not comfortable with posting the code here.

If you'd prefer to wait until after you've tested a 2.6.12.x kernel first, that's fine too.

Thanks,
Lonni

siersmak 08-25-05 03:48 PM

Re: Lockups with Quadro FX 3400 + HP xw9300 + 7676
 
1 Attachment(s)
netlama,

Upgrading to 2.6.12.5 appears to be a huge improvement. I am still seeing Xid's, but so far they have all been recoverable and I'm not seeing any of the nv_vm_malloc_pages errors I was seeing in my original post.

Here's another log file from a recent Xid error. Anything that happened in the syslog before 15:31 was before I rebooted with the new kernel. You will notice that I'm not passing anything special to the kernel or to the nvidia driver. I'm going to start tinkering with a few things (swiotlb, apic, etc) but if anything catches your eye please let me know.

Thanks for your help,
Ken

siersmak 08-25-05 04:21 PM

Re: Lockups with Quadro FX 3400 + HP xw9300 + 7676
 
I tried passing only swiotlb=65536 to the kernel, and I saw a few Xids so I tried adding noapic as well. WIthout noapic, my video card gets it's own IRQ. But with noapic, it end up sharing with a few others on IRQ 5. Then, after a few minutes of stressing the card pretty hard I got this:

Quote:

Aug 25 16:13:20 ekk2 kernel: NVRM: not remapping 0x1000 bytes, 0x3c00000 total
Aug 25 16:13:20 ekk2 kernel: NVRM: VM: nv_vm_malloc_pages: failed to sg map pages
Aug 25 16:13:20 ekk2 kernel: ----------- [cut here ] --------- [please bite here ] ---------
Aug 25 16:13:20 ekk2 kernel: Kernel BUG at "arch/x86_64/mm/pageattr.c":154
Aug 25 16:13:20 ekk2 kernel: invalid operand: 0000 [1] SMP
Aug 25 16:13:20 ekk2 kernel: CPU 1
Aug 25 16:13:20 ekk2 kernel: Modules linked in: nvidia tulip forcedeth iptable_filter ip_tables af_packet usbserial usbcore floppy freq_table thermal snd_pcm_oss p
rocessor snd_mixer_oss fan button battery ac snd_intel8x0 snd_ac97_codec snd_pcm snd_timer snd ipv6 soundcore snd_page_alloc edd evdev joydev st sr_mod sg parport_
pc lp parport video1394 ohci1394 raw1394 ieee1394 capability commoncap jfs dm_mod reiserfs ide_cd cdrom ide_disk mptscsih mptbase sata_nv libata amd74xx ide_core s
d_mod scsi_mod
Aug 25 16:13:20 ekk2 kernel: Pid: 8676, comm: meshid Tainted: P 2.6.12.5-smp
Aug 25 16:13:20 ekk2 kernel: RIP: 0010:[<ffffffff80121e48>] <ffffffff80121e48>{__change_page_attr+728}
Aug 25 16:13:20 ekk2 kernel: RSP: 0018:ffff8100ac403ad8 EFLAGS: 00010282
Aug 25 16:13:20 ekk2 kernel: RAX: 00000000e6e001e3 RBX: 8000000000000163 RCX: 0000000000000000
Aug 25 16:13:20 ekk2 kernel: RDX: 0000000000000070 RSI: 00000000001e6fe0 RDI: ffff810000011000
Aug 25 16:13:20 ekk2 kernel: RBP: 8000000000000163 R08: 03fffffffffff000 R09: 0000000000000000
Aug 25 16:13:20 ekk2 kernel: R10: 00000000ffffffff R11: 0000000000000000 R12: ffff810001000380
Aug 25 16:13:20 ekk2 kernel: R13: ffff8101e6fe0000 R14: ffff8100000109b8 R15: 0000000000000001
Aug 25 16:13:20 ekk2 kernel: FS: 00002aaaace0e360(0000) GS:ffffffff804d1dc0(0000) knlGS:0000000000000000
Aug 25 16:13:20 ekk2 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 25 16:13:20 ekk2 kernel: CR2: 0000000008849000 CR3: 00000001ef394000 CR4: 00000000000006e0
Aug 25 16:13:20 ekk2 kernel: Process meshid (pid: 8676, threadinfo ffff8100ac402000, task ffff8101ff447660)
Aug 25 16:13:20 ekk2 kernel: Stack: ffff8101e6fe0000 00000000001e6fe0 0000000000000000 0000000000000000
Aug 25 16:13:20 ekk2 kernel: ffffffff7fffffff ffffffff8012205c ffff810100000680 0000000000000163
Aug 25 16:13:20 ekk2 kernel: 8000000000000163 ffff810000000000
Aug 25 16:13:20 ekk2 kernel: Call Trace:<ffffffff8012205c>{change_page_attr_addr+140 } <ffffffff8853098d>{:nvidia:nv_vm_malloc_pages+3934 }
Aug 25 16:13:20 ekk2 kernel: <ffffffff8016432a>{cache_alloc_refill+426} <ffffffff8852d48d>{:nvidia:nv_alloc_pages+988}
Aug 25 16:13:20 ekk2 kernel: <ffffffff883334a8>{:nvidia:_nv002119rm+152} <ffffffff8831c385>{:nvidia:_nv001557rm+343}
Aug 25 16:13:20 ekk2 kernel: <ffffffff8831c077>{:nvidia:_nv001562rm+503} <ffffffff8845bdc0>{:nvidia:_nv004528rm+28}
Aug 25 16:13:20 ekk2 kernel: <ffffffff88317f8f>{:nvidia:_nv002947rm+139} <ffffffff8833b25f>{:nvidia:_nv001445rm+379}
Aug 25 16:13:20 ekk2 kernel: <ffffffff88339e72>{:nvidia:rm_ioctl+58} <ffffffff8852eb14>{:nvidia:nv_kern_ioctl+945}
Aug 25 16:13:20 ekk2 kernel: <ffffffff80191d34>{do_ioctl+116} <ffffffff8019200d>{vfs_ioctl+685}
Aug 25 16:13:20 ekk2 kernel: <ffffffff802229a1>{__up_write+49} <ffffffff801920aa>{sys_ioctl+106}
Aug 25 16:13:20 ekk2 kernel: <ffffffff8010ea9e>{system_call+126}
Aug 25 16:13:20 ekk2 kernel:
Aug 25 16:13:20 ekk2 kernel: Code: 0f 0b 72 ab 34 80 ff ff ff ff 9a 00 41 8b 04 24 f6 c4 08 74
Aug 25 16:13:20 ekk2 kernel: RIP <ffffffff80121e48>{__change_page_attr+728} RSP <ffff8100ac403ad8>
so now I've removed the noapic option and I'm back to having a few Xids.

netllama 08-25-05 04:28 PM

Re: Lockups with Quadro FX 3400 + HP xw9300 + 7676
 
siersmak,
Are you able to tie any specific actions to triggering the Xids that you're seeing? At this stage, I'll need to have a means of reproducing the Xids in order to investigate further.

Thanks,
Lonni

siersmak 08-25-05 04:29 PM

Re: Lockups with Quadro FX 3400 + HP xw9300 + 7676
 
One thing I'm considering trying is updating the video card BIOS. I downloaded a video card BIOS from HP, but this is not my machine so I'm a little afraid of doing it. The video card bios is currently version 05.40.02.35.05, and the update would bring it up to 5.40.02.41.04-06 Rev. A

(http://h20000.www2.hp.com/bizsupport...T%29&tech=BIOS)

Do you have any advice as far as that's concerned?

-Ken

siersmak 08-25-05 04:31 PM

Re: Lockups with Quadro FX 3400 + HP xw9300 + 7676
 
Quote:

Are you able to tie any specific actions to triggering the Xids that you're seeing? At this stage, I'll need to have a means of reproducing the Xids in order to investigate further.
I understand. Unfortunately I can't send you my code. I'll try to conjur up something I can send to you tomorrow morning.

Thanks,
Ken

netllama 08-25-05 05:18 PM

Re: Lockups with Quadro FX 3400 + HP xw9300 + 7676
 
siersmak,
There's always some risk when upgrading any BIOS. If you're upgrading an HP provided VBIOS, then it should be relatively safe, and they should support it. Whether it helps this problem is hard to say since I don't know what the actual problem is right now.

Thanks,
Lonni

siersmak 08-26-05 09:34 AM

Re: Lockups with Quadro FX 3400 + HP xw9300 + 7676
 
gl-117 produces a bunch of Xids, in either KDE or fvwm2:

Quote:

Aug 26 09:29:15 ekk2 kernel: NVRM: Xid: 25, L1 -> L0
Aug 26 09:29:15 ekk2 kernel: NVRM: Xid: 6, PE0002 0000 beef4901 0000fde8 beef1e20 00000000
Aug 26 09:29:15 ekk2 kernel: NVRM: Xid: 25, L1 -> L0
Aug 26 09:29:15 ekk2 kernel: NVRM: Xid: 6, PE0002 194c 00000000 00603e74 ae10a040 00000000
Aug 26 09:29:15 ekk2 kernel: NVRM: Xid: 25, L1 -> L0
Aug 26 09:29:15 ekk2 kernel: NVRM: Xid: 6, PE0002 194c 00000000 00000180 00000000 00000000
Aug 26 09:29:15 ekk2 kernel: NVRM: Xid: 25, L0 -> L0
Aug 26 09:29:15 ekk2 kernel: NVRM: Xid: 6, PE0002 194c 00000000 00000000 00000000 00000000
Aug 26 09:29:15 ekk2 kernel: NVRM: Xid: 6, PE0002 194c 00000000 00000000 00000000 00000000
Aug 26 09:29:15 ekk2 kernel: NVRM: Xid: 25, L1 -> L0
Aug 26 09:29:15 ekk2 kernel: NVRM: Xid: 6, PE0002 194c 00000000 00000000 beef1e20 00000000
Aug 26 09:29:15 ekk2 kernel: NVRM: Xid: 25, L1 -> L0
Aug 26 09:29:15 ekk2 kernel: NVRM: Xid: 6, PE0002 194c 00000000 00000180 00000000 00000000
Aug 26 09:29:15 ekk2 kernel: NVRM: Xid: 25, L0 -> L0
Aug 26 09:29:15 ekk2 kernel: NVRM: Xid: 6, PE0002 194c 00000000 00000000 00000000 00000000
Aug 26 09:29:15 ekk2 kernel: NVRM: Xid: 6, PE0002 194c 00000000 00000000 00000000 00000000

netllama 08-26-05 12:44 PM

Re: Lockups with Quadro FX 3400 + HP xw9300 + 7676
 
siersmak,
I believe that I've reproduced this problem. On my xw9300/FX3400, when I run 64bit gl-117, I get very similar Xids. Are you experiencing any functionality or performance problems when running gl-117 when those Xid's are generated, or are they just silently happening with no noticable problems?

Thanks,
Lonni


All times are GMT -5. The time now is 11:23 PM.

Powered by vBulletin® Version 3.7.1
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Copyright 1998 - 2014, nV News.