|
|
#1 | |
|
Registered User
Join Date: Feb 2003
Location: ISRAEL
Posts: 31
|
At this juncture, all I can do is express my dismay. The X server crash phenomenon after logout with multiple local X servers under GDM has been known for a long time. There have been numerous posts, but no solution. We have tried all the suggestions on the forum posts, but to no avail. Worse yet, there is only silence from Nvidia. We're close to losing an important client whose patience has run out. Shame I can't attach a few tears to this post after all the sleepless nights I've invested in attempting to apply a solution.
It is definitely an Nvidia issue. Users of the Linux open nv driver seldom see the crash. I'd be happy to forgo the accelerated video and use nv, but we have not been able to run the nv driver on multiple local X servers. Multi-local X server users with chips from other manufacturers do not have the problem. In misery, Yitzhak P.S. Don't get me wrong, fellows. I'm sure you Nvidia people work hard and the Linux community appreciates that. But unfortunately, we need a solution or a workaround. What choice do we have now?
__________________
Best, Yitzhak |
|
|
|
|
|
|
#2 | |
|
NVIDIA Corporation
Join Date: Dec 2004
Posts: 8,763
|
Please post a bug report, along with information about your hardware. This is clearly a problem specific to your environment, and not global to all users.
Thanks, Lonni |
|
|
|
|
|
|
#3 |
|
Registered User
Join Date: Feb 2003
Location: ISRAEL
Posts: 31
|
Thank you very much for your willingness to assist.
Attached is a report after a typical crash. The crash invariably occurs after a logout on the multi-user system. In this case, it was X2. It may not crash immediately after logout, or at all, but we almost never see crashes until after the first logout. Preventing a reset of the X server (AlwaysRestartServer=false in gdm.conf) does not help. As many of my multi local X server colleagues I sense it is an Nvidia problem. Indicative is the fact that your bug report script logs the first X server (X0) only. I have included the other X server logs in the attached zip file together with gdm.conf. Believe me, we're desperate.
__________________
Best, Yitzhak |
|
|
|
|
|
#4 | |
|
NVIDIA Corporation
Join Date: Dec 2004
Posts: 8,763
|
Yitzhak,
Does this problem also exist with 1.0-7676? Which vendor's nForce2 motherboard are you using? Which BIOS version? In my experience ruby-kernel issues are often tied to the specific hardware & OS configuration in use Is shipping a system to NVIDIA where this reproduces an option for you? Thanks, Lonni |
|
|
|
|
|
|
#5 |
|
Registered User
Join Date: Jul 2004
Posts: 8
|
Hello,
We too have an x crashing problem after logout, but we have also found a scenario where a similar problem occurs on initialization and is repeatable. We are NOT using ruby or anything kernel specific, and the problem occurs on both Asus VIA and nforce4 boards. We filed a very detailed bug report on November 18th, and haven't heard anything. Indeed, the problem is specific to situations where more than one x server is running, but it is not specific to a board or distro -- so far we have not found one user or company that has been able to avoid this crashing with multiple users - using many different approaches, boards, and distros. Also, using the nv driver prevents the problem, but obviously that has other drawbacks, including that more recent cards won't even work with the nv driver. Here is the bug report I filed, is linux-bugs@nvidia.com still the correct address? Dear NVIDIA, My company sells multi-user computers, and we are in the process of moving to Ubuntu (instead of standard debian) and switching to the latest nvidia drivers (required because newer cards don't work with the old nvidia drivers). Summary: Now we are seeing problems with X servers crashing and not being able to start back up again. We have finally found a repeatable scenario for this normally intermittent problem. There are 4 cards, each independent with their own keyboard and mouse. The X server on :3 won't come up, and complains about /dev/nvidia1 in the gdm log, but x server :3 should not be trying to use /dev/nvidia1. This same problem occurs intermittently in other configurations, i.e. all screens come up fine and 1/2 hour later when someone logs out/in one screen crashes and won't come back up until we /etc/init.d/gdm restart. Key Points: 1. Please tell us what we need to do to help you debug this. we are willing to put in the time to debug it, but we've tried everything we can possibly think of over the past 2 weeks. We are willing to sign an NDA, send you hardware, make a trip to your headquarters, etc., whatever it takes. we even have a multi-user live cd we can give you so you don't need to spend time configuring your software 2. it is mainly intermittent, and occurs ~ 1/20 times when a user is logging out/in -- however, we now have a situation in which all cards will not start up and it is repeatable 3. the messages in /var/log/gdm/:X when display X crashes and won't restart reference a /dev/nvidiaY device where Y is a different display -- in other words, one video card is trying to interfere with another card 4. running /etc/init.d/gdm restart will get the cards back up again after a crash (but the repeatable problem where the fourth card won't come up can't be fixed in any way) 5. we have finally found a completely repeatable situation under which this occurs, so we can debug and get you any information you need 6. we have read about several other people having this exact same problem, some on your forums. we have tried all the suggestions to no avail 7. we have even tried your "leaked" 8168 drivers, which behave the same or worse 8. we have replicated this on several different motherboards. we saw one thread where you blamed the motherboard, so we bought a asus nvidia nforce4 chipset motherboard so we could have a complete nvidia solution. We don't know of any motherboard that could possibly be "better". 9. This particular repeatable problem happens to occur on a dual pci-e motherboard with 2 pci-e cards and two pci cards, but as I mentioned it happens with normal agp boards, so dual pci-e isn't the problem 10. I know you probably think multi-user isn't important, but more and more people are switching to it. if we can't resolve this, we will have to look at other brands of video cards and nvidia's linux and multi-user friendly reputation won't be as good. If we find the root cause of this problem, that could fix other bugs or prevent more serious problems down the road, so it really is in your best interest to get this fixed. If you have any questions, please let me know. a bug report is available with extra log and configuration files included: http://groovix.com/nvidia-bug-report...groovix.tar.gz Thank You, Michael Pardee Open Sense Solutions LLC http://opensensesolutions.com |
|
|
|
|
|
#6 | |
|
NVIDIA Corporation
Join Date: Aug 2002
Posts: 3,740
|
@mppardee: your problem isn't necessarily related to yitzhakbg's. In your case, the problem appears to be related to exhaustion of the kernel's virtual address space. It is likely that disabling vesafb or, if it's already disabled, increasing the kernel's virtual address space size with the vmalloc kernel parameter (e.g. to 196MB) will help.
|
|
|
|
|
|
|
#7 |
|
NVIDIA Corporation
Join Date: Dec 2004
Posts: 8,763
|
Michael,
I apologize for the delay in responding to your email sooner. I've just sent you a reply, and am including it here as well: Looking at the log file, it seems that the kernel's virtual address space is barely large enough to support your configuration under normal circumstances, and, depending on what's going on in the system, the space remaining with all but the last X server up can be fragmented enough for the register mapping attempt to fail. The system in question is a Linux/x86 system with 2GB of RAM, the kernel's virtual address space is < 128MB to begin with. A couple of things worth trying: - make sure vesafb is disabled (it's unclear if it's actually attached to cards in this case). - use the 'vmalloc' kernel parameter to increase the kernel's virtual address space size (e.g. to 196MB). If neither of the above helps, and you are still willing to ship a system that easily reproduces this problem, please let me know, and I'll provide you with a shipping address. Getting a system that easily reproduces this is the best way to ensure that development can investigate further. Thanks, Lonni J Friedman NVIDIA Corporation |
|
|
|
|
|
#8 |
|
Registered User
Join Date: Jul 2004
Posts: 8
|
For anyone else tracking this issue:
using vmalloc=256M as a kernel parameter allowed us to initialize all screens successfully in that pci-e machine. However, our main problem is the crashes and failure to restart that occur after several login/logout cycles on all of our machines, which happens even with vmalloc=512M and vesafb disabled. The logs looked the same so we thought the initial problem was the same as crashing problem . The main symptom is i/o errors on /dev/nvidiaX in the gdm log for display Y, where Y and X aren't the same. We are sending a machine to nvidia for testing -- thanks! |
|
|
|
|
|
#9 |
|
Registered User
Join Date: Feb 2003
Location: ISRAEL
Posts: 31
|
I'm pretty sure that Mike and I have the same problem. We also report that the problem does not occur until at least one logout (resettin X) has occurred. It also occurs around the same frequency. The heck of it is that since it's a multi-user system, having to reset gdm knocks out the other two or three innocent users o the system.
before we get down to the nitty-gritty, can one of you tell us how we can resset only the offending X server without having to restart GDM (or XDM, KDM)? That would take a lot of pain away while we work on soving the problem
__________________
Best, Yitzhak |
|
|
|
|
|
#10 |
|
NVIDIA Corporation
Join Date: Dec 2004
Posts: 8,763
|
You could just kill the X process, however I'm not sure that's going to serve as a reasonable workaround.
-Lonni |
|
|
|
|
|
#11 |
|
Registered User
Join Date: Feb 2003
Location: ISRAEL
Posts: 31
|
After having implemented the multi-user local X server solution over the past years on different hardware platforms, both Intel and AMD I am reinforced in being convinced that Mike and I are talking about the same thing. Daniel Weingartner in Brazil, a multi-user pioneer had to reluctantly abandon Nvidia in favor of Sis 315 boards which solved the problem for him.
If I may suggest, you folks at Nvidia can easily implement your own multi-user system for testing in less than an hour. For you, it's a piece of cake. Just install a standard Linux distro with an AGP card and two or three PCI cards. Install Aivis' faketty module, see: http://lkml.org/lkml/2005/10/4/25 It's quick and simple. I've included xorg.conf and gdm.conf and you're off and running. Start GDM with three or four local servers and then get crash the system by repeatedly pressing <CTRL><ALT>BKSP>. You can do it the polite way, by logging in and logging out until the servers crash, but <CTRL><ALT>BKSP> is quicker. Believe me, if we had Nvidia working smoothly, the rapidly growing multi-user local X server community would embrace you. If you could provide us with a resonably priced multi GPU card (must absolutely be multi-GPU with seperate PCI addresses, like the Matrox G-550). We'd all be storming ahead. In the meantime, I'm up a creek and need a paddle fast. If I don't get a solution, we'll have to swithch the video boards. Painful, but what can I do? Yitzhak
__________________
Best, Yitzhak |
|
|
|
|
|
#12 |
|
Registered User
Join Date: Jul 2004
Posts: 8
|
Yitzhak, we are sending nvidia a machine on monday already setup with automatic logins/logouts to exhibit the problem. Multi-user is only a small fraction of their customer base and they are busy trying to get a new release out, we have to do everything we can to make their job easier.
If the next driver release fixes the "NVRM: RmInitAdapter failed!" problem, I'm hoping that might fix our problem, but here's more info from /var/log/messages using the nvidia 8168 leaked driver if anyone cares, right before a total system freeze (7676 doesn't freeze the machine, a gdm restart gets things back to normal.) Nov 26 18:05:13 localhost kernel: [4294842.118000] NVRM: RmInitAdapter failed! (0x23:0xffffffff:676) Nov 26 18:05:13 localhost kernel: [4294842.118000] e1165849 Nov 26 18:05:13 localhost kernel: [4294842.118000] Modules linked in: fuse rfcomm l2cap bluetooth cpufreq_userspace cpufreq_stats freq_table cpufreq_powersave cpufreq_ondemand cpufreq_conservative nvidia agpgart video tc1100_wmi sony_acpi pcc_acpi hotkey dev_acpi i2c_acpi_ec button battery container ac ipv6 af_packet analog gameport snd_mpu401 snd_mpu401_uart snd_rawmidi snd_seq_device pcspkr rtc ohci1394 shpchp pci_hotplug snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_nforce2 i2c_core dm_mod tsdev evdev sr_mod sbp2 ieee1394 psmouse mousedev parport_pc lp parport md ext3 jbd thermal processor fan usbhid skge sata_sil forcedeth ehci_hcd ohci_hcd usbcore sd_mod ide_cd cdrom ide_generic sata_nv libata scsi_mod amd74xx ide_core unix tileblit font bitblit cfbcopyarea cfbimgblt cfbfillrect softcursor capability commoncap Nov 26 18:05:13 localhost kernel: [4294842.118000] CPU: 0 Nov 26 18:05:13 localhost kernel: [4294842.118000] EIP: 0060:[pg0+551372873/1069995008] Tainted: P VLI Nov 26 18:05:13 localhost kernel: [4294842.118000] EFLAGS: 00013016 (2.6.12-10-386) Nov 26 18:05:13 localhost kernel: [4294842.118000] EIP is at _nv002298rm+0x15/0x1c [nvidia] Nov 26 18:05:13 localhost kernel: [4294842.118000] eax: e4700000 ebx: c99a5800 ecx: 00000001 edx: 00000050 Nov 26 18:05:13 localhost kernel: [4294842.118000] esi: c9b08000 edi: ca3d0400 ebp: c94f5e10 esp: c94f5e10 Nov 26 18:05:13 localhost kernel: [4294842.118000] ds: 007b es: 007b ss: 0068 Nov 26 18:05:13 localhost kernel: [4294842.118000] Process Xorg (pid: 9713, threadinfo=c94f4000 task=c9971020) Nov 26 18:05:13 localhost kernel: [4294842.118000] Stack: c94f5e40 e12aa06a ca3d0400 c99a5800 00000140 00000001 c99a5800 c9b08000 Nov 26 18:05:13 localhost kernel: [4294842.118000] 00000001 00075e40 c99a5800 c9b08000 c94f5e70 e116b9f1 c99a5800 00000140 Nov 26 18:05:13 localhost kernel: [4294842.118000] 00000001 00000023 c9e1dd40 00003297 00000000 c94f5ea8 00000008 00000007 Nov 26 18:05:13 localhost kernel: [4294842.118000] Call Trace: Nov 26 18:05:13 localhost kernel: [4294842.118000] [pg0+552702058/1069995008] _nv004895rm+0x8a/0x94 [nvidia] Nov 26 18:05:13 localhost kernel: [4294842.118000] [pg0+551397873/1069995008] rm_set_interrupts+0x129/0x144 [nvidia] Nov 26 18:05:13 localhost kernel: [4294842.118000] [pg0+553637341/1069995008] os_release_sema+0x21/0x3e [nvidia] Nov 26 18:05:13 localhost kernel: [4294842.118000] [pg0+551368050/1069995008] _nv002248rm+0x12/0x18 [nvidia] Nov 26 18:05:13 localhost kernel: [4294842.118000] [pg0+551368002/1069995008] _nv002340rm+0x12/0x18 [nvidia] Nov 26 18:05:13 localhost kernel: [4294842.118000] [pg0+551396955/1069995008] rm_init_adapter+0x77/0x8c [nvidia] Nov 26 18:05:13 localhost kernel: [4294842.118000] [pg0+551396937/1069995008] rm_init_adapter+0x65/0x8c [nvidia] Nov 26 18:05:13 localhost kernel: [4294842.118000] [pg0+551396918/1069995008] rm_init_adapter+0x52/0x8c [nvidia] Nov 26 18:05:13 localhost kernel: [4294842.118000] [pg0+553624975/1069995008] nv_kern_open+0x116/0x207 [nvidia] Nov 26 18:05:13 localhost kernel: [4294842.118000] [pg0+553627720/1069995008] nv_kern_isr+0x0/0x5b [nvidia] Nov 26 18:05:13 localhost kernel: [4294842.118000] [pg0+553625081/1069995008] nv_kern_open+0x180/0x207 [nvidia] Nov 26 18:05:13 localhost kernel: [4294842.118000] [chrdev_open+215/240] chrdev_open+0xd7/0xf0 Nov 26 18:05:13 localhost kernel: [4294842.118000] [dentry_open+190/391] dentry_open+0xbe/0x187 Nov 26 18:05:13 localhost kernel: [4294842.118000] [filp_open+65/73] filp_open+0x41/0x49 Nov 26 18:05:13 localhost kernel: [4294842.118000] [sys_chown+54/65] sys_chown+0x36/0x41 Nov 26 18:05:13 localhost kernel: [4294842.118000] [sys_open+56/179] sys_open+0x38/0xb3 Nov 26 18:05:13 localhost kernel: [4294842.118000] [sysenter_past_esp+84/117] sysenter_past_esp+0x54/0x75 Nov 26 18:05:13 localhost kernel: [4294842.118000] Code: 8b 4d 14 8b 55 10 d1 ea 8b 80 7c 01 00 00 66 89 0c 50 89 ec 5d c3 55 89 e5 8b 45 0c 8b 4d 14 8b 55 10 c1 ea 02 8b 80 7c 01 00 00 <89> 0c 90 89 ec 5d c3 55 89 e5 83 ec 10 56 53 8b 5d 0c 8b 75 10 Nov 26 18:05:14 localhost kernel: [4294842.118000] <3>irq 18: nobody cared (try booting with the "irqpoll" option. Nov 26 18:05:14 localhost kernel: [4294842.846000] [__report_bad_irq+49/116] __report_bad_irq+0x31/0x74 Nov 26 18:05:14 localhost kernel: [4294842.846000] [note_interrupt+125/162] note_interrupt+0x7d/0xa2 Nov 26 18:05:14 localhost kernel: [4294842.846000] [__do_IRQ+133/177] __do_IRQ+0x85/0xb1 Nov 26 18:05:14 localhost kernel: [4294842.846000] [do_IRQ+25/36] do_IRQ+0x19/0x24 Nov 26 18:05:14 localhost kernel: [4294842.846000] [common_interrupt+26/32] common_interrupt+0x1a/0x20 Nov 26 18:05:14 localhost kernel: [4294842.846000] [schedule+1167/1188] schedule+0x48f/0x4a4 Nov 26 18:05:14 localhost kernel: [4294842.846000] [sys_sched_yield+89/98] sys_sched_yield+0x59/0x62 Nov 26 18:05:14 localhost kernel: [4294842.846000] [sysenter_past_esp+84/117] sysenter_past_esp+0x54/0x75 Thanks, Mike |
|
|
|
![]() |
| Thread Tools | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| nvidia 9800gt, linux, and nvidia 295.xx crashes | mattgen88 | NVIDIA Linux | 0 | 06-05-12 11:11 PM |
| Crashes now? | Imperito | NVIDIA Linux | 4 | 10-09-02 11:23 PM |
| XF86COnifg-4 configuring and server crashes | JoeJaz100 | NVIDIA Linux | 2 | 10-01-02 09:49 AM |
| crashes and freezes | Lethal Weapon | NVIDIA Linux | 9 | 09-27-02 08:23 PM |
| Crashes in Wolfenstein SP Demo | Heulsuse | NVIDIA Linux | 2 | 08-24-02 03:10 PM |