PDA

View Full Version : driver bsods; similar issues?


pooman
08-16-04, 11:39 PM
This message was copied from Guru3D forum. I hope someone could help solve this problem. Thank you!


[system specs]

CPU:
2x Athlon MP 1.2GHz
OS:
Win2k3 and Win2k
Mobo:
Tyan Tiger MPX S2466-4M
RAM:
1GB PC2100 ECC, Samsung & Crucial
Video:
BFG Nvidia FX 5900
Storage:
Maxtor 30GB, 2x WD 120GB SE
Audio:
C-Media 8738


[ summary ]

I have prepared a list of tried driver set that BSOD on my system
when running DirectX featured graphics and the one and only 44.03
that doesn't!

ATI Catalyst has customer feedback on their driver sets. I'm new
to Nvidia so I have no idea how to let Nvidia know of my issue.
Can anyone tell me?


[ background ]

Just so that we can speak freely, I've some experience with
developing drivers and digital circuits.


[ settings ]

I would like to remind everyone that my system is typically used as
a server, thusly it is not overclocked. It is under room
temperature (25C). I have tried every setting known to me and this
includes every combination of: AGP 1x, 2x, 4x, AGP Aperture Size
32MB-512MB, FastWrite on/off, PCI bus mastering timing values,
memory timings and all ECC modes, DirectX AGP on/off, removing
devices off the motherboard, disconnecting power using devices
(HDs, CD-ROM), testing DirectX 8.1-9.0c, running Driver Cleaner
and manually inspecting the .INF files and registries are gone,
and ... I forget what else. I have the AMD 760 MPX chipset, only
problem I know about it is with the 768 southbridge PCI-33 bandwidth
limitation (no video problems in the erata).

EDIT: Memtest86 for two days on extended tests turns out fine.

There is no way for me to verify that the AGP slot is not damaged in
any way. The power supply checks out fine using the digital multimeter
(near TruePower's specs and within a 10% tolerance level).


[ crash tests ]

Ironically I have not encountered any BSOD with OpenGL. With DirectX
Far Cry loads about 1/3 of the way and then BSODs or crashes to desktop.
The video driver is still in use so it takes a reboot to try anything
that uses 3D again. I proceeded to modify the Far Cry system
settings to use OpenGL rather than DirectX, and with OpenGL I finished
playing the game from beginning to end without a hitch (split up into
several days of game play of course).

America's Army crashes in maps, most notibly, SF Hospital. Typically it
will BSOD upon completion of caching the map textures or in a few drivers,
it would crash to desktop within the first round. A reboot is necessary
to do any 3D when video crashes to desktop.

3DMark03 is the primary test performed on the driver sets. I found that
the game test 4, Mother Nature, is the demo that BSODs the system nearly
all the time. Within 10 seconds of running this demo the system will BSOD.


[ the list ]

pre-44.03 to 65.62 tested on Windows Server 2003, similar to Windows XP
architecture. Same errors confirmed on a supported platform for 56.72 to
65.62 tested on Windows Server 2000 (SP3, SP4).


pre-44.03; 5900NU card unsupported ("PCI\VEN_10DE&DEV_0331" missing).

44.03; MOTHER NATURE STUTTERS BUT DOESN'T CRASH!

45.23; nv4_disp.dll; BugCheck 10000050, {ffffffb0, 1, bfadc75b, 0}
Rebooted before Mother Nature started.

45.33; 5900NU card unsupported ("PCI\VEN_10DE&DEV_0331" missing).

52.16; nv4_disp.dll; BugCheck 1000008E, {c0000005, bfb31450, ee43a994, 0}
LAST_CONTROL_TRANSFER: from bfbaada9 to bfb31450

Rebooted at the beginning of Mother Nature (camera on water);
two small yellow rectangular blocks moving, entire top half
of the screen is yellow, became full screen yellow, and then
system crashed.

53.03; nv4_disp.dll; BugCheck 1000008E, {c0000005, bfb3b6a0, ee519990, 0}
LAST_CONTROL_TRANSFER: from bfbb57d9 to bfb3b6a0

Rebooted at the beginning of Mother Nature (camera on water);
two small yellow rectangular blocks moving.

53.04; 5900NU card unsupported ("PCI\VEN_10DE&DEV_0331" missing).

56.64; nv4_disp.dll; BugCheck 1000008E, {c0000005, bfbd275f, edc10990, 0}
LAST_CONTROL_TRANSFER: from bfaf8f1e to bfbd275f

Rebooted at the beginning of Mother Nature (camera on water);
two small yellow rectangular blocks moving, entire top half
of the screen is yellow, became full screen yellow, and then
system crashed.

56.72; nv4_disp.dll; BugCheck 1000008E, {c0000005, bfbee871, ee05572c, 0}

eax=00000000 ebx=00000763 ecx=00000763 edx=00001d8c esi=bc758b40 edi=00000000
eip=bfbee871 esp=ee0557a0 ebp=00000000 iopl=0 nv up ei pl nz ac po nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010216
nv4_disp+0x228871:
bfbee871 f3ab rep stosd es:00000000=????????

Rebooted at the beginning of Mother Nature (camera on water);
two small yellow rectangular blocks moving.

61.76; nv4_disp.dll; BugCheck 1000008E, {c0000005, bfb3d6d4, ee15da8c, 0}

eax=00000000 ebx=bc1db780 ecx=bfc96e38 edx=00000438 esi=bc3e9648 edi=bc1db780
eip=bfb3d6d4 esp=ee15db00 ebp=bc1db780 iopl=0 nv up ei ng nz na po nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010286
nv4_disp+0x1776d4:
bfb3d6d4 8b4830 mov ecx,[eax+0x30] ds:0023:00000030=????????

Rebooted at the beginning of Mother Nature (camera on water).

61.77; nv4_disp.dll; BugCheck 1000008E, {c0000005, bfb3d6d4, ee3f7aa0, 0}

eax=00000000 ebx=bc1db780 ecx=bfc96e38 edx=00000438 esi=bc3e9748 edi=bc1db780
eip=bfb3d6d4 esp=ee3f7b14 ebp=bc1db780 iopl=0 nv up ei ng nz na po nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010286
nv4_disp+0x1776d4:
bfb3d6d4 8b4830 mov ecx,[eax+0x30] ds:0023:00000030=????????

Rebooted at the beginning of Mother Nature (camera on water).

65.62; was another 0xC0000005 exception from nv4_disp.dll,
unmotivated to reinstall this set.


Besides 45.23, every crashing driver is 0xC0000005 (page fault exception).
I don't really want to point fingers and say someone made a boo boo and
attempted to write to memory that doesn't exist, but the outcome of each
BSOD appears that way. I don't know what got added after 44.03 but whatever
it is, on Far Cry, America's Army, and 3DMark03's Mother Nature, the
call to particular graphic routines is a bomb waiting to explode.

EDIT: Though 3DMark03's Mother Nature doesn't crash with 44.03, America's Army 2.1 does (tested on SF Hospital). "BugCheck 10000050, {ffffffe0, 1, bfafc7d7, 0}". nv4_disp.dll basically attempted to write to a non-paged area. 0xffffffe0 is most likely a relative offset that was added to a NULL pointer or non-allocated block of memory. :nanahump: :nanahump: :nanahump: :lame: :drooling: :drooling: :drooling: :drooling: :drooling:

pooman
08-23-04, 05:00 AM
Addendum:

66.00 BugCheck 1000008E, {c0000005, bfc31ff0, edc6c838, 0}
BugCheck 1000008E, {c0000005, bfb4c36f, efbad664, 0}


Will there be no end?

<cries>

pooman
08-24-04, 05:21 AM
EDIT:



And the saga continues. I'm all alone in this matter. At this point I have three guesses as to what could be the problem and these are:

(1) NV4_DISP.DLL was not tested with FX series on any version of Windows with the SMP kernel installed.

Prior to my BFG FX 5900, I had a Gainward FX 5700U and it bsod'ed just the same. I returned it when I found green pixel and model stretching artifacts. This is the reason for my assumption on the FX series.

I will have to try this on a later date by disabling one of the CPUs before Windows reinstallation; reinstallation is not a simple matter on this workstation/semi-server machine.

(2) NV4_DISP.DLL may be trying to access already freed memory.

Like most less experienced programmers, the video driver could be using a portion of freed pages of memory and cause a page fault error (0xC0000005). Since NV4_DISP.DLL is running under kernel-mode (ring-0), the NT kernel architecture does not recover from this error unless NV4_DISP.DLL itself sets its own recovery handler (kernel-mode SEH). The second version of this problem is that NV4_DISP.DLL could be using an allocated portion of memory that does not belong to the freed pointer. This would explain why some people can continue running their systems without causing a "fatal" crash.

A friend of mine suggested that it could be that of multi-threaded race conditions, but I argued that it could not have been because a good portion of users will experience BSODs and Nvidia would have tend the problem earlier. However, this idea is not totally thrown out the window because freed memory can still work in certain cases like the previously mentioned second version of the freed buffer problem.

Since I don't believe that Nvidia has the same level of verification process for software compared to their hardware counterpart, the chances for memory errors is very high. (The assumption is that companies ignore resource issues, e.g. buffer overflows, to work on more pressing features to woo new customers).

(3) Bad video card or system hardware.

I have gone through several tests as shown in my first post and none seem to point to a hardware problem. However, I will not discount this probability for two reasons. The video card components could really be faulty or motherboard could be faulty.

The video card fault could come from two places. Bad components like capacitors, resistors, and even the GPU itself. MTBF of components are stable but not bulletproof. The other problem could have something to do with memory mapping.

The memory mapping relates to the problem proposed in (2) when for some reason the video card's memory or interface memory mapped to the address space of the system's virtual address space could be inaccessible. This is an unlikely scenario but it could be possible if the PCI/AGP bus were unaccessible due to other devices hogging the master line or that there could be genuinely a defect on the AGP slot (a friend had this happened to him as well, however, for him the video card did not work at all). I have not been successful testing the AGP. No software that I know can test it nor do I currently have the equipments necessary to diagnose the slot (oscilloscope and an fpga test board).

...

The problem is still a burning issue. I really value stability over performance gains. For those of you who are unfamiliar with the possible scenarios of this problem, imagine this hypothetical case:

You are writing a report (or programming) and the OS has not yet flushed its disk cache ("sync"). You are also copying files from one drive to another. You visit a website to look up some information on say Bezier curves and you happen to arrive at a site that had a Java applet. Say the Java applet uses graphics acceleration and Sun linked the Windows port to DirectX's Direct3D. Just when the applet runs the system BSODs and tells you 0xC0000005 in the BugCheck. You just lost a good portion of your work just by visiting a stinking website.

At this time I am not willing to cut my losses and buy a new video card. Money doesn't fall from the sky. Besides, this is a call to restructuring how Nvidia works with consumers because they distance themselves from by sending us to the vendors that point us back to Nvidia.

Please speak up if you have similar problems! It's a healthy to have open discussions on areas to improve stability rather than 1-2 FPS gains.

ekotan
08-25-04, 07:35 PM
It's unlikely to be a multithreading issue because pretty much all P4 processors right now commonly found in PCs use SMT (aka HTT or Hyperthreading Technology), because of which all companies had to fix their threading bugs (Creative was the biggest offender here, their drivers were notoriously unstable on multi-processor PCs, but their current Audigy2 driver seems to work well - thank you Intel, for pushing Hyperthreading so heavily; it may not be real SMP, but it did force companies to fix those threading and resource-locking bugs and race conditions in their drivers). Besides, NVIDIA has a lot of professional customers and those workstations almost always have dual processors installed. This is why I doubt it's an NV driver bug.

Ensure your case has good cooling and that nothing is overheating. Take out your graphics card, clean its contacts with rubbing alcohol and re-seat it carefully. Try removing all unnecessary PCI cards and disabling audio to see if it helps. Even though the BSOD lists the NV driver as the culprit, it could be misleading since something else in the system might be causing the NV driver to crash.

I'd suspect your GFX card could be an early card revision which is incompatible with your motherboard. I had that problem with my old Radeon 9700 Pro card and the only way to fix constant BSODs was to RMA it for a newer card revision. Nothing else helped and I was tearing my hair out until I tried an RMA as a last resort. Even if you don't want to buy another card, can you borrow another FX card and try with that? Or a BIOS flash to a newer revision of the VGA BIOS might help. I had to do this with my current 6800U card to achieve perfect stability, because the initial BIOS was running the core at different clock speeds in 2D and 3D modes and that was crashing Doom3 sometimes. My definition of stability is zero BSODs.

If it helps, I now never get BSODs with my current system (3.2 GHz P4EE, 6800U, 1GB CL2 PC3200 RAM, 1TB SATA RAID-5) which runs 3D Studio Max, Photoshop and several video editing apps along with many intensive games like Doom3 and NFSU. I stress-test by running two copies of Prime95 on top of a looping 3DMark03 and a video in a window as well as heavy network activity in the background for 48 hours and it just never fails. That's almost as stable as my Power Mac G5. :)

E@

Bah!
08-25-04, 11:16 PM
pooman, are you absolutely certain its a driver issue?

I only ask because when i built my last system i was constantly getting the same exact errors. I ran Memtest as well as a ton of other tests and all of them came back with zero errors. I must have tried 20 different driver sets and all of them had the same issues when playing games or 3dmark, ect.

Finally when i was about to RMA my video card i just decided to take out a memory stick for the hell of it, and it worked. It seems that the two sticks (different brands) that i was using in my old system didn't agree with each other on my new board. Because of passing memtest with flying colors i hadn't even thought about messing with the memory but i'm glad i did.

Dunno if it will help you or not, just thoght i'd share my experience with you.

pooman
08-26-04, 05:13 AM
Thank you people for replying. I thought no one had noticed this lonely thread.

I have tried a lot of tricks in the book and no hardware is faulty.

As per the rubbing alcohol to clean contacts suggestion, I strongly disapprove of that. Cleaning metal contacts can be cheaply and safely done by rubbing an eraser against the pins. Take any old oxidized ISA card (I'm sure you can find one somewhere) and you will see the difference on the first rub.

***
This is my second day on Windows XP. The system has not crashed on any test that I throw at it.
***

Apparently I have stumbled upon an issue where NV4_DISP.DLL is not fully compatible with Windows Advanced Server 2000 and Windows Server 2003 Enterprise. I withdraw any remarks against Win2k3 because Nvidia is not yet officially supporting it. However, why should the Win2k server exhibit BSODs similar to that of Win2k3 server? This issue raises questions whether or not the Nvidia development team ran regression tests on other platforms besides the mainstream OSs. The Windows servers are not mainstream I'll give you that. They cost a lot of money.

I use the Windows servers because I have a genuine need for having it. I'm mainly a device driver developer. I also do hardware as a hobby, or if the job calls for it yuck. I know it's a driver problem. I've seen issues like these many times before. This is why I'm so distraught seeing this kind of driver problem. I've even sent Nvidia an invitation to provide me with the NV4_DISP.DLL symbol files so that I may fix the bug and produce referenceable correlations with their developers. So far I've heard nothing but tumbleweeds blowing by.

I'm not about to rant or moan. This is truly a call to fix something broken so that people can get on with their lives. Unfortunately my environment isn't as popular as Linux, BSD, nor Windows workstations so fixes come yearly if ever.

Why are people trying to submit to NVDVD feedback? I tried it, but I think they'll just toss it in the trash since it's not even their job to deal with driver issues.

Bah!
08-26-04, 09:20 AM
Apparently I have stumbled upon an issue where NV4_DISP.DLL is not fully compatible with Windows Advanced Server 2000 and Windows Server 2003 Enterprise. I withdraw any remarks against Win2k3 because Nvidia is not yet officially supporting it. However, why should the Win2k server exhibit BSODs similar to that of Win2k3 server? This issue raises questions whether or not the Nvidia development team ran regression tests on other platforms besides the mainstream OSs. The Windows servers are not mainstream I'll give you that. They cost a lot of money.

I know its a longshot, but maybe you can get one of their drivers for their professional cards and mod the .inf to add your 5900. Maybe those drivers will be compatible with the Server OS's.

Other than that i feel for ya. Good luck with getting a fix :(

ekotan
08-26-04, 06:56 PM
I'm surprised you have strong feelings against rubbing alcohol, it's a perfectly safe method. I don't use pencils, so I don't have an eraser; but there's always plenty of alcohol in my house. Kidding, bygones. :)

You're being bitten by the fact that companies only test their products with a few popular target platforms, this is because supporting each additional platform (writing drivers, certifying your product and training staff to deliver customer support on that platform) costs a lot of money - more than most people realise or appreciate. The equation boils down to a simple case of return on investment: A vendor is unlikely to support a platform unless there is a financially profitable revenue stream from doing so.

Since most people don't use 3D accelerators in conjunction with Windows Server versions, it's very probable that NVIDIA did not test their driver on that OS very thoroughly. High pricing aside, there's no practical benefit to running a server OS for gaming. Those OSes are tuned for disk and network I/O. They are built to support multiple users running apps simultaneously. They reserve the lion's share of RAM for caching purposes. None of this is conducive to gaming. In fact, all the DirectX video and audio acceleration is turned off in Win2K3 Server by default. This leads me to believe that MS may have discovered problems with DirectX apps in their in-house testing and so chosen to turn it off out of the box to maximise stability. Who knows?

If you're not getting any attention from NVIDIA, then one option is to sell your NVIDIA card and get another brand whose vendor will support you on your chosen OS. Another option is to dual-boot with the desktop versions of Windows XP/2K.

E@

pooman
08-27-04, 02:15 AM
[ This is a long one, it's still talking about the topic :) ]


I believe we are thinking about the same things just spoken in a different way.

I agree testing is a big factor that determines whether or not a platform will support it. Testing on Windows Server is a bit more rigorous since Windows Server 2003 standard came about. Memory management on the Win2k3 is more strict than its Win2k predecessors. However we should not forget that a server based platform should be the same if not better fine tuned for multitasking.

The Win2k3 has something that resembles round-robbin task scheduling with an anti-starvation algorithm that gives a longer quantum to execute code. This is actually something that benefits most workstations in addition to servers because all applications will at least be in a running state. What we have now is a different story. XP and Win2k3 are a little different in resource management. The basic architecture remains the same as it has since NT 4.0. Most drivers for older "platforms" will still work on newer ones because of this similarity. The DDKs for each platform explains the differences in new and obsolete API calls. One may also perform a simple API extraction from NTOSKRNL.EXE to see undocumented but visible API calls.

I am going to stop short of blathering and conclude that Windows server is the way Windows was supposed to be. Workstations are cutdown versions of the server and optimized for users. I'm sure many of you could care less about Window's features at all, Firewall, Security, and other useless items. Read over the changes in Windows XP SP2. You will notice that a large majority of fixes are the same in Windows Server 2003 SP1 beta 1218 (current version). This comes down to what Microsoft considers to be "tuning for the purpose". As I'll restate, architectual-wise, these platforms are equivalent but not equal.

The reason for pointing out the differences in Windows workstation and server becomes more obvious when you relate this to Nvidia's Linux and FreeBSD support. I will not deny that these operating systems are very good. Linux can be both workstation and server depending on how one manually configures it. FreeBSD is a long running stable operating system that counts its system uptime in years (not days). In fact Microsoft has ran BSD for keeping Hotmail up for years. I remember Microsoft pulling out their Windows Whistler servers from handing microsoft.com and putting back BSD. All these platforms have things in common.

Most notibly for our discussion is that modern operating systems are tweaked from a certain set of code to change from a workstation to a server. Don't get me wrong, you won't be able to turn your Windows XP into a Windows Server 2003 just by a few registry hacks. For instance, certain services like HTTP.SYS handle incoming connections in kernel mode to lessen context switches. You're basically missing registry and files, and some limitations hard coded in NTOSKRNL.EXE to prevent more than 2 CPU support (excluding the HyperThreading logical processors). The development for such a video card driver would cost the same as to develop on any other platform, Linux and FreeBSD. One couldn't argued that the costs for supporting these platforms were not as equally expensive.

As far as I know, many hardware vendors do not pass Windows Server 2003's pricey WHQL. From many embarrasment fiascos Microsoft raised their bar on their own quality assurance. I could say that using the default VGA drivers for any video card I plug into the system, it will work using just the VGA drivers. Again, a workstation is not just a workstation in today's standards (refer to the refute above). A properly written device driver should simply work on the current product lines and especially recover from a system crash when the error was caused by a video graphics library.