View Full Version : data corruption with nvidia chipsets // memhole mapping/iommu related

12-22-06, 08:59 PM

I've already tried to "resolve" this via the nvidia knowledgebase but either they don't want to know about that issue or there is noone who is competent enought to give information/solutions about it.
They finally pointed me to this fourm and told me that Linux support would be handled here (they did not realise that this is probably a hardware flaw and not OS related).

I must admit that I'm a little bit bored with Nvidia's policy in such matters and thus I only describe the problem in brief.
If here is any competent chipset engineer who reads this, than he might read the main discussion-thread (and some spin-off threads) of the issue which takes place at the linux-kernel mailing list (again this is probably not Linux related).
You can find the archive here: http://marc.theaimsgroup.com/?t=116502121800001&r=1&w=2

Now a short description:
-I (and many others) found a data corruption issue that happens on AMD Opteron / Nvidia chipset systems.

-What happens: If one reads/writes large amounts of data there are errors.
We test this the following way: Create some test data (huge amounts of),.. make md5sums of it (or with other hash algorithms), then verify them over and over.
The test shoes differences (refer the lkml thread for more information about this). Always at differnt files (!!!!). It may happen at read AND write access.
Note that even for affected users the error occurs rarely (but this is of course still far to often): My personal tests shows about the following:
Test data: 30GB (of random data), I verify sha512sum 50 times (that is what I call one complete test). So I verify 30*50GB. In one complete test there are about 1-3 files with differences. With about 100 corrupted bytes (at leas very low data sizes, far below an MB)

-It probably happens with all the nforce chipsets (see the lkml thread where everybody tells his hardware)

-The reasons are not single hardware defects (dozens of hight quality memory, CPU, PCI bus, HDD bad block scans, PCI parity, ECC, etc. tests showed this, and even with different hardware compontents the issue remained)

-It is probably not an Operating System related bug, although Windows won't suffer from it. The reason therefore is, that windows is (too stupid) ... I mean unable to use the hardware iommu at all.

-It happens with both, PATA and SATA disk. To be exact: It is may that this has nothing special to do with harddisks at all.
It is probably PCI-DMA related (see lkml for more infos and reasons for this thesis).

-Only users with much main memory (don't know the exact value by hard and I'm to lazy to look it up)... say 4GB will suffer from this problem.
Why? Only users who need the memory hole mapping and the iommu will suffer from the problem (this is why we think it is chipset related).

-We found two "workarounds" but these have both big problems:
Workaround 1: Disable Memory Hole Mapping in the system BIOS at all.
The issue no longer occurs, BUT you loose a big part of your main memory (depending on the size of the memhole, which itself depends on the PCI devices). In my case I loose 1,5GB from my 4GB. Most users will probably loose 1GB.
=> inacceptable

Workaround 2: As told Windows won't suffer from the problem because it always uses an software iommu. (btw: the same applies for Intel CPUs with EMT64/Intel 64,.. these CPUs don't even have a hardware iommu).
Linux is able to use the hardware iommu (which of course accelerates the whole system).
If you tell the kernel (Linux) to use a software iommu (with the kernel parameter iommu=soft),.. the issue won't appear.
=> this is better than workaround 1 but still not really acceptable. Why? There are some following problems:

The hardware iommu and systems with such big main memory is largely used in computing centres. Those groups won't abdicate the hwiommu in general, simply because some Opteron (and perhaps Athlon) / Nvidia combinations make problems.
(I can tell this because I work at the Leibniz Supercomputing Centre,.. one of the largest in Europe)

But as we don't know the exact reason for the issue, we cannot selectively switch the iommu=soft for affected mainboards/chipsets/cpu-steppings/and alike.

We'd have to use a kernel wide iommu=soft as a catchall solution.
But it is highly unlikely that this is accepted by the Linux community (not to talk about end users like the supercomputing centres) and I don't want to talk about other OS'es.

So we (and of course all, and especially professional, customers) need Nvidias help.

Perhaps this might be solvable via BIOS fixes, but of course not by the stupid-solution "disable hwiommu via the BIOS".
Perhaps the reason is a Linux kernel bug (although this is highly unlikely).
Last but not least,.. perhaps this is AMD Opteron/Athlon (Note: These CPUs have the memory controllers directly integrated) issue and/or Nvidia nforce chipset issue.


btw: For answers from Nvidia engineers/developers or end-users who suffer from that issue too,... please post it to the lkml thread (see above for the link) and if not possible here.
You may even contact me via email (calestyo@scientia.net) or personal messages.

PS: Please post any other resources/links to threads about this or similar problems.

01-01-07, 12:20 PM
This looks like a duplicate of this thread:

01-03-07, 09:16 AM
This looks like a duplicate of this thread:
Yes it is,... in fact I'am the original author of the thread at lkml he refers to...

01-03-07, 09:46 PM
Hi Christoph,

Think i've been experiencing that problem for years, & with non nvidia chipsets.

Some years ago, i had problems with those VIA Apollo Pro for Pentium III. There were bad transfer over the network cards, 'noise' grabbing frames from those PCTV cards, corrupt files while copying files from one HD to another, ...

There were problems with those VIA KT266A chipsets for the Athlon XP too. I remember problems with the network & HD transfers.

nForce2 mobos for the Athlon XP had huge problems with the nVidia IDE driver, & they never fixed it for everybody.

nForce4 still has the problem with the IDE driver. Also, there're problems with the nForce networking too, if you enable that nVidia network shield (a lot of IP packets are corrupted). And now, i've found another problem with the h264 hw acceleration. It sends corrupted packets to the gfx card too, cause, sometimes, i see random artifacts, & it can hang the machine too sometimes.

I've heard about same problems with the nForce6 chipsets here. Think that the problem should be under control now with all those BIOS updates.

In the other hand, i've been using intel chipsets for years too, since those BX chipsets for Pentium II/III, till 875P/975X recently, and, i don't remember about any 'weird' problem with those chipsets.

nForce chipsets have been great for me, but seriously, i'm tired about all these weird problems, & now that intel has the performance crown again, i think that the best platform is intel cpu & chipset, if you want a stable system.


01-04-07, 08:22 AM

Thanks for your input.

It's difficult to tell wheter the problems you've experienced have to to with this problem (or the problem we discuss at lkml, in case this one here is a different one)...
Anyway if it is true that "our" problem comes from memory mapping and/or hw-iommu issues, it would be unlikely that you've had the same problem. Those Pentium III/Via Apollo and so on didn't know anything about AMD64 architecture and thus nothing about the iommu or memory mapping (as those were 32bit based).