View Full Version : nForce 4 corrupting data written to HDD
Hello,
I want to describe a serious problem which I had since I got my new PC with Asus M2NPV-VM, 2 Samsung HDDs (one SATA and one PATA). The problem is that when bigger files are written they get corrupted. Sometimes it happends even with smaller ones. The issue occurs using both 32bit and 64bit kernels 2.6.17 through 2.6.19 (some were compiled by me some stock Debian kernels). The problem is explained more broadly here, at LKML: http://lkml.org/lkml/2006/12/2/197
My lspci:
00:00.0 RAM memory: nVidia Corporation C51 Host Bridge (rev a2)
00:00.1 RAM memory: nVidia Corporation C51 Memory Controller 0 (rev a2)
00:00.2 RAM memory: nVidia Corporation C51 Memory Controller 1 (rev a2)
00:00.3 RAM memory: nVidia Corporation C51 Memory Controller 5 (rev a2)
00:00.4 RAM memory: nVidia Corporation C51 Memory Controller 4 (rev a2)
00:00.5 RAM memory: nVidia Corporation C51 Host Bridge (rev a2)
00:00.6 RAM memory: nVidia Corporation C51 Memory Controller 3 (rev a2)
00:00.7 RAM memory: nVidia Corporation C51 Memory Controller 2 (rev a2)
00:02.0 PCI bridge: nVidia Corporation C51 PCI Express Bridge (rev a1)
00:03.0 PCI bridge: nVidia Corporation C51 PCI Express Bridge (rev a1)
00:04.0 PCI bridge: nVidia Corporation C51 PCI Express Bridge (rev a1)
00:05.0 VGA compatible controller: nVidia Corporation C51PV [GeForce 6150] (rev a2)
00:09.0 RAM memory: nVidia Corporation MCP51 Host Bridge (rev a2)
00:0a.0 ISA bridge: nVidia Corporation MCP51 LPC Bridge (rev a3)
00:0a.1 SMBus: nVidia Corporation MCP51 SMBus (rev a3)
00:0a.2 RAM memory: nVidia Corporation MCP51 Memory Controller 0 (rev a3)
00:0b.0 USB Controller: nVidia Corporation MCP51 USB Controller (rev a3)
00:0b.1 USB Controller: nVidia Corporation MCP51 USB Controller (rev a3)
00:0d.0 IDE interface: nVidia Corporation MCP51 IDE (rev a1)
00:0e.0 IDE interface: nVidia Corporation MCP51 Serial ATA Controller (rev a1)
00:10.0 PCI bridge: nVidia Corporation MCP51 PCI Bridge (rev a2)
00:10.1 Audio device: nVidia Corporation MCP51 High Definition Audio (rev a2)
00:14.0 Bridge: nVidia Corporation MCP51 Ethernet Controller (rev a3)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
04:05.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 Controller (PHY/Link)
04:08.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ (rev 10)
netllama
12-05-06, 11:58 AM
How reliably does this corruption reproduce?
Do you have a file that will become corrupted every time it is copied?
Thanks,
Lonni
Although I can't reproduce it well right now, the fact is it occurs with this computer. I had a computer with nForce 2 it was working very well.
However, when I changed the computer to a newer one I transferred a PATA disk to the new computer and it didn't last a month till I get first ext3 inconsistency. When fsck passed everything were looking all right till it started to happen more frequently in the end. Some system files seemed corrupted and were being load.
Thus, I decided to do a clean debian amd64 in my second SATA disk. I have been living happy since recently it started to show the same signs. Now it boots with some scary messages appearing and it stops when it cames to starting xorg (nvidia logo flashes a few times) and a message appears that xorg startup didn't suceed. Even mplayer segfaults.
I don't have a spare HDD to corrupt right now but I will probably get one after the week end. Then I will be able to something more.
I would be glad if you looked at the mailing list I wrote about. There are some more sophisticated problem reports than mine is.
Thank you
netllama
12-05-06, 05:03 PM
I reviewed the LKML thread that you referenced and the problem descriptions there sound vastly different from what you're reporting. Your issue sounds like filesystem and/or in-memory corruption, however the issue on LKML isn't occuring on the filesystem level, but in the files themselves.
Additionally, you stated that "it boots with some scary messages" and "mplayer segfaults". It sounds like the data on your disk(s) is getting corrupted even when its not being actively written to, which is not the same issue as was reported on LKML.
At this point, the information that you've provided suggests a hardware problem (faulty RAM or disk). If you can provide information that suggests otherwise, I can look into your issue further.
I'll look into the LKML issue, however the information that you've provided here is not the same as what was reported on the LKML.
Thanks,
Lonni
I think it highly likely the problem reported in the LKML thread is the same problem reported here by ~domasj. Please read all my messages in that LKML thread. On my system the problem is reproducible 100% of the time. My system had exhibited odd behavior from day one. But the problems occurred extremely infrequently and typically involved symptoms such as being unable to uncompress large archives of diagnostic data sent by customers. I simply shrugged off such problems as being due to the file being corrupted before it reached me.
But eventually my VMware Windows XP guest image exhibited problems. Again, I initially shrugged it of to MS Windows being its usual flakey self. But attempts to do a scratch install kept randomly failing during the installation. So I rebuilt the VMware image on my home Linux server. I then brought that image to work and finished the install by loading IBM specific apps (e.g., Lotus Notes). I then compressed the VMware image with bzip2 and took a copy home. When I attempted to uncompress I found three files were corrupted. When I attempted to uncompress the original files on my nVidia based workstation I found it impossible to do so. At that point I started doing controlled tests. The results are documented in the LKML thread.
Feel free to contact me if you wish to investigate this further. But as a system support specialiast for the past sixteen years who has seen all manner of silent data corruption I'm pretty confident there is a nForce chipset problem.
netllama
12-05-06, 07:42 PM
krader,
I read through your input on the LKML thread, and had setup a system with the same A8N-SLI Deluxe motherboard that you're using. I've been copying a single 30GB file between two SATA disks on the nFORCE SATA controller, and running a comparison of the sha512sum of the two files after each copying iteration, and haven't run into any differences thus far. I'll let it continue to run, however I should note that we had an issue exactly like this brought to our attention several months ago and after investigation it was determined to be an SBIOS bug on that particular vendor's board (it was not Asus, however).
From your LKML post, you stated "copying certain 2 GiB files would result in at least five bytes, and as many as thirty, being corrupted every single time". Can you provide me with these 2GB files, along with details on how you were detecting the corruption? If this issue is specific to the type of data in the file(s), then I'll need you or someone else experiencing this problem to provide the files that trigger the problem.
Thanks,
Lonni
Based on my testing, and the behavior of my system, I don't think there is much doubt that the failure is sensitive to the data pattern and overall "load" on the system. I attempted to uploaded the four files which consistently exhibit corruption when copied but the attempt failed. They range from 316 MiB to 1.1 GiB bzip2 compressed. If you give me an FTP location I'll be happy to upload them.
When you eventually get the file you'll need to uncompress them first. Note that my system had a 2.2 Ghz AMD Athlon 64 dual-core CPU. The filesystems were ext3 with no unusual options. Also, while others have reported corruption with PATA disks I was unable to do so. It appeared that the speed of the disks is a factor.
You'll also note from the LKML thread that I updated the BIOS to the current GA version. That had no noticeable effect. A coworker and I think a likely explanation is the BIOS is being too aggressive in configuring the chipset (i.e., choosing settings that would maximize performance at the expense of stability). Thus leading to exceeding the capabilities of some component. But I triple-checked that all BIOS settings were in the most conservative setting so if the BIOS is at fault then it is beyond my control.
netllama
12-06-06, 05:04 PM
krader,
I'll send you a PM shortly with details on where you can upload the files.
In this thread, please detail specifically how you are reproducing the problem, along with how much time is needed to reliably & consistantly reproduce the problem.
Thanks,
Lonni
Below the script I was using. It should be fairly obvious how to adapt it to your configuration. In short
1) cd to the destination directory
2) copy from the source directory to the current directory
3) flush dirty pages to disk by reading a couple of large files
4) calculate md5sums for the copied files
5) compare the checksums to known correct checksums
6) if any checksums are incorrect do a byte comparison of all files
On my system this would report at least one file being corrupted on every iteration. How long this takes will depend on the speed of your system. Obviously since we don't have root cause, and therefore don't know where the problem lies, there is a risk that you won't be able to reproduce the failure. Your best chance will be to adhere as closely to my configuration as possible.
I'll upload the output of a few runs of the following script as attachments to this thread.
#!/bin/ksh
integer i=0
cd /vmware/c
while : ; do
i=i+1
cp /home/krader/WinXP/* .
cat /vmware/WinXP/Windows\ XP-f001.vmdk /vmware/WinXP/Windows\ XP-f002.vmdk > /dev/null
md5sum Windows* > /tmp/x
echo iteration $i
if diff ~/good.vmware.md5sums /tmp/x ; then
:
else
for f in Windows* ; do
echo $f
cmp -l "/home/krader/WinXP/$f" "$f"
done
fi
done
netllama
12-06-06, 10:22 PM
It doesn't look like you've uploaded the files required to reproduce this.
-Lonni
Correct, I have not uploaded the files which should reproduce this. From my update five hours ago (comment #7):
I attempted to uploaded the four files which consistently exhibit corruption when copied but the attempt failed. They range from 316 MiB to 1.1 GiB bzip2 compressed. If you give me an FTP location I'll be happy to upload them.
netllama
12-06-06, 11:51 PM
Right and see my comment #8
Sorry, I interpreted "PM" to mean you would send me an email. I now see the "private messages" link at the top of this page and have uploaded the files: file1.bz2 thru file4.bz2
netllama
12-08-06, 02:18 PM
All 4 of these files are reported to have been corrupted when I attempt to bunzip2 them:
bunzip2: Data integrity error when decompressing.
Input file = file1.bz2, output file = file1
The md5sum of the files that you uploaded to the FTP server is the same as what I downloaded from the FTP server.
Am I supposed to be using the files in this state (without bunzip2'ing them first)?
I can't believe I made such a bone-headed mistake. I neglected to put my FTP client in binary mode before uploading the files. I've confirmed my local copies are good with "bunzip2 -t" and am resending the files in binary mode.
Here are the sizes and md5sums you should get after downloading them:
-rw-r--r-- 1 krader krader 492496034 2006-12-08 17:22:31 file1.bz2
-rw-r--r-- 1 krader krader 741700564 2006-12-08 17:23:30 file3.bz2
-rw-r--r-- 1 krader krader 1108086333 2006-12-08 17:25:29 file4.bz2
-rw-r--r-- 1 krader krader 330596543 2006-12-08 17:21:16 file5.bz2
43b730dc2d6a98d3b926c71145e344e0 file1.bz2
9e9c33dcb6e9a85d913d0748175acfc2 file3.bz2
cb3290ae77d1e409144de5abd58cee5e file4.bz2
73fcb62d9a1508408a3f25e96b1ffa9c file5.bz2
netllama
12-11-06, 08:16 PM
I've run with your files & script for 3 iterations, and have not been able to reproduce the problem.
I have a few questions:
0) Are you able to consistently reproduce the corruption within 3 iterations with the files & script you provided?
1) Does this still reproduce if you are using a non-SMP kernel?
2) Does this still reproduce if you reduce the RAM to 1GB or 512MB (booting with the mem= kernel parameter)?
Thanks,
Lonni
I have searched around a bit and found out that this (or very similar) is quite common problem. It was reported that people get data corruption while using nForce 4 based system regardless the OS. http://www.nforcershq.com/forum/1-vt5108.htm - here is a huge thread about that. It seams that this behaviour is caused by hardware not by software. Although people write about problem disappearing after BIOS update, it haven't helped me. Now I have noticed that both NVIDIA GeForce 6150 GPU (with a radiator) and NVIDIA nForce 430 MCP (a bare chip) get that hot during operation that isn't possible to keep a finger on them. Is there a possibility that the heat causes my problems?
netllama
12-12-06, 01:29 PM
domasj,
As I stated earlier, the problem that you reported in this thread has different symptoms than the LKML posts you referenced. The nforcershq thread that you referenced appears to have at least a half dozen different, seemingly related problems, with an assortment of potential solutions & workarounds.
You referenced filesystem corruption, whereas the LKML posts are not filesystem corruption, but rather file corruption. You've not provided any additional information to suggest that you're hitting the same issue as in the LKML posts. Are you seeing file corruption or filesystem corruption or both? Please provide detailed instructions on how can I reproduce the problem(s) that you initially reported.
Thanks,
Lonni
The fact is that both of my drives are more or less corrupted. So now I will try to reformat a partition and see what happens with huge files. Also, I should mention that before that serious fs corruption I had several files wrongly copied from one disk to another. I copied about 5 GB of photos and some of them appeared corrupted. I'll look and see if I still can find some missed ones as I had them replaced one by one with the originals and they were successfully corrected. http://jozita.lt/Problem.zip here it is. Three pairs of photos corrupted during PATA->SATA copy operation.
Here are my PATA->fresh ext3 partition on SATA:
17050436099fb9a05ad03ae2cce1a607 debian-update-3.1r4-i386-1.iso - the source
... and the copies (filenames changed on purpose :) :
a1ce1f3e703aae62ec940eaf5b8019f4 debian-update-3.1r4-i386-0.iso
7762914497922b5fc09b936ec15c094b debian-update-3.1r4-i386-1.iso
1c6e5d218835539f97128e1de3e100a3 debian-update-3.1r4-i386-2.iso
2ac382a59210a3059aa7b2ee3902b0de debian-update-3.1r4-i386-3.iso
Lastly, because of such differences I decided to check them once more. Every file had a different md5sum! Even in the fresh partition it gives different md5sum.
netllama
12-12-06, 02:47 PM
Are you stating that the original debian-update-3.1r4-i386-1.iso that you copied has had its md5sum change over time, even though its not been touched in any way?
With regard to the jpg's in your Problem.zip, I've just copied them from one SATA disk to another, and their md5sums have remained the same. Do these images consistantly get corrupted every time you copy them?
Have you verified that you're using the latest BIOS for the motherboard?
Additionally, please provide output from the following commands:
cat /proc/cpuinfo
free -m
uname -a
Thanks,
Lonni
Are you stating that the original debian-update-3.1r4-i386-1.iso that you copied has had its md5sum change over time, even though its not been touched in any way?
Yes it reads differently every time. I also used sha512sum with same results - different reads. The iso file is 620 MB but when I try the same with smaller files I get constant reads.
With regard to the jpg's in your Problem.zip, I've just copied them from one SATA disk to another, and their md5sums have remained the same. Do these images consistantly get corrupted every time you copy them?
As I said, I have corrected other photos by just copying them once more. I believe it doesn't depend on the file but some part of data stream gets corrupted somewhere.
Have you verified that you're using the latest BIOS for the motherboard?
Additionally, please provide output from the following commands:
cat /proc/cpuinfo
free -m
uname -a
Yesterday I updated BIOS to 0603 which is the most recent one.
Here you are:
cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 79
model name : AMD Athlon(tm) 64 Processor 3000+
stepping : 2
cpu MHz : 1808.435
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt lm 3dnowext 3dnow up pni cx16 lahf_lm svm cr8legacy ts fid vid ttp tm stc
bogomips : 3619.75
free -m
total used free shared buffers cached
Mem: 979 954 25 0 3 605
-/+ buffers/cache: 345 633
Swap: 956 66 890
uname -a
Linux debian 2.6.18-1-686 #1 SMP Sat Oct 21 17:21:28 UTC 2006 i686 GNU/Linux
I think that the photo corruption happened using a 64 bit kernel if that does help.
netllama
12-12-06, 03:19 PM
No one in the LKML thread you referenced reported files getting corrupted on-disk. This still sounds like a separate and unrelated problem, potentially with faulty hardware. Earlier you reported filesystem corruption at bootup. Do you have a log of the corruption errors? Have you run memtest86 to verify that your memory isn't faulty?
With respect to the jpg's getting corrupted, that sounds like the same issue as in the LKML posts, however in light of the filesystem corruption, its hard to say for certain whether you're not just hitting a variant. If you copy the same jpg file 100 times, what percentage of the time does it end up corrupted?
How many memory modules are you using, and which brand?
Thanks,
Lonni
> 0) Are you able to consistently reproduce the corruption within 3
> iterations with the files & script you provided?
Yes, as stated before at least one file exhibits corruption on every
iteration. I spent four days doing nothing but running tests while changing
variables (e.g., the source of the data, whether regular async or direct I/O
was used).
> 1) Does this still reproduce if you are using a non-SMP kernel?
That is one thing I did not try.
> 2) Does this still reproduce if you reduce the RAM to 1GB or 512MB
> (booting with the mem= kernel parameter)?
Yes, the problem still occurred after booting with "mem=1g". Problem also
occurs if I remove half the memory (2 x 1 GiB DIMMS). Problem still occurs
if I swap the pairs of DIMMs I removed with the ones still in the system.
Memtest86 was run 24 hours on the full 4 GiB without error.
> No one in the LKML thread you referenced reported files getting corrupted
> on-disk
Please read that discussion thread again. I reported that the on-disk copy was corrupted. Also, I deliberately tested the case where the copy of the file would remain wholly in memory. I copied a file just over 1 GiB in size with no other activity on the system. An immediate cmp(1) of the source and copy showed no errors. I then read two different 2 GiB files in order to clear the buffer cache of those original files; thus forcing the copy to be synced to disk. Running cmp(1) again showed multiple bytes were corrupted.
Test results like the above are why I originally suspected the fault lay with the nForce 4 SATA controller. However, subsequent tests showed the corruption occurs even when the disks are attached to a Promise TX2 controller in a PCI slot and when using the onboard Silicon Image 3114 RAID controller. Multiple disks of different models were also tested and the problem occured with each disk. The SATA cables were also replaced without affecting the symptoms.
So we know the problem is definitely not the SATA controller, cables, or disk. There is good reason to believe the problem is not due to defective memory. This leaves the nForce 4 chipset, the AMD Athlon64 X2 CPU, a mainboard design error, or an error in how the ASUS BIOS is configuring the hardware as the remaining possibilities. Since different people are reporting similar systems with mainboards from different vendors the mainboard design can probably be ruled out. Note that as part of testing this I flashed the BIOS to the most recent version. That had no effect on the symptoms.
netllama
12-12-06, 10:38 PM
Clearly, something differs between your system and mine, as this problem is not reproducing here.
vBulletin® v3.7.1, Copyright ©2000-2012, Jelsoft Enterprises Ltd.