PDA

View Full Version : CentOS 5.2 network bugs, kernel panic (forcedeth, MCP67)


paix
06-18-09, 11:52 AM
Hi all.

I have a strange problems with the network card.
One time I have got a kernel panic.
You could see screenshot from IPKVM here: http://paix.org.ua/tmp/panik_260509.jpg

Sometimes under high network load my server become unavailable, but trough IPKVM server works fine, and after service network restart netcard start work again.

Very often under high network load too my server become pings with a very long time.
64 bytes from xxx.xxx: icmp_seq=71 ttl=56 time=2218 ms
64 bytes from xxx.xxx: icmp_seq=72 ttl=56 time=2208 ms
64 bytes from xxx.xxx: icmp_seq=74 ttl=56 time=1047 ms

Server connected by 100mbit link, and I ping it from neighbour machine.
And after
service network restart server come in to normal behavior.

I am running openvz kernel, which is based on the current RHEL5 kernel.

#uname -a
Linux domain 2.6.18-128.1.1.el5.028stab062.3 #1 SMP Sun May 10 18:54:51 MSD 2009 x86_64 x86_64 x86_64 GNU/Linux


Base Board Information
Manufacturer: ASUSTeK Computer INC.
Product Name: M2N-VM DVI

# from dmesg:
00:0a.0 Ethernet controller: nVidia Corporation MCP67 Ethernet (rev a2)
forcedeth.c: Reverse Engineered nForce ethernet driver. Version 0.60.
forcedeth: using HIGHDMA
eth0: forcedeth.c: subsystem: 01043:82b3 bound to 0000:00:0a.0


# lspci
00:00.0 RAM memory: nVidia Corporation MCP67 Memory Controller (rev a2)
00:01.0 ISA bridge: nVidia Corporation MCP67 ISA Bridge (rev a2)
00:01.1 SMBus: nVidia Corporation MCP67 SMBus (rev a2)
00:02.0 USB Controller: nVidia Corporation MCP67 OHCI USB 1.1 Controller (rev a2)
00:02.1 USB Controller: nVidia Corporation MCP67 EHCI USB 2.0 Controller (rev a2)
00:04.0 USB Controller: nVidia Corporation MCP67 OHCI USB 1.1 Controller (rev a2)
00:04.1 USB Controller: nVidia Corporation MCP67 EHCI USB 2.0 Controller (rev a2)
00:06.0 IDE interface: nVidia Corporation MCP67 IDE Controller (rev a1)
00:07.0 Audio device: nVidia Corporation MCP67 High Definition Audio (rev a1)
00:08.0 PCI bridge: nVidia Corporation MCP67 PCI Bridge (rev a2)
00:09.0 IDE interface: nVidia Corporation MCP67 AHCI Controller (rev a2)
00:0a.0 Ethernet controller: nVidia Corporation MCP67 Ethernet (rev a2)
00:0b.0 PCI bridge: nVidia Corporation MCP67 PCI Express Bridge (rev a2)
00:0c.0 PCI bridge: nVidia Corporation MCP67 PCI Express Bridge (rev a2)
00:0d.0 PCI bridge: nVidia Corporation MCP67 PCI Express Bridge (rev a2)
00:0e.0 PCI bridge: nVidia Corporation MCP67 PCI Express Bridge (rev a2)
00:0f.0 PCI bridge: nVidia Corporation MCP67 PCI Express Bridge (rev a2)
00:10.0 PCI bridge: nVidia Corporation MCP67 PCI Express Bridge (rev a2)
00:11.0 PCI bridge: nVidia Corporation MCP67 PCI Express Bridge (rev a2)
00:12.0 VGA compatible controller: nVidia Corporation GeForce 7050 PV / nForce 630a (rev a2)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control


Any advice will be greatly appreciated!
Thanks!

paix
06-18-09, 11:55 AM
Sorry, I'm running CentOS 5.3, not 5.2

whig
06-18-09, 06:59 PM
You can restart the network, can you run "top" to see if any processes are using 99% cpu/ram/swap?

paix
06-19-09, 04:13 AM
You can restart the network, can you run "top" to see if any processes are using 99% cpu/ram/swap?

There are no cpu\ram\disk intensive process when netcard become unavailable (and during time when netcard not responding).
Also there are no any messages about this in log/messages or dmesg.

paix
06-19-09, 09:09 AM
Recently I have got too panics, when I tested the network by iperf.

kernel booted with
irqpoll nousb noapic

http://paix.org.ua/tmp/panic_190609.jpg

kernel booted with
nousb noapic
and

alias eth0 forcedeth
options forcedeth optimization_mode=1

http://paix.org.ua/tmp/panic2_190609.jpg

Also there is one interesting oops in log/messages:
kernel: skb_over_panic: text:ffffffff881bf46f len:15398 put:15398 head:ffff8100a25c5800 data:ffff8100a25c5810 tail:ffff8100a25c9436 end:ffff8100a25c5e80 dev:eth0
kernel: ----------- [cut here ] --------- [please bite here ] ---------
kernel: Kernel BUG at net/core/skbuff.c:96

Also I have a kernel booted with crashkernel=128M@16M option, and I have kdump running, but unfortunately there are no any saved core. I couldn't reboot server via ipkvm too, so I've requested a support to hardware reboot the server.

dmesg:
http://paix.org.ua/tmp/dmesg_190609.txt

whig
06-20-09, 07:18 PM
Post the results of this command in the problem conditiontop -b -n 1

paix
06-22-09, 07:39 AM
Post the results of this command in the problem conditiontop -b -n 1

the NIC completely freeze a server. I even can't reboot the server through IPKVM :(

I've stressed the NIC today by iperf package (from epel. Description: Iperf is a tool to measure maximum TCP bandwidth) and got the panic.

The screenshot from IPKVM here: http://paix.org.ua/tmp/panic_220609.jpg
Unfortunately there isn't any interesting info to identify the problem.

The kernel was loaded with

kernel /vmlinuz-2.6.18-128.1.1.el5.028stab062.3 ro root=/dev/VolGroupSys/LogVolRoot crashkernel=128M@16M nousb noapic debug=2
and
options forcedeth optimization_mode=1


# dmesg |grep forcedeth
forcedeth.c: Reverse Engineered nForce ethernet driver. Version 0.62-Driver Package V1.25.
forcedeth: using HIGHDMA
eth0: forcedeth.c: subsystem: 01043:82b3 bound to 0000:00:0a.0

whig
06-24-09, 04:46 AM
Earlier mentioned,service network restartresumes network; post the requested dump because it could be indicative.

paix
07-06-09, 03:42 AM
In continuation of the subject:
http://www.centos.org/modules/newbb/viewtopic.php?topic_id=20835 (CentOS 5.3 nvidia nForce network bugs, kernel panics (forcedeth, MCP67))