4 network cards, linked 2 by 2 on 2 bonds.
External network: bond0 : eth0 up & running, eth1 active-backup Internal network: bond1 : eth2 up & running, eth3 active-backup
It is now 3 times in 2 months that the main adapter (eth0) is failing, but the failover (bond0 in active backup) doesn't start eth1.
And we have no sysadmin in the company to help us solve the issue.
I fix it for the moment by putting eth0 down then up, but it incurs downtimes of 5 to 10 mins until someone connects and fix it.
Here is the output of ifconfig:
Code: Select all
bond0 Link encap:Ethernet HWaddr 90:b1:1c:xxxxx
inet addr:195.178.186.222 Bcast:195.178.xxxxxxx Mask:255.255.255.224
inet6 addr: fe80::92xxxxa:4b1e/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:11806289 errors:0 dropped:563346 overruns:0 frame:0
TX packets:15209428 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:2314496738 (2.3 GB) TX bytes:17247449206 (17.2 GB)
bond1 Link encap:Ethernet HWaddr 00:10:1xxxx:ce
inet addr:192.168.0.1 Bcast:192.168.0.255 Mask:255.255.255.0
inet6 addr: fe80::210:18ff:fed3:b1ce/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:161091053340 errors:0 dropped:1071 overruns:0 frame:13821
TX packets:112926434041 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:99357307904176 (99.3 TB) TX bytes:45744253012472 (45.7 TB)
eth0 Link encap:Ethernet HWaddr 90:b1:xxxxxx4b:1e
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:11806289 errors:0 dropped:563346 overruns:0 frame:0
TX packets:15209428 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2314496738 (2.3 GB) TX bytes:17247449206 (17.2 GB)
Interrupt:16
eth1 Link encap:Ethernet HWaddr 90:b1:1xxxxxx:1e
UP BROADCAST SLAVE MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
Interrupt:17
eth2 Link encap:Ethernet HWaddr 00:10:xxxxx1:ce
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:161091053340 errors:0 dropped:1070 overruns:0 frame:13821
TX packets:112926434041 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:99357307904176 (99.3 TB) TX bytes:45744253012472 (45.7 TB)
Interrupt:48
eth3 Link encap:Ethernet HWaddr 00:10xxxb1:ce
UP BROADCAST SLAVE MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
Interrupt:52
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:6935638599 errors:0 dropped:0 overruns:0 frame:0
TX packets:6935638599 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:18028725295176 (18.0 TB) TX bytes:18028725295176 (18.0 TB)
Code: Select all
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: fault-tolerance (active-backup)
Primary Slave: eth0 (primary_reselect always)
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Slave Interface: eth0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 90:b1:1c:4a:4b:1e
Slave queue ID: 0
Slave Interface: eth1
MII Status: down
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr: 90:b1:1c:4a:4b:1f
Slave queue ID: 0
Code: Select all
Mar 3 10:38:16 localhost kernel: [93739227.917537] tg3 0000:02:00.0 eth0: 0x000068b0: 0xe0011514, 0x00000000, 0x00000000, 0x00000000
Mar 3 10:38:16 localhost kernel: [93739227.930035] tg3 0000:02:00.0 eth0: 0x000068e0: 0x00000000, 0x00000000, 0x00000000, 0x0001c2cc
Mar 3 10:38:16 localhost kernel: [93739227.942529] tg3 0000:02:00.0 eth0: 0x000068f0: 0x00ff000e, 0x00ff0000, 0x00000000, 0x04444444
...
Mar 3 10:38:17 localhost kernel: [93739228.141585] tg3 0000:02:00.0 eth0: 4: NAPI info [0000000a:0000000a:(0000:0000:01ff):04dc:(04dc:04dc:0000:0000)]
Mar 3 10:38:17 localhost kernel: [93739228.201559] bonding: bond0: link status definitely down for interface eth0, disabling it
Mar 3 10:38:17 localhost kernel: [93739228.216343] tg3 0000:02:00.0 eth0: Link is down
Mar 3 10:38:18 localhost kernel: [93739229.253266] bonding: bond0: now running without any active interface !
Mar 3 10:38:18 localhost kernel: [93739229.253331] tg3 0000:08:00.0 eth2: transmit timed out, resetting
Mar 3 10:38:19 localhost kernel: [93739230.509553] tg3 0000:08:00.0 eth2: 0x00000000: 0x165f14e4, 0x00100406, 0x02000000, 0x00800010
Mar 3 10:38:19 localhost kernel: [93739230.521603] tg3 0000:08:00.0 eth2: 0x00000010: 0xd90a000c, 0x00000000, 0xd90b000c, 0x00000000
Mar 3 10:38:19 localhost kernel: [93739230.533658] tg3 0000:08:00.0 eth2: 0x00000020: 0xd90c000c, 0x00000000, 0x00000000, 0x200314e4
Mar 3 10:38:19 localhost kernel: [93739230.545704] tg3 0000:08:00.0 eth2: 0x00000030: 0xdd000000, 0x00000048, 0x00000000, 0x0000010f
Mar 3 10:38:19 localhost kernel: [93739230.557755] tg3 0000:08:00.0 eth2: 0x00000040: 0x00000000, 0xa5000000, 0xc8035001, 0x64002008
Mar 3 10:38:19 localhost kernel: [93739230.569808] tg3 0000:08:00.0 eth2: 0x00000050: 0x818c5803, 0x78000000, 0x0086a005, 0x00000000
...
Mar 3 10:38:23 localhost kernel: [93739234.611688] tg3 0000:08:00.0 eth2: 4: Host status block [00000001:000000df:(0000:0000:0a0f):(0000:0000)]
Mar 3 10:38:23 localhost kernel: [93739234.624030] tg3 0000:08:00.0 eth2: 4: NAPI info [000000c4:000000c4:(0000:0000:01ff):09d4:(01d4:01d4:0000:0000)]
Mar 3 10:38:23 localhost kernel: [93739234.699205] bonding: bond1: link status definitely down for interface eth2, disabling it
Mar 3 10:38:23 localhost kernel: [93739234.738410] tg3 0000:08:00.0: tg3_stop_block timed out, ofs=1400 enable_bit=2
Mar 3 10:38:23 localhost kernel: [93739234.850735] tg3 0000:08:00.0: tg3_stop_block timed out, ofs=c00 enable_bit=2
Mar 3 10:38:23 localhost kernel: [93739234.977285] tg3 0000:08:00.0 eth2: Link is down
Mar 3 10:38:25 localhost kernel: [93739236.081087] bonding: bond1: now running without any active interface !
What could be the issue of the bond failing to switch to the hot backup card?
(and if you could give me some lead about where to search the root cause of the failure it would be much appreciated)