kernel oops: bnx2 do_IRQ causing network connection to drop

Issues related to hardware problems
Post Reply
DrewL
Posts: 2
Joined: 2015/06/25 15:02:32

kernel oops: bnx2 do_IRQ causing network connection to drop

Post by DrewL » 2015/06/25 15:53:59

Hi Folks --

Not sure yet if this belongs in hardware of software, so please bare with me...

Currently running up against an issue with a kernel oops on a system recently upgraded to CentOS 6.6. The root of the issue seems to be the bnx2 driver. The NIC configuration is two bnx2 NICs in a bonded configuration. Whatever issue is being triggered, the end result is that the eth0 (active) card flaps and causes failover several times in a single event.

To try to correct the issue via software, we upgraded the kernel to the latest. Unfortunately this does not seem to have fixed the issue (same driver version, different kernel). Looking further into the issue, the problem looks like it can be anywhere from faulty hardware right up to kernel-level driver issues. The hardware itself is an IBM HS22 7870 blade, which has two onboard Broadcom NetXtremeII cards. We had also considered disabling irqbalance, but thats something we can't really do in this case.

Hoping someone else might have run into this, or can point me in the right direction! I'm still trying to figure out if this is faulting hardware, or if there settings that need tweaking or firmware/software that needs to be updated here.

I've pasted everything to http://pastebin.centos.org/29146/14352469/ with a password of 'bnx2'.

Heres some extracts from todays /var/log/messages (see pastebin for full)....

[ log ]# zegrep "kernel|bond|eth" messages
Jun 25 04:35:48 bwas-02-a kernel: do_IRQ: 15.212 No irq handler for vector (irq -1)
Jun 25 04:36:07 bwas-02-a kernel: ------------[ cut here ]------------
Jun 25 04:36:07 bwas-02-a kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26b/0x280() (Not tainted)
... [ snip ]
Jun 25 04:36:07 bwas-02-a kernel: Call Trace:
Jun 25 04:36:07 bwas-02-a kernel: <IRQ> [<ffffffff81074e47>] ? warn_slowpath_common+0x87/0xc0
Jun 25 04:36:07 bwas-02-a kernel: [<ffffffff81074f36>] ? warn_slowpath_fmt+0x46/0x50
Jun 25 04:36:07 bwas-02-a kernel: [<ffffffff8147e67b>] ? dev_watchdog+0x26b/0x280
Jun 25 04:36:07 bwas-02-a kernel: [<ffffffff81060e1e>] ? scheduler_tick+0x11e/0x260
Jun 25 04:36:07 bwas-02-a kernel: [<ffffffff8147e410>] ? dev_watchdog+0x0/0x280
Jun 25 04:36:07 bwas-02-a kernel: [<ffffffff81087e07>] ? run_timer_softirq+0x197/0x340
Jun 25 04:36:07 bwas-02-a kernel: [<ffffffff810b03c5>] ? tick_dev_program_event+0x65/0xc0
Jun 25 04:36:07 bwas-02-a kernel: [<ffffffff8107d901>] ? __do_softirq+0xc1/0x1e0
Jun 25 04:36:07 bwas-02-a kernel: [<ffffffff810b049a>] ? tick_program_event+0x2a/0x30
Jun 25 04:36:07 bwas-02-a kernel: [<ffffffff8100c38c>] ? call_softirq+0x1c/0x30
Jun 25 04:36:07 bwas-02-a kernel: [<ffffffff8100fbd5>] ? do_softirq+0x65/0xa0
Jun 25 04:36:07 bwas-02-a kernel: [<ffffffff8107d7b5>] ? irq_exit+0x85/0x90
Jun 25 04:36:07 bwas-02-a kernel: [<ffffffff81533c6a>] ? smp_apic_timer_interrupt+0x4a/0x60
Jun 25 04:36:07 bwas-02-a kernel: [<ffffffff8100bc13>] ? apic_timer_interrupt+0x13/0x20
Jun 25 04:36:07 bwas-02-a kernel: <EOI> [<ffffffff812eab5e>] ? intel_idle+0xde/0x170
Jun 25 04:36:07 bwas-02-a kernel: [<ffffffff812eab41>] ? intel_idle+0xc1/0x170
Jun 25 04:36:07 bwas-02-a kernel: [<ffffffff81426167>] ? cpuidle_idle_call+0xa7/0x140
Jun 25 04:36:07 bwas-02-a kernel: [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
Jun 25 04:36:07 bwas-02-a kernel: [<ffffffff81522e8d>] ? start_secondary+0x2be/0x301
Jun 25 04:36:07 bwas-02-a kernel: ---[ end trace 057af7c3f5dab123 ]---
... [ snip ]
Jun 25 10:46:21 bwas-02-a kernel: bonding: bond0: link status definitely down for interface eth0, disabling it
Jun 25 10:46:21 bwas-02-a kernel: bonding: bond0: making interface eth1 the new active one.
Jun 25 10:46:22 bwas-02-a kernel: bonding: bond0: link status definitely up for interface eth0.
Jun 25 10:46:22 bwas-02-a kernel: bonding: bond0: making interface eth0 the new active one.
Jun 25 10:46:25 bwas-02-a kernel: bonding: bond0: link status definitely down for interface eth0, disabling it
Jun 25 10:46:25 bwas-02-a kernel: bonding: bond0: making interface eth1 the new active one.
Jun 25 10:46:26 bwas-02-a kernel: bonding: bond0: link status definitely up for interface eth0.
Jun 25 10:46:26 bwas-02-a kernel: bonding: bond0: making interface eth0 the new active one.

DrewL
Posts: 2
Joined: 2015/06/25 15:02:32

Re: kernel oops: bnx2 do_IRQ causing network connection to d

Post by DrewL » 2015/07/07 12:28:42

Just a quick update here to let you all know what has been tried to try and correct the problem. All of these were allowed to 'soak' for at least a week. In each case we have run into the same "No irq handler" error in less than 10 days.
  • upgrade kernel to latest [ 2.6.32-358 -> -2.6.32-504.23.4 ] with a full 'yum update'
  • upgrade of module to latest reference Broadcom/Qlogic reference code, built from source
  • firmware update for the Broadcom NetXtreme 2 from the vendor
Just ran into a new IRQ error with the associated kernel OOPs and link flaps again this morning. :-\

Replacing the hardware is next, I think. Sort of suspected that this was the problem from the beginning.

User avatar
TrevorH
Site Admin
Posts: 33219
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: kernel oops: bnx2 do_IRQ causing network connection to d

Post by TrevorH » 2015/07/07 13:32:22

Did you try appending pci=nomsi yet as I suggested on IRC?
The future appears to be RHEL or Debian. I think I'm going Debian.
Info for USB installs on http://wiki.centos.org/HowTos/InstallFromUSBkey
CentOS 5 and 6 are deadest, do not use them.
Use the FAQ Luke

Post Reply