Infiniband ipoib_wq hangs

Issues related to configuring your network
Post Reply
lmklassen
Posts: 1
Joined: 2015/03/31 13:36:15

Infiniband ipoib_wq hangs

Post by lmklassen » 2015/03/31 13:50:15

Over the last few months we have seen a number of ipoib_wq hangs (approx. once per month per system). Because IPoIB hangs, the NFS mounts using IB interface also hang. This prevents the system from even rebooting, forcing a system power reset to regain functionality. We have a mix of systems and it only happens on two Intel S2600GZ based systems using Intel AXX1FDRIBIOM Infiniband HCAs. They are running the latest 6.6 kernel and HCA firmware has been update to the latest version.

I have come across http://permalink.gmane.org/gmane.linux.network/311094 which discusses a similar hang in cm_destroy_id, but nothing that would indicate how to rectify the situation.

Any insight into diagnosing and rectifying the issue would be appreciated.

The error message that is generated is:

Mar 30 18:04:16 hubel1 kernel: INFO: task ipoib_wq:2197 blocked for more than 120 seconds.
Mar 30 18:04:16 hubel1 kernel: Not tainted 2.6.32-504.el6.x86_64 #1
Mar 30 18:04:16 hubel1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 30 18:04:16 hubel1 kernel: ipoib_wq D 000000000000000e 0 2197 2 0x00000000
Mar 30 18:04:16 hubel1 kernel: ffff88401196bc50 0000000000000046 0000000000000000 ffff88401196bbe0
Mar 30 18:04:16 hubel1 kernel: ffff882000000001 0000000100000246 0009d8388d65eebf 0000000000000002
Mar 30 18:04:16 hubel1 kernel: ffff88401196bbe0 00000001a511ffd8 ffff88400e946638 ffff88401196bfd8
Mar 30 18:04:16 hubel1 kernel: Call Trace:
Mar 30 18:04:16 hubel1 kernel: [<ffffffff8152a8c5>] schedule_timeout+0x215/0x2e0
Mar 30 18:04:16 hubel1 kernel: [<ffffffff81063bf3>] ? perf_event_task_sched_out+0x33/0x70
Mar 30 18:04:16 hubel1 kernel: [<ffffffff8152a543>] wait_for_common+0x123/0x180
Mar 30 18:04:16 hubel1 kernel: [<ffffffff81064b90>] ? default_wake_function+0x0/0x20
Mar 30 18:04:16 hubel1 kernel: [<ffffffff8152a65d>] wait_for_completion+0x1d/0x20
Mar 30 18:04:16 hubel1 kernel: [<ffffffffa02b08d5>] cm_destroy_id+0xf5/0x320 [ib_cm]
Mar 30 18:04:16 hubel1 kernel: [<ffffffffa02b0c20>] ib_destroy_cm_id+0x10/0x20 [ib_cm]
Mar 30 18:04:16 hubel1 kernel: [<ffffffffa02f7437>] ipoib_cm_free_rx_reap_list+0xa7/0x110 [ib_ipoib]
Mar 30 18:04:16 hubel1 kernel: [<ffffffffa02f74a0>] ? ipoib_cm_rx_reap+0x0/0x20 [ib_ipoib]
Mar 30 18:04:16 hubel1 kernel: [<ffffffffa02f74b5>] ipoib_cm_rx_reap+0x15/0x20 [ib_ipoib]
Mar 30 18:04:16 hubel1 kernel: [<ffffffff81097fe0>] worker_thread+0x170/0x2a0
Mar 30 18:04:16 hubel1 kernel: [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40
Mar 30 18:04:16 hubel1 kernel: [<ffffffff81097e70>] ? worker_thread+0x0/0x2a0
Mar 30 18:04:16 hubel1 kernel: [<ffffffff8109e66e>] kthread+0x9e/0xc0
Mar 30 18:04:16 hubel1 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
Mar 30 18:04:16 hubel1 kernel: [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
Mar 30 18:04:16 hubel1 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20

aks
Posts: 3073
Joined: 2014/09/20 11:22:14

Re: Infiniband ipoib_wq hangs

Post by aks » 2015/03/31 19:14:25

I'm afraid the only insight I can provide (based on information provided) is that the process is "hogging" CPU time - perhaps google that to see if it's a known problem and/or a fix is available.

User avatar
TrevorH
Site Admin
Posts: 33219
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: Infiniband ipoib_wq hangs

Post by TrevorH » 2015/03/31 20:53:21

That's not hogging cpu time, it's just not responding in a timely manner. That could be hardware error or it could be driver error.
The future appears to be RHEL or Debian. I think I'm going Debian.
Info for USB installs on http://wiki.centos.org/HowTos/InstallFromUSBkey
CentOS 5 and 6 are deadest, do not use them.
Use the FAQ Luke

Post Reply