Need Help Troubleshooting Infiniband on CentOS 7

Issues related to configuring your network
Post Reply
TronCarter
Posts: 3
Joined: 2015/05/27 22:17:08

Need Help Troubleshooting Infiniband on CentOS 7

Post by TronCarter » 2015/05/28 12:38:44

I'm new to CentOS, but have used other linux variants for many years. I know that the key to troubleshooting it to narrow down and eventually identify the actual problem(s) and I can use some help doing so.

Background:
I have a 6 year old, 10 node i86 cluster running Solaris 10 and each of the nodes have Mellanox Infiniband cards. I have a Qlogic 9024 infiniband switch that all nodes are connected to. 9 of the nodes were purchased initially and have (what I would call) 'Infiniband only' cards, in that the connectors on the back are the CX4 style. So those are CX4 cable to CX4 connector on the Qlogic. The 10th node was purchased a few years later (refurbished, so I don't know the age) and it has a QSFP+ (I believe) connector on the card and a cable that converts QSFP+ to CX4 and it plugs into the Qlogic.. Everything works as expected on the current setup.

The Project:
I have been tasked with adding three new nodes to the cluster and at the same time change everything from Solaris to CentOS. The new nodes are Dell R620's with Mellanox QSFP+ style Infiniband cards. Over the past few weeks I have used the three nodes for testing purposes of installing CentOS and documenting the process.

What I know:
All of the non-Infiniband stuff works fine, but when I moved the three new nodes into the server room and connected them to the Qlogic switch for the first time I had problems. I don't get a blue light on the Qlogic port like I do with the first 10 (operational) nodes. In CentOS, running an 'ip link' command shows the state of both infiniband ports to be DOWN. From what I have read, if I am connected properly to the Qlogic, at minimum the output of 'ip link' should show an INIT state for the one port that is connected (It is a dual-port card and I am just using one port/cable). I have the "Infiniband Support" group installed. I can see p1p1 and p1p2 listed in ifconfig or ip link.

Code: Select all

6: p1p1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT qlen 1000
    link/ether e4:1d:2d:06:d6:f0 brd ff:ff:ff:ff:ff:ff
7: p1p2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT qlen 1000
    link/ether e4:1d:2d:06:d6:f1 brd ff:ff:ff:ff:ff:ff
What I don't know:
Perhaps the new Infiniband cards are either not compatible with the Qlogic 9024, or by default they run at a higher speed and therefore I need to configure the switch for some 12X ports.
Perhaps I don't have them configured properly in CentOS and the switch sees a misconfigured card and doesn't activate the port. I know you can get different modules that plug into the back of a QSFP+ Infiniband cards and make them into ethernet ports and perhaps other options as well. Do I need to tell CentOS what type of module or connection I'm using (ethernet, ib, etc)? I didn't need to specify in any of the Solaris installations what type of connection it was. I believe once I had the firmware updated and ran devfsadm -C, I was able to assign an IP address and it worked.

I could use any help in getting pointed in the right direction or maybe someone has done this before and can guide me through the process.

Tron

TronCarter
Posts: 3
Joined: 2015/05/27 22:17:08

Re: Need Help Troubleshooting Infiniband on CentOS 7

Post by TronCarter » 2015/05/28 14:22:28

A little further info after some poking around:

I logged into the CLI on the Qlogic and set ports 22, 23, and 24 for ENABLED AUTO 1X/4X/12X. I moved the three new server cables to those ports and rebooted the Qlogic. No change, I still don't get a blue light. I also tried moving one of the working cables from port 10 to 11. The blue light came on and it started working.

Also, the ports on the new cards are actually SFP+

TronCarter
Posts: 3
Joined: 2015/05/27 22:17:08

Re: Need Help Troubleshooting Infiniband on CentOS 7

Post by TronCarter » 2015/05/28 20:17:52

Even more info:

Code: Select all

root@SOL11:/users/gaudio# dmesg | grep infi
[   18.194733] infiniband mlx4_0: ib_register_mad_agent: QP 0 not supported
[   48.242321] infiniband mlx4_0: ib_register_mad_agent: QP 0 not supported
[   78.288526] infiniband mlx4_0: ib_register_mad_agent: QP 0 not supported
[  108.334354] infiniband mlx4_0: ib_register_mad_agent: QP 0 not supported
and many more lines like this. So, I googled that error and was only able to come up with one reference to it (in Japanese) and that person said they got the same error when the upgraded to CentOS 7.1. I am going to try an install of 7.0 and see what happens.

aalexson
Posts: 1
Joined: 2016/04/09 20:36:37

Re: Need Help Troubleshooting Infiniband on CentOS 7

Post by aalexson » 2016/04/09 20:39:27

Hi Tom,
Have you gotten anywhere with this? I'm getting the same issue. I've connected two servers directly together an at least the ports on the servers turn up. When I connect the servers into a voltaire 4036e switch, no link on the ports.

Thanks in advance.

Gerb
Posts: 3
Joined: 2018/08/16 09:53:19

Re: Need Help Troubleshooting Infiniband on CentOS 7

Post by Gerb » 2018/08/16 10:05:32

The same message occurs when connecting two ports that have the "Ethernet" link layer (see output of ibstat). You can switch the link layer on Mellanox cards with a line like this in /etc/modprobe.d/mlx4.conf:

options mlx4_core port_type_array=1,2

Then modprobe -r mlx4_en, mlx4_ib and mlx4_core kernel modules, and then modprobe the mlx4_core.

This means that the first port should get link layer "Infiniband" and the second "Ethernet". I accidentally put a cable between two "Ethernet" ports and go the same message. Using the two "Infiniband" ports it worked, provided that opensm runs. See also "modpinfo -p mlx4_core".

Gerben

Post Reply