Infiniband ibnetdiscover excesive time outs

Issues related to configuring your network
Post Reply
RafaARV
Posts: 3
Joined: 2014/11/28 00:24:01

Infiniband ibnetdiscover excesive time outs

Post by RafaARV » 2014/11/28 00:48:29

Hi everybody,

I have an infiniband network with 165 nodes on a IS5200 Mellanox switch (has 6 spines but only 5 are populated with 10 leafs, the spine #1 is the one that is not populated). It is working under QDR and all the nodes has the IBM1060110028 card (a variation of the Mellanox ConnectX-3 made for IBM) with the 2.32.5180 firmware.

I'm making some experiments with a node in order to make some upgrades so I installed CentOS 7 with MLNX_OFED 2.3.2, apparently everything works fine but I have two problems in here, the first and small one is that I can't find the SDP module (even when I installed OFED with the "--all" flag), the second one is that when I run the "ibnetdiscover" command in the test node I got a bunch of time outs before getting the response, to be more specific:



src/query_smp.c:197; umad (DR path slid 0; dlid 0; 0,1,1 Attr 0xff90:1) bad status 110; Connection timed out
src/query_smp.c:197; umad (DR path slid 0; dlid 0; 0,1,2 Attr 0xff90:1) bad status 110; Connection timed out
src/query_smp.c:197; umad (DR path slid 0; dlid 0; 0,1,4 Attr 0xff90:1) bad status 110; Connection timed out
src/query_smp.c:197; umad (DR path slid 0; dlid 0; 0,1,3 Attr 0xff90:1) bad status 110; Connection timed out
src/query_smp.c:197; umad (DR path slid 0; dlid 0; 0,1,6 Attr 0xff90:1) bad status 110; Connection timed out
.
.
.
src/query_smp.c:197; umad (DR path slid 0; dlid 0; 0,1,19,34,14 Attr 0xff90:1) bad status 110; Connection timed out
src/query_smp.c:197; umad (DR path slid 0; dlid 0; 0,1,19,34,15 Attr 0xff90:1) bad status 110; Connection timed out
src/query_smp.c:197; umad (DR path slid 0; dlid 0; 0,1,19,34,16 Attr 0xff90:1) bad status 110; Connection timed out
src/query_smp.c:197; umad (DR path slid 0; dlid 0; 0,1,19,34,17 Attr 0xff90:1) bad status 110; Connection timed out
src/query_smp.c:197; umad (DR path slid 0; dlid 0; 0,1,19,34,18 Attr 0xff90:1) bad status 110; Connection timed out



They are about 165 timeouts (I'm not putting all in here cause they are a LOT) before giving me the topology. The topology file looks fine, (I'm copying the start of it as an example):



#
# Topology file: generated on Thu Nov 27 18:37:26 2014
#
# Initiated from node 0002c903002014e0 port 0002c903002014e1

vendid=0x2c9
devid=0xbd36
sysimgguid=0x2c90200489038
switchguid=0x2c902004849f8(2c902004849f8)
Switch 36 "S-0002c902004849f8" # "MF0;switch-thubat:IS5200/L12/U1" base port 0 lid 21 lmc 0
[1] "H-0002c90300202f80"[1](2c90300202f81) # "compute001 HCA-1" lid 99 4xQDR
[2] "H-0002c90300202060"[1](2c90300202061) # "compute011 HCA-1" lid 79 4xQDR
[3] "H-0002c90300203200"[1](2c90300203201) # "compute021 HCA-1" lid 37 4xQDR
[4] "H-0002c90300202cc0"[1](2c90300202cc1) # "compute031 HCA-1" lid 143 4xQDR
[5] "H-0002c90300202d60"[1](2c90300202d61) # "compute041 HCA-1" lid 80 4xQDR
[6] "H-0002c90300203130"[1](2c90300203131) # "compute051 HCA-1" lid 58 4xQDR
[7] "H-0002c90300202c70"[1](2c90300202c71) # "compute061 HCA-1" lid 89 4xQDR
[8] "H-0002c90300202f00"[1](2c90300202f01) # "compute071 HCA-1" lid 34 4xQDR
[9] "H-0002c90300202dd0"[1](2c90300202dd1) # "compute081 HCA-1" lid 157 4xQDR
[10] "H-0002c90300fe0040"[1](2c90300fe0041) # "mgt02 HCA-1" lid 23 4xQDR
[11] "H-0002c90300202040"[1](2c90300202041) # "compute093 HCA-1" lid 62 4xQDR
[12] "H-0002c903002021e0"[1](2c903002021e1) # "cgcompute009 HCA-1" lid 166 4xQDR
[13] "H-0002c90300202390"[1](2c90300202391) # "compute103 HCA-1" lid 108 4xQDR
[14] "H-0002c903002022e0"[1](2c903002022e1) # "compute113 HCA-1" lid 167 4xQDR
[15] "H-0002c903002015f0"[1](2c903002015f1) # "cgcompute019 HCA-1" lid 176 4xQDR
[16] "H-0002c903002020b0"[1](2c903002020b1) # "compute122 HCA-1" lid 130 4xQDR
[17] "H-0002c903002021a0"[1](2c903002021a1) # "compute132 HCA-1" lid 119 4xQDR
[18] "H-0002c90300201520"[1](2c90300201521) # "cgcompute024 HCA-1" lid 44 4xQDR
[19] "S-0002c90200488e28"[34] # "MF0;switch-thubat:IS5200/S01/U1" lid 5 4xQDR
[20] "S-0002c90200488e28"[35] # "MF0;switch-thubat:IS5200/S01/U1" lid 5 4xQDR
[21] "S-0002c90200488e28"[36] # "MF0;switch-thubat:IS5200/S01/U1" lid 5 4xQDR
[22] "S-0002c90200488e78"[34] # "MF0;switch-thubat:IS5200/S02/U1" lid 7 4xQDR
[23] "S-0002c90200488e78"[35] # "MF0;switch-thubat:IS5200/S02/U1" lid 7 4xQDR
[24] "S-0002c90200488e78"[36] # "MF0;switch-thubat:IS5200/S02/U1" lid 7 4xQDR
[25] "S-0002c90200490490"[34] # "MF0;switch-thubat:IS5200/S03/U1" lid 9 4xQDR
[26] "S-0002c90200490490"[35] # "MF0;switch-thubat:IS5200/S03/U1" lid 9 4xQDR
[27] "S-0002c90200490490"[36] # "MF0;switch-thubat:IS5200/S03/U1" lid 9 4xQDR
[28] "S-0002c90200488e10"[34] # "MF0;switch-thubat:IS5200/S04/U1" lid 4 4xQDR
[29] "S-0002c90200488e10"[35] # "MF0;switch-thubat:IS5200/S04/U1" lid 4 4xQDR
[30] "S-0002c90200488e10"[36] # "MF0;switch-thubat:IS5200/S04/U1" lid 4 4xQDR
[31] "S-0002c90200490478"[34] # "MF0;switch-thubat:IS5200/S05/U1" lid 6 4xQDR
[32] "S-0002c90200490478"[35] # "MF0;switch-thubat:IS5200/S05/U1" lid 6 4xQDR
[33] "S-0002c90200490478"[36] # "MF0;switch-thubat:IS5200/S05/U1" lid 6 4xQDR
[34] "S-0002c90200488e80"[34] # "MF0;switch-thubat:IS5200/S06/U1" lid 8 4xQDR
[35] "S-0002c90200488e80"[35] # "MF0;switch-thubat:IS5200/S06/U1" lid 8 4xQDR
[36] "S-0002c90200488e80"[36] # "MF0;switch-thubat:IS5200/S06/U1" lid 8 4xQDR



But I was wondering what does that errors means and why I just get them under this infiniband network (I connected this test node to an other IB network [Other infiniband cards with other switch and CentOS 6.5 on the other nodes] and I didn't get a single time out). Also in the other nodes of the conflictual network I don't get this errors but i have OFED 1.5.4.1-rc2 and CentOS 6.2 (but same firmware).

Any help will be VERY VERY appreciated as I have almost 1 month trying to work this out.

Thank you!.

chemal
Posts: 776
Joined: 2013/12/08 19:44:49

Re: Infiniband ibnetdiscover excesive time outs

Post by chemal » 2014/11/28 18:39:27

CentOS already comes with full support for infiniband. If you are replacing this with Mellanox OFED, then you should probably ask their support for help.

Post Reply