RPC/NFS Problems Under IO

Issues related to applications and software problems
Blorpy
Posts: 2
Joined: 2018/01/12 04:25:18

RPC/NFS Problems Under IO

Postby Blorpy » 2018/01/12 05:27:51

Having a problem that I haven't been able to pin down in a new setup across three physical servers.

The setup is glusterfs (3.12.3) being the clustered storage, with NFS as the frontend to clients using the kernel NFS server (CentOS 7.4.1708). The goal here was somewhat HA NFS, using a gluster replica 3 volume, and CTDB from the samba to handle moving the VIP for NFS. It works great (mostly) until you throw IO at it, then CTDB triggers a failover because `rpcinfo -T tcp 127.0.0.1 nfs` that it uses to check whether NFS is "working" reports "RPC: Timed out" and "program 100003 version 3 is not available". This only happens when IO is occurring, I can let these boxes run for days without IO to them and a failover will never be triggered.

I don't understand why rpcinfo is reporting this, but I assume that something related to NFS is choking momentarily. There are no errors reported in the journal or /var/log/messages at all beyond what CTDB provides:

Code: Select all

Jan 12 04:48:49 nfs2 ctdb-eventd[3509]: 60.nfs: ERROR: nfs failed RPC check:
Jan 12 04:48:49 nfs2 ctdb-eventd[3509]: 60.nfs: rpcinfo: RPC: Timed out
Jan 12 04:48:49 nfs2 ctdb-eventd[3509]: 60.nfs: program 100003 version 3 is not available
Jan 12 04:48:49 nfs2 ctdb-eventd[3509]: monitor event failed


The same thing seems to happen on any of these three boxes. I've tried upping the open FD limit for the rpc and rpcuser users, thinking that maybe there was something hammering RPC, but no difference. Using rpcdebug for nfs didn't yield any useful information, and so far running rpcbind in debug mode didn't either.

The load on these boxes when IO is happening isn't too crazy: 26.97, 10.27, 3.92. Let me expand on this by saying these are boxes with 20 real cores, 40 threads, 256GB of RAM, 2x10g links port channeled, and 16 10k SAS drives in RAID10.

I'm at a loss here and not sure where to continue debugging.

Blorpy
Posts: 2
Joined: 2018/01/12 04:25:18

Re: RPC/NFS Problems Under IO

Postby Blorpy » 2018/01/18 06:09:37

The only other details I've been able to find out are that poll from rpcinfo is timing out when the rpcinfo command is running and the problem happens:

Code: Select all

write(3, "\200\0\0(\v\330\235\255\0\0\0\0\0\0\0\2\0\1\206\243\0\0\0\0\0\0\0\0\0\0\0\0"..., 44) = 44
poll([{fd=3, events=POLLIN}], 1, 10000) = 0 (Timeout)
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
write(2, "rpcinfo: RPC: Timed out\n", 24rpcinfo: RPC: Timed out


That and, enabling debug for rpc shows the following around the time of the problem:

Code: Select all

Jan 16 20:26:03 nfs1 kernel: svc: socket ffff882fae621740 TCP (listen) state change 10
Jan 16 20:26:03 nfs1 kernel: svc: socket ffff882fb426ae80 TCP (listen) state change 1
Jan 16 20:26:13 nfs1 kernel: svc: socket ffff882fb426ae80 TCP (listen) state change 8


Tried upping the FD limit once again for root/rpc/rpcuser, but to no avail.

Any help is appreciated.