The setup is glusterfs (3.12.3) being the clustered storage, with NFS as the frontend to clients using the kernel NFS server (CentOS 7.4.1708). The goal here was somewhat HA NFS, using a gluster replica 3 volume, and CTDB from the samba to handle moving the VIP for NFS. It works great (mostly) until you throw IO at it, then CTDB triggers a failover because `rpcinfo -T tcp 127.0.0.1 nfs` that it uses to check whether NFS is "working" reports "RPC: Timed out" and "program 100003 version 3 is not available". This only happens when IO is occurring, I can let these boxes run for days without IO to them and a failover will never be triggered.
I don't understand why rpcinfo is reporting this, but I assume that something related to NFS is choking momentarily. There are no errors reported in the journal or /var/log/messages at all beyond what CTDB provides:
Code: Select all
Jan 12 04:48:49 nfs2 ctdb-eventd: 60.nfs: ERROR: nfs failed RPC check:
Jan 12 04:48:49 nfs2 ctdb-eventd: 60.nfs: rpcinfo: RPC: Timed out
Jan 12 04:48:49 nfs2 ctdb-eventd: 60.nfs: program 100003 version 3 is not available
Jan 12 04:48:49 nfs2 ctdb-eventd: monitor event failed
The same thing seems to happen on any of these three boxes. I've tried upping the open FD limit for the rpc and rpcuser users, thinking that maybe there was something hammering RPC, but no difference. Using rpcdebug for nfs didn't yield any useful information, and so far running rpcbind in debug mode didn't either.
The load on these boxes when IO is happening isn't too crazy: 26.97, 10.27, 3.92. Let me expand on this by saying these are boxes with 20 real cores, 40 threads, 256GB of RAM, 2x10g links port channeled, and 16 10k SAS drives in RAID10.
I'm at a loss here and not sure where to continue debugging.