cLVM is driving me nuts

Issues related to applications and software problems
jeffinto
Posts: 9
Joined: 2019/05/09 19:52:32

cLVM is driving me nuts

Post by jeffinto » 2019/05/09 20:08:48

Hi Everyone,

First I'd like to thank you guys for having such an awesome forum, I've never had to post before as a result. But this time I can't find a solution anywhere.

I have 2 identical servers; hardware, package versions, everything. I followed this guide to setup GFS2
https://access.redhat.com/documentation ... samba-haaa

However, after the last step only the node I was working on successfully mounted it. The other said

Code: Select all

Couldn't find device
.

After some digging I found the issue is the system didn't create the volume group device because of an

Code: Select all

Error: locking on node2: volume is busy on another node
I checked and changed the lvm.conf locking mechanism to 3 on all machines, still nothing.

I tried killing clvmd and stonith restarted the node as it should.

I can successfully move the mount point between them, but not mount it together so I'm sure it's a locking issue. I just can't find anything wrong to fix it.

Any help would be most appreciated.

Messages log

Code: Select all

May  9 15:34:33 csn2 clvm(clvmd)[8349]: INFO:  Error locking on node 2: Volume is busy on another node 0 logical volume(s) in volume group "clustered_vg" now active
May  9 15:34:33 csn2 Filesystem(fs)[8572]: INFO: Running start for /dev/clustered_vg/kvm_gfs on /var/lib/libvirt/images
May  9 15:34:33 csn2 Filesystem(fs)[8572]: ERROR: Couldn't find device [/dev/clustered_vg/kvm_gfs]. Expected /dev/??? to exist
May  9 15:34:33 csn2 lrmd[7888]:  notice: fs_start_0:8572:stderr [ ocf-exit-reason:Couldn't find device [/dev/clustered_vg/kvm_gfs]. Expected /dev/??? to exist ]
May  9 15:34:33 csn2 crmd[7892]:  notice: csn2-gfs-fs_start_0:31 [ ocf-exit-reason:Couldn't find device [/dev/clustered_vg/kvm_gfs]. Expected /dev/??? to exist\n ]
May  9 15:34:34 csn2 Filesystem(fs)[8654]: WARNING: Couldn't find device [/dev/clustered_vg/kvm_gfs]. Expected /dev/??? to exist
May  9 15:34:34 csn2 Filesystem(fs)[8654]: INFO: Running stop for /dev/clustered_vg/kvm_gfs on /var/lib/libvirt/images
May  9 15:34:34 csn2 lrmd[7888]:  notice: fs_stop_0:8654:stderr [ blockdev: cannot open /dev/clustered_vg/kvm_gfs: No such file or directory ]
vgshow

Code: Select all

# vgdisplay
  --- Volume group ---
  VG Name               clustered_vg
  System ID             
  Format                lvm2
  Metadata Areas        6
  Metadata Sequence No  3
  VG Access             read/write
  VG Status             resizable
  Clustered             yes
  Shared                no
  MAX LV                0
  Cur LV                1
  Open LV               0 (this is 1 on whatever machine has the mount)
  Max PV                0
  Cur PV                6
  Act PV                6
section of lvm.conf

Code: Select all

locking_type = 3

        # Configuration option global/wait_for_locks.
        # When disabled, fail if a lock request would block.
        wait_for_locks = 1

        # Configuration option global/fallback_to_clustered_locking.
        # Attempt to use built-in cluster locking if locking_type 2 fails.
        # If using external locking (type 2) and initialisation fails, with
        # this enabled, an attempt will be made to use the built-in clustered
        # locking. Disable this if using a customised locking_library.
        fallback_to_clustered_locking = 1

aks
Posts: 3073
Joined: 2014/09/20 11:22:14

Re: cLVM is driving me nuts

Post by aks » 2019/05/10 16:30:59

You don't say what your underlying tech is, but I'd start by looking at the SCSI layer - is it shared (MPIO/reserve_policy and friends)?

hunter86_bg
Posts: 2019
Joined: 2015/02/17 15:14:33
Location: Bulgaria
Contact:

Re: cLVM is driving me nuts

Post by hunter86_bg » 2019/05/12 15:55:57

For editing the lvm conf use lvmconf with the switch for clusterization, enable/stop services as you need your lvm caching to be stopped.

Note: Always rebuild initramfs when changing lvm/multipath configuration files and reboot.

Check dlm is up and running. Start it manually.
Firewall settings on each node ? Keep in mind that dlm will switch to sctp when using 2 corosync rings and this is NOT SUPPORTED , nor firewalld is configured to do it.

Also, I would recommend you to provide all steps (from history or your personal notes) you have done so far.

jeffinto
Posts: 9
Joined: 2019/05/09 19:52:32

Re: cLVM is driving me nuts

Post by jeffinto » 2019/05/29 15:31:23

Thank you for the replies guys. The underlying tech is a Promise J630 SAS device. It's a SAS enclosure with no RAID of it's own and has 2 dual connection SAS modules.

This is from bash_history

Step 1: Installing packages & Setup firewall

Code: Select all

yum install gfs2-utils lvm2-cluster dlm pcs fence-agents-ipmilan
firewall-cmd --permanent --add-service=high-availability
firewall-cmd --reload
I also set name resolution in the hostname file for both machines, on both machines.

Step 2: Configure PCS

Code: Select all

pcs property set no-quorum-policy=freeze
pcs property set stonith-enabled=true
pcs property show
pcs staus
pcs status
pcs resource create dlm ocf:pacemaker:controld op monitor interval=30s on-fail=fence clone interleave=true ordered=true
pcs resource create clvmd ocf:heartbeat:clvm op monitor interval=30s on-fail=fence clone interleave=true ordered=true
pcs constraint order start dlm-clone then clvmd-clone
pcs constraint colocation add clvmd-clone with dlm-clone
pcs status
pcs stonith create ilocsn1 fence_ilo3 pcmk_host_list="csn1" ipaddr="csn1" login="" passwd=""
pcs status
pcs stonith create ilocsn2 fence_ilo3 pcmk_host_list="csn2" ipaddr="csn2" login="" passwd=""
pcs status
Step 3: Setup cLVM

Code: Select all

pvcreate /dev/sdb
pvcreate /dev/sdc
pvcreate /dev/sdd
pvcreate /dev/sde
pvcreate /dev/sdh
pvcreate /dev/sdi
vgcreate -Ay -cy clustered_vg /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdh /dev/sdi
lvcreate -i3 -m1 -L3T -n gfs clustered_vg
I checked this on the other node afterwards and was sure it was working correctly since on that machine it's sd[a-g] for these drives and it picked that up on it's own.

Step 4: Make the GFS2 volume

Code: Select all

mkfs.gfs2 -p lock_dlm -j 2 -t cluster_core:gfs /dev/clustered_vg/gfs
Step 5: Add the GFS volume to cluster

Code: Select all

pcs resource create fs ocf:heartbeat:Filesystem device="/dev/clustered_vg/kvm_gfs" directory="/var/lib/libvirt/images" fstype="gfs2" --clone
pcs constraint order start clvmd-clone then fs-clone
pcs constraint colocation add fs-clone with clvmd-clone
pcs status
pcs edit
pcs resource edit
pcs resource edit clvmd-clone with_cmirrord=true
pcs resource update clvmd-clone with_cmirrord=true
pcs status
Troubleshooting:
1- Edited clvmd-clone to add "cmirrord=true"
2- Changed lvm.conf to use locking_type=3
3- performed updates which included a new kernel which should rebuild the initramfs
4- While machines were rebooting I realized I could scan and mount the shared LVM volume if the other machine did not have a lock. So I know both machines are capable of using it.

The error is with LVM locking, not GFS locking since I cannot mount the VG to get the LV or Partition.

Code: Select all

Error: locking on node[1-2]: volume is busy on another node
Can anyone see where I botched the configuration?

aks
Posts: 3073
Joined: 2014/09/20 11:22:14

Re: cLVM is driving me nuts

Post by aks » 2019/05/30 17:11:39

Nothing is doing your I/O fencing I think.
Perhaps have a look at: https://www.server-world.info/en/note?o ... emaker&f=3
(Not verified by casual observation seems okay).

jeffinto
Posts: 9
Joined: 2019/05/09 19:52:32

Re: cLVM is driving me nuts

Post by jeffinto » 2019/05/30 20:48:19

Hi aks,

Thank you for the reply. Isn't that what stonith is for?

Code: Select all

[admin@csn2 ~]# pcs status
Cluster name: cluster_core
Stack: corosync
Current DC: csn1 (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with quorum
Last updated: Thu May 30 16:42:39 2019
Last change: Wed May  1 14:31:18 2019 by admin via cibadmin on csn1

2 nodes configured
8 resources configured

Online: [ csn1 csn2 ]

Full list of resources:

 Clone Set: dlm-clone [dlm]
     Started: [ csn1 csn2 ]
 Clone Set: clvmd-clone [clvmd]
     Started: [ csn1 csn2 ]
 Clone Set: fs-clone [fs]
     Started: [ csn1 ]
     Stopped: [ csn2 ]

ilocsn1	(stonith:fence_ilo3):	Started csn1
ilocsn2	(stonith:fence_ilo3):	Started csn2

Failed Actions:
* ilocsn2_start_0 on csn1 'unknown error' (1): call=49, status=Timed Out, exitreason='',
    last-rc-change='Sun May 26 03:23:37 2019', queued=0ms, exec=60303ms
* ilocsn1_monitor_60000 on csn1 'unknown error' (1): call=34, status=Timed Out, exitreason='',
    last-rc-change='Sun May 26 03:24:32 2019', queued=0ms, exec=20004ms
* fs_start_0 on csn2 'not installed' (5): call=31, status=complete, exitreason='Couldn't find device [/dev/clustered_vg/kvm-gfs]. Expected /dev/??? to exist',
    last-rc-change='Sun May 26 03:29:49 2019', queued=0ms, exec=89ms


Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
I should also point out that I'm not using iSCSI under it. It's shared SAS drives, so I don't think there is an issue at that level since the machines can both see all the drives and access them together. This was tested with a small EXT4 partition before trying GFS to make sure I wasn't just wasting my time.

aks
Posts: 3073
Joined: 2014/09/20 11:22:14

Re: cLVM is driving me nuts

Post by aks » 2019/05/30 20:56:10

sn't that what stonith is for?
Nope. Stonith is Shoot The Other Node In The Head. A la networking, cluster "un-awareness" or "cluster partition".
It's not I/O fencing (which essentially arbitrates SCSI calls).

jeffinto
Posts: 9
Joined: 2019/05/09 19:52:32

Re: cLVM is driving me nuts

Post by jeffinto » 2019/05/30 21:02:29

Alright that makes sense then.

I'm currently researching this, but I'd love to hear any advice on how to set up IO fencing for SAS drives. They are fully managed by CentOS (HBA card instead of RAID, SAS enclosure is JBOD).

hunter86_bg
Posts: 2019
Joined: 2015/02/17 15:14:33
Location: Bulgaria
Contact:

Re: cLVM is driving me nuts

Post by hunter86_bg » 2019/05/30 23:52:44

Just use a small (10MB) shared disk for SBD (a.k.a. poison pill) fencing and disable HP ASR on both servers.
Also, how many corosync rings do you have ?

Arre you sure that you haven't used 'pcs auth' command in the beginning.
For LVM configuration use :

Code: Select all

lvmconf --enable-cluster --services --startstopservices  
Its far less error-prone.

Warning: Always test fencing from all nodes against all nodes before deploying your resources.Time synchronization is of high importance.

jeffinto
Posts: 9
Joined: 2019/05/09 19:52:32

Re: cLVM is driving me nuts

Post by jeffinto » 2019/06/01 02:50:06

I only seem to have the one ring

Code: Select all

# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
	id	= 192.168.0.1
	status	= ring 0 active with no faults

# pcs status corosync

Membership information
----------------------
    Nodeid      Votes Name
         1          1 csn1 (local)
         2          1 csn2

# corosync-cmapctl |grep members
runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(192.168.0.1) 
runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.1.status (str) = joined
runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(192.168.0.2) 
runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.2.status (str) = joined
I don't see it in the history, but I am pretty sure I did use pcs auth to get things started.

I looked up SBD fencing and every tutorial I can find shows to use that with STONITH. I already have iLo3 STONITH IO fencing setup. Can you please explain why I need the second IO fence?

The times are synced and I've verified that either node can kill the other.

The lvmconf command did result in output on both machines.

csn1

Code: Select all

Removed symlink /etc/systemd/system/sysinit.target.wans/lvm2-lvmetad.socket.
csn2

Code: Select all

Removed symlink /etc/systemd/system/sysinit.target.wans/lvm2-lvmetad.socket.
Removed symlink /etc/systemd/system/sockets.target.wans/lvm2-lvmetad.socket.
However, it didn't change anything even after a reboot, except the mount is now on the other host.

Post Reply