[SOLVED] Problem with large mirrored volume in cluster with CLVM and cmirror

jdito · Post by **jdito** » 2010/06/29 17:48:18

Hi, I'm having an issue getting a mirrored volume working in a cluster. (If there's a better subforum to put this in, please let me know) I'll put all relevant config files at the end of the post for readability purposes. I'm running a 4 node cluster that's going to be used as a webhosting platform in the future. The backing storage is a pair of AC&NC JetStor 516 iS units, exposing iSCSI mounts to the servers. I'm able to create a mirrored volume of up to around 2 TB with no issues:

[root@newt ~]# pvcreate /dev/mapper/jetstor0[12]
Physical volume "/dev/mapper/jetstor01" successfully created
Physical volume "/dev/mapper/jetstor02" successfully created
[root@newt ~]# vgcreate VolGroup01 /dev/mapper/jetstor0[12]
Clustered volume group "VolGroup01" successfully created
[root@newt ~]# lvcreate -n hosting_mirror -l 500000 -m1 --corelog --nosync VolGroup01 /dev/mapper/jetstor0[12]
WARNING: New mirror won't be synchronised. Don't read what you didn't write!
Logical volume "hosting_mirror" created
[root@newt ~]#

I'm then able to use the mirrored volume normally. However, larger volumes fail with the following error:

[root@newt ~]# lvcreate -n hosting_mirror -l 550000 -m1 --corelog --nosync VolGroup01 /dev/mapper/jetstor0[12]
WARNING: New mirror won't be synchronised. Don't read what you didn't write!
Error locking on node puma: Command timed out
Error locking on node frog: Command timed out
Error locking on node newt: Command timed out
Aborting. Failed to activate new LV to wipe the start of it.
Error locking on node puma: Command timed out
Error locking on node frog: Command timed out
Error locking on node newt: Command timed out
Unable to deactivate failed new LV. Manual intervention required.

Not sure where the exact cutoff is, but it's between 500000 and 550000 4 MB PEs, so between 2.0 and 2.2 TB.

Cluster stats: 1 Dell 2650, 3 Dell 2850. The 2850s are 64bit, the 2650 is 32bit. I tried removing the 32bit machine from the cluster, and still had the same problem with the remaining machines. They're all running Centos 5.5, fully yum updated. There's no problem creating non-mirrored volumes of any size. I attempted to create a non-mirrored volume and convert it to a mirrored volume, but got the same error. I created a single-node cluster and was able to create a mirrored volume of any size, so it appears to be a problem with communication between the nodes. I checked in lvm.conf, but there didn't appear to be any timeout settings that could be modified. (I also tried to activate the mirror I created on the single node cluster on the main cluster, but that didn't work) Is this a known issue with CLVM? Am I doing something dumb? Any input would be greatly appreciated. Let me know if there's any other information that would be helpful, as well.

SANITIZED CONFIG FILE DUMP FOLLOWS

/etc/cluster/cluster.conf:
[code]
[root@newt ~]# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="7" name="webhosting">
<fence_daemon post_fail_delay="0" post_join_delay="90"/>
<clusternodes>
<clusternode name="newt" votes="1" nodeid="1">
<fence>
<method name="1">
<device name="fence01" option="off" port="2"/>
<device name="fence02" option="off" port="2"/>
<device name="fence01" option="on" port="2"/>
<device name="fence02" option="on" port="2"/>
</method>
</fence>
</clusternode>
<clusternode name="frog" votes="1" nodeid="2">
<fence>
<method name="1">
<device name="fence01" option="off" port="6"/>
<device name="fence02" option="off" port="6"/>
<device name="fence01" option="on" port="6"/>
<device name="fence02" option="on" port="6"/>
</method>
</fence>
</clusternode>
<clusternode name="toad" votes="1" nodeid="3">
<fence>
<method name="1">
<device name="fence01" option="off" port="7"/>
<device name="fence02" option="off" port="7"/>
<device name="fence01" option="on" port="7"/>
<device name="fence02" option="on" port="7"/>
</method>
</fence>
</clusternode>
<clusternode name="puma" votes="1" nodeid="4">
<fence>
<method name="1">
<device name="fence01" option="off" port="3"/>
<device name="fence02" option="off" port="3"/>
<device name="fence01" option="on" port="3"/>
<device name="fence02" option="on" port="3"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman cluster_id="10161"/>
<fencedevices>
<fencedevice agent="fence_apc" name="fence01" ipaddr=""
login="" passwd=""/>
<fencedevice agent="fence_apc" name="fence02" ipaddr=""
login="" passwd=""/>
</fencedevices>
<rm>
<failoverdomains/>
<resources/>
</rm>
</cluster>
[/code]

/etc/hosts:
[code]
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1 localhost.localdomain localhost
::1 localhost6.localdomain6 localhost6

# Cluster members
*.*.*.66 newt.* newt
*.*.*.67 puma.* puma
*.*.*.71 frog.* frog
*.*.*.72 toad.* toad
[/code]

lvm.conf is all defaults, except it's switched to clustered locking.

jdito · Post by **jdito** » 2010/06/29 21:41:47

I also have some strace files from working and non-working runs, but they're about 1800 lines each. Diff is about 700 lines. If anybody would like to see any of the files, let me know.

jdito · Post by **jdito** » 2010/06/30 17:16:45

Just tried bumping up the PE size for the volume group to 64M, no difference. So it's definitely a size issue, and not a problem with the number of PE's being used.
[Edit] Also, I made a mistake in my first post. The PE size was 4M, not 4K. [/Edit]

AlanBartlett · Post by **AlanBartlett** » 2010/06/30 17:54:45

[quote]
[Edit] Also, I made a mistake in my first post. The PE size was 4M, not 4K. [/Edit]
[/quote]
Moderator edited so that the line in post #1 now reads --

[code]
Not sure where the exact cutoff is, but it's between 500000 and 550000 4 MB PEs, so between 2.0 and 2.2 TB.
[/code]

jdito · Post by **jdito** » 2010/06/30 19:16:57

Solved! Turned out to be a problem with the mirror region size, as illustrated here: https://bugzilla.redhat.com/show_bug.cgi?id=514814
Somehow managed to search Google without finding that for about 2 weeks.

Post by **toracat** » 2010/06/30 19:28:19

Thank you for reporting back with the solution. This thread is now marked SOLVED.

jdito · Post by **jdito** » 2010/06/30 19:43:56

Hopefully it'll save someone else from pounding their head against this for as long as i did :-)

CentOS

[SOLVED] Problem with large mirrored volume in cluster with CLVM and cmirror

[SOLVED] Problem with large mirrored volume in cluster with CLVM and cmirror

Re: Problem with large mirrored volume in cluster with CLVM and cmirror

Re: Problem with large mirrored volume in cluster with CLVM and cmirror

Re: Problem with large mirrored volume in cluster with CLVM and cmirror

Re: Problem with large mirrored volume in cluster with CLVM and cmirror

[SOLVED] Problem with large mirrored volume in cluster with

Re: Problem with large mirrored volume in cluster with CLVM and cmirror