Steps for Setting Up and Connecting the Cluster Hardware

After installing Red Hat Linux, set up the cluster hardware components and verify the installation to ensure that the cluster systems recognize all the connected devices. Note that the exact steps for setting up the hardware depend on the type of configuration. See the Section called Choosing a Hardware Configuration for more information about cluster configurations.

To set up the cluster hardware, follow these steps:

  1. Shut down the cluster systems and disconnect them from their power source.

  2. Set up the point-to-point Ethernet and serial heartbeat channels, if applicable. See the Section called Configuring Heartbeat Channels for more information about performing this task.

  3. When using power switches, set up the devices and connect each cluster system to a power switch. See the Section called Configuring Power Switches for more information about performing this task.

    In addition, it is recommended to connect each power switch (or each cluster system's power cord if not using power switches) to a different UPS system. See the Section called Configuring UPS Systems for information about using optional UPS systems.

  4. Set up the shared disk storage according to the vendor instructions and connect the cluster systems to the external storage enclosure.See the Section called Configuring Shared Disk Storage for more information about performing this task.

    In addition, it is recommended to connect the storage enclosure to redundant UPS systems. See the Section called Configuring UPS Systems for more information about using optional UPS systems.

  5. Turn on power to the hardware, and boot each cluster system. During the boot-up process, enter the BIOS utility to modify the system setup, as follows:

  6. Exit from the BIOS utility, and continue to boot each system. Examine the startup messages to verify that the Linux kernel has been configured and can recognize the full set of shared disks. Use the dmesg command to display console startup messages. See the Section called Displaying Console Startup Messages for more information about using this command.

  7. Verify that the cluster systems can communicate over each point-to-point Ethernet heartbeat connection by using the ping command to send packets over each network interface.

  8. Set up the quorum disk partitions on the shared disk storage. See the Section called Configuring Quorum Partitions for more information about performing this task.

Configuring Heartbeat Channels

The cluster uses heartbeat channels as a policy input during failover of the cluster systems. For example, if a cluster system stops updating its timestamp on the quorum partitions, the other cluster system will check the status of the heartbeat channels to determine if additional time should be alloted prior to initiating a failover.

A cluster must include at least one heartbeat channel. It is possible to use an Ethernet connection for both client access and a heartbeat channel. However, it is recommended to set up additional heartbeat channels for high availability, using redundant Ethernet heartbeat channels, in addition to one or more serial heartbeat channels.

For example, if using both an Ethernet and a serial heartbeat channel, and the cable for the Ethernet channel is disconnected, the cluster systems can still check status through the serial heartbeat channel.

To set up a redundant Ethernet heartbeat channel, use a network crossover cable to connect a network interface on one cluster system to a network interface on the other cluster system.

To set up a serial heartbeat channel, use a null modem cable to connect a serial port on one cluster system to a serial port on the other cluster system. Be sure to connect corresponding serial ports on the cluster systems; do not connect to the serial port that will be used for a remote power switch connection. In the future, should support be added for more than two cluster members, then usage of serial based heartbeat channels may be deprecated.

Configuring Power Switches

Power switches enable a cluster system to power-cycle the other cluster system before restarting its services as part of the failover process. The ability to remotely disable a system ensures data integrity is maintained under any failure condition. It is recommended that production environments use power switches or watchdog timers in the cluster configuration. Only development (test) environments should use a configuration without power switches (type "None"). Refer to the Section called Choosing the Type of Power Controller for a description of the various types of power switches. Note that within this section, the general term "power switch" also includes watchdog timers.

In a cluster configuration that uses physical power switches, each cluster system's power cable is connected to a power switch through either a serial or network connection (depending on switch type). When failover occurs, a cluster system can use this connection to power-cycle the other cluster system before restarting its services.

Power switches protect against data corruption if an unresponsive (or hanging) system becomes responsive after its services have failed over, and issues I/O to a disk that is also receiving I/O from the other cluster system. In addition, if a quorum daemon fails on a cluster system, the system is no longer able to monitor the quorum partitions. If power switches or watchdog timers are not used in the cluster, then this error condition may result in services being run on more than one cluster system, which can cause data corruption and possibly system crashes.

It is strongly recommended to use power switches in a cluster. However, administrators who are aware of the risks may choose to set up a cluster without power switches.

A cluster system may hang for a few seconds if it is swapping or has a high system workload. For this reason, adequate time is allowed prior to concluding another system has failed (typically 12 seconds).

A cluster system may "hang" indefinitely because of a hardware failure or kernel error. In this case, the other cluster will notice that the hung system is not updating its timestamp on the quorum partitions, and is not responding to pings over the heartbeat channels.

If a cluster system determines that a hung system is down, and power switches are used in the cluster, the cluster system will power-cycle the hung system before restarting its services. Clusters configured to use watchdog timers will self-reboot under most system hangs. This will cause the hung system to reboot in a clean state, and prevent it from issuing I/O and corrupting service data.

If power switches are not used in cluster, and a cluster system determines that a hung system is down, it will set the status of the failed system to DOWN on the quorum partitions, and then restart the hung system's services. If the hung system becomes becomes responsive, it will notice that its status is DOWN, and initiate a system reboot. This will minimize the time that both cluster systems may be able to issue I/O to the same disk, but it does not provide the data integrity guarantee of power switches. If the hung system never becomes responsive and no power switches are in use, then a manual reboot is required.

When used, power switches must be set up according to the vendor instructions. However, some cluster-specific tasks may be required to use a power switch in the cluster. See the Section called Setting Up Power Switches in Appendix A for detailed information on power switches (including information about watchdog timers). Be sure to take note of any caveats or functional attributes of specific power switch types. Note that the cluster-specific information provided in this document supersedes the vendor information.

When cabling power switches, take special care to ensure that each cable is plugged into the appropriate outlet. This is crucial because there is no independent means for the software to verify correct cabling. Failure to cable correctly can lead to an incorrect system being power cycled, or for one system to inappropriately conclude that it has successfully power cycled another cluster member.

After setting up the power switches, perform these tasks to connect them to the cluster systems:

  1. Connect the power cable for each cluster system to a power switch.

  2. On each cluster system, connect a serial port to the serial port on the power switch that provides power to the other cluster system. The cable used for the serial connection depends on the type of power switch. For example, an RPS-10 power switch uses null modem cables, while a network attached power switch requires a network cable.

  3. Connect the power cable for each power switch to a power source. It is recommended to connect each power switch to a different UPS system. See the Section called Configuring UPS Systems for more information.

After the installation of the cluster software, test the power switches to ensure that each cluster system can power-cycle the other system before starting the cluster. See the Section called Testing the Power Switches in Chapter 3 for information.

Configuring UPS Systems

Uninterruptible power supply (UPS) systems provide a highly-available source of power. Ideally, a redundant solution should be used that incorporates multiple UPS's (one per server). For maximal fault-tolerance, it is possible to incorporate two UPS's per server as well as APC's Automatic Transfer Switches to manage the power and shutdown management of the server. Both solutions are solely dependent on the level of availability desired.

It is not recommended to use a large UPS infrastructure as the sole source of power for the cluster. A UPS solution dedicated to the cluster itself allows for more flexibility in terms of manageability and availability.

A complete UPS system must be able to provide adequate voltage and current for a prolonged period of time. While there is no single UPS to fit every power requirement, a solution can be tailored to fit a particular configuration. Visit APC's UPS configurator at http://www.apcc.com/template/size/apc to find the correct UPS configuration for your server. The APC Smart-UPS product line ships with software management for Red Hat Linux. The name of the RPM package is pbeagent.

If the cluster disk storage subsystem has two power supplies with separate power cords, set up two UPS systems, and connect one power switch (or one cluster system's power cord if not using power switches) and one of the storage subsystem's power cords to each UPS system. A redundant UPS system configuration is shown in Figure 2-3.

Figure 2-3. Redundant UPS System Configuration

An alternative redundant power configuration is to connect both power switches (or both cluster systems' power cords) and the disk storage subsystem to the same UPS system. This is the most cost-effective configuration, and provides some protection against power failure. However, if a power outage occurs, the single UPS system becomes a possible single point of failure. In addition, one UPS system may not be able to provide enough power to all the attached devices for an adequate amount of time. A single UPS system configuration is shown in Figure 2-4.

Figure 2-4. Single UPS System Configuration

Many vendor-supplied UPS systems include Linux applications that monitor the operational status of the UPS system through a serial port connection. If the battery power is low, the monitoring software will initiate a clean system shutdown. As this occurs, the cluster software will be properly stopped, because it is controlled by a System V run level script (for example, /etc/rc.d/init.d/cluster).

See the UPS documentation supplied by the vendor for detailed installation information.

Configuring Shared Disk Storage

In a cluster, shared disk storage is used to hold service data and two quorum partitions. Because this storage must be available to both cluster systems, it cannot be located on disks that depend on the availability of any one system. See the vendor documentation for detailed product and installation information.

There are some factors to consider when setting up shared disk storage in a cluster:

The following list details the parallel SCSI requirements, and must be adhered to if employed in a cluster environment:

See the Section called SCSI Bus Configuration Requirements in Appendix A for more information.

In addition, it is strongly recommended to connect the storage enclosure to redundant UPS systems for a highly-available source of power. See the Section called Configuring UPS Systems for more information.

See the Section called Setting Up a Single-Initiator SCSI Bus and the Section called Setting Up a Fibre Channel Interconnect for more information about configuring shared storage.

After setting up the shared disk storage hardware, partition the disks and then either create file systems or raw devices on the partitions. Two raw devices must be created for the primary and the backup quorum partitions. See the Section called Configuring Quorum Partitions, the Section called Partitioning Disks, the Section called Creating Raw Devices, and the Section called Creating File Systems for more information.

Setting Up a Single-Initiator SCSI Bus

A single-initiator SCSI bus has only one cluster system connected to it, and provides host isolation and better performance than a multi-initiator bus. Single-initiator buses ensure that each cluster system is protected from disruptions due to the workload, initialization, or repair of the other cluster system.

When using a single or dual-controller RAID array that has multiple host ports and provides simultaneous access to all the shared logical units from the host ports on the storage enclosure, the setup of two single-initiator SCSI buses to connect each cluster system to the RAID array is possible. If a logical unit can fail over from one controller to the other, the process must be transparent to the operating system. Note that some RAID controllers restrict a set of disks to a specific controller or port. In this case, single-initiator bus setups are not possible.

A single-initiator bus must adhere to the requirements described in the Section called SCSI Bus Configuration Requirements in Appendix A. In addition, see the Section called Host Bus Adapter Features and Configuration Requirements in Appendix A for detailed information about terminating host bus adapters and configuring a single-initiator bus.

To set up a single-initiator SCSI bus configuration, the following is required:

  • Enable the on-board termination for each host bus adapter.

  • Enable the termination for each RAID controller.

  • Use the appropriate SCSI cable to connect each host bus adapter to the storage enclosure.

Setting host bus adapter termination is usually done in the adapter BIOS utility during system boot. To set RAID controller termination, refer to the vendor documentation. Figure 2-5 shows a configuration that uses two single-initiator SCSI buses.

Figure 2-5. Single-Initiator SCSI Bus Configuration

Figure 2-6 shows the termination in a single-controller RAID array connected to two single-initiator SCSI buses.

Figure 2-6. Single-Controller RAID Array Connected to Single-Initiator SCSI Buses

Figure 2-7. Dual-Controller RAID Array Connected to Single-Initiator SCSI Buses

Setting Up a Fibre Channel Interconnect

Fibre Channel can be used in either single-initiator or multi-initiator configurations

A single-initiator Fibre Channel interconnect has only one cluster system connected to it. This may provide better host isolation and better performance than a multi-initiator bus. Single-initiator interconnects ensure that each cluster system is protected from disruptions due to the workload, initialization, or repair of the other cluster system.

If employing a RAID array that has multiple host ports, and the RAID array provides simultaneous access to all the shared logical units from the host ports on the storage enclosure, set up two single-initiator Fibre Channel interconnects to connect each cluster system to the RAID array. If a logical unit can fail over from one controller to the other, the process must be transparent to the operating system.

Figure 2-8 shows a single-controller RAID array with two host ports, and the host bus adapters connected directly to the RAID controller, without using Fibre Channel hubs or switches.

Figure 2-8. Single-Controller RAID Array Connected to Single-Initiator Fibre Channel Interconnects

Figure 2-9. Dual-Controller RAID Array Connected to Single-Initiator Fibre Channel Interconnects

If a dual-controller RAID array with two host ports on each controller is used, a Fibre Channel hub or switch is required to connect each host bus adapter to one port on both controllers, as shown in Figure 2-9.

If a multi-initiator Fibre Channel is used, then a Fibre Channel hub or switch is required. In this case, each HBA is connected to the hub or switch, and the hub or switch is connected to a host port on each RAID controller.

Configuring Quorum Partitions

Two raw devices on shared disk storage must be created for the primary quorum partition and the backup quorum partition. Each quorum partition must have a minimum size of 10 MB. The amount of data in a quorum partition is constant; it does not increase or decrease over time.

The quorum partitions are used to hold cluster state information. Periodically, each cluster system writes its status (either UP or DOWN), a timestamp, and the state of its services. In addition, the quorum partitions contain a version of the cluster database. This ensures that each cluster system has a common view of the cluster configuration.

To monitor cluster health, the cluster systems periodically read state information from the primary quorum partition and determine if it is up to date. If the primary partition is corrupted, the cluster systems read the information from the backup quorum partition and simultaneously repair the primary partition. Data consistency is maintained through checksums and any inconsistencies between the partitions are automatically corrected.

If a system is unable to write to both quorum partitions at startup time, it will not be allowed to join the cluster. In addition, if an active cluster system can no longer write to both quorum partitions, the system will remove itself from the cluster by rebooting (and may be remotely power cycled by the healthy cluster member).

The following are quorum partition requirements:

  • Both quorum partitions must have a minimum size of 10 MB.

  • Quorum partitions must be raw devices. They cannot contain file systems.

  • Quorum partitions can be used only for cluster state and configuration information.

The following are recommended guidelines for configuring the quorum partitions:

  • It is strongly recommended to set up a RAID subsystem for shared storage, and use RAID 1 (mirroring) to make the logical unit that contains the quorum partitions highly available. Optionally, parity RAID can be used for high-availability. Do not use RAID 0 (striping) alone for quorum partitions.

  • Place both quorum partitions on the same RAID set, or on the same disk if RAID is not employed, because both quorum partitions must be available in order for the cluster to run.

  • Do not put the quorum partitions on a disk that contains heavily-accessed service data. If possible, locate the quorum partitions on disks that contain service data that is rarely accessed.

See the Section called Partitioning Disks and the Section called Creating Raw Devices for more information about setting up the quorum partitions.

See the Section called Editing the rawdevices File in Chapter 3 for information about editing the rawdevices file to bind the raw character devices to the block devices each time the cluster systems boot.

Partitioning Disks

After shared disk storage hardware has been set up, partition the disks so they can be used in the cluster. Then, create file systems or raw devices on the partitions. For example, two raw devices must be created for the quorum partitions using the guidelines described in the Section called Configuring Quorum Partitions.

Invoke the interactive fdisk command to modify a disk partition table and divide the disk into partitions. While in fdisk, use the p command to display the current partition table and the n command to create new partitions.

The following example shows how to use the fdisk command to partition a disk:

  1. Invoke the interactive fdisk command, specifying an available shared disk device. At the prompt, specify the p command to display the current partition table.

    # fdisk /dev/sde
    Command (m for help): p
    
    Disk /dev/sde: 255 heads, 63 sectors, 2213 cylinders 
    Units = cylinders of 16065 * 512 bytes 
    
    Device    Boot    Start       End    Blocks   Id  System
    /dev/sde1             1       262   2104483+  83  Linux 
    /dev/sde2           263       288    208845   83  Linux
  2. Determine the number of the next available partition, and specify the n command to add the partition. If there are already three partitions on the disk, then specify e for extended partition or p to create a primary partition.

    Command (m for help): n
    Command action 
       e   extended 
       p   primary partition (1-4)
  3. Specify the partition number required:

    Partition number (1-4): 3
  4. Press the [Enter] key or specify the next available cylinder:

    First cylinder (289-2213, default 289): 289
  5. Specify the partition size that is required:

    Last cylinder or +size or +sizeM or +sizeK (289-2213, default 2213): +2000M

    Note that large partitions will increase the cluster service failover time if a file system on the partition must be checked with fsck. Quorum partitions must be at least 10 MB.

  6. Specify the w command to write the new partition table to disk:

    Command (m for help): w
    The partition table has been altered! 
    
    Calling ioctl() to re-read partition table. 
    
    WARNING: If you have created or modified any DOS 6.x 
    partitions, please see the fdisk manual page for additional 
    information. 
    
    Syncing disks.
  7. If a partition was added while both cluster systems are powered on and connected to the shared storage, reboot the other cluster system in order for it to recognize the new partition.

After partitioning a disk, format the partition for use in the cluster. For example, create file systems or raw devices for quorum partitions.

See the Section called Creating Raw Devices and the Section called Creating File Systems for more information.

For basic information on partitioning hard disks at installation time, see The Official Red Hat Linux x86 Installation Guide. Appendix E. An Introduction to Disk Partitions of The Official Red Hat Linux x86 Installation Guide also explains the basic concepts of partitioning.

For basic information on partitioning disks using fdisk, refer to the following URL http://kb.redhat.com/view.php?eid=175.

Creating Raw Devices

After partitioning the shared storage disks, create raw devices on the partitions. File systems are block devices (for example, /dev/sda1) that cache recently-used data in memory in order to improve performance. Raw devices do not utilize system memory for caching. See the Section called Creating File Systems for more information.

Linux supports raw character devices that are not hard-coded against specific block devices. Instead, Linux uses a character major number (currently 162) to implement a series of unbound raw devices in the /dev/raw directory. Any block device can have a character raw device front-end, even if the block device is loaded later at runtime.

To create a raw device, edit the /etc/sysconfig/rawdevices file to bind a raw character device to the appropriate block device. Once bound to a block device, a raw device can be opened, read, and written.

Quorum partitions and some database applications require raw devices, because these applications perform their own buffer caching for performance purposes. Quorum partitions cannot contain file systems because if state data was cached in system memory, the cluster systems would not have a consistent view of the state data.

Raw character devices must be bound to block devices each time a system boots. To ensure that this occurs, edit the /etc/sysconfig/rawdevices file and specify the quorum partition bindings. If using a raw device in a cluster service, use this file to bind the devices at boot time. See the Section called Editing the rawdevices File in Chapter 3 for more information.

After editing /etc/sysconfig/rawdevices, the changes will take effect either by rebooting or by execute the following command:

# service rawdevices restart

Query all the raw devices by using the command raw -aq:

# raw -aq 
/dev/raw/raw1   bound to major 8, minor 17 
/dev/raw/raw2   bound to major 8, minor 18 

Note that, for raw devices, there is no cache coherency between the raw device and the block device. In addition, requests must be 512-byte aligned both in memory and on disk. For example, the standard dd command cannot be used with raw devices because the memory buffer that the command passes to the write system call is not aligned on a 512-byte boundary.

For more information on using the raw command, refer to the raw(8) manual page.

Creating File Systems

Use the mkfs command to create an ext2 file system on a partition. Specify the drive letter and the partition number. For example:

# mkfs -t ext2 -b 4096 /dev/sde3

For optimal performance of shared filesystems, a 4 KB block size was specified in the above example. Note that it is necessary in most cases to specify a 4 KB block size when creating a filesystem since many of the mkfs file system build utilities default to a 1 KB block size, which can cause long fsck times.

Similarly, to create an ext3 filesystem, the following command can be used:

# mkfs -t ext2 -j -b 4096 /dev/sde3 

For more information on creating filesystems, refer to the mkfs(8) manual page.