HIGAKNOWIT.COM
23Aug/11Off

GFS2 clustered file system over iSCSI SAN

1. ABSTRACT

Necessity for accessing simultaneously a file system, by multiple nodes for read/write operations has became a requirement in various environments nowadays in the IT world. There are different scenarios for such implementations, but the one that particularly made us research the topic was the need to provide a simultaneous read/write access for multiple nodes to a file repository, which resides on a SAN system. In this article we will review how Red Hat's GFS2 clustered file system, built on top of an iSCSI SAN, could be used as a solution to this requirement.

2. BACKGROUND

The building blocks that compose this solution are:

  • the SAN storage system, which in our scenario uses iSCSI protocol
  • 10GbE network, on top of which we run the SAN
  • Red Hat Clustering Suite, which helps us to build the GFS2 file system with clustering support
  • Linux sg3_utils package, which provides us with a way to use fencing mechanism built on top of the SCSI3 persistent reservation
  • CentOS Linux, used as an Operation System on top of which the solution is built

For further reading regarding, please visit the following pages:

GFS2 Overview - Wikipedia
Red Hat GFS2 Product Page
Red Hat RHEL 5 Cluster Suite Overview
Linux sg3_utils package

3. ENVIRONMENT

Below we'll review in more details the diagram of the solution, the hardware, the software, the configuration files of the software and finally we'll explain step by step how this was brought together.

3.1. DIAGRAM

GFS2 Logical Diagram

The above logical diagram shows how each of the nodes, which will be participating in the cluster are connected. There are 2 subnets - Production Subnet, which will be the subnet that will serve the connections from clients, and SAN Subnet, which will connect cluster node servers to the SAN storage system. Addressing scheme, which we're using, is also available in the diagram.

Please note that the SAN Subnet is running on top of 10Gb Ethernet network, while Production Subnet is running on top of 1Gb Ethernet network.

3.2. HARDWARE

The hardware used for this example is the following:

Cluster nodes

4x virtual machines, which for this example run on top of a VMware vSphere cluster. Each of them has:
- 2x vCPUs
- 2GB memory
- 20GB virtual disk
- 1x 1GbE NIC for Production Subnet (E1000 module)
- 1x 10GbE NIC for SAN Subnet (VMXNET3 module)

SAN storage system

1x HP P4500 G2 5.4TB SAS Storage System (composed by 2 nodes)
1x HP P4000 G2 10G BASE SFP+ Upgrade Kit (upgrading the above system to provide 10GbE functionality)

3.3. SOFTWARE

CentOS Linux
Red Hat Clustering Suite
Linux sg3_utils
iSCSI utils

3.4. CONFIGURATION

You could see below all configuration files required to build the cluster. The first one that we'll start with is the anaconda kickstart file, used to setup the cluster nodes:

# Install OS instead of upgrade
################################
install

# Use HTTP installation media
###############################
url --url http://centos.slu.cz/mirror/5/os/x86_64

# System language
##################
lang en_US.UTF-8

# System keyboard
##################
keyboard us

# Root password
################
rootpw --iscrypted $1$.CPm5e0jrFjXGh$0WnEHf4/kA9DmRki

# Firewall configuration
#########################
firewall --disabled

# System authorization information
###################################
authconfig --enableshadow --enablemd5

# SELinux configuration
########################
selinux --disabled

# System timezone
##################
timezone --utc Europe/Prague

# System bootloader configuration
##################################
bootloader --location=mbr

# Clear the Master Boot Record
###############################
zerombr

# Partition clearing information
#################################
clearpartĀ  --all --initlabel

# Disk partitioning information
################################
part /boot --fstype ext2 --size=250 --asprimary
part pv.1 --size=100 --grow --asprimary
volgroup VolGroup00 --pesize=32768 pv.1
logvol swap --fstype=swap --name=SWAP --vgname=VolGroup00 --size=4096
logvol / --fstype ext4 --name=ROOT --vgname=VolGroup00 --size=20480 --grow

# Do not configure the X Window System
#######################################
skipx

# Use text mode install
########################
text

# Run the Setup Agent on first boot
####################################
firstboot --disabled

# Reboot after installation
############################
reboot

%packages
@base
@core
@clustering
@cluster-storage
samba3x
vim-enhanced
e4fsprogs
iscsi-initiator-utils
sg3_utils-libs
sg3_utils
ctdb

%post --log=/root/anaconda-kickstart-post.log

### Disable all serivices
############################################
printf "1. DISABLE NOT NEEDED SERVICES\n";

for i in `chkconfig --list | grep 0:off | awk '{print $1}'`
do
    printf "chkconfig $i off\n";
    /sbin/chkconfig $i off;
done

### Enable only needed services
############################################
printf "2. ENABLE ONLY NEEDED SERVICES\n";

for i in acpid crond haldaemon irqbalance lvm2-monitor messagebus network sendmail sshd syslog
do
    printf "chkconfig $i on\n";
    /sbin/chkconfig $i on;
done

### Disable SELINUX
############################################
printf "3. DISABLE SELINUX\n";
sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/sysconfig/selinux

For more information on unattended Linux installation with Anaconda there will be a separate article. For now you could read here for more details.

As a prerequisite of the node setup we have to have the SAN setup, before we continue with the configuration. The configuration of the SAN is out of the scope of this article and will be explained in another one, but we'll mention here at least that we've created a LUN on the SAN storage system, which is configured to grant read/write access to all nodes based on their iSCSI Qualified Name (IQNs). In order to grant correctly this read access we have to make sure that the IQNs are correctly setup on each node. This could be controlled by the /etc/initiatorname.iscsi file.

Example of the content of this file, and how we've modified it is available below:

InitiatorName=iqn.1994-05.com.redhat:node01

Now that we have the IQNs setup correctly we have to setup correctly the host names, so that each node could resolve correctly the node names of the others. In order to do this in the current article we're going to modify the /etc/hosts file. This same file is going to be distributed to all nodes, so that each of them will be able to ping the others. Below is the content of the file:

# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1              localhost.localdomain localhost
::1                    localhost6.localdomain6 localhost6
10.0.0.1               node01.localdomain node01
10.0.0.2               node02.localdomain node02
10.0.0.3               node03.localdomain node03
10.0.0.4               node04.localdomain node04

The next configuration file is the /etc/cluster/cluster.conf one, which is required for the cman service. In this file we list all cluster nodes and define the fencing mechanism, which we'll be using. In our case this will be fence_scsi, which will use the sg3_utils to use SCSI3 persistent reservations.

Please note that in order to have this configuration file working we have to complete first the above step and edit /etc/hosts file to contain all node host names.

Now below the content of the /etc/cluster/cluster.conf:

< ?xml version="1.0" ?>
<cluster alias="FILES" config_version="1" name="FILES">
        <fence_daemon post_fail_delay="0" post_join_delay="8"></fence_daemon>
        <clusternodes>
                <clusternode name="node01" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="SCSI_fence" nodename="node01"></device>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="node02" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="SCSI_fence" nodename="node02"></device>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="node03" nodeid="3" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="SCSI_fence" nodename="node03"></device>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="node04" nodeid="4" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="SCSI_fence" nodename="node04"></device>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman></cman>
        <fencedevices>
                <fencedevice agent="fence_scsi" name="SCSI_fence"></fencedevice>
        </fencedevices>
        <rm>
                <failoverdomains>
                </failoverdomains>
        </rm>
        <dlm plock_ownership="1" plock_rate_limit="0"></dlm>
        <gfs_controld plock_rate_limit="0"></gfs_controld>
</cluster>

The following part of the above code:

        <dlm plock_ownership="1" plock_rate_limit="0"></dlm>
        <gfs_controld plock_rate_limit="0"></gfs_controld>

is used to fine tune the GFS2 locking performance. Without this part we might experience serious performance deterioration.

IMPORTANT!!! This configuration file has to be the same on all nodes of the cluster. After we create the file on one of the nodes we have to copy it to all other members of the cluster.

Once this is done we can use the following commands to verify the config on all nodes:

ccs_tool lsnode
ccs_tool lsfence

4. SOLUTION EXPLANATION

Now that we have all of the four nodes setup, and the above configuration provided we could move further towards the actual explanation of how all of the above will be made working.

The first thing to do is to discover the SAN LUN, which was setup. This have to be done on all nodes, and is a prerequisite of the next steps. In order to do this we have to run the following command:

[[email protected] ~]# iscsiadm -m discovery -t sendtargets -p [ip_address_of_iSCSI_target_server]

The above command will generate an output with the following syntax, which will indicate us that we were able to discover the LUN, which the SAN storage system is providing:

target_IP:port,target_portal_group_tag proper_target_name

After this step is done, we have to restart the iscsi and iscsid service and make sure that they're configured to start during boot time. This have to be done on all nodes of the cluster and could be accomplished like this:

[[email protected] ~]# /etc/init.d/iscsi start
[[email protected] ~]# /etc/init.d/iscsid start
[[email protected] ~]# chkconfig iscsi on
[[email protected] ~]# chkconfig iscsid on

Now that we have the SAN LUN visible on all of the nodes and both /etc/hosts and /etc/cluster/cluster.conf distributed to all nodes, we could move further with the cluster setup.

Next thing to do is to start cman and rgmanager on all cluster nodes, and then to make sure that these two services are configured to start up during boot time:

[[email protected] ~]# /etc/init.d/cman start
[[email protected] ~]# /etc/init.d/rgmanager start
[[email protected] ~]# chkconfig cman on
[[email protected] ~]# chkconfig rgmanager on

In order to verify the correct setup of the above, we could use the following command:

[[email protected] ~]# clustat
Member Name     ID   Status
------ ----     ---- ------
node01             1 Online, Local
node02             2 Online
node03             3 Online
node04             4 Online

The command output shows us that all the nodes are online, and we could see which is the node, on which we're running the command.

Now that we have the clustering services up, we have to configure our LVM2 to be cluster aware. In order to do this we have to first modify the LVM2 configuration by running the following command:

[[email protected] ~]# lvmconf --enable-cluster

!Please note that the above command is for all of the nodes!

Now we could move forward, and create LVM2 volume on the SAN LUN. This is done like this:

[[email protected] ~]# pvcreate PV_FILES /dev/sdb
[[email protected] ~]# vgcreate -c y VG_FILES /dev/sdb
[[email protected] ~]# lvcreate -n LV_FILES -L 20G VG_FILES

!Please note that above steps are performed only on the first node of the cluster!

Once we have completed the above steps we have to start the clvmd on all nodes and make sure that the service will start after reboot:

[[email protected] ~]# /etc/init.d/clvmd start
[[email protected] ~]# chkconfig clvmd on

Now, we have the SAN LUN visible on all nodes, the clustering services running on all nodes, the cluster aware LVM2 configured and logical volume created on the SAN LUN. The next step from this cluster setup is to format the LVM2 logical volume with the GFS2 clustered file system. In order to do this we could run the following command:

[[email protected] ~]# mkfs -t gfs2 -p lock_dlm -t FILES:PV_FILES -j 4 /dev/mapper/VG_FILES-LV_FILES

The exact meaning of the command parameters is:

mkfs -t "file_system" -p "locking_mechanism" -t "cluster_name":"physical_volume_name" -j "journals_needed. This_ is_equal_of_the_nodes_in_the_cluster" "location_of_file_system"

!!! Please note that this command have to be executed only on one of the cluster nodes, and shouldn't be executed on the others!

At this moment we have our GFS2 clustered file system ready to perform. One additional thing, which we have to do, is to tell to all cluster nodes to mount the file system during boot time. This could be done by putting the following line in /etc/fstab:

/dev/mapper/VG_FILES-LV_FILES   /mnt/files        gfs2    defaults,noatime,nodiratime     0 0

In order to mount it before rebooting the system we could run:

[[email protected] ~]# mount /dev/mapper/VG_FILES-LV_FILES /mnt/files

With this step we're now ready to use this recipe for exploring further the GFS2 clustered file system possibilities.

5. INTERESTING READING THAT HELPED DURING THE CREATION OF THIS ARTICLE

1. Cluster Manager
2. CentOS Kickstart Tips & Tricks
3. Red Hat Cluster Suite
4. GFS2 Setup Notes
5. iSCSI setup on Red Hat 5 / CentOS 5
6. Red Hat Cluster Suite Overview
7. Configuring and managing Red Hat Cluster
8. LVM Administrator's Guide
9. Red hat Global File System
10. SCSI Reservations & Red Hat Cluster Suite