— modern ops stuff —
Sun Cluster 3.1 on Solaris 9
26 November 2009 // Sun Cluster

The Internet has loads of information on how to build a Sun Cluster, but none of them, at least that I can find, tell you why you take each of the required steps.

I think it’s a big problem that we have a generation of sys-admins who have learnt everything they know from HOWTOs. It’s like studying for an exam by just learning the questions that will be on the paper (as I believe happens in schools these days). The immediate results will be good, but you get no depth of knowledge, and probably won’t be able to fix what you’ve made when it breaks, or apply the techniques to anything else. So, I’ve annotated some of my Sun Cluster notes, to try to explain the theory behind what’s going on.

My hardware for this project is a Sun 3510 array for shared storage, and a couple of v210s. Fine machines though they are, you wouldn’t use v210s for a real HA cluster, because they have too many single points of failure. There’s only one power supply, and a single PCI slot, which limits you to a one HBA or an extra NIC card.

Planning

I’ve seen (mainly Veritas) clusters develop issues later in their lifetimes because people change them too much. My number one tip for building a cluster is to get it right the first time. My number two tip is to leave the damn thing alone. So, there’s a bit of planning to be done first. Here are the things you need to decide.

Cluster Name

This is the name you give to the cluster as a whole. It can be anything you like, and you won’t use it much. I’m going to call mine robot.

cluster name    robot

Node Hostnames and IP Addresses

You’ll need three public network IP addresses for each host, because IPMP has to be used on all interfaces. We’ll only be using one IPMP group, called talk on both nodes. You don’t have to have the same group name across nodes, but it seems sensible to me, especially if you are using multiple groups.

node 1                  sc31-01 on 192.168.1.61
node 2                  sc31-02 on 192.168.1.62
node 1 test addresses   192.168.1.63 and 192.168.1.64
node 2 test addresses   192.168.1.65 and 192.168.1.66

cluster transport interfaces   bge2 and bge3

service names and IP addresses

Each service, for instance NFS, Apache, Oracle, whatever, requires a logical hostname and IP address.

My services are going to be a highly available NFS service, and an Apache service which runs in parallel on both nodes. Each of these will need filesystems for the data to share, which implies a VxVM disk group, a floating IP address, and a logical hostname. I’ll use the same name for the service name and logical hostname.

service name  service type    IP address      disk group   resource group
-------------------------------------------------------------------------
robot-nfs       failover      192.168.1.70      cl-nfs         nfs-rg
robot-www     distributed     192.168.1.71      cl-www         www-rg

The service names and IP addresses need to be in the /etc/hosts files on both nodes.

Cabling

On each host I’ve connected bge0 and bge1 to a switch. These interfaces will be put into IPMP groups later, to provide maximum reslience. Ideally you’d have, say, a QFE card in your server, and use a bge interface and a qfe interface, so if a card failed you’d still have connectivity. It would also be nice to be able to connect each NIC to a different switch, but we’re only in a lab, so we won’t.

bge2 and bge3 will team up to provide the private cluster interconnect. Again, you’d ideally have these on separate NIC card and separate switches, but I’m just going between them with a couple of crossover cables. If you have more than two nodes, that obviously means you need a switch or two - crossovers won’t cut it. When you start to configure the cluster with scsetup, it will check that there’s no other traffic on the private network, so set up a private cluster VLAN.

In the old days of Sun Cluster 2, the private network was for the cluster heartbeat and, if I recall correctly, not a lot else. In version 3, things are very different. There’s a great deal of communication between nodes, even to the point where a node has full access to storage to which it is not physically attached. All that happens over the private network, so make it reliable, and make it fast. The cluster framework will trunk multiple connections together.

A lack of cables means I’ve had to cut corners on my FC connections. In a production cluster you’d have at least two connections between each host and the storage array, but I’ve only got two cables, so it’s a single point of failure for me, and no opportunity to demonstrate multi pathing.

The Hosts

My hosts are going to be called sc31-01 and sc31-02. Let’s check them out with s-audit.

# s-audit.sh platform

'platform' audit on sc31-01

       hostname : sc31-01
       hardware : Sun Fire V210 (64-bit SPARC)
 virtualization : none
            CPU : 2 @ 1336MHz
         memory : 2048Mb physical
         memory : 2.0Gb swap
            OBP : 4.30.4.a
       ALOM f/w : v1.6.10
        ALOM IP : 192.168.1.136
        storage : disk: 2 x 73GB SCSI
        storage : CD/DVD: 1 x ATA (empty)
           card : scsi-fcp (SUNW,qlc PCI0@66MHz) QLA2342
         EEPROM : local-mac-address?=true
         EEPROM : scsi-initiator-id=7
         EEPROM : auto-boot-on-error?=true
         EEPROM : auto-boot?=true
         EEPROM : boot-device=/pci@1c,600000/scsi@2/disk@0,0:a
         EEPROM : use-nvramrc?=true
         EEPROM : diag-level=max
       devalias : disk0 /pci@1c,600000/scsi@2/disk@0,0:b
       devalias : disk1 /pci@1c,600000/scsi@2/disk@1,0:b
# s-audit.sh platform

'platform' audit on sc31-02

       hostname : sc31-02
       hardware : Sun Fire V210 (64-bit SPARC)
 virtualization : none
            CPU : 2 @ 1336MHz
         memory : 2048Mb physical
         memory : 2.0Gb swap
            OBP : 4.30.4.a
       ALOM f/w : v1.6.10
        ALOM IP : 192.168.1.137
        storage : disk: 2 x 73GB SCSI
        storage : CD/DVD: 1 x ATA (empty)
           card : scsi-fcp (SUNW,qlc PCI0@66MHz) QLA2342
         EEPROM : local-mac-address?=true
         EEPROM : scsi-initiator-id=7
         EEPROM : auto-boot-on-error?=true
         EEPROM : auto-boot?=true
         EEPROM : boot-device=disk:a disk0 disk1 net
         EEPROM : use-nvramrc?=true
         EEPROM : diag-level=min
       devalias : disk0 /pci@1c,600000/scsi@2/disk@0,0:b
       devalias : disk1 /pci@1c,600000/scsi@2/disk@1,0:b

All looks good, with a QLogic HBA in each host to talk to the 3510. Note that the scsi-initiator-id is the same on both hosts. Because we’re using fibre, that’s okay, but if we were using old-skool SCSI to connect to the shared storage, we’d have to change the value on one of the hosts to avoid a collision. Just drop one of the values to 6 so it maintains the highest possible priority. Note that local-mac-address? is set to true. This is important because we have to use IPMP later, which requires different NICs have unique MAC addresses.

Installing Solaris 9

I Jumpstart using my own framework, issuing these commands:

# setup_client.sh -sp -mc1t0d0:c1t1d0 -f all \
  /js/ufs/images/sparc/9\-905hw\-ga sc31-01

# setup_client.sh -sp -mc1t0d0:c1t1d0 -f all \
  /js/ufs/images/sparc/9\-905hw\-ga sc31-02

A little bit of profile editing is necessary. Here’s a suitable profile.

install_type    initial_install
system_type     standalone
cluster         SUNWCuser

partitioning    explicit
filesys mirror:d0 c1t0d0s0 c1t1d0s0       400     /
filesys mirror:d1 c1t0d0s1 c1t1d0s1       free    /opt
filesys mirror:d3 c1t0d0s3 c1t1d0s3       200     /globaldevices
filesys mirror:d4 c1t0d0s4 c1t1d0s4       2048    swap
filesys mirror:d5 c1t0d0s5 c1t1d0s5       1024    /var
filesys mirror:d6 c1t0d0s6 c1t1d0s6       1024    /usr
metadb  c1t0d0s7 size 8192 count 4
metadb  c1t1d0s7 size 8192 count 4

The two things to notice here are that Sun Cluster requires the SUNWCuser install cluster; and the /globaldevices mountpoint. When the cluster framework is configured, /globaldevices will be turned into a global device mounted at /global/.devices/node@n, and all global devices require unique minor numbers, which means unique metadevice numbers. So, on one of your hosts use d3 for /globaldevices, and on the other, use d13. If you don’t do that, booting the cluster will give you an error of the form

WARNING - Unable to mount one or more of the following filesystem(s):
        /global/.devices/node@2
If this is not repaired, global devices will be unavalilable.

Though we will be using Veritas Volume Manager for cluster filesystems, I prefer mirroring my boot disks with DiskSuite. VxVM encapsulation has always seemed messy to me, and I’ve seen people get into a real tangle trying to correct disk failures when it’s been used. (That’s not Veritas’s fault - the technology works, if you know how to use it.) There’s also an issue with device minor numbers and VxVM boot disks with Sun Cluster, which I’m not going to go into.

Post Install

Obiously it’s always a good idea to fully patch your fresh install. Use PCA, not the junk that Sun supply. Because we’re going to use a 3510 we need FC drivers, which for Solaris 9 (and earlier) are in the SAN package.

# uncompress -C SAN_4.4.13_install_it.tar.Z | tar -xf -
# cd SAN_4.4.13_install_it
# ./install_it

And do a reconfiguration reboot. This, of course, must be done on both hosts.

We also want sccli, which means a download and install of the Sun StorEdge 3000 Family Storage Products v2.3. It seems sensible to me to put it on both nodes.

Configuring the 3510

Given that the 3510 has excellent hardware RAID capabilities, it would seem sensible to use those to present a single logical volume to both hosts. But, I’m not going to that. I’m going to present Solaris with four logical disks, and put those in VxVM disk groups. This is so I can better illustrate how VxVM would used in a clustered environment using a JBOD. (Some people even prefer to do disk management through VxVM. It does have certain advantages.)

Connect to the 3510 with sccli and have a look what disks we have:

# sccli
sccli: selected device /dev/es/ses2 [SUN StorEdge 3510 SN#099E20]
sccli> show disks
Ch     Id      Size   Speed  LD     Status     IDs                      Rev
----------------------------------------------------------------------------
 2(3)   0  136.73GB   200MB  ld0    ONLINE     SEAGATE ST314670FSUN146G 055A
                                                   S/N 0643K13D
                                                  WWNN 20000014C3849F4B

There are eleven more, but they’re all the same. You can see this is disk belongs to logical disk unit ld0 and is on channels 2 and 3.

This is important because in a 3510 we deal far more with logical disks. A logical disk can be created from one or more physical disks using a number of RAID configurations. All we want to do in this example is a one-to-one mapping, where each physical disk simply “hides behind” a logical one.

sccli> show ld
LD    LD-ID        Size  Assigned  Type   Disks Spare  Failed Status
------------------------------------------------------------------------
ld0   301B8CB2 136.48GB  Primary   NRAID  1     0      0      Good
                         Write-Policy Default          StripeSize 128KB
ld1   46D3E0AA 136.48GB  Primary   NRAID  1     0      0      Good
                         Write-Policy Default          StripeSize 128KB
ld2   01F2BEE0 136.48GB  Primary   NRAID  1     0      0      Good
                         Write-Policy Default          StripeSize 128KB
ld3   58D32860 136.48GB  Primary   NRAID  1     0      0      Good
                         Write-Policy Default          StripeSize 128KB
ld4   0ECFC55E 136.48GB  Primary   NRAID  1     0      0      Good
                         Write-Policy Default          StripeSize 128KB
...

This is the default configuration of a 3510, and it’s just what we want.

If you look on the back of a 3510 you’ll see the GBICs are numbered. These are the channels through which the hosts connect, and each channel must be assigned a SCSI target number for each of the 3510s controllers. The channel IDs show up as the target number on the hosts. If you’re connected to the primary controller you see the disks as PID targets, via the secondary controller, as the SID.

sccli> show channels
Ch  Type    Media   Speed   Width  PID / SID
--------------------------------------------
 0  Host    FC(L)   2G      Serial  40 / N/A
 1  Host    FC(L)   N/A     Serial  N/A / 42
 2  DRV+RCC FC(L)   2G      Serial  14 / 15
 3  DRV+RCC FC(L)   2G      Serial  14 / 15
 4  Host    FC(L)   2G      Serial  44 / N/A
 5  Host    FC(L)   N/A     Serial  N/A / 46
 6  Host    LAN     N/A     Serial  N/A / N/A

Channels 2 and 3 are for the disks to communicate with the controllers, so they are private. I’m only going to use 0 and 1, which are clearly labelled on the back of the host. I’ll give channel 0 IDs 40 and 42, and channel 1 41 and 43

sccli> configure channel 0 pid 40
sccli: changes will not take effect until controller is reset
sccli> configure channel 1 pid 41
sccli: changes will not take effect until controller is reset
sccli> configure channel 1 sid 43
sccli: changes will not take effect until controller is reset
sccli> configure channel 0 sid 42
sccli: changes will not take effect until controller is reset
sccli> show channels
Ch  Type    Media   Speed   Width  PID / SID
--------------------------------------------
 0  Host    FC(L)   2G      Serial  40 / 42
 1  Host    FC(L)   N/A     Serial  41 / 43
...
sccli> reset controller

We have physical disks masquerading as logical disks, we have SCSI channels, so we need something to put those disks on those channels. That’s what mappings are for.

We need to map each logical disk to both channels - i.e. both controllers. Then, each host will be able to see each disk. The map command is as follows

map logical_disk channel.id.disk_number

So to map ld0 to disk 0 on both channels:

sccli> map ld0 0.40.0
sccli: mapping ld0-00 to 0.40.0
sccli> map ld0 1.41.0
sccli: mapping ld0-00 to 1.41.0

Reset the 3510, and, as promised:

[sc31-01]# echo | format
...
       2. c3t40d0 <SUN-StorEdge3510-423A cyl 35211 alt 2 hd 64 sec 127>
          /pci@1d,700000/SUNW,qlc@1/fp@0,0/ssd@w216000c0ff899e20,0

[sc31-02]# echo | format
       2. c3t41d0 <SUN-StorEdge3510-423A cyl 35211 alt 2 hd 64 sec 127>
          /pci@1d,700000/SUNW,qlc@1/fp@0,0/ssd@w226000c0ff999e20,0

If you were doing this properly, you would map each disk to the primary and secondary controllers, so each host had two paths to each disk. That way, a failure in one 3510 controller, or one HBA would not result in any loss of storage connectivity.

Installing Sun Cluster 3.1

So we have two computers installed with Solaris 9, both connected to the same shared storage, cabled into a public network, and with crossover cables forming a private network. We’re all ready to install the Sun Cluster software.

First you have to install Sun Web Console. As a minimal systems nut, I’ve always been uncomfortable with the amount of junk that Sun Cluster depends on. Surely if there were ever a case for a hardened minimal system, it’s on a cluster? Apparantly not. Installint Web Console is as simple as cding to the sun_web_console/2.1 directory and running

# ./setup

Then go to the sun_cluster directory and run the installer. You’ll need X for this, so either do the old xhost/DISPLAY trick, or ssh -X.

# ./installer

If you need any more information than that, then you probably shouldn’t be trying to build a cluster. Do the “typical” install, there’s little to be gained by going “custom”.

If you want to see what was installed and where it went:

$ pkginfo | egrep "Cluster|cacao|jdmk|mdmx"
application SUNWcacao            Cacao Component
application SUNWcacaocfg         Cacao configuration files
application SUNWjdmk-runtime     Java DMK 5.1 Runtime Library
application SUNWjdmk-runtime-jmx Java DMK 5.1 JMX libraries
application SUNWscdev            Sun Cluster developer support
application SUNWscgds            Sun Cluster Generic Data Service
application SUNWscman            Sun Cluster Manual Pages
application SUNWscmasa           Sun Cluster Managability and Serviceability Agent
application SUNWscnm             Sun Cluster name
application SUNWscr              Sun Cluster, (root)
system      SUNWscrsm            Sun Cluster RSM Transport
application SUNWscsal            Sun Cluster SyMON agent library
application SUNWscsam            Sun Cluster SyMON modules
application SUNWscu              Sun Cluster, (Usr)
application SUNWscvm             Sun Cluster VxVM Support

Or look at the files in those packages

$ pkginfo | egrep "Cluster|cacao|jdmk|mdm"awk '{ print $2 }' | \
while read p
> do
> pkgchk -l $p
> done | grep Path | sort

and you’ll see most stuff is in /usr/cluster, so add /usr/cluster/bin to your PATH, but most interestingly, there are files in /kernel. If you had the “pleasure” of working with Sun Cluster 2, you’ll recall that it was entirely a userland application. Just stuff that sat on top of Solaris, monitored applications and paths, and failed services over to other nodes (you hoped). Sun Cluster 3 is a proper, tightly integrated piece of software. The part of it most tightly bound to the kernel is that which makes global filesystems work.

Also installed are Sun Explorer, a Sun PS tool which finds out detailed system information, some Apache SSL extensions, and SunPlex packages. SunPlex is a rudimentary web based GUI to Sun Cluster. I’ve never used it, and it isn’t even going to be a part of Sun Cluster any more, so don’t worry about it. You don’t need it for the cluster to work.

If you reboot and watch the console, you’ll see a message complaining that the SunPlex installer requires Apache packages. Don’t worry about it.

Configuring the Cluster Framework

Up to now we’ve only done normal Solaris stuff. Here’s where we start to move into the cluster’s world.

Node 01

Run

[sc31-01]# scinstall

and choose option 1 to Install a cluster or cluster node. You now have the option to install your cluster all in one go, or one node at a time. Force of habit, I always do it one node at a time. Sun showed me how to do it that way, so it’s the way I’ve always used. So, option 2 for me.

You’ll notice that scinstall is a wrapper to the other sc* commands, and it always shows you the commands it runs. I like this, because it gives you a look at how things work underneath.

So, agree for sccheck to examine your system. It is able to apply patches if you want it to. I don’t.

When asked, supply the cluster name you decided on earlier, and sccheck will do its thing

You will have to supply the name of the other nodes in your cluster, so the framework can begin creating the Cluster Configuration Repository, or CCR. This is a little database of flat files, stored in /etc/cluster/ccr, which keeps track of the members of the cluster and their states, as well as other important stuff like the disk paths, disk groups, and the configuration of the cluster transport. DON’T MESS WITH IT!

I’ve never worked on a site that used the DES authentication method, so I always skip that. I always accept the default transport address and netmask too.

My cluster is using crossover cables on bge2 and bge3, so I have no junctions. If you do use junctions, you will be asked to supply names for them. Those names are only meaningful to the cluster. You’ll be informed that DLPI will be used for the transport. This is a simple, low-level protocol that uses the MAC layer rather than IP.

Next you’ll be on to the global device filesystem. This is the method by which any cluster node is able to access storage physically attached to any other node. Clever stuff. Remember how I told you to put /globaldevices in your Jumpstart profile? We’ll need it now.

You’ll see the scinstall command that is going to be run. In my case it’s

# scinstall -ik \
  -C robot \
  -F \
  -T node=sc31-01,node=sc31-02,authtype=sys \
  -A trtype=dlpi,name=bge2 -A trtype=dlpi,name=bge3 \
  -B type=direct

Pretty simple eh? That’s what I like about Sun Cluster - nothing ever seems any more complicated than it needs to be. -C sets the cluster name; -F says this is the first node in the cluster; -T lists the nodes that will form the cluster, and they authentiction used when members join (system, because we declined the DES encrypted option); -A lists the cluster transport interconnects, and specifies the DLPI protocol will be used; -B is used to list transport junctions, and we have a direct (crossover) connection.

When the system reboots, watch the console. You’ll see some errors:

  Configuring the /dev directory (compatibility devices)
  /usr/cluster/bin/scdidadm: Could not load DID instance list.  Cannot
  open /etc/cluster/ccr/did_instances.

DID is the device ID pseudo-driver. It’s the mechanism used to access remote storage. Every disk on every cluster node is assigned a unique DID, which is the same on every node. So applications which access filesystems through a DID, using /dev/global/dsk, rather than /dev/dsk, will work on any cluster node. As of this moment, the DID database hasn’t been built.

Booting as part of a cluster NOTICE: CMM: Node sc31-01 (nodeid = 1)
with votecount = 1 added.
NOTICE: CMM: Node sc31-01: attempting to join cluster.
NOTICE: CMM: Cluster has reached quorum.

Ah, quorum. Imagine a two node cluster running a parallel database - each node updates the other. If that cluster loses its transport interconnect, both nodes carry on working, assuming the other node is down, and updating their own local copy of the database without sending updates to the other node, breaking the synchronization. Bad news eh? So, you need some mechanism which in such an event says “this half of the cluster is the one to trust”, and which shuts the other half down. If each node has a vote when this decision is made, it would be a tie - one all. So, there’s a quorum device, usually a disk, which affiliates itself to one particular node, and has the casting vote. At the moment, our cluster has one node, with one vote, so only one vote is required for quorum. It has that.

NOTICE: CMM: Node sc31-01 (nodeid = 1) is up; new incarnation number =
1323099728.
NOTICE: CMM: Cluster members: sc31-01.
NOTICE: CMM: node reconfiguration #1 completed.
NOTICE: CMM: Node sc31-01: joined cluster.
ip: joining multicasts failed (18) on clprivnet0 - will use link layer
broadcasts for multicast

Remember how I said the transport interconnects would be grouped together into a single logical link? That’s clprivnet0. Remember, we were told during the configuration that we were going to use DLPI (i.e, link layer).

Configuring DID devices
did instance 1 created.
did subpath sc31-01:/dev/rdsk/c0t0d0 created for instance 1.
did instance 2 created.
did subpath sc31-01:/dev/rdsk/c1t0d0 created for instance 2.
did instance 3 created.
did subpath sc31-01:/dev/rdsk/c1t1d0 created for instance 3.
did instance 4 created.
did subpath sc31-01:/dev/rdsk/c3t40d0 created for instance 4.
did instance 5 created.
did subpath sc31-01:/dev/rdsk/c3t40d3 created for instance 5.
did instance 6 created.
did subpath sc31-01:/dev/rdsk/c3t40d2 created for instance 6.
did instance 7 created.
did subpath sc31-01:/dev/rdsk/c3t40d1 created for instance 7.

And there are the DID devices I was telling you about. Once the node is up, we can have a look at those.

Node 02

Again, run scinstall and choose option 1. This time, we’ll be adding this machine as a node in an existing cluster.

Our “sponsoring node” is sc31-01, the boss of the cluster right now, and the name of the cluster is still robot.

As before, let sccheck make sure everything is okay. I’ve always found the autodiscovery of the transport interfaces works perfectly, so let it happen. It will check with you before adding anyway, so make sure everything looks right.

sc31-01:bge2  -  sc31-02:bge2
sc31-01:bge3  -  sc31-02:bge3

Looks good to me. This time my scinstall command is

# scinstall -ik \
  -C robot \
  -N sc31-01 \
  -A trtype=dlpi,name=bge2 -A trtype=dlpi,name=bge3 \
  -B type=direct \
  -m endpoint=:bge2,endpoint=sc31-01:bge2 \
  -m endpoint=:bge3,endpoint=sc31-01:bge3

Which is similar to before, but with the name of the sponsoring node supplied by -N, rather than the names of all the nodes in cluster, and -m used to specify the transport interconnects. Let it install, let it reboot, and watch the consoles.

If you still had your sc31-01 console open, you’d have seen the cl_runtime daemon inform you that it had seen the cluster transport links being created.

Final Configuration

Earlier I talked about quorum, and there being a device with a casting vote. We have to set that up now. Let’s examine the quorum status with scstat.

$ scstat -q

-- Quorum Summary --

  Quorum votes possible:      1
  Quorum votes needed:        1
  Quorum votes present:       1


-- Quorum Votes by Node --

                    Node Name           Present Possible Status
                    ---------           ------- -------- ------
  Node votes:       sc31-01             1        1       Online
  Node votes:       sc31-02             0        0       Online


-- Quorum Votes by Device --

                    Device Name         Present Possible Status
                    -----------         ------- -------- ------

Only one quorum vote. Remember I said a machine would panic if it was without votes? Try rebooting sc31-01 and watch sc31-02. It panics. That’s not much of a cluster is it? The whole idea is that if one node goes down the other takes over! We need to finish the quorum setup.

The final quorum device will be a disk, which is always visible to both servers. That is, something in the 3510. We can see what disks are available, and their global paths, using scdidadm.

# scdidadm -L
1        sc31-01:/dev/rdsk/c0t0d0       /dev/did/rdsk/d1
2        sc31-01:/dev/rdsk/c1t0d0       /dev/did/rdsk/d2
3        sc31-01:/dev/rdsk/c1t1d0       /dev/did/rdsk/d3
4        sc31-01:/dev/rdsk/c3t40d0      /dev/did/rdsk/d4
4        sc31-02:/dev/rdsk/c3t41d0      /dev/did/rdsk/d4
5        sc31-01:/dev/rdsk/c3t40d3      /dev/did/rdsk/d5
5        sc31-02:/dev/rdsk/c3t41d3      /dev/did/rdsk/d5
6        sc31-01:/dev/rdsk/c3t40d2      /dev/did/rdsk/d6
6        sc31-02:/dev/rdsk/c3t41d2      /dev/did/rdsk/d6
7        sc31-01:/dev/rdsk/c3t40d1      /dev/did/rdsk/d7
7        sc31-02:/dev/rdsk/c3t41d1      /dev/did/rdsk/d7
8        sc31-02:/dev/rdsk/c0t0d0       /dev/did/rdsk/d8
9        sc31-02:/dev/rdsk/c1t0d0       /dev/did/rdsk/d9
10       sc31-02:/dev/rdsk/c1t1d0       /dev/did/rdsk/d10

This shows all the disks with their “traditional” paths and their new-fangled global device IDs. I’d say disk 4 looks a suitable candidate, wouldn’t you?

You can add a quorum device through scsetup. Run it again, on either node, and it will as if you want to add a quorum disk. Handy, that. Say “yes”, and tell it to use d4. You’ll see the “real” command to do this is

# scconf -a -q globaldev=d4

scsetup offered us that option by default because the cluster was still in “installmode”. When you exit scsetup this time, you will be able to turn that off, which you should do, because the cluster framework is now fully installed.

Now have another look at the quorum status.

$ scstat -q

-- Quorum Summary --

  Quorum votes possible:      3
  Quorum votes needed:        2
  Quorum votes present:       3


-- Quorum Votes by Node --

                    Node Name           Present Possible Status
                    ---------           ------- -------- ------
  Node votes:       sc31-01             1        1       Online
  Node votes:       sc31-02             1        1       Online


-- Quorum Votes by Device --

                    Device Name         Present Possible Status
                    -----------         ------- -------- ------
  Device votes:     /dev/did/rdsk/d4s2  1        1       Online

If you reboot node 1, node 2 will have the quorum device, giving it the two votes needed to keep the cluster running. Clever, isn’t it?

What’s Running?

Let’s have a look at a running cluster framework.

$ pgrep -fl cl
    4 cluster

This is the daddy, the boss of them all, encapsulating the kernel part of the cluster framework. It’s a kernel process, so you can’t kill it. Never.

85 /usr/cluster/lib/sc/failfastd

This has the ability to panic a machine if requested by another part of the framework.

87 /usr/cluster/lib/sc/clexecd
88 /usr/cluster/lib/sc/clexecd

clexecd is the mechanism through which cluster nodes issue commands to each other, and from the kernel to userland.

1770 /usr/cluster/lib/sc/cl_eventd

Looks out for system events generated within the cluster (nodes joining and such like), and passes notification of them between nodes.

2136 -su -c cd /opt/SUNWcacao ; /usr/j2se/bin/java -Xms4M -Xmx64M  -classpath /opt

The Common Agent Container. This is purely for the benefit of SunPlex. Feel free to rename /etc/rc3.d/S79cacao and neuter it.

1774 /usr/cluster/lib/sc/rpc.fed

I don’t fully understand what the fork execution daemon does. It’s something to do with starting and stopping resources.

1745 /usr/cluster/lib/sc/sparcv9/rpc.pmfd

This handles all the process monitoring for the cluster framework itself, so it’s very important. It restarts other cluster daemons, and makes sure applications and fault monitors keep running. An oddity of rpc.pfmd is that is uses /proc to monitor these processes, which is why any attempt to use truss or pfiles on a cluster process gives you a process is traced message.

1769 /usr/cluster/lib/sc/cl_eventlogd

As you would expect, this is responsible for cluster logging. It writes to a binary log, /var/cluster/logs/eventlog. I’m not aware of any way you can read this file.

1798 /usr/cluster/lib/sc/rgmd

The resource group monitor. We’ll learn about resource groups soon; this brings them up and down on the relevant nodes.

1789 /usr/cluster/bin/pnmd

The public network monitoring daemon. It keeps an eye on your IPMP groups.

1948 /usr/cluster/lib/sc/scdpmd

Monitors disk paths.

2155 /usr/cluster/lib/sc/cl_ccrad

The daemon which manages the CCR.

We’ve seen the quorum information already. We can also use scstat to check the cluster transport paths.

$ scstat -W

-- Cluster Transport Paths --

                    Endpoint            Endpoint            Status
                    --------            --------            ------
  Transport path:   sc31-01:bge3        sc31-02:bge3        Path online
  Transport path:   sc31-01:bge2        sc31-02:bge2        Path online

That’s all the cluster framework configured. But aside from informing us when other nodes join or leave the cluster, it doesn’t do anything.

VxVM Disk Groups

Sun Cluster offers two ways of running applications. You can run scalable applications, like Apache, which run simultaneously on both nodes, or failover applications, like NFS, which run on one node, and migrate to another should that node fail. Both nodes can see the same storage but, if we were running a service on one node, we wouldn’t want the other node to have access to its storage - somethng bad could happen.

So, we put disks into groups, using either VxVM or SVM, and import and deport those groups to and from hosts as required.

I’ve been around a bit, but I’ve never seen a production cluster running its resource disk groups on SVM. So, we’ll ignore that and use VxVM. A copy of Storage Foundation Basic is fine for this exercise, because we’re not going to use enough disks, volumes, or file systems to run into the restrictions.

I’m not going to tell you how to install SFB, because it’s very easy, varies from one version to another, and this article is getting way too long already. Don’t enable enclosure based namimg (not that there’s anything wrong with it), don’t set up a default disk group, and don’t enable server management.

One thing to be sure of is that you have the same major number for vxio on every node, even nodes which aren’t physically connected to the disks. So

$ grep vxio /etc/name_to_major

on every node and make sure you get the same result. If not, edit the files so you do. Let’s see if VxVM can see the disks. If it can’t, we’re in trouble.

[sc31-01]# vxdisk list
DEVICE       TYPE            DISK         GROUP        STATUS
c1t0d0s2     auto:none       -            -            online invalid
c1t1d0s2     auto:none       -            -            online invalid
c3t40d0s2    auto:cdsdisk    -            -            online
c3t40d1s2    auto:cdsdisk    -            -            online
c3t40d2s2    auto:cdsdisk    -            -            online
c3t40d3s2    auto:cdsdisk    -            -            online

There are the disks we’re interested in, on controller 3, and I’m going to create two disk groups, one for my Apache service called clwww and one called clnfs for my NFS service. If I were to put everything in one group, then I wouldn’t be able to run a service on each node. I’m a bit of a VxVM wuss, so I add my disks with vxdiskadd. That lets you create the disk group as it goes, so it’s minimal fuss. Accepting default answers to all the questions is fine.

# vxdiskadd c3t40d0 c3t40d1
...
# vxdiskadd c3t40d2 c3t40d3
...
# vxdisk list
DEVICE       TYPE            DISK         GROUP        STATUS
c1t0d0s2     auto:none       -            -            online invalid
c1t1d0s2     auto:none       -            -            online invalid
c3t40d0s2    auto:cdsdisk    clwww01      clwww        online
c3t40d1s2    auto:cdsdisk    clwww02      clwww        online
c3t40d2s2    auto:cdsdisk    clnfs01      clnfs        online
c3t40d3s2    auto:cdsdisk    clnfs02      clnfs        online

That’ll do. (Note that you don’t have to have a rootdg group any more. I don’t know why you ever did.)

IPMP

Every public network interface on a Sun Cluster has to be part of an IPMP group. The framework doesn’t actually care if you have multiple NICs in a group, so if you’re pushed for interfaces, you can set up a group with a single NIC and it will work. But, we have two.

There are a number of ways to configure IPMP groups, but the most common is to have two NICs on mutual failover, which is what we’ll be doing. So, in the hosts files add:

# IPMP group "talk"
192.168.1.63 sc31-01-bge0-test
192.168.1.64 sc31-01-bge1-test

# IPMP group "talk"
192.168.1.65 sc31-02-bge0-test
192.168.1.66 sc31-02-bge1-test

Then edit /etc/hostname.bge0 and /etc/hostname.bge1 on sc31-01 so they respectively read

sc31-01 group mp32 up
addif sc31-01-bge0-test -failover deprecated up

and

sc31-01-bge1-test group mp32 -failover deprecated up

Do the same on the other node, changing sc31-01 for sc31-02, and reboot.

You can check your IPMP is working with if_mpadm. -d takes a link down, -r reinstates it. Watch the console and you should see messages of the form

in.mpathd[2284]: Successfully failed over from NIC bge0 to NIC bge1

This performs a graceful manual failover of the group. If you want to simulate pulling the plug, use a command of the form:

# ifconfig bge0 modinsert ldterm@2

To “plug it back in” use

# ifconfig bge0 modremove ldterm@2

IPMP itself is not cluster-aware, it only works on a single node. As mentioned earlier, pnmd watches IPMP groups across the whole cluster, and if it detects the complete failure of a group, it will try to migrate across whatever applications depended on that group. Let’s see if pnmd knows about our new IPMP group.

$ scstat -i

-- IPMP Groups --

              Node Name           Group   Status         Adapter   Status
              ---------           -----   ------         -------   ------
  IPMP Group: sc31-01             talk    Online         bge1      Online
  IPMP Group: sc31-01             talk    Online         bge0      Online

  IPMP Group: sc31-02             talk    Online         bge1      Online
  IPMP Group: sc31-02             talk    Online         bge0      Online

Agents

The cluster framework is now more-or-less complete, with resilient paths to storage and public and private networks. Now we need to think about the agents - the software which goes between your normal, cluster-unaware application software, and the cluster framework.

Agents stop and start applications (you don’t use standard rc scripts on clustered applications, or control them manually), and they also contain “fault monitor”, logic which continually check the application’s health. They do this using rpc.pfmd.

The agent packages come in a separate file to the cluster software, so unpack it and run the installer. Most clusters I’ve seen just install all the agents whether they need them or not. but you don’t have to do it like that - choose “Custom Install”, and after the language page you get a list of all available agents. Just select “no install” for the ones you won’t use. Since the whole point of a cluster is to run applications on multiple machines, you clearly need to install the agents on all your cluster nodes.

Resource Types

Part of each software agent is its “resource type”. This groups all the features of the agent into a “resource” which can be used by the cluster. Resource types have to be added to the framework, and they’re all called SUNW.something.

Resource Groups

Resource groups are groups of resources. Got that? So, what are resources? Pretty much anything that makes up a running cluster, but isn’t part of the cluster framework. So, that could be a virtual hostname or IP address, a running mysqld or httpd, shared storage, you know, all the stuff that actually does something.

If you have a failover service, then everything in the resource group fails over together, so the contents of the group are always bound to the same node. Scalable resource groups must exist on multiple nodes simultaneously.

Commonly, a resource group describes a single application, say Apache, MySQL or Oracle. But it doesn’t have to. Say you have an Apache/PHP/MySQL stack - you could put everything in a single resource group so it’s all always running on one host and is easier to manage and keep track of. Some people even put everything the cluster does in a single group, and have a second node purely as a standby.

Remember earlier I mentioned “resource types”. Well, every resource must have a resource type. You can see which ones are currently registered with the framework using scrgamd. (Sun Cluster Resource Group ADMinistration. Geddit?)

$ scrgadm -pv | grep "resource type"

At this point all that’s registered are the SUNW.LogicalHostname and SUNW.SharedAddress resources. The former allows failover services to present a logical IP address on the public network; the latter is a pseudo-load-balancer used by shared services. We’ll add more soon.

Making an NFS failover resource group

I think I’ve now told you enough that I can add my first resource group without anything coming as a big surprise. I’m going to an NFS resource. This isn’t scalable, it’s a failover resource, and the floating IP will be 192.168.1.70, as I decided earlier. I’ll have one directory for my data, mounted at /global/nfs/data, and a second for my admin files, at /global/nfs/admin. These will be in the clnfs VxVM group we created earlier. The first job then, is to make two filesystems in that disk group. We’ll mirror them.

# vxassist -g clnfs make nfsdata 10g layout=mirror

(I don’t need to specify the disks, because there are only two in the group.)

# vxassist -g clnfs make nfsadmin 100m layout=mirror

If you try to create a filesystem on those new volumes, you’ll be told:

# newfs /dev/vx/rdsk/clnfs/nfsadmin
/dev/vx/rdsk/clnfs/nfsadmin: No such device or address

But, the device file is there. Check if you don’t believe me. The problem is that the disk group isn’t registered with the cluster framework.

# scconf -a -D type=vxvm,name=clnfs,nodelist=sc31-01:sc31-02

The clnfs disk group will now show up in the output of an scstat -D command,and you can create filesystems on the new volumes:

# newfs /dev/vx/rdsk/clnfs/nfsdata
...
# newfs /dev/vx/rdsk/clnfs/nfsadmin

We now have to put these filesystems in the vfstab on both nodes. How you do it is up to you. Remember, NFS is a failover service, so it can only run on one node at a time. You may feel it’s sensible to only have the filsystems mounted on the host currently running the service. If you do, set up your vfstab like this:

/dev/vx/dsk/clnfs/nfsadmin /dev/vx/rdsk/clnfs/nfsadmin /global/nfs/admin ufs 2 no logging
/dev/vx/dsk/clnfs/nfsdata /dev/vx/rdsk/clnfs/nfsdata /global/nfs/data ufs 2 no logging

Note that there’s a no in the mount at boot column. This tells Solaris not to mount the filesystem - it’s done by the cluster when it’s required. Specifically, it’s done by the cluster’s SUNW.HAStoragePlus resource, which we’ll learn about later.

However, some people prefer to have the filesystem mounted globally. This has the advantage of not having to unmount and mount when the service fails over, minimizing the risk of a problem if, for instance, someone’s messed up the mountpoint. If you want to do that, tell Solaris to mount it, and to mount it as a global filesystem.

/dev/vx/dsk/clnfs/nfsadmin /dev/vx/rdsk/clnfs/nfsadmin /global/nfs/admin ufs 2 yes global,logging
/dev/vx/dsk/clnfs/nfsdata /dev/vx/rdsk/clnfs/nfsdata /global/nfs/data ufs 2 yes global,logging

Whichever way you do it, don’t forget to make the mountpoints on both nodes:

# mkdir -p /global/nfs/data /global/nfs/admin

If you’re going the global way, mount the directories. Next create the dfstab. This is just like any standard Solaris /etc/dfs/dfstab file.

# mkdir /global/nfs/admin/SUNW.nfs
# cat /global/nfs/admin/SUNW.nfs/dfstab
share  -F nfs  -o rw,anon=0  -d "NFS cluster share"  /global/nfs/data

Make sure the NFS agent is installed:

$ pkginfo SUNWscnfs
application SUNWscnfs      Sun Cluster NFS Server Component

If not, install the agents as explained above. We need to register two resource types. The first, SUNW.nfs is the one that knows how to do all the proper NFS-ey stuff. As you work through these commands, keep running scstat -g to see how the resource group list grows.

# scrgadm -a -t SUNW.nfs

The -a tells scrgadm we’re adding something, -t specifies that it’s a resource type. We now have to add SUNW.HAStoragePlus, which is a dependency of SUNW.nfs, and takes care of really basic filesystem stuff like mounting and unmounting. (There’s also the even more fundamental SUNW.HAStorage, but you don’t need to know about that any more. Go steady with that shift key, this stuff is case-sensitive!

# scrgadm -a -t SUNW.HAStoragePlus

Those are the only resource types we need, so we can create the resource group.

# scrgadm -a -g nfs-rg -h sc31-01,sc31-02 -y PathPrefix=/global/nfs/admin

Here we’re specifying the resource group name with the -g flag; the hosts on which that group can run with -h, and using -y to set the PathPrefix property. Each resource type has a bunch of properties, some of which need to be set. Some of them are common to all resource types, and you can see what they are by reading the rg_properties man page. You have to set Pathprefix for SUNW.nfs because it points to the dfstab file used to share the directories. Obviously, that has to be visible on every node which can host the resource group.

As I said earlier, every resource group needs a logical hostname, in our case robot-nfs.

# scrgadm -a -L -g nfs-rg -l robot-nfs

Again, -a is for “add” and -g is the resource group. -L specifies that we’re working on a logical hostname (I always think they should use a command word here); -l is used to say what that hostname is.

Now add the SUNW.HAStoragePlus resource type, which we’ll call nfs-stor. This needs a few more options.

# scrgadm -a -g nfs-rg -j nfs-stor -t SUNW.HAStoragePlus \
  -x FilesystemMountpoints=/global/nfs/admin,/global/nfs/data \
  -x AffinityOn=true

-j tells scrgadm we’re creating a new resource, in this case called, sensibly enough, nfs-res, of type SUNW.HAStoragePlus, which as before is specified by -t. I mentioned that resource types have properties, which scrgadm sets with the -y flag. Well, they also have “extension properties”, which you set with -x. These are listed in the section 5 man pages too, and the difference is that all resource types have “properties”, whilst “extension properties” are particular to a single resource type. FilesystemMountpoints is a comma-separated list of mounts belonging to the resource type. Because you’re only listing the mountpoints, you need appropriate vfstab entries on all the nodes, providing all the pertinent information to mount. AffinityOn tells the cluster to keep the resource group on the same node as the device group. This provides optimum performance as the raw device files and physical paths are always used. Otherwise you could end up with the device group on one node and the resource group on another, in which case all I/O would have to pass through the cluster interconnect.

All that’s left to do is add the NFS service itself; nfs-res. This depends on the SUNW.HAStoragePlus resource type, and we specify this dependency with the Resource_dependencies property.

# scrgadm -a -g nfs-rg -j nfs-res -t SUNW.nfs \
  -y Resource_dependencies=nfs-stor

Now you can start the resource group, and see where the various resources are running:

# scswitch -Z -g nfs-rg
# scstat -g

To switch the resource from sc31-01 to sc31-02:

# scswitch -z -g nfs-rg -h sc31-02

If you want, physically pull the plug on the node hosting the service and watch the cluster do its thing.

Should you need to take the resource group completely offline, to change the filesystems or somesuch,

# scswitch -F -g nfs-rg

will do the job.

Making an Apache Scalable Resource Group

Apache is one of a small number of applications that can be run in a scalable resource group. That is, an instance of Apache runs on multiple cluster nodes, and those instances serve requests in parallel.

The clever bit of Sun Cluster that makes this possible is the SUNW.SharedAddress resource type that we stumbled across earlier. On one node it presents a floating IP address on the public network IPMP group. On the other nodes, that address is set as a virtual address on the loopback interface. That means that Apache (or some other application) sees the floating IP and is able to bind to it, so thinks it’s on the public network. Secretly, the cluster interconnect is handling all the traffic. Clever stuff.

The first decision to make is how to install Apache - globally or locally. Some people like to have a separate copy on each node. This makes it simple to upgrade the software on one node, move to that node, and roll-back to the other if there are problems, but it means you have to maintain multiple copies of config files. I generally like to maintain a single copy, mounted on a global filesystem, but in this instance I’m going to install the standard Solaris 9 Apache package on both nodes. From my Solaris 9 Jumpstart Product directory:

# pkgadd -d . SUNWapchr SUNWapchu

to install on each node. This installs Apache in /usr/apache, so the path to the httpd binary will be /usr/apache/bin/httpd. We need to tell the cluster about this later, so it can run the program.

I do need a global filesystem for my content, so I can be sure all the Apaches are reading from the same hymn sheet. Remember the clwww disk group from earlier? Let’s register it with the cluster and make a filesystem.

# vxassist -g clwww make wwwdata 1000m layout=mirror
# scconf -a -D type=vxvm,name=clwww,nodelist=sc31-01:sc31-02
# newfs /dev/vx/rdsk/clwww/wwwdata

Now put a line in the vfstab to make sure that’s mounted globally and mount it.

/dev/vx/dsk/clwww/wwwdata /dev/vx/rdsk/clwww/wwwdata /global/www/data ufs 2 yes global,logging

# mkdir -p /global/www/data
# mount /global/www/data

Check it’s mounted on both nodes, and throw together a working /etc/apache/httpd.conf (make sure it’s the same on both nodes), put a bit of content in /global/www/data/ and make sure you can start and access Apache on both nodes. Remember to stop it once you’ve done.

Now ensure the Apache agent is installed with

# pkginfo SUNWscapc
application SUNWscapc      Sun Cluster Apache Web Server Component

Looks good. Now, register the resource types. We’ll use the apache one, and, since we have storage, good old SUNW.HAStoragePlus. We’ve already registered that with the framework though, so we don’t need to do it again.

# scrgadm -a -t SUNW.apache

As I said, we’re using the SharedAddress resource. Though exists to provide scalable resources, it is a failover resource. Sounds odd? Well, remember I said that the floating IP exists on one server only, hence the failover part. So that needs its own failover resource group, which we’ll call sa-rg. We already know how to create resource groups with logical hostnames - the only difference here is to use -S for shared rather than -L for logical.

# scrgadm -a -g sa-rg -h sc31-01,sc31-02
# scrgadm -a -g sa-rg -S -l robot-www

Now on to the scalable resource group. This is dependent on the shared address group we just created, and we want it running on both hosts whenever possible.

# scrgadm -a -g www-rg -y RG_dependencies=sa-rg \
  -y Desired_primaries=2 \
  -y Maximum_primaries=2

Desired_primaries tells the cluster framework how many nodes you’d like the shared service to run on, if possible. A sys-admin can up the number of nodes at any time, but the Maximum_primaries value sets a hard limit.

# scrgadm -a -g www-rg -j www-stor -t SUNW.HAStoragePlus \
  -x FilesystemMountpoints=/global/www/data

Here we’re telling SUNW.HAStoragePlus to make sure we have our web content, just like we did with the NFS group earlier. Now we can create the Apache service.

# scrgadm -a -g www-rg -j www-res -t SUNW.apache \
  -y Resource_dependencies=www-stor \
  -y Scalable=TRUE \
  -y Network_resources_used=robot-www \
  -x Bin_dir=/usr/apache/bin

As before, we supply properties with -y and extended properties with -x. We set a dependency on the SUNW.HAStoragePlus resource we just created, so Apache won’t try to start if it has no content to serve. We tell the cluster this is a scalable service, and we tell it to use the robot-www failover resource group we made for its logical address. Then we supply the path to httpd.

All that remains is to fire up the resource group. But wait - remember the dependency!

# scswitch -Z -g sa-rg
# scswitch -Z -g www-rg

tail the log files on both servers, and hit the floating IP address. You can move the service on or off nodes with commands of the form

# scswitch -z -g www-rg -h sc31-01

Hopefully that explains not only the “how to” part of building a Sun Cluster, but a lot of the “why”.

Tags: