UNIX SYSTEM ADMINISTRATION : HACMP Failover Test Scenario's

CLUSTER FAILOVER TEST SCENARIO’S IN AIX ENVIRONMENT

This document covers the Cluster Failover Test Scenarios in AIX Environment .

In AIX ,We have normally three ways for performing the Failover Testing .

1. Manual Failover by moving the Resource Group

2. Automatic Failover by abruptly halting the nodes

3. Failover Testing by removing the attached hardware(disabling the NIC’s ,cables etc)

Important points that need to be validated before performing any failover test as a System Administrator .

1. Data backup should be handy .

2. Cluster snapshot should be taken .

3. Configuration backup (including the RG attributes ,FS details ).

4. If crossmount is configured kindly verify the exports file and compare the FS crossmounted .

In 1 case we noticed that the cluster filesystem was mounted as normal nfs mount leading to issue while performing the failover test . Since cluster will look for the entries in file "/usr/es/sbin/cluster/etc/exports " if it exists to mount and unmount the FS .

5. Also if going for failover test , if the RG's goes to error state , there are cases where it will not allow you to execute any cluster commands . In this case you may require to reboot the nodes . So better keep the required team updated ,that we may require the server reboot of both the nodes in case of any issues.

Manual Failover Testing by moving the RG’s

Steps :

1. Take the console session of both the nodes.

2. Verify the Resource Group availability on nodes before the failover test .

Command to be used #/usr/es/sbin/cluster/utilities/clRGinfo

# clRGinfo

-----------------------------------------------------------------------------

Group Name Group State Node

-----------------------------------------------------------------------------

RES_01 ONLINE node1 >>>>>. RG (RES_01) currently active on node1

OFFLINE node2

RES_02 ONLINE node2

OFFLINE node1

3. Here in this case .we are going to manually move the resource group (RES_GRP_01) from node1 to node2

4. From node1 run the command #smitty clstop

node1# smitty clstop

Stop Cluster Services

Type or select values in entry fields.

Press Enter AFTER making all desired changes.

[Entry Fields]

* Stop now, on system restart or both now +

Stop Cluster Services on these nodes [node1 ] + >>>>>> select the node

BROADCAST cluster shutdown? true +

* Select an Action on Resource Groups Move Resource Groups >>>>> need to select this option for manual failover

5. Next screen will ask for the Resource group to move and the node where to move . Select the appropiate Resource Group and press enter , it will start the failover .

6. From node 2 , verify the RG status using the command #/usr/es/sbin/cluster/utilities/clRGinfo

1^st probable output

# clRGinfo

-----------------------------------------------------------------------------

Group Name Group State Node

-----------------------------------------------------------------------------

RES_01 OFFLINE node1

ACQUIRING node2 >>>>>>>>>> failover initiated and node2 is acquiring the Resource group

RES_02 ONLINE node2

OFFLINE node1

2^nd probable output

# clRGinfo

-----------------------------------------------------------------------------

Group Name Group State Node

-----------------------------------------------------------------------------

RES_GRP_01 OFFLINE node1

ONLINE node2 Failover completed successfully ,node2 has acquired Resource Group (RES_GRP_01)

RES_GRP_02 ONLINE node2

OFFLINE node1

Note: When stopping the cluster on node 1 the first thing executed is the cluster stop script. It brings down the applications and unmounts all application filesystems. If your application stop script is not able to stop all application processes some filesystems can't be unmounted and the failover fails.When all resources are down on node 1 HACMP starts to bring up all resources on node 2. The application start script is the last thing hacmp does.

7. Verify the status of the cluster using the command #lssrc -ls clstrmgrES . It should be in "stable" state . If so everything is fine .

7. Perform the server-level health-checkup to validate the FS and Cluster IP'S have moved successfully.

8. Inform APP/DB Team to start the APP/DB Services or validate the APP/DB Status after failover

Force of auto failover by rebooting active node (typically not recommended, but an option)

HACMP is intelligent enough to differentiate between deliberate shutdown and abrupt shutdown of node due to any hardware failures. Whenever we are forcing the failover by bringing down the active node , shutdown ,reboot command will not trigger failover.

The halt command will only force the automatic RG failover from Server end .

1. Login to node1 , run the command #halt –q as root user . This will bring down the node1 abruptly and Force the RG available on node1 to automatically failover to node2 .

2. Login to node2 ,Verify the Resource group status on node2 using the below command .

# clRGinfo

-----------------------------------------------------------------------------

Group Name Group State Node

-----------------------------------------------------------------------------

RES_01 OFFLINE node1

ONLINE node2 Failover completed successfully ,node2 has acquired Resource Group (RES_01)

RES_02 ONLINE node2

OFFLINE node1

3. Verify that all the filesystems and IP’s are available on node2 after the automatic failover.

4. Inform APP/DB Team to validate the APP/DB Status and Startup(if applicable)

3 comments:

JaqMay 3, 2018 at 9:13 AM
Hi abhishek. Came across your blog and had a question about PowerHA.

Good morning community. newbie question here. Just got thrown to PowerHA, took first class but still have not put this into practical use. My senario.

2 node cluster, DATABASE(resources) is running on secondary server.

I need to perform maintenance on primary. Which and why option to use? Unmanage Resource Groups or Bring Resource Group Offline. To me they seem to perform the same action IF nothing is running on said server.
abhishekMay 4, 2018 at 12:29 PM
1. un-manage resource group - Application will not be impacted , it can be running on the other node . That means when we talk at production impact level , no downtime w.r.t application .

Normally this is preferred whenever we are doing network level changes etc .

But in 2 nodes active-passive cluster , if it is diffcult to get the appropiate downtime , It is the best option.

2. Bring resource group - Bring resource group offline means you will bring your's database also down, during the whole maintainance window .

=================================================================

Ex- In the given scenario , suppose Customer is ready to provide downtime of only 1 hour after lot of requests .Maintanence activity requires 2 hours downtime . The activity is also critical and can't be posponded

In this scenario, we will prefer to go with the un-manage resource group option and perform the activity on primary node otherwise if customer is ready to afford the downtime , safe way is to bring down resource group and do the tasks
AnonymousMay 7, 2018 at 6:50 AM
Hi abhishek. Thank you so much for the reply and also thank you for posting this blog. I sure hope you post more.

UNIX SYSTEM ADMINISTRATION

Friday, July 15, 2016

HACMP Failover Test Scenario's

3 comments:

Pages