Friday, July 15, 2016

HACMP Failover Test Scenario's


                          

      CLUSTER FAILOVER TEST SCENARIO’S IN AIX ENVIRONMENT

This document covers the  Cluster Failover Test Scenarios  in AIX Environment .

In AIX  ,We have  normally three ways for performing the Failover Testing  .
1.       Manual Failover by moving the Resource Group
2.       Automatic Failover by abruptly halting the nodes
3.       Failover Testing by removing the attached hardware(disabling the NIC’s ,cables etc)




Important points that need to be validated before performing any failover test  as a System Administrator .

1. Data backup should be handy .

2. Cluster snapshot should be taken .

3. Configuration backup (including the RG attributes  ,FS details ).

4. If crossmount is configured kindly verify the exports file and compare the FS crossmounted . 
    In 1 case we noticed that the cluster filesystem was mounted as normal nfs mount  leading to issue while performing the failover test .  Since cluster will look for the entries in file "/usr/es/sbin/cluster/etc/exports "  if it exists to mount and unmount the FS . 

5. Also if going for failover test , if the RG's goes to error state , there are cases where it will not allow you to execute any cluster commands . In this case you may require to reboot the nodes . So better  keep the required team updated ,that we may require the server reboot of both the nodes in case of any issues. 



    Manual Failover Testing by moving the RG’s

Steps :
1.  Take the console  session of both the nodes.
2.   Verify  the Resource Group availability on nodes before the failover test .
               Command to be used #/usr/es/sbin/cluster/utilities/clRGinfo
# clRGinfo
-----------------------------------------------------------------------------
Group Name     Group State          Node          
-----------------------------------------------------------------------------
RES_01     ONLINE                   node1      >>>>>.    RG (RES_01) currently active on node1
                  OFFLINE                  node2       

RES_02     ONLINE                    node2       
                  OFFLINE                   node1 

3.   Here in this case .we are going to manually move the resource group (RES_GRP_01) from node1 to      node2
4.    From node1  run the command #smitty clstop
                  node1# smitty clstop
                               Stop Cluster Services

Type or select values in entry fields.
Press Enter AFTER making all desired changes.

                                                        [Entry Fields]
* Stop now, on system restart or both                 now                    +
  Stop Cluster Services on these nodes               [node1 ]                +           >>>>>>   select the node
  BROADCAST cluster shutdown?                         true                   +
* Select an Action on Resource Groups            Move Resource Groups      >>>>>  need to select this option for  manual failover


5. Next screen will ask for the  Resource group to move and the node where to move . Select the appropiate Resource Group and press enter , it will start the failover .

6. From node 2 , verify the RG status using the command #/usr/es/sbin/cluster/utilities/clRGinfo
1st probable output


     # clRGinfo
-----------------------------------------------------------------------------
Group Name     Group State                  Node          
-----------------------------------------------------------------------------
RES_01     OFFLINE                       node1       
                  ACQUIRING                  node2         >>>>>>>>>>       failover initiated and node2 is acquiring the Resource group    

RES_02     ONLINE                       node2       
                   OFFLINE                      node1 


2nd probable output

# clRGinfo
-----------------------------------------------------------------------------
Group Name     Group State            Node          
-----------------------------------------------------------------------------
RES_GRP_01     OFFLINE               node1       
                            ONLINE                 node2     Failover completed successfully ,node2 has acquired Resource Group  (RES_GRP_01)

RES_GRP_02     ONLINE                  node2       
                            OFFLINE                 node1 

Note: When stopping the cluster on node 1 the first thing executed is the cluster stop script. It brings down the applications and unmounts all application filesystems. If your application stop script is not able to stop all application processes some filesystems can't be unmounted and the failover fails.When all resources are down on node 1 HACMP starts to bring up all resources on node 2. The application start script is the last thing hacmp does.

7. Verify the status of the cluster using the command #lssrc -ls clstrmgrES .  It should be in "stable" state . If so everything is fine . 
7. Perform the server-level health-checkup to validate the FS and  Cluster IP'S have moved successfully.
8. Inform APP/DB Team to start the APP/DB Services or validate the APP/DB Status after failover  


  Force of auto failover by rebooting active node (typically not recommended, but an option)
HACMP is intelligent enough to differentiate between deliberate shutdown and   abrupt shutdown of node due to  any hardware failures.  Whenever we are forcing the failover by bringing down the active node  , shutdown ,reboot command will not trigger failover.
                                 The halt command will only force the automatic  RG failover from Server end .

1.       Login to node1 , run the command #halt –q as root user . This will bring down the node1 abruptly and Force the RG available on node1 to automatically  failover to node2 .
2.       Login to node2 ,Verify the Resource group status on node2    using the below command .

# clRGinfo
-----------------------------------------------------------------------------
Group Name     Group State            Node          
-----------------------------------------------------------------------------
RES_01     OFFLINE               node1       
                   ONLINE                node2           Failover completed successfully ,node2 has acquired Resource Group  (RES_01)

RES_02     ONLINE                  node2       
                   OFFLINE                node1 

3.       Verify that all the filesystems and IP’s are available on node2 after the automatic failover.
4.       Inform APP/DB Team to validate the APP/DB Status and Startup(if applicable)














3 comments:

  1. Hi abhishek. Came across your blog and had a question about PowerHA.

    Good morning community. newbie question here. Just got thrown to PowerHA, took first class but still have not put this into practical use. My senario.

    2 node cluster, DATABASE(resources) is running on secondary server.

    I need to perform maintenance on primary. Which and why option to use? Unmanage Resource Groups or Bring Resource Group Offline. To me they seem to perform the same action IF nothing is running on said server.

    ReplyDelete
  2. 1. un-manage resource group - Application will not be impacted , it can be running on the other node . That means when we talk at production impact level , no downtime w.r.t application .

    Normally this is preferred whenever we are doing network level changes etc .

    But in 2 nodes active-passive cluster , if it is diffcult to get the appropiate downtime , It is the best option.



    2. Bring resource group - Bring resource group offline means you will bring your's database also down, during the whole maintainance window .

    =================================================================

    Ex- In the given scenario , suppose Customer is ready to provide downtime of only 1 hour after lot of requests .Maintanence activity requires 2 hours downtime . The activity is also critical and can't be posponded

    In this scenario, we will prefer to go with the un-manage resource group option and perform the activity on primary node otherwise if customer is ready to afford the downtime , safe way is to bring down resource group and do the tasks

    ReplyDelete
  3. Hi abhishek. Thank you so much for the reply and also thank you for posting this blog. I sure hope you post more.

    ReplyDelete