CLUSTER FAILOVER TEST SCENARIO’S IN
AIX ENVIRONMENT
This document covers the
Cluster Failover Test Scenarios
in AIX Environment .
In AIX ,We have normally three ways for performing the
Failover Testing .
1. Manual Failover by moving the Resource Group
2. Automatic Failover by abruptly halting the
nodes
3. Failover Testing by removing the attached
hardware(disabling the NIC’s ,cables etc)
5. Also if going for failover test , if the RG's goes to error state , there are cases where it will not allow you to execute any cluster commands . In this case you may require to reboot the nodes . So better keep the required team updated ,that we may require the server reboot of both the nodes in case of any issues.
Manual Failover Testing by moving the RG’s
Steps :
1. Take the
console session of both the nodes.
2. Verify the Resource Group availability on nodes
before the failover test .
Command to be used #/usr/es/sbin/cluster/utilities/clRGinfo
# clRGinfo
-----------------------------------------------------------------------------
Group
Name Group State Node
-----------------------------------------------------------------------------
RES_01 ONLINE node1 >>>>>.
RG (RES_01) currently active on node1
OFFLINE node2
RES_02 ONLINE node2
OFFLINE node1
3. Here in this case
.we are going to manually move the resource group (RES_GRP_01) from node1 to node2
4. From node1
run the command #smitty clstop
node1# smitty clstop
Stop Cluster
Services
Type or
select values in entry fields.
Press Enter
AFTER making all desired changes.
[Entry Fields]
* Stop now,
on system restart or both
now +
Stop Cluster Services on these nodes [node1 ] + >>>>>>
select the node
BROADCAST cluster shutdown? true +
* Select an
Action on Resource Groups Move
Resource Groups >>>>> need to select
this option for manual failover
5. Next screen will ask for the Resource group to move and the node where to
move . Select the appropiate Resource Group and press enter , it will start the
failover .
6. From node 2 , verify the RG status using the command #/usr/es/sbin/cluster/utilities/clRGinfo
1st
probable output
# clRGinfo
-----------------------------------------------------------------------------
Group
Name Group State Node
-----------------------------------------------------------------------------
RES_01 OFFLINE node1
ACQUIRING node2 >>>>>>>>>> failover initiated and node2 is
acquiring the Resource group
RES_02 ONLINE node2
OFFLINE node1
2nd probable output
# clRGinfo
-----------------------------------------------------------------------------
Group
Name Group State Node
-----------------------------------------------------------------------------
RES_GRP_01 OFFLINE node1
ONLINE
node2 Failover completed successfully ,node2 has acquired
Resource Group (RES_GRP_01)
RES_GRP_02 ONLINE node2
OFFLINE node1
Note: When
stopping the cluster on node 1 the first thing executed is the cluster stop
script. It brings down the applications and unmounts all application
filesystems. If your application stop script is not able to stop all
application processes some filesystems can't be unmounted and the failover
fails.When all resources are down on node 1 HACMP starts to bring up all
resources on node 2. The application start script is the last thing hacmp does.
7. Verify the status of the cluster using the command #lssrc -ls clstrmgrES . It should be in "stable" state . If so everything is fine .
7. Perform the server-level health-checkup to validate the
FS and Cluster IP'S have moved
successfully.
8. Inform APP/DB Team to start the APP/DB Services or
validate the APP/DB Status after failover
Force of auto failover by rebooting active node
(typically not recommended, but an option)
HACMP is intelligent enough to
differentiate between deliberate shutdown and abrupt shutdown of node due to any hardware failures. Whenever we are forcing the failover by
bringing down the active node , shutdown
,reboot command will not trigger failover.
The halt command will only force the
automatic RG failover from Server end .
1.
Login
to node1 , run the command #halt –q
as root user . This will bring down the node1 abruptly and Force the RG
available on node1 to automatically
failover to node2 .
2.
Login
to node2 ,Verify the Resource group status on node2 using the below command .
# clRGinfo
-----------------------------------------------------------------------------
Group
Name Group State Node
-----------------------------------------------------------------------------
RES_01 OFFLINE node1
ONLINE node2 Failover completed successfully ,node2
has acquired Resource Group (RES_01)
RES_02 ONLINE node2
OFFLINE node1
3.
Verify
that all the filesystems and IP’s are available on node2 after the automatic
failover.
4.
Inform
APP/DB Team to validate the APP/DB Status and Startup(if applicable)
Hi abhishek. Came across your blog and had a question about PowerHA.
ReplyDeleteGood morning community. newbie question here. Just got thrown to PowerHA, took first class but still have not put this into practical use. My senario.
2 node cluster, DATABASE(resources) is running on secondary server.
I need to perform maintenance on primary. Which and why option to use? Unmanage Resource Groups or Bring Resource Group Offline. To me they seem to perform the same action IF nothing is running on said server.
1. un-manage resource group - Application will not be impacted , it can be running on the other node . That means when we talk at production impact level , no downtime w.r.t application .
ReplyDeleteNormally this is preferred whenever we are doing network level changes etc .
But in 2 nodes active-passive cluster , if it is diffcult to get the appropiate downtime , It is the best option.
2. Bring resource group - Bring resource group offline means you will bring your's database also down, during the whole maintainance window .
=================================================================
Ex- In the given scenario , suppose Customer is ready to provide downtime of only 1 hour after lot of requests .Maintanence activity requires 2 hours downtime . The activity is also critical and can't be posponded
In this scenario, we will prefer to go with the un-manage resource group option and perform the activity on primary node otherwise if customer is ready to afford the downtime , safe way is to bring down resource group and do the tasks
Hi abhishek. Thank you so much for the reply and also thank you for posting this blog. I sure hope you post more.
ReplyDelete