Thursday, April 09, 2015

Cluster issue(netmon.cf )--solved

                           HACMP CLUSTER ISSUE.

The hacmp cluster nodes  are having the only 1 ethernet adapter which is virtual.


           IP Configuration on both nodes
node-1 
 en0: 
boot-ip           :  192.168.3.7
Persistent-ip :   10.1.1.16

node-2 

 en0:
boot-ip           :    192.168.3.8
persistent-ip :    10.1.1.18

problem statement


After performing the re-configuration ,when we started the cluster, automatically 1 node was getting rebooted automatically.


              Understanding the exact issue.

1. we verified the cluster logs (hacmp.out and cluster.log files) and found the  error mssgs from both the nodes.

       hacmp.out log entry on "node-1"

dec  5 23:07:52 node-1 user:notice HACMP for AIX: EVENT START: fail_interface node-1 192.168.3.7    >>>>>>>>>>>>>>>>> this indicates that there is some issue with boot-ip interface
dec  5 23:07:52 node-1 user:notice HACMP for AIX: EVENT COMPLETED: fail_interface node-1 192.168.3.7 0
dec  5 23:07:59 node-1 local0:crit clstrmgrES[7143544]: Sun dec  5 23:07:59 announcementCb: Called, state=ST_STABLE, provider token 1
dec  5 23:07:59 node-1 local0:crit clstrmgrES[7143544]: Sun dec  5 23:07:59 announcementCb: GsToken 2, AdapterToken -1, rm_GsToken 1
dec  5 23:07:59 node-1 local0:crit clstrmgrES[7143544]: Sun dec  5 23:07:59 announcementCb: GRPSVCS announcment code=512; exiting
dec  5 23:07:59 node-1 local0:crit clstrmgrES[7143544]: Sun dec  5 23:07:59  CHECK FOR FAILURE OF RSCT SUBSYSTEMS (topsvcs or grpsvcs) >>>this refers nthat there can be heartbeat issue.
dec  5 23:07:59 node-1 daemon:err|error haemd[13041862]: LPP=PSSP,Fn=emd_gsi.c,SID=1.4.1.37,L#=1395,                                     haemd: 2521-032 Cannot dispatch group services (1). >>> This again indicates that there is some issue with boot-ip's
dec  5 23:07:59 node-1 user:notice HACMP for AIX: clexit.rc : Unexpected termination of clstrmgrES.
dec  5 19:08:00 node-1 user:notice HACMP for AIX: clexit.rc : Halting system immediately!!!


                      Cluster.log error entry

cllsstbys: No communication interfaces found.


on other node also same type of error mssgs were recieved.


Now question arise that why both the nodes are complaining that the boot-ip's interfaces  are down .                                                                                                                                      


1. So  again we re-validated the cluster configuration, we tried ping testand also performed heartbeat functionality  test from both the nodes to figure out exact issue.
                           The cluster configuration was fine and also offline synchronization was happening successfully,hearbeat links were operational and also the ping test was successful.

 2. Then,we started to verify the cluster related configuration files and at last were successful in finding the root cause.


                 Root Cause


 After going through the netmon.cf file that is normally used in virtualized cluster environment, we found the following entry  in the netmon.cf file

        Node-1

!REQD !ALL 192.168.2.8

  The node-1  ethernet adapter  will be considered up if it is able to ping 192.168.2.8  .

  Here was the issue, the entry in the netmon.cf file was in-correct means wrong ip was mentioned in the file.  because of the entry in netmon.cf file , the node-1 where trying to reach the ip-address 192.168.2.8 ,since this ip doesn't exist  and is un-reachable ,it marks the interface as down in cluster .
         Node-2

!REQD !ALL 192.168.2.7

 The node-2  ethernet adapter  will be considered up if it is able to ping 192.168.2.7


     here was the issue, the entry in the netmon.cf file. here also since ip 192.168.2.7 was not reachable, the cluster was marking the interface as down in the cluster log.


This lead to the condition when both the nodes think that other node is not reachable and will try to grab the RG 's and for maintaining the data integrity,it will reboot the other node.



                   Solution provided


We modified the entry in the netmon.cf file  on both the nodes as

 Node-1

!REQD !ALL 192.168.3.8


Node-2

!REQD !ALL 192.168.3.7

After that we synchronized the cluster and started the cluster, the issue got resolved.


Note : if we are removing the netmon.cf file then also the issue will get resolved.