HACMP CLUSTER ISSUE.
The hacmp cluster nodes are having the only 1 ethernet adapter which is virtual.
IP Configuration on both nodes
node-1
en0:
boot-ip : 192.168.3.7
Persistent-ip : 10.1.1.16
node-2
en0:
boot-ip : 192.168.3.8
persistent-ip : 10.1.1.18
problem statement
After performing the re-configuration ,when we started the cluster, automatically 1 node was getting rebooted automatically.
Understanding the exact issue.
1. we verified the cluster logs (hacmp.out and cluster.log files) and found the error mssgs from both the nodes.
hacmp.out log entry on "node-1"
dec 5 23:07:52 node-1 user:notice HACMP for AIX: EVENT START: fail_interface node-1 192.168.3.7 >>>>>>>>>>>>>>>>> this indicates that there is some issue with boot-ip interface
dec 5 23:07:52 node-1 user:notice HACMP for AIX: EVENT COMPLETED: fail_interface node-1 192.168.3.7 0
dec 5 23:07:59 node-1 local0:crit clstrmgrES[7143544]: Sun dec 5 23:07:59 announcementCb: Called, state=ST_STABLE, provider token 1
dec 5 23:07:59 node-1 local0:crit clstrmgrES[7143544]: Sun dec 5 23:07:59 announcementCb: GsToken 2, AdapterToken -1, rm_GsToken 1
dec 5 23:07:59 node-1 local0:crit clstrmgrES[7143544]: Sun dec 5 23:07:59 announcementCb: GRPSVCS announcment code=512; exiting
dec 5 23:07:59 node-1 local0:crit clstrmgrES[7143544]: Sun dec 5 23:07:59 CHECK FOR FAILURE OF RSCT SUBSYSTEMS (topsvcs or grpsvcs) >>>this refers nthat there can be heartbeat issue.
dec 5 23:07:59 node-1 daemon:err|error haemd[13041862]: LPP=PSSP,Fn=emd_gsi.c,SID=1.4.1.37,L#=1395, haemd: 2521-032 Cannot dispatch group services (1). >>> This again indicates that there is some issue with boot-ip's
dec 5 23:07:59 node-1 user:notice HACMP for AIX: clexit.rc : Unexpected termination of clstrmgrES.
dec 5 19:08:00 node-1 user:notice HACMP for AIX: clexit.rc : Halting system immediately!!!
Cluster.log error entry
cllsstbys: No communication interfaces found.
on other node also same type of error mssgs were recieved.
Now question arise that why both the nodes are complaining that the boot-ip's interfaces are down .
1. So again we re-validated the cluster configuration, we tried ping testand also performed heartbeat functionality test from both the nodes to figure out exact issue.
The cluster configuration was fine and also offline synchronization was happening successfully,hearbeat links were operational and also the ping test was successful.
2. Then,we started to verify the cluster related configuration files and at last were successful in finding the root cause.
Root Cause
After going through the netmon.cf file that is normally used in virtualized cluster environment, we found the following entry in the netmon.cf file
Node-1
!REQD !ALL 192.168.2.8
The node-1 ethernet adapter will be considered up if it is able to ping 192.168.2.8 .
Here was the issue, the entry in the netmon.cf file was in-correct means wrong ip was mentioned in the file. because of the entry in netmon.cf file , the node-1 where trying to reach the ip-address 192.168.2.8 ,since this ip doesn't exist and is un-reachable ,it marks the interface as down in cluster .
Node-2
!REQD !ALL 192.168.2.7
The node-2 ethernet adapter will be considered up if it is able to ping 192.168.2.7
here was the issue, the entry in the netmon.cf file. here also since ip 192.168.2.7 was not reachable, the cluster was marking the interface as down in the cluster log.
This lead to the condition when both the nodes think that other node is not reachable and will try to grab the RG 's and for maintaining the data integrity,it will reboot the other node.
Solution provided
We modified the entry in the netmon.cf file on both the nodes as
Node-1
!REQD !ALL 192.168.3.8
Node-2
!REQD !ALL 192.168.3.7
After that we synchronized the cluster and started the cluster, the issue got resolved.
Note : if we are removing the netmon.cf file then also the issue will get resolved.
The hacmp cluster nodes are having the only 1 ethernet adapter which is virtual.
IP Configuration on both nodes
node-1
en0:
boot-ip : 192.168.3.7
Persistent-ip : 10.1.1.16
node-2
en0:
boot-ip : 192.168.3.8
persistent-ip : 10.1.1.18
problem statement
After performing the re-configuration ,when we started the cluster, automatically 1 node was getting rebooted automatically.
Understanding the exact issue.
1. we verified the cluster logs (hacmp.out and cluster.log files) and found the error mssgs from both the nodes.
hacmp.out log entry on "node-1"
dec 5 23:07:52 node-1 user:notice HACMP for AIX: EVENT START: fail_interface node-1 192.168.3.7 >>>>>>>>>>>>>>>>> this indicates that there is some issue with boot-ip interface
dec 5 23:07:52 node-1 user:notice HACMP for AIX: EVENT COMPLETED: fail_interface node-1 192.168.3.7 0
dec 5 23:07:59 node-1 local0:crit clstrmgrES[7143544]: Sun dec 5 23:07:59 announcementCb: Called, state=ST_STABLE, provider token 1
dec 5 23:07:59 node-1 local0:crit clstrmgrES[7143544]: Sun dec 5 23:07:59 announcementCb: GsToken 2, AdapterToken -1, rm_GsToken 1
dec 5 23:07:59 node-1 local0:crit clstrmgrES[7143544]: Sun dec 5 23:07:59 announcementCb: GRPSVCS announcment code=512; exiting
dec 5 23:07:59 node-1 local0:crit clstrmgrES[7143544]: Sun dec 5 23:07:59 CHECK FOR FAILURE OF RSCT SUBSYSTEMS (topsvcs or grpsvcs) >>>this refers nthat there can be heartbeat issue.
dec 5 23:07:59 node-1 daemon:err|error haemd[13041862]: LPP=PSSP,Fn=emd_gsi.c,SID=1.4.1.37,L#=1395, haemd: 2521-032 Cannot dispatch group services (1). >>> This again indicates that there is some issue with boot-ip's
dec 5 23:07:59 node-1 user:notice HACMP for AIX: clexit.rc : Unexpected termination of clstrmgrES.
dec 5 19:08:00 node-1 user:notice HACMP for AIX: clexit.rc : Halting system immediately!!!
Cluster.log error entry
cllsstbys: No communication interfaces found.
on other node also same type of error mssgs were recieved.
Now question arise that why both the nodes are complaining that the boot-ip's interfaces are down .
1. So again we re-validated the cluster configuration, we tried ping testand also performed heartbeat functionality test from both the nodes to figure out exact issue.
The cluster configuration was fine and also offline synchronization was happening successfully,hearbeat links were operational and also the ping test was successful.
2. Then,we started to verify the cluster related configuration files and at last were successful in finding the root cause.
Root Cause
After going through the netmon.cf file that is normally used in virtualized cluster environment, we found the following entry in the netmon.cf file
Node-1
!REQD !ALL 192.168.2.8
The node-1 ethernet adapter will be considered up if it is able to ping 192.168.2.8 .
Here was the issue, the entry in the netmon.cf file was in-correct means wrong ip was mentioned in the file. because of the entry in netmon.cf file , the node-1 where trying to reach the ip-address 192.168.2.8 ,since this ip doesn't exist and is un-reachable ,it marks the interface as down in cluster .
Node-2
!REQD !ALL 192.168.2.7
The node-2 ethernet adapter will be considered up if it is able to ping 192.168.2.7
here was the issue, the entry in the netmon.cf file. here also since ip 192.168.2.7 was not reachable, the cluster was marking the interface as down in the cluster log.
This lead to the condition when both the nodes think that other node is not reachable and will try to grab the RG 's and for maintaining the data integrity,it will reboot the other node.
Solution provided
We modified the entry in the netmon.cf file on both the nodes as
Node-1
!REQD !ALL 192.168.3.8
Node-2
!REQD !ALL 192.168.3.7
After that we synchronized the cluster and started the cluster, the issue got resolved.
Note : if we are removing the netmon.cf file then also the issue will get resolved.