UNIX SYSTEM ADMINISTRATION : 2017

Saturday, December 16, 2017

lspv shows newly assigned LUN as "VeritasVolumes" instead of "None"

Problem

AIX's lspv shows newly assigned LUN as "VeritasVolumes" instead of "None".
The problem was that the veritas volume manager package was already removed from the server level long back .veritas services were not configured at server level .

error :

test# extendvg -f datavg hdisk3
Disk hdisk3 is already in volume group VeritasVolumes

probable problem cause

lspv command reads the customized database . That means that output we are getting is due to the PV attributes defined in customized database.
it can be the case that earlier hdisk3 was part of veritas volume manager and the proper procedure was not followed for removal of veritas volumes and that caused in-consistencies
in customized database .

understanding the exact cause

1)for understanding the exact issue , we tried to get the disk details from the customized database .We found that the PV attributes is set as "VeritasVolumes" to the problematic disk ,which was not present for other AIX LVM disks. we cross-checked all the disks and found the same issue .

# odmget -q value=hdisk3 CuAt

CuAt:
name = "VeritasVolumes"
attribute = "pv"
value = "hdisk3"
type = "R"
generic = ""
rep = ""
nls_index = 0

2)we removed the disk from the server level using rmdev command ,to figure out whether this attribute value is getting removed . but no luck rest all attributes were removed but this
attribute was not removed . This gives us the impression that either this attribute need to be removed by correct VxVM command or need to be forcefully removed from customized ODM database.

Resolution

normally what we know is the below process to remove the disk from VxVM control,if it is configured.

1. to tell the vxconfigd deamon to enter enabled mode
#vxdctl enable
2. check for the disk details .
# vxdisk -e list|grep hdisk3
test_aks0_1242 auto - - online hdisk3 std

3.uninitialise the device to remove the VxVM information:
# /etc/vx/bin/vxdiskunsetup -C test_aks0_1242

4.it will also have to be removed from VxVM's view:
# vxdisk rm test_aks0_1242

5.lspv shows the device without the VeritasVolumes tag:
# lspv|grep hdisk3
hdisk3 none None

when VxVM package is itself not present and it is not possible to remove the disk from VxVM control using any VxVM command .

step 1) validated the hdisk3 disk details
test# lspv | grep hdisk3
hdisk3 none VeritasVolumes

Step 2 ) removed the disk , to figure out that ODM disk related PV attributes information of hdisk3 is getting cleared or not .

test# rmdev -Rdl hdisk3
hdisk3 deleted

Step 3) found that after removing the disk also the PV attribute of hdisk3 is not getting cleared .
test# odmget -q value=hdisk3 CuAt

CuAt:
name = "VeritasVolumes"
attribute = "pv"
value = "hdisk3"
type = "R"
generic = ""
rep = ""
nls_index = 0
test# odmget -q name=hdisk3 CuAt
test# odmget -q name=hdisk3 CuDv
test# odmget -q value3=hdisk3 CuDvDr
test# odmget -q name=hdisk3 CuDep
test# odmget -q name=hdisk3 CuVPD

Step 4) removed the PV attribute from customized ODM.

# odmdelete -q value=hdisk3 -o CuAt
0518-307 odmdelete: 1 objects deleted.

Step 5) ran cfgmgr to re-configure it .

#cfgmgr
test# lspv | grep hdisk3
hdisk3 none none

Note : it is not required to remove the disk using rmdev , we can directly remove the ODM definitions

Thursday, November 09, 2017

HMC Commandline

Getting the frame details

hscroot@hmc-op:~> lssyscfg -r sys -F name
op710-1-xxxxxxx
op710-2-xxxxxxx
op720-1-xxxxxxx

Getting the LPAR details in the frame with status

hscroot@hmc-op:~> lssyscfg -m op710-2-SN1008B2A -r lpar -F name,lpar_id,state
op710-2-Client5-Fedora-Core-4,6,Running
op710-2-Client4-openSUSE-10.0,5,Running
op710-2-Client3-Debian-3.1,4,Running
op710-2-Client2-RHAS4U3,3,Running
op710-2-Client1-SLES9SP3,2,Running
op710-2-VIO-Server,1,Running

Getting the resource allocation for frame

hscroot@HMC:~> lshwres -r mem -m Server-8204-XXX-XXXX --level sys
configurable_sys_mem=114688,curr_avail_sys_mem=256,pend_avail_sys_mem=256,installed_sys_mem=114688,max_capacity_sys_mem=deprecated,
deconfig_sys_mem=0,sys_firmware_mem=2560,mem_region_size=256,configurable_num_sys_huge_pages=0,curr_avail_num_sys_huge_pages=0,pend_avail_num_sys_huge_pages=0,max_num_sys_huge_pages=6,requested_num_sys_huge_pages=0,huge_page_size=16384,total_sys_bsr_arrays=16,bsr_array_size=8,curr_avail_sys_bsr_arrays=0,max_mem_pools=0
hscroot@HMC:~>

Getting the resource allocation for LPAR

HMC:~> lssyscfg -m Server-8206-E48-XXXXXXX -r prof --filter "lpar_names=test_retail"
name=test_retail_Profile_OK,lpar_name=test_retail,lpar_id=2,lpar_env=aixlinux,all_resources=0,min_mem=28872,desired_mem=28872,max_mem=28872,min_num_huge_pages=0,
desired_num_huge_pages=0,max_num_huge_pages=0,proc_mode=ded,min_procs=6,desired_procs=6,max_procs=6,sharing_mode=share_idle_procs,"io_slots=,lpar_io_pool_ids=none,
max_virtual_slots=10,"virtual_serial_adapters=0/server/1/any//any/1,1/server/1/any//any/1",virtual_scsi_adapters=none,virtual_eth_adapters=none,hca_adapters=none,boot_mode=norm,conn_monitoring=1,auto_start=1,power_ctrl_lpar_ids=none,work_group_id=none,redundant_err_path_reporting=0,bsr_arrays=0,lhea_logical_ports=none,lhea_capabilities=none,lpar_proc_compat_mode=default,
electronic_err_reporting=null,virtual_fc_adapters=none

changing the Memory Allocation for and LPAR .

chsyscfg -r prof -m Server-8206-E48-SN2239B16 -i "name=test_retail_Profile_OK,lpar_name=test_retail,min_mem=94208,desired_mem=94208,max_mem=94208"

Changing the Virtual CPU parameter for an LPAR

chsyscfg -r prof -m Server-8206-E48-SN2239B16 -i "name=test_retail_Profile_OK,lpar_name=test_retail,min_procs=7,desired_procs=7,max_procs=7"

Changing the entitiled capacity for an LPAR

chsyscfg -r prof -m Server-8206-E48-SN2239B16 -i "name=test_retail_Profile_OK,lpar_name=test_retail,min_proc_units=0.1,desired_proc_units=0.2,max_proc_units=2.0"

Starting and bringing down the LPAR .

To start the LPAR named "test_retail" with the profile "test_retail_Profile_OK"

hscroot@hmc-570:~> lssyscfg -m Server-9110-510-SN100129A -r lpar -F name,lpar_id,state,default_profile
VIOS1.3-FP8.0,1,Running,default
linux_test,2,Not Activated,client_default

chsysstate -m Server-8206-E48-SN2239B16 -r lpar -o on -n test_retail -f test_retail_Profile_OK

Shutting down the Lpar "test12" immediately

chsysstate -m SYSTEM-9131-52A-SN10XXXXX -r lpar -o shutdown -n test12 --immed

IMP Commands

lshmc -v	Shows vital product data, such as the serial number.
lshmc -V	Shows the release of the HMC.
lshmc -n	Shows network information of the HMC.
hmcshutdown -r -t now	Reboot the HMC.
lssysconn -r all	Show the connected managed systems.
chhmcusr -u hscpe -t passwd -v abc1234	Change the password of user hscpe.
lshmcusr	List the users of the HMC.
monhmc -r disk	Look at the filesystems of the HMC
monhmc -r proc	details of the processor
monhmc -r mem	details of memory
rmvterm -m SYSTEM-9117-570-SN10XXXXX -p name	Forces the closure of a virtual terminal session.
lspartition -dlpar	shows dlpar capable partitions

And now let's initiate some commands to a VIOS using viosrvcmd.

hscroot@hmc-570:~> viosvrcmd -m Server-9115-520-SNxxxxx -p VIOS1.3-FP8.0 -c "mkvg -f -vg datavg hdisk2 hdisk3"
datavg
hscroot@hmc-570:~> viosvrcmd -m Server-9115-520-SNxxxxxx -p VIOS1.3-FP8.0 -c "mklv -lv testlv datavg 10G"
testlv
hscroot@hmc-570:~> viosvrcmd -m Server-9115-520-SNxxxxxx -p VIOS1.3-FP8.0 -c "lsvg -lv datavg"
datavg:
LV NAME             TYPE       LPs   PPs   PVs  LV STATE      MOUNT POINT
testlv              jfs        160   160   1    closed/syncd  N/A

Cluster Issue

Why the SP2 Failover failed?

Observations :

1. After analyzing the logs ,we noticed the below error in cluster logs. The error what we noticed that the cluster event “get_disk_vg_fs” failed . On further analysis ,to pinpoint where was the actual issue and why this event failed we further deep-dived the logs . we found that the Cluster services had issues while activating/mounting the Cluster filesystem /sapmnt/SP2.

2. When we initiated the SP2 cluster failover , it will un-mount the Filesystems , export the VG’s from node1 and after this it will import all the VG’s and mount the respected Filesystems on node2 . As per the logs ,the cluster VG’s were successfully exported from node1 and the VG’s were imported successfully on node2 but while mouting the FS(/sapmnt/SP2) it was giving issues . The cluster failed to mount the /sapmnt/SP2 filesystem .

3. Once we got these details from cluster logs, we investigated further , to know why the cluster was facing issues with /sapmnt/SP2 filesystem during the failover . On further investigation ,we found that /sapmnt/SP2 filesystem was already NFS -mounted and also this filesystem was manually mounted on node2 though the normal NFS commands . That means that cluster was not able to mount the FS since it was already mounted .

4. We verified with the SAP/DB team in call ,about the requirement of /sapmnt/SP2 filesystem on node2 and upon confirmation, we have un-mounted it. As per the application team this filesystem is needed where SP2 application is running. we configured the filesystem /sapmnt/SP2 as NFS-Crossmount inside cluster to meet the requirement and again performed the cluster failover test and application validation .

Everything was fine .

Tuesday, September 12, 2017

Dynamic Routing -gated services aix

In TCP/IP, routing can be one of two types:

1. Static routing
2. Dynamic routing

With static routing, you maintain the routing table manually using the route command. Static routing is practical for a single network communicating with one or two other networks.

* Note - However, as your network begins to communicate with more networks, the number of gateways increases, and so does the amount of time and effort required to maintain the routing table manually.
With dynamic routing, daemons update the routing table automatically. Routing daemons continuously receive information broadcast by other routing daemons, and so continuously update the routing table.

In AIX , TCP/IP provides two daemons for use in dynamic routing,

1. routed deamon

2. gated daemon

The gated daemon supports

a)Routing Information Protocol (RIP) & Routing Information Protocol Next Generation (RIPng)

b)Exterior Gateway Protocol (EGP),

c)Border Gateway Protocol (BGP) and BGP4+,

d)Defense Communications Network Local-Network Protocol (HELLO),

e)Open Shortest Path First (OSPF),

f) Simple Network Management Protocol (SNMP) and some more

Routing daemons can operate in one of two modes,

1. passive

2. active,

In active mode, routing daemons both broadcast routing information periodically about their local network to gateways and hosts, and receive routing information from hosts and gateways.

In passive mode, routing daemons receive routing information from hosts and gateways, but do not attempt to keep remote gateways updated (they do not advertise their own routing information).

Dynamic routing daemons, however, must be run in the passive (quiet) mode when run on a host that is not a gateway.

Recently came across environment where gated services where used with OSPF routing protocol

This was something new for me ,so started reading the pdf's and blogs to understand the exact concepts.

The most important point is that if you want to understand the complete configuration ,you first need to understand the Routing protocol and it's working and it's network terms .

Now let us go through the basic concept of the OSPF routing protocol that will be helpful in configuration

OSPF

Dynamic Routing Protocol
Link State technology
Runs over IP, protocol 89
Designed by IETF for TCP/IP
Supports VLSM -- It supports subnetting
Multi-vendor - It is standard protocol and supported by all the vendor's
Fast rerouting - OSPF detects changes in the topology, such as link failures, and converges on a new loop-free routing structure within seconds.
Minimises routing protocol traffic

Low bandwidth requirements
Supports different types of areas
Route summarisation and authentication

Under construction ....

Sunday, September 10, 2017

Network performance .. Some points

Recently was trying to understand the issue, in which the customer complained that there network connection are getting dropped off.
Network team worked on it for long time, and came to Unix team to look from server end also.

Since it was virtualize environment, we started look from network end first. And also informed application team to let us know how is these connections setup.

Hoping that some tuning is required from both the end to resolve the issue

Network stats
=============

108038312 packets received

67173530 acks (for 3510816000 bytes)

295731 duplicate acks

0 acks for unsent data

97425484 packets (2215095896 bytes) received in-sequence

22985 completely duplicate packets (28717295 bytes)

0 old duplicate packets

8552 packets with some dup. data (5423403 bytes duped)

8332754 out-of-order packets (461387377 bytes)

understanding the reason for these out of order packet and duplcate packets at receiving end ?

There are these certain scenario's :

1. The network congestion .

2. the adapter(etherchannel) configuration

3. the adapter buffers etc

The Adapter Configuration

=====================

In our scenario ,The etherchannel is configured as link-aggregation but with the algorithm used as “round-robin”.

Let us first understand the round-robin algorithm

Round-Robin: All outgoing traffic is spread evenly across all of the adapters in the EtherChannel. It provides the highest bandwidth optimization for the AIX server system. While round robin distribution is the ideal way to utilize all the links equally but we should also consider that it also introduces the potential for out-of-order packets at the receiving system.

The out of order packets ,duplicate acks these all can be due to the etherchannel configuration algorithm “round-robin” or may indicate any other network issues .

2. We have noticed the lot of TCP ack packets are getting delayed. This is normal behavior of TCP-IP functioning in UNIX but sometime for high performance(response time ) demanding application it may be issue .

This is normally customized at app level . but we are also having option in AIX to overcome this. the “TCP_NODELAY” socket option is disabled by default, which means TCP Nagle algorithm on network transmissions is used which delays sending small successive packets.

The nagle algorithm means that a TCP connection can only have one outstanding acknowledgement for a small segment. Clearly this causes delays in sending further packets until either the acknowledgement is received or TCP can bundle up more data into a full segment. Setting tcp_nodelay to 1 is a dynamic change and can better response time.

sometimes it is seen that this is very helpful , in getting the network throughput for high response time demanding application . but this will increase the cpu overhead and may lead to network congestion .

Before reaching the conclusion , we also need to validate different other parameters .... Under construction....

Friday, August 25, 2017

NIM mksysb Restoration issue - "image.data has invalid logical volume data"

Recently we were working on the DR exercise . we were trying to restore the mksysb and faced some unique issue . After checking the IBM sites also we were not able to get the satisfactory steps to resolution .

Issue noticed and troubleshooting steps:
----------------

while restoring , we were recieving the below errors and it was redirecting us to the main page ,language selection page after this .
we noticed that ,the system was getting error while pharsing the image.data file and was throwing the error "image.data has invalid logical volume data" .
further for isolating the issue . we again created the new image.data file from the source servers and again tried to restore themksysb after adding the image.data resources .
But no luck , after this also we were facing the same errors. This make us clear that there is no issues with the image.data file , but issue with the content of the image.data file .
We tried to again validate the image.data files from many of the other servers which were successfully restored .
After lot of investigation and validating the many image.data files of the successfully restored servers , we figured out that , the problematic server was not having paging space(hd6) defined under rootvg .

Resolution :

As a hit and trial method , We edited the image.data file of the server and added the paging space logical volume into it .
Created new image.data resource and assigned it to nim-client .
After doing so , the NIM mksysb restoration started smoothly without any issues .

Thursday, July 27, 2017

sendmail -issue : sendmail listening to only localhost

Problem Statement
================
The server test01 and test02 are in cluster . The customer was having concern that they were not able to reach the Cluster-IP through port number 25 .

Analysis and understanding the exact problem
======================================
1. First we validated if the sendmail services are running fine .it was working fine .
2. We tried to telnet the Cluster-IP using port number 25 ,it was rejecting . we tried to telnet all the configured IP’s on the network interfaces using port number 25 but that also failed .
3. We tried to telnet the localhost using port 25 , it was successful . That means that there is no port level blocking at server level .
4. We validated the sendmail configuration file to check for the relay server configuration and found it is ok .
5. Now the question arises that if port is open and we are able to do telnet to localhost ,then why are we not able to do for other-IP . why is the sendmail services listening only to localhost . This was the actual Issue that need to be sorted out .

For searching the solution ,we went through net and found some useful documents . sharing the same .

For sendmail :
============
https://www.novell.com/support/kb/doc.php?id=7003912

After performing this change in configuration file , it was working fine for all the IP's.

For postfix
==========

1. Open /etc/postfix/main.cf file:
# vi /etc/postfix/main.cf
2. Append / modify line as follows to bind to localhost (127.0.0.1) only:
inet_interfaces = 127.0.0.1
3. If you need to bind to 127.0.0.1 and 192.168.2.1, enter:
inet_interfaces = 192.168.2.1,127.0.0.1
Save and close the file. You need to stop and start Postfix when this parameter changes. So type the following to restart Postfix:
# /etc/init.d/postfix restart