Saturday, September 19, 2015

Introduction to GPFS Filesystem




  • IBM Introduced GPFS Filesystems in 1998.
  • GPFS is a high performance clustered file system developed by IBM .

  • GPFS provides concurrent high speed file access to application executing on multiple nodes of cluster

  •  It is a high-performance shared-disk file system that can provide fast data access from all nodes in a homogenous or heterogenous cluster of IBM UNIX servers running either the AIX or the Linux operating system or windows.

  • All nodes in a GPFS cluster have the same GPFS journaled filesystem mounted, allowing multiple nodes to be active at the same time on the same data.




GPFS Filesystem internals 


A file system (or stripe group) consists of a set of disks that are used to store file metadata as well as data and structures used by GPFS, including quota files and GPFS recovery


                 How does the GPFS Filesystem works ?

Whenever a disk is added to GPFS Filesystem , a file system descriptor is written on it . The filesystem desccriptor is written at a fixed position
on each disks which helps the GPFS to identify this disk and its place in a file system.

The filesystem descriptor contains file system specifications and information about the state of the file system.


the GPFS  Filesystem uses the concept of inodes,indirect blocks and data blocks to access and store the disks .


                 what is metadata ?

Inodes and indiret blocks are considered as metadata .
The metadata for each file is stored in the inodes and contains information such as file-name,file-size and  last modification timestamp.

For faster access , the inodes of the small files also contains the addresses of all disk blocks that contains the file data.


You can control which disks GPFS uses for storing metadata when creating the file system using the mmcrfs command or
when modifying the file system at a later time by issuing the mmchdisk command.


How to define which disk will be used for storing the metadata ?


already discussed ,the format of the  disk descriptor  file .

Diskname:::Diskusage:FailureGroup::StoragePool:

The DiskUsage field will decide what kind of data you are going to store in the disk

Below are the options that can be used.

  • dataAndMetadata     >>        indicates that disk stores both data and metadata
  • dataOnly                   >>        indicates that disk stores only data
  • metadataOnly            >>       indicates that disk contains only metadata
  • descOnly                   >>        indicates that disk contains only file system decsriptor.



         

We can also use the same options with the mmchdisk command for changing the disk usage options .


But after changing the diskusage paramter using mmchdisk command ,we need to use the mmrestripfs command with -r option to re-allocate the data
as per the new disk parameter. This is online activity but running the mmrestripefs command is I/O intensive,so need to be executed when i/O load is
less.

ex. mmchdisk gpfs0 change -d "gpfsnsd:::dataOnly"

after this confirm whether the changes has been done successfully using the  below command
mmlsdisk gpfs0


GPFS and memory


GPFS uses three areas of memory:


  •  memory allocated from the kernel heap, 
  • memory allocated within the daemon segment, and 
  • shared segments accessed from both the daemon and the kernel.


Memory allocated from the kernel heap
GPFS uses kernel memory for control structures such as vnodes and related structures
 that establish the necessary relationship with the operating system

Memory allocated within the daemon segment
GPFS uses daemon segment memory for file system manager functions. Because of that, the file system manager
 node requires more daemon memory since token states for the entire file system are initially stored there.

File system manager functions requiring daemon memory include:

  • Structures that persist for the execution of a command
  • Structures that persist for I/O operations
  • States related to other nodes



Shared segments accessed from both the daemon and the kernel

Shared segments consist of both pinned and unpinned memory that is allocated at daemon startup.
The initial values are the system defaults. However, you can change these values later using the mmchconfig


The pinned memory is called the pagepool and is configured by setting the pagepool cluster configuration parameter.
This pinned area of memory is used for storing file data and for optimizing the performance of various data access patterns


In a non-pinned area of the shared segment, GPFS keeps information about open and recently opened files. This information is held in two forms:
    1.  full inode cache
    2.   stat cache



Pinned  memory


GPFS  uses pinned memory (also called pagepool memory) for storing file data and metadata in support of I/O operations.
With some access patterns, increasing the amount of pagepool memory can increase I/O performance


Increased pagepool memory can be useful in the following cases:
There are frequent writes that can be overlapped with application execution.
There is frequent reuse of file data that can fit in the pagepool.
The I/O pattern contains various sequential reads large enough that the prefetching data improves performance.


Pinned memory regions cannot be swapped out to disk, which means that GPFS will always consume at least the value of pagepool in system memory.


Non-pinned memory
There are two levels of cache used to store file metadata:

Inode cache
The inode cache contains copies of inodes for open files and for some recently used files that are no longer open.
The maxFilesToCache parameter controls the number of inodes cached by GPFS.

Every open file on a node consumes a space in the inode cache.
Additional space in the inode cache is used to store the inodes for recently used files in case another application needs that data.

The number of open files can exceed the value defined by the maxFilesToCache parameter to enable applications to operate. However,
 when the maxFilesToCache number is exceeded, there is not more caching of recently open files, and only open file inode data is kept in the cache.


Stat cache
The stat cache contains enough information to respond to inquiries about the file and open it, but not enough information to read from it or write to it.

A stat cache entry consumes significantly less memory than a full inode. The default value stat cache is four times the maxFilesToCache parameter.

This value may be changed through the maxStatCache parameter on the mmchconfig command.



Monday, September 14, 2015

Adding the space or disks in GPFS Filesystem

          Steps to add the disks to the filesystem

step 1 : Before adding a disks in the GPFS ,take the details of GPFS disks .

      # mmlsnsd  and also verify using the command


       # mmlsnsd
         File system   Disk name    NSD servers
         --------------------------------------------------------------------------
          gpfs0         nsd08        (directly attached)

          gpfs0         nsd09        (directly attached)


      #mmlsnsd -m  >> this gives details of the corresponding disk and ID .
 

Step 2 : Before adding the disk in GPFS filesystem ,we need to create the 
         GPFS Disk using the command mmcrnsd.
       
         For creating a nsd we need to create a disk descriptor file . The format of the file is as follows .
         it is not necessary to to define all fields.
     

         disk-Name:Primaryserver:backupserver:diskusage:failuregroup:desiredname:storagepool
       
  I am going to add hdisk1,hdisk2,hdisk3,hdisk4,hdisk5,hdisk6 to the filesystem gpfs0 .


    Create the file /tmp/abhi/gpfs-disks.txt .

hdisk1:::dataAndMetadata::nsd01::
hdisk2:::dataAndMetadata::nsd02::
hdisk3:::dataAndMetadata::nsd03::
hdisk4:::dataAndMetadata::nsd04::
hdisk5:::dataAndMetadata::nsd05::
hdisk6:::dataAndMetadata::nsd06::



#mmcrnsd -F /tmp/abhi/gpfs-disks.txt

mmcrnsd: Processing disk hdisk1
mmcrnsd: Processing disk hdisk2
mmcrnsd: Processing disk hdisk3
mmcrnsd: Processing disk hdisk4
mmcrnsd: Processing disk hdisk5
mmcrnsd: Processing disk hdisk6
mmcrnsd: Propagating the cluster configuration data to all
  affected nodes.  This is an asynchronous process.

Once the command is sucessful ,we can see that NSD names corresponding to the disks in lspv output.



# lspv
hdisk0          00c334b6af00e77b                    rootvg          active
hdisk1          none                                nsd01
hdisk2          none                                nsd02
hdisk3          none                                nsd03
hdisk4          none                                nsd04
hdisk5          none                                nsd05
hdisk6          none                                nsd06
hdisk8          none    nsd08
hdisk9          none      nsd09


Also  we need to verify using the mmlsnsd command .

# mmlsnsd
 File system   Disk name    NSD servers
--------------------------------------------------------------------------
 gpfs0         nsd08        (directly attached)

 gpfs0         nsd09        (directly attached)

(free disk)   nsd01        (directly attached)

(free disk)   nsd02        (directly attached)

(free disk)   nsd03        (directly attached)

(free disk)   nsd04        (directly attached)

(free disk)   nsd05        (directly attached)

(free disk)   nsd06        (directly attached)


step 3 -- after this we need to add the disks to the filesystems

Before adding the disk to the GPFS filesystems ,we need to create a disk descriptor file .
since while creating the NSD ,we have already defined some of the parameters so no need to define it again here .
                  Below fields "diskname",datausage ,failure group,storagepool should be defined

by default GPFS Cluster will have one storage pool that is "system" but we can define many storage pools as per our requirement.

diskname:::diskusage:failuregroup::storagepool:

    cat /tmp/abhi/gpfs-disk.txt
nsd01:::dataAndMetadata:-1::system
nsd02:::dataAndMetadata:-1::system
nsd03:::dataAndMetadata:-1::system

#mmadddisk gpfs -F /tmp/cg/gpfs-disk.txt -r  >>>-r option is used here for re-balancing the data on all the new disks

Note: Rebalancing of data is I/O intensive job . it is not preferred to use this option during peak load .

Once added verify the disk size using the df -gt and also the output of #mmlsnsd.

# mmlsnsd
 File system   Disk name    NSD servers
--------------------------------------------------------------------------
 gpfs0         nsd08        (directly attached)

 gpfs0         nsd09        (directly attached)

 gpfs0         nsd01        (directly attached)

 gpfs0         nsd02        (directly attached)

 gpfs0         nsd03        (directly attached)

 gpfs0         nsd04        (directly attached)

 gpfs0         nsd05        (directly attached)

 gpfs0         nsd06        (directly attached)