franku

Integrate Nagios and Nimble

Discussion created by franku on Apr 10, 2013
Latest reply on Oct 14, 2015 by windows.oc@vpbank.com

(i tried posting this as a blog, because that's what i thought this should fit under, but this will do, at least, i'm hoping people can use this, and / or improve upon which would imply some sort of discussion)

 

 

For us it was mandatory that we could monitor the various services delivered by the nimble storage device in nagios, so i decided to dive into this and see what i could come up with.

 

Ofcourse, Nagios is very good at reading SNMP values, but SNMP functionality is (at the time of this writing) kind of limited on our Nimble storage device. As in limited to Logical volume size usage.

 

Some things our organisation wanted to monitor (and produce graphs of) are:

  • Individual disk statusses
  • SSD Wearout time (so that we know when it hits 20 % remaining use or less to ring some bells)
  • Volume space utilization AND overall compression-rates
  • *ndividual lun space utilization and compression-rates
  • Snapshot space utilization and compression

Nimble's system load

 

To check these values, obviously SNMP is not going to cut it, so for now my path will be SSH with a passwordless key.

This key needs to be imported on the nimble, also through SSH.

 

To make this work, we have to do some steps, so let me begin with the first:

 

- Step 1, getting the data from the nimble device to nagios /tmp/ directory.

We never want to cause too much load on a device caused by monitoring, so use the crontab of your nagios server to do the following:

 

#I use the uptime command below to determine the current sytem load of the nimble device

*/1 * * * * sleep 1 && ssh admin@nimble01 uptime |cut -d : -f5|cut -d " " -f2-8|cut -d "," -f1 >/tmp/nimble01-load

#commands below return the volume info of in these cases homedirs and novell-data

*/5 * * * * sleep 5 && ssh admin@nimble01 vol --info homedirs >/tmp/nimble01-homedirs
*/5 * * * * sleep 8 && ssh admin@nimble01 vol --info novell-data >/tmp/nimble01-novell

#command below returns info about general volume usage and compression

*/5 * * * * sleep 32 && ssh admin@nimble01 array --info >/tmp/nimble01-array

#command below returns info about the physical disks in the nimble, note the tail -16, as we only have 16 disks in our nimble device!

*/5 * * * * sleep 37 && ssh admin@nimble01 disk --list |tail -16 >/tmp/nimble01-disk

#commands below return info about the SSD cache disks in the nimble, in our case, we have 4

*/5 * * * * sleep 39 && ssh admin@nimble01 disk --info 7 >/tmp/nimble01-ssd1
*/5 * * * * sleep 41 && ssh admin@nimble01 disk --info 8 >/tmp/nimble01-ssd2
*/5 * * * * sleep 43 && ssh admin@nimble01 disk --info 9 >/tmp/nimble01-ssd3
*/5 * * * * sleep 45 && ssh admin@nimble01 disk --info 10 >/tmp/nimble01-ssd4

All the data collected will be put in an individual file in /tmp on your nagios server and we can use those files in our next step:

 

- Step 2, making scripts in Nagios to process these data files.

 

The scripts I created are attached to this post, and i will give some short info on where to place them.

- Put the scripts in your <nagios-installation-directory>/libexec (you should find other nagios scripts here)

- Make sure the scripts are executable by user nagios (if you run as user nagios that is): chown nagios <scriptname>, and chmod +xx <scriptname>

 

Scriptnames:

1, check_nimble_array

  --Syntax: ./check_nimble_array nimble01-array volusage 2000 3000

    This will return something like: MB used: 2750204 Free: 4921073 MB  |B=2883797909504;2097152000;3145728000;; & 'total'=8116998504448;;;;

    Flags explanation:

          nimble01-array: is the name  of the file you saved in /tmp through your crontab (read above section, step1 )

          volusage: the flag we use because we want to display the general volume usage

          2000: Warning value in MB

          3000: Critical value in MB

     Other possible flags rather then volusage:

          volusage, volcompress (for general volume compress ratio), snapusage, and snapcompress (for general snapshot compression ratio)

 

2, check_nimble_disks

     --Syntax: ./check_nimble_disks nimble01-disk 3

     This will return something like: 3 WD-WCAW34572760 HDD 1000.20 in use okay AC-103153 B.0

     If the status of the disk is: "okay" or "spare" then it returns an OK state to nagios, otherwise it CRITS out.

     Flags explanation:

          nimble01-disk: is the name of the file you saved in /tmp through your crontab (read above section, step 1)

          3: the number of the disk, in our case, can be 1 to 16

 

3, check_nimble_load

     --Syntax: ./check_nimble_load nimble01-load 1 2

     This will return something like: System load: 0.44 |load=0.44;1;2;;

     The data beyond the | is performance data, if you have something like pnp4nagios this performance data will be put in a nice graph.

     Flags explanation:

     nimble01-load: is the name of the file you saved in /tmp through your crontab (read above section, step 1)

     1: if load passes 1, return warn state

     2: if load passes 2, return CRIT state

 

4, check_nimble_ssd

     --Syntax: ./check_nimble_ssd nimble01-ssd3 25 10

     This will return something like: ok Wearout percentage: 100%  |wo=100;25;10;; & 'total'=100;;;;

     (i'm currently not sure if this works properly to be honest, because our nimble is a couple of weeks old, and the disks are brand new, so it's to be expected that wearout is 100% ok. so correct me if i'm wrong)

     Flags explanation:

     nimble01-ssd3:  is the name of the file you saved in /tmp through your crontab (read above section, step 1),and ssd3 is the identifier of disk 3 in the filename!

     25, if it gets lower then 25 % it returns warn state

     10, if it gets lower then 10 % it returns CRIT state

 

5, check_nimble_volume

     --Syntax: ./check_nimble_volume nimble01-homedirs volusage 1000 2000

     This will return something like: MB used: 1238577 |B=1298742116352;2144047464448;2473901162496;; & 'total'=3298534883328;;;;

     Flags explanation:

     nimble01-homedirs: is the name of the file you saved in /tmp through your crontab (read above section, step 1)

     volusage is the flag that gives us the volume usage of volume homedirs

     1000 and 2000 and respectively warning and critical values.

     Other flags possible: volusage, volcompress (show compression ratio of this individual volume), snapusage (show space used by snapshots), snapcompress (show snapshot compressionratio)

 

 

- Step 3, Making nagios command definitions:

Locate your commands.cfg file in your nagios installation directory (default under <nagios>/etc/objects/commands.cfg and edit it using your favorite editor: VI ofc!

Now we have to define the scripts in this file as predefined nagios commands, here they are: (spam)


Please note that if your nagios plugins  directory (libexec) is NOT in /usr/local/nagios/libexec, you need to change the path of the command definitions below!!

 

#./check_nimble_array nimble01-array snapusage 100000 400000
#./check_nimble_array nimble01-array volusage 4039202 4739202
#./check_nimble_array nimble01-array snapcompress
#./check_nimble_array nimble01-array volcompress
define command{
        command_name    check_nimble_array
        command_line    /usr/local/nagios/libexec/check_nimble_array $ARG1$ $ARG2$ $ARG3$ $ARG4$ $ARG5$
        }
define command{
        command_name    check_nimble_compression
        command_line    /usr/local/nagios/libexec/check_nimble_array_c $ARG1$ $ARG2$ $ARG3$ $ARG4$ $ARG5$
        }

#./check_nimble_disks nimble01-disk 16
define command{
        command_name    check_nimble_disks
        command_line    /usr/local/nagios/libexec/check_nimble_disks $ARG1$ $ARG2$
        }

#./check_nimble_volume nimble01-novell-data volusage
#warns en crits komen uit QUOTA
define command{
        command_name    check_nimble_volume
        command_line    /usr/local/nagios/libexec/check_nimble_volume $ARG1$ $ARG2$ $ARG3$ $ARG4$ $ARG5$
        }
define command{
        command_name    check_nimble_compression_volume
        command_line    /usr/local/nagios/libexec/check_nimble_volume_c $ARG1$ $ARG2$ $ARG3$ $ARG4$ $ARG5$
        }

#./check_nimble_ssd 1
define command{
        command_name    check_nimble_ssd
        command_line    /usr/local/nagios/libexec/check_nimble_ssd $ARG1$ $ARG2$ $ARG3$
        }


- Step 4,making a nagios configuration.

This all depends on how you've built your nagios installation, but to help you out a bit here's an example that i'm using:


define service{
        use     template-you-generally-use
        host_name                       nimble01.yourdomain.local

        service_description           Volume Usage
        check_command              check_nimble_array!nimble01-array!volusage!4039202!4739202
        check_interval                   1               ; Actively check the host every 5 minutes
        retry_interval                     1               ; Schedule host check retries at 1 minute intervals
        max_check_attempts        3                       ; Re-check the service up to 3 times in order to determine its final (hard) state
        }

define service{
        use    template-you-generally-use
        host_name                        nimble01.yourdomain.local
        service_description             Snapshot Usage
        check_command               check_nimble_array!nimble01-array!snapusage!100000!400000
        check_interval                   1               ; Actively check the host every 5 minutes
        retry_interval                     1               ; Schedule host check retries at 1 minute intervals
        max_check_attempts        3                       ; Re-check the service up to 3 times in order to determine its final (hard) state
        }
define service{
        use     template-you-generally-use
        host_name                        nimble01.yourdomain.local
        service_description             Volume compression ratio
        check_command                check_nimble_compression!nimble01-array!volcompress
        check_interval                    1               ; Actively check the host every 5 minutes
        retry_interval                      1               ; Schedule host check retries at 1 minute intervals
        max_check_attempts         3                       ; Re-check the service up to 3 times in order to determine its final (hard) state
        }
define service{
        use     template-you-generally-use
        host_name                        nimble01.yourdomain.local
        service_description             Snapshot compression ratio
        check_command                check_nimble_compression!nimble01-array!snapcompress
        check_interval                    1               ; Actively check the host every 5 minutes
        retry_interval                      1             ; Schedule host check retries at 1 minute intervals
        max_check_attempts         3                       ; Re-check the service up to 3 times in order to determine its final (hard) state
        }
define service{
        use     template-you-generally-use
        host_name                       nimble01.yourdomain.local
        service_description           Disk 01
        check_command              check_nimble_disks!nimble01-disk!1
        check_interval                  25               ; Actively check the host every 5 minutes
        retry_interval                    10               ; Schedule host check retries at 1 minute intervals
        max_check_attempts       3                       ; Re-check the service up to 3 times in order to determine its final (hard) state
        }
define service{
        use     template-you-generally-use
        host_name                       nimble01.yourdomain.com
        service_description            homedirs volume usage
        check_command               check_nimble_volume!nimble01-homedirs!volusage
        check_interval                   1               ; Actively check the host every 5 minutes
        retry_interval                      1              ; Schedule host check retries at 1 minute intervals
        max_check_attempts         5                       ; Re-check the service up to 3 times in order to determine its final (hard) state
}
define service{
        use    template-you-generally-use
        host_name                       nimble01.yourdomain.com
        service_description            homedirs compression
        check_command               check_nimble_compression_volume!nimble01-homedirs!volcompress
        check_interval                   1               ; Actively check the host every 5 minutes
        retry_interval                     1               ; Schedule host check retries at 1 minute intervals
        max_check_attempts        5                       ; Re-check the service up to 3 times in order to determine its final (hard) state
}
define service{
        use     template-you-generally-use
        host_name                       nimble01.yourdomain.local
        service_description            homedirs snapshot usage
        check_command               check_nimble_volume!nimble01-homedirs!snapusage
        check_interval                   1               ; Actively check the host every 5 minutes
        retry_interval                     1               ; Schedule host check retries at 1 minute intervals
        max_check_attempts        5                       ; Re-check the service up to 3 times in order to determine its final (hard) state
}
define service{
        use     template-you-generally-use
        host_name                       nimble01.yourdomain.local
        service_description            homedirs snapshot compression
        check_command               check_nimble_compression_volume!nimble01-homedirs!snapcompress
        check_interval                   1               ; Actively check the host every 5 minutes
        retry_interval                     1               ; Schedule host check retries at 1 minute intervals
        max_check_attempts        5                       ; Re-check the service up to 3 times in order to determine its final (hard) state
}
define service{
        use     your-template-name-here
        host_name                       nimble01.yourdomain.local
        service_description            SSD 4 Wearout time phy-disk-10
        check_command               check_nimble_ssd!nimble01-ssd4!25!10
        check_interval                   1               ; Actively check the host every 5 minutes
        retry_interval                     1               ; Schedule host check retries at 1 minute intervals
        max_check_attempts        5                       ; Re-check the service up to 3 times in order to determine its final (hard) state
}
define service{
        use     your-template-name-here
        host_name                       nimble01.yourdomain.local
        service_description           System load
        check_command              check_nimble_load!nimble01-load!35!45
        check_interval                  1               ; Actively check the host every 5 minutes
        retry_interval                    1               ; Schedule host check retries at 1 minute intervals
        max_check_attempts       5                       ; Re-check the service up to 3 times in order to determine its final (hard) state
}

 




- Step 4, EXPECT monitoring bugs, these scripts are Beta at best, and specifically designed for my own use. If you have any problems,questions or remarks be sure to let me know.


Hoping that somebody finds this usefull.

Regards,

Frank Uittenbosch





Outcomes