Thursday, May 15, 2008

Using Simple Network Management Protocol

The Simple Network Management Protocol (SNMP) is built in to many devices, but often the tools and software that can read and parse this information are too large and complicated when you only want to check a quick statistic or track a particular device or issue. This article looks at some simplified methods for getting SNMP information from your devices and how to integrate this information into the rest of your network's data map.

About this series

The typical UNIX® administrator has a key range of utilities, tricks, and systems he or she uses regularly to aid in the process of administration. There are key utilities, command line chains, and scripts that are used to simplify different processes. Some of these tools come with the operating system, but a majority of the tricks come through years of experience and a desire to ease the system administrator's life. The focus of this series is on getting the most from the available tools across a range of different UNIX environments, including methods of simplifying administration in a heterogeneous environment.



 

SNMP basics

There are many ways you can monitor your UNIX server. See the Resources for some examples of the type of monitoring available. Monitoring a single server is not a problem, but monitoring the same information across a number of servers can present problems. If one of the servers you are in charge of runs out of disk space, you want to know about it before it starts to affect your users and clients.

Monitoring multiple servers in this way, especially if they use a variety of different operating systems, can be a problem. The differences in command line tools, output formats, values, and other information all complicate what should otherwise be a simple process. What is required is a solution that provides a generic interface to the information that works, irrespective of the UNIX variant you are using.

The Simple Network Management Protocol (SNMP) provides a method for managing information about different systems. An agent runs on each system and reports information using SNMP to different managing systems.

SNMP is often a built-in component for network devices such as routers and switches, and is the only method available for retrieving statistics and status information remotely (without logging in to some sort of interface). On most hosts you will need to explicitly run SNMP software to expose information about the host over the SNMP protocol.

Information can be retrieved from an agent either explicitly, by requesting the information using a GET request, or the agent can broadcast information to management systems using the TRAP or INFORM messages. In addition, managing systems can set information and parameters on the agent, but this is usually only used to change the network configuration.

The types of information that can be shared can be quite varied. It can be everything from network settings, statistics, and metric data for network interfaces, through to monitoring CPU load and disk space.

The SNMP standard does not define what information the agent returns; instead, the available information is defined by Management Information Bases (MIBs). The MIB defines the structure of the information that is returned, and are organized into a hierarchical structure using object identifiers (OID). You access information within an agent by requesting data using a specific location within the MIB structure.

For example, some of the more common IDs are shown in Listing 1.


Listing 1. SNMP object IDs
 
                
sysDescr.0      1.3.6.1.2.1.1.1.0
sysObjectId.0   1.3.6.1.2.1.1.2.0
sysUpTime.0     1.3.6.1.2.1.1.3.0
sysContact.0    1.3.6.1.2.1.1.4.0
sysName.0       1.3.6.1.2.1.1.5.0
sysLocation.0   1.3.6.1.2.1.1.6.0
sysServices.0   1.3.6.1.2.1.1.7.0
ifNumber.0      1.3.6.1.2.1.2.1.0

 

You can see from this list that the MIBs are numerical and, effectively, in sequence. When obtaining information you can use a GET request to obtain a specific value, or GETNEXT to get the next property from the last one you read. You can also use the names. The names shown above are all part of the system tree, so you can read the value by getting using the OID 'system.sysUpTime.0'.

The values that you read are also of specific types. You can read integer, floating point, and string values that are all defined as 'scalar' objects. Within these objects are types that are identified with specific significance. For example, time interval values are reported as 'timeticks,' or hundredths of a second. These values need to be converted into a more readable human form before being displayed. There are also MIB objects that return tabular data. This is handled by returning additional OID instances that can be grouped together to make an SNMP table.

From a security perspective, SNMP agents can be associated with a specific community, and managing systems access information by using the community as a method of validating their access to the agent. In Version 1 of the SNMP standard, the community string was the only method of securing or restricting access. With Version 2 of the SNMP standard, the security was improved, but could be complex to handle. With Version 3, considered the current version since 2004, the standard was improved with explicit authentication and access control systems.



 

Getting SNMP statistics

There are many different ways of obtaining information from SNMP systems, including using professional management tools, programming interfaces, and command line tools.

Of the latter, probably the best known and easiest to use is the snmpwalk command, which is part of a larger suite of SNMP tools that allow you to obtain information from SNMP agents directly from the command line. This command will walk the entire subtree of a given management value and return all the information about the system contained within the subtree.

For example, Listing 2 shows the output when querying a local system for all the information within the 'system' tree.


Listing 2. 'Walking' an SNMP tree
 
                
$ snmpwalk -Os -c MCSLP -v 1 localhost system
sysDescr.0 = STRING: Linux tweedledum 2.6.23-gentoo-r8 
             #1 SMP Tue Feb 12 16:32:14 GMT 2008 x86_64
sysObjectID.0 = OID: netSnmpAgentOIDs.10
sysUpTimeInstance = Timeticks: (34145553) 3 days, 22:50:55.53
sysContact.0 = STRING: root@Unknown
sysName.0 = STRING: tweedledum
sysLocation.0 = STRING: serverroom
sysORLastChange.0 = Timeticks: (0) 0:00:00.00
sysORID.1 = OID: snmpFrameworkMIBCompliance
sysORID.2 = OID: snmpMPDCompliance
sysORID.3 = OID: usmMIBCompliance
sysORID.4 = OID: snmpMIB
sysORID.5 = OID: tcpMIB
sysORID.6 = OID: ip
sysORID.7 = OID: udpMIB
sysORID.8 = OID: vacmBasicGroup
sysORDescr.1 = STRING: The SNMP Management Architecture MIB.
sysORDescr.2 = STRING: The MIB for Message Processing and Dispatching.
sysORDescr.3 = STRING: The management information definitions for 
                                    the SNMP User-based Security Model.
sysORDescr.4 = STRING: The MIB module for SNMPv2 entities
sysORDescr.5 = STRING: The MIB module for managing TCP implementations
sysORDescr.6 = STRING: The MIB module for managing IP and ICMP implementations
sysORDescr.7 = STRING: The MIB module for managing UDP implementations
sysORDescr.8 = STRING: View-based Access Control Model for SNMP.
sysORUpTime.1 = Timeticks: (0) 0:00:00.00
sysORUpTime.2 = Timeticks: (0) 0:00:00.00
sysORUpTime.3 = Timeticks: (0) 0:00:00.00
sysORUpTime.4 = Timeticks: (0) 0:00:00.00
sysORUpTime.5 = Timeticks: (0) 0:00:00.00
sysORUpTime.6 = Timeticks: (0) 0:00:00.00
sysORUpTime.7 = Timeticks: (0) 0:00:00.00
sysORUpTime.8 = Timeticks: (0) 0:00:00.00

 

You can see here a range of information about the host, including the operating system (in sysDescr.0), the amount of time that the system has been available (sysUpTimeInstance), and the location of the machine. The interval time here is shown in both its original value (timeticks) and the converted, human-readable days, hours:minutes:seconds.

The uptime or availability of a machine is a very common use for SNMP, as it provides probably the most convenient and efficient method for determine whether a machine is up and processing requests. Other solutions that have been described in past parts of the series include ping or using rwho and ruptime. These latter two solutions are very CPU and network intensive and not very friendly in terms of their resource utilization.

Note, however, the limitation of the uptime described here, which is the information shown in the uptime of the SNMP agent, not the uptime of the entire machine. In most situations the two are same, especially for devices with built-in SNMP monitoring, such as network routers and switches. For computers that expose their status through SNMP, there may be a discrepancy between system and SNMP agent uptime.

You can get a quicker idea of the status of a machine through SNMP using snmpstatus. This obtains a number of data points from a specified SNMP agent, including the IP address, description, uptime, and network statistics (packets sent/received, and IP packets sent/received). For example, if we look at a Solaris host, you can see the simplified information, as shown in Listing 3.


Listing 3. Simplified information
 
                
$ snmpstatus -v1 -c public t1000
[192.168.0.26]=>[SunOS t1000 5.11 snv_81 sun4v] Up: 2:12:10.20
Interfaces: 4, Recv/Trans packets: 643/160 | IP: 456/60
2 interfaces are down!

 

This machine has recently been rebooted (hence the low uptime and packet statistics). The snmpstatus command has also determined that two of the interfaces on the machine (which has four Ethernet ports) are down. This is a good example of the sort of warning information that SNMP can provide to help notify you of an issue that requires further investigating.

For obtaining a specific piece of information, you can use the snmpget command, which reads one or more OIDs directly and reports their value. For special types, it will also convert to a human-readable format. For example, to get the system description and uptime, use the following command (in Listing 4).


Listing 4. Getting system description and uptime information
 
                
$ snmpget -v1 -c public t1000 system.sysUpTime.0 system.sysContact.0
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (867411) 2:24:34.11
SNMPv2-MIB::sysContact.0 = STRING: "System administrator"

 

In isolation, all of these methods are useful, but in reality, you need to be able to monitor and track multiple machines and multiple OIDs to get a full picture of what is going on. We can do this by using one of the many programmable interfaces to SNMP.



 

Getting SNMP data programmatically

The Net::SNMP module for Perl obtains information from one or more agents using SNMP. Other, similar, interfaces are available for other languages, including Python, Ruby, and PHP (see Resources). The interface works by you creating a session that communicates (and if necessary authenticates) with the SNMP agent on the desired host. Once you have an active and valid session, you can request data from the agent directly for one or more OIDs. The information is returned in the form of a hash of information, tied between the OID and the corresponding value.

Listing 5 shows a very simple script that will obtain the system uptime for each of the hosts supplied on the command line.


Listing 5. Getting a single SNMP agent property with Perl and Net::SNMP
 
                
#! /usr/local/bin/perl

use strict;

use Net::SNMP;

my $uptimeOID = '1.3.6.1.2.1.1.3.0';

foreach my $host (@ARGV)
{
    my ($session, $error) = Net::SNMP->session(
        -hostname  =>  $host,
        -community => 'public',
        -port      => 161
        );

    warn ("ERROR for $host: $error\n") unless (defined($session));

    my $result = $session->get_request(
        -varbindlist => [$uptimeOID]
        );

    if (!defined($result))
    {
        warn ("ERROR: " . $session->error . "\n");
    }
    else
    {
        printf("Uptime for %s: %s\n",$host, $result->{$uptimeOID});
    }

    $session->close;
}

 

In the script, we've provided the full numerical OID for the system, sysUpTime property. You have to supply the list of OIDs to obtain when using the get_request() method as a reference to an array, and then pull the information back out from the hash that is returned. In Listing 5 we build the array reference dynamically during the call, and then use the OID as the hash key when printing out the result.

Using the script, we can get a list of the uptimes for each host supplied on the command line (see Listing 6).


Listing 6. List of uptimes for each host
 
                
$ perl uptime.pl tweedledum t1000
Uptime for tweedledum: 4 minutes, 52.52
Uptime for t1000: 6 minutes, 26.12

 

Of course, watching this information manually is hardly efficient.


 

Tracking SNMP data over time

Viewing a single instance of an SNMP OID property at one time is not always very useful. Often you want to monitor something over time (for example, availability), or you want to monitor for changes in particular values. A good example is disk space. SNMP can be configured to record all sorts of information, and disk space is a common system to want to monitor so that you can identify not only when the disk space reaches a particular level, but also when there is a significant change to the disk space, which might signify a problem.

For example, Listing 7 shows a callback-based solution to constantly monitor the diskspace. In the script, we output a running total, but it could be configured to only output the warning message that is triggered when there is a reduction in the diskspace.


Listing 7. Getting a running view of SNMP properties
 
                
#! /usr/local/bin/perl

use strict;
use warnings;
use Net::SNMP qw(snmp_dispatcher);

my $diskspaceOID = '1.3.6.1.4.1.2021.9.1.7.1';

foreach my $host (@ARGV)
{
    my ($session, $error) = Net::SNMP->session(
        -hostname    => $host,
        -nonblocking => 0x1,
        );

    if (!defined($session))
    {
        warn "ERROR: $host produced $error - not monitoring\n"
    }
    else
    {
        my ($last_poll) = (0);

        $session->get_request(
            -varbindlist => [$diskspaceOID],
            -callback    => [
                 \&diskspace_cb, \$last_poll
            ]
            );
    }
}

snmp_dispatcher();

exit 0;

sub diskspace_cb
{
    my ($session, $last_poll) = @_;

    if (!defined($session->var_bind_list))
    {
        printf("%-15s  ERROR: %s\n", $session->hostname, $session->error);
    }
    else
    {
        my $space = $session->var_bind_list->{$diskspaceOID};

        if ($space < ${$last_poll})
        {
            my $diff = ((${$last_poll}-$space)/${$last_poll})*100;
            printf("WARNING: %s has lost %0.2f%% diskspace)\n",
                   $session->hostname,$diff);
        }

        printf("%-15s  Ok (%s)\n",
               $session->hostname,
               $space
               );

        ${$last_poll} = $space;
    }

    $session->get_request(
        -delay       => 60,
        -varbindlist => [$diskspaceOID]
        );
}

 

The script is in two parts, and uses some functionality within the Net::SNMP module that allows you to call a function when an SNMP value is obtained from a host, coupled with the ability to continually monitor hosts and SNMP objects in a simple, but efficient, loop.

The first part sets up each host to monitor the information. We are only monitoring one piece of information, but we could monitor others as part of the solution. The object is configured as 'non-blocking,' so that the script will not wait if the host cannot be reached, but simply move on to the next host. Finally, in the call to get_request(), we submit the callback information. The first argument here is the name of the function to be called when the response is received from the agent. The second is an argument that will be supplied to the function when it is called.

We'll use this argument to be able to record and track the previous value returned by the SNMP call. Within the callback function, we compare the newly returned value and the previous value. If there's a reduction, we calculate the percentage reduction and then report a warning.

The final part of the callback is to specify that another retrieval should occur, here specifying that the next retrieval should be delayed by 60 seconds. The existing callback information is retained. In effect, the script obtains the value from the SNMP agent, calls the callback function, which then queues up another retrieval in the future. Because the same callback is already defined, the process repeats in an endless loop.

Incidentally, the script uses the dskAvail OID value, and calculates the percentage difference based on the last and new values. The dskTable tree that this property is part of actually has a disk percentage property that we could have queried, instead of calculating it manually. However, the value returned is probably not finely grained enough to be useful.

You can see this property and current values by using snmpwalk to output the dskTable tree, which itself is part of the UCD MIB (Listing 8).


Listing 8. Getting a dump of available MIB data
 
                
$ snmpwalk -v 1 localhost -c public UCD-SNMP-MIB::dskTable
UCD-SNMP-MIB::dskIndex.1 = INTEGER: 1
UCD-SNMP-MIB::dskPath.1 = STRING: /
UCD-SNMP-MIB::dskDevice.1 = STRING: /dev/sda3
UCD-SNMP-MIB::dskMinimum.1 = INTEGER: 100000
UCD-SNMP-MIB::dskMinPercent.1 = INTEGER: -1
UCD-SNMP-MIB::dskTotal.1 = INTEGER: 72793272
UCD-SNMP-MIB::dskAvail.1 = INTEGER: 62024000
UCD-SNMP-MIB::dskUsed.1 = INTEGER: 7071512
UCD-SNMP-MIB::dskPercent.1 = INTEGER: 10
UCD-SNMP-MIB::dskPercentNode.1 = INTEGER: 3
UCD-SNMP-MIB::dskErrorFlag.1 = INTEGER: noError(0)
UCD-SNMP-MIB::dskErrorMsg.1 = STRING:

 

To find the property in the first place, you can dump all the known properties by using snmptranslate. By filtering this with grep we can see the information we want: $ snmptranslate -Ts |grep dsk.

To get a numerical value, use snmptranslate and provide the name with the -On option (see Listing 9).


Listing 9. Using snmptranslate
 
                
$ snmptranslate -On UCD-SNMP-MIB::dskAvail 
.1.3.6.1.4.1.2021.9.1.7

 

Running the script, we get a running commentary (and warnings) for the disk space usage on the specified host. See Listing 10.


Listing 10. Monitoring disk space automatically
 
                
$ perl diskspace-auto.pl tweedledum
tweedledum       Ok (50319024)
WARNING: tweedledum has lost 2.67% diskspace)
tweedledum       Ok (48976392)
WARNING: tweedledum has lost 1.65% diskspace)
tweedledum       Ok (48166292)
tweedledum       Ok (48166292)
tweedledum       Ok (48166292)
tweedledum       Ok (48166292)

 

You can see from this output that we have lost some significant space out of the space available on this disk on the specified host. To monitor more hosts, just add more hostnames on the command line.



 

Publishing information through an SNMP agent

The SNMP package includes a daemon, snmpd, which can be configured to expose a variety of information using the SNMP protocol. The configuration for the information to be exposed is controlled using the /etc/snmpd.conf file.

For example, Listing 11 shows the snmpd.conf file on the host used in the earlier examples in this article.


Listing 11. Sample snmpd.conf file
 
                
syslocation  serverroom
proc  imapd 20 10
disk  / 100000
load  5 10 10

 

Each of these lines populates different information. In the example, we set the location of the machine, and then configure some specific items to monitor.

The proc section monitors a specific process, shown here as a monitor for the IMAP daemons for a mail service. The numbers following the option specify the maximum number of processes allowed to be running, and the minimum number that should be running. You can use this to make sure that a particular service is running, and that you haven't exceeded capacity that might indicate a fault. When the process count goes above the MAX value, an SNMP trap is generated.

For the disk, you specify the path to the directory to be monitored and the minimum size (in kilobytes) that the disk should have free. Again, an SNMP trap is triggered if the disk space dips below this value.

Finally, the load information shows the maximum CPU load for 1, 5, and 15 minutes that should be reported. This is equivalent to the output of the uptime command that shows the process loading for these intervals. Like the other configured limits, a trap is raised when these limits are exceeded.

Manually setting this information is not difficult, but also not ideal. A simple menu-based solution, snmpconf, is available if you want a more straightforward method of setting the configuration.


 

Summary

Monitoring your servers and devices is a process that can be very complex, especially as the number of devices in your network increases. SNMP is an efficient, and extensible, method for exposing and reporting this information. Because the interface is consistent across all the devices, you can get uptime, network statistics, disk space, and even process monitoring using the same methods across multiple hosts.

In this article we've looked both at the basics of SNMP and also how to read specific values from different hosts. Using the Net::SNMP perl module we have also examined methods for reading information, using both one-hit and continual monitoring-based solutions. Finally, we examined the methods for configuration additional information to be exposed on a system so that you can customize and monitor the systems you need for your network when using the snmpd daemon.

No comments: