Use Monitoring Tools

Monitor a SLES 12 System

Creating baselines is a key task for most administrators.
Information to create a baseline can include the following:

* Boot log information

* Hardware information (/proc/ and /sys/)

* Hardware information (command line utilities)

* System and process information (command line utilities)

* Hard drive usage

As a system administrator, you are probably responsible for documenting and monitoring your systems. Most administrators use a variety of utilities to develop an initial baseline when the system is deployed. This provides a snapshot of how the system was performing at the point right after it was initially installed. Then you create subsequent baselines at regular intervals over time. You compare these baselines against your initial baseline to evaluate performance trends. Analyzing your baselines against your system documentation and your change log can help you identify issues that may be impacting performance.

To develop your system documentation and baselines, you need to evaluate the following questions:

Does the system boot normally?
What is the version number of the kernel?
What services are running on the system?
What is the average system load?

In this objective, you are introduced to SUSE Linux Enterprise Server 12 tools that you can use to answer these questions.

Areas to Monitor

Boot Process
CPU
Memory
Storage
Network

A tool that creates report about all these areas is supportconfig.

Monitor the Boot Process

Press Esc during boot
View the content of the kernel ring buffer with dmesg
View log entries with journalctl -b
View the /var/log/boot.log file

When SUSE Linux Enterprise 12 initially starts, you can press Esc to view system boot messages. These boot messages contain a wealth of valuable system information.

If you want to turn off the splash screen and always view the boot messages, replace splash=silent by splash=none in /boot/grub2/grub.cfg.

However, most of the messages scroll by so quickly that they are difficult to read. If an error message were to be displayed, it’s unlikely that you would be able to read it before it scrolled off the screen.

With systemd, these messages are stored in systemd's journal and can be viewed with the journalctl -b command. When rsyslog is started, these messages are also written to the /var/log/messages file.

They are also stored in the kernel ring buffer, but the capacity of the kernel ring buffer is quite limited. Therefore, the oldest entries in the kernel ring buffer are deleted when new entries are added to it. The current content of the kernel ring buffer can be viewed with the dmesg command.

Get CPU Information

View configuration regarding the CPU

cat /proc/cpuinfo

View information about the host and kernel

uname -a

The /proc/ directory contains a great deal of information about the running SUSE Linux Enterprise Server 12 system, including the hardware information stored in the kernel memory space. The /proc/ directory and all of its subdirectories and files aren’t “real” files. Instead, they are dynamically generated when you access them. However, you can view the contents of the files within /proc/ using standard Linux shell commands such as cat, more, and less. For example, if you enter cat /proc/cpuinfo, output is generated from data stored in kernel memory that displays information such as the CPU model name and cache size. Some of the more commonly used files in /proc/ include the following:

/proc/devices: Displays information about the devices installed in your Linux system
/proc/cpuinfo: Displays processor information
/proc/ioports: Displays information about the I/O ports in your server
/proc/interrupts: Displays information about the IRQ assignments in your Linux system
@@/proc/dma@: Displays information about the DMA (Direct Memory Access) channels used in your Linux system
/proc/bus/pci/devices: Displays information about the PCI (Peripheral Component Interconnect) devices in your Linux system
/proc/scsi/scsi/: Displays summary information about the SCSI devices installed in your Linux system
/proc/partitions: Displays information about the disk partitions on your system.
/proc/sys/: Contains a series of subdirectories and files that contain kernel variables.

The uname command displays information about the host and the currently running kernel.

Monitor CPU Utilization

Information about current load, users, time the system is running:

uptime

View information regarding processes, memory, CPU utilization, current load:

top

Monitor Memory Utilization

Areas:

* RAM

* Swap

Commands:

free

top

Mem: Contains information about the physical memory:
total: Total amount of available physical memory, in KBs. The number is lower than the installed physical memory, since the kernel itself uses a small part of the memory.
used: Amount of memory that is used for applications cached data.
free: Memory that is not used and available at the moment.
Shared/buffers/cached: More detailed information about how the memory is used.
-/+ buffers/cache: Some of the memory on a Linux system is used to cache data for applications or devices. Parts of this memory can be freed when it is needed for other purposes.

The free column displays the buffer adjusted line, which shows the memory that would be used and available if the buffer and the cache were freed.

Swap: Shows information about the utilization of the swap memory. The information includes the amount of total, used, and free available memory.

As accessing the hard disk is much slower than accessing physical memory, the performance of the whole system is affected when a lot of swap space has to be used. Usually this happens when there is not enough physical memory to perform the desired functionality of a system. It can also happen if an application requests much more memory than it actually needs.

You can use the top command to find programs that use a lot of memory. By default, top sorts the process list by CPU utilization. By typing F, n, and then pressing the Enter key, you can change the column used for sorting to memory utilization, and the top memory consumers will be found at the top.

Monitor Hard Disks

Monitor individual disks

smartctl

Monitor RAIDs

mdadm

Hard disks fail - it is not so much a matter of if they fail, but when they fail.

The smartctl command can be used to view the smart information from the hard disk.

If you activate logwatch (package logwatch), it also includes a brief section with SMART information. Create a /etc/logwatch/conf/logwatch.conf file with the following lines:

    Output = mail
    MailTo = account@example.com
    Print = No

The cron job in /etc/cron.daily/0logwatch checks logfiles and sends a mail to the account above.

mdadm In the /etc/sysconfig/mdadm file, set the MDADM_MAIL variable to a proper email address and make sure that mail sent to this address from this machine arrives at its destination.

Activate the monitoring with

systemctl enable mdmonitor

and

systemctl start mdmonitor

If, for instance, a hard disk in a RAID fails you will be sent a mail.

Monitor Memory Utilization vmstat

* Reports virtual memory statistics

* si and so columns show swapping activity:

si (swap in, read) and so (swap out, write), per second

Attach:sto27.png

'If a lot of used swap memory is displayed in free, this can indicate a performance bottleneck caused by a lack of physical memory. But this is not always the case. Sometimes a lot of memory is copied to the swap partition but is never touched again. The performance of the system is affected only when the swap memory is actually accessed.'

You can use the vmstat command to display the activity of swap memory. vmstat 1 lets vmstat repeat its output every second:

    procs memoryswapiosystemcpur
    b swpd free buff cache si so bi bo in cs us sy id wa st
    0 0 0 1426600 20564 282632 0 0 64 10 35 51 1 1 97 1 0
    0 0 0 1426576 20564 282668 0 0 0 0 37 41 0 0 100 0 0
    0 0 0 1426576 20564 282668 0 0 0 0 27 37 0 0 100 0 0

The output in the columns si and so are of interest in this case. si stands for swap in, which means that data is transferred to the main memory from the swap space. so stands for swap out, which means that data is transferred to the swap space from the main memory. In the example above, there is no activity for the swap space.

The first line of the output displays the average values since the system was started. The lines that follow show the average values since the last output.

Monitor Storage Performance

Rule out system load or RAM problems first

vmstat

bi (blocks in, read) and bo (blocks out, write), per second

Attach:sto28.png

vmstat, bi and bo - a lot of data read from or written to the disk does not necessarily mean that the disk subsystem is too slow. Depending on the available disk types and the disk configuration, a disk load that totally blocks one system can be easily handled by another system. A performance problem that is caused by the disk subsystem usually occurs when a process has to wait for data being delivered from or written to the disk.

Look at per disk I/O statistics

iostat

Attach:sto29.png Δ

iostat (package sysstat) can be used to determine the average time a program has to wait for data from the disk.

Every output contains two blocks of information. The first block displays information of the CPU utilization, like top or uptime. The second block shows the information about the requested disk device.

The first output represents the average values since the system was started. All following lines show the average values since the last update period. The block that displays the device information shows first some details about the amount of data that is read from or written to the device. To find out if the disk subsystem has a performance bottleneck, focus on the following columns:

await: Average (wait) time, in milliseconds, an application has to wait until its I/O request is performed.
svctm: Average (service) time, in milliseconds, that an I/O request needs to be performed.

Monitor the Network

ip -s link show
ethtool -S interface
iptraf
ss (replacement for netstat)

ip -s link show displays a static view on the interface statistics, including transmitted, received, and dropped packets.

iptraf is an ncurses-based IP LAN monitor that generates various network statistics including TCP info, UDP counts, ICMP and OSPF information, Ethernet load info, node stats, IP checksum errors, and others. If the command is issued without any commandline options, the program comes up in interactive mode, with the various facilities accessed through the main menu.

Gnome System Monitor and KDE System Guard are graphical programs that show various parameters of system load, including network load.

Supportconfig

Purpose: Resolve support calls faster

* Collect 95% of needed information into one spot

*Teach engineers troubleshooting commands

* Organize information

Team integration
Supportconfig is a bash script owned by the supportutils package that gathers system information

Supportconfig's primary purpose is to resolve calls faster. This is accomplished by collecting as much information as possible the first time, organizing that information into one directory by topics, and including the commands used to gather information.

Needed Information

Saves all files into one directory
Files end in *.txt for default editors across platforms
Tars and bzip2 compresses the files

* Default file naming convention is nts_<hostname>_<date>_<time>.tbz

* Use -B to control the file name convention

* See supportconfig(8) and supportconfig.conf(5)

Since the primary goal is to resolve calls faster, some information may be duplicated in more than one topical file. As much information as possible is collected the first time around, minimizing repeat requests for information. Each file ends in a *.txt extension which is recognized by most default text editors across operating systems. All the files are tarred into one file and then compressed with bzip2 to minimize the time to upload the information to support engineers and central supportconfig repositories. You can change the default tar ball file name convention by using the -B startup option.

Organisation

Files arranged by topic
File entry format

* Header #==[ <string> ]==#

* Command Executed - # command [options]

* Command Output

File Layout

* RPM validation

* Service Status

* Informational commands

* Configuration files

* Log files

Information is gathered by topic when applicable. For example, all Logical Volume Management (LVM) information is saved in the lvm.txt file. All /etc/*.conf files are saved in the etc.txt file. The most common troubleshooting topics are included in supportconfig. If there is not a file for a troubleshooting topic (i.e. Squid), then you can use general troubleshooting files; like etc.txt, env.txt, chkconfig.txt, messages.txt, rpm.txt, etc.

Supportconfig runs a lot of system commands. For each command that is executed, a file is written that begins with a header #==[ <string> ]==#, such as #==[ Verification ]==#, #==[ Command ] ==#, or #==[ Configuration File ]==#, followed by the exact command being executed, and finally the output of that command. The header and command are logged to file before executing the command itself. The purpose of the header is instructive. It's hard to remember all the system commands. If you make a change to the server during the troubleshooting process, the supportconfig teaches you which command to use to retest your changes – or just run another supportconfig for comparison.

Most of the topical files are arranged in a logical order. The RPM packages owning the primary files of a service are validated, followed by the service's state. Is is turned on to start at boot? Is is currently running? Informational commands specific to the service, configuration files and log files follow. We often forget to check basic information when troubleshooting problems. Putting the basic RPM and service information at the top of the file prompts looking at the basics first.

Running Supportconfig

Attach:sto30.png

Possible options include:

-F lists the supportconfig features
-o toggles the supportconfig feature on or off based on its current setting
-i run a minimum supportconfig, but includes the feature(s) listed
-u uploads the tarball to the Novell ftp server
-r <srnum> includes an 11 digit Novell service request number in the file name
See supportconfig(8) for all startup options and supportconfig.conf(5) for all configuration file options.

Wikisiso

Use Monitoring Tools