The complexities of systems monitoring

By Roger Mitan

Monitoring your network and systems is extremely important to maintain a healthy environment. Adding monitoring to your environment may seem like a straightforward task on the surface, but a little dip below the surface reveals the complexities of the tools, metrics to monitor, thresholds, and automation.

First, let’s look at some general categories of monitoring tools. For the purposes of simplicity for this article, I have broken these categories into the following:

  • Basic monitoring
  • Detailed monitoring
  • Remote management and monitoring (RMM)

Basic monitoring consists of the monitoring of the up/down status of systems and devices. This one is as simple as it sounds. There are no metrics involved, usually just a ping to an IP address (numerical identifier assigned to the device) that sends an alert if a certain number of pings fails.

Detailed monitoring is much more complicated. For example, with network devices this involves in-depth monitoring of metrics, such as CPU, memory, throughput, packet loss, cyclic redundancy checks (CRC) errors, etc. With systems there are thousands of metrics, such as monitoring memory utilization, CPU utilization, services status, disk operations, disk queues, Active Directory latency, SQL deadlocks, SQL index searches, virtual machine parameters, etc. Along with these metrics are their corresponding alert thresholds to consider. Setting these thresholds too low could result in an excessive number of alerts, which will often lead to alert fatigue. Yet setting them too high could easily cause something critical to be missed.

With these detailed monitoring tools there are also escalation chains involved, which will alert various tiers of support personnel depending on criticality and the amount of time an alert has gone unacknowledged. Many of these tools also have built-in trending and prediction engines that can alert you to items such as network bandwidth or system disk activity being higher or lower than normal for a certain time period and can also predict failures or capacity issues that will occur in the future based on monitoring history.

Remote management and monitoring takes detailed monitoring a step further by allowing remote access through the monitoring system and, more importantly, automation to handle tasks such as updates, clearing temporary files when disk capacity is too high, restarting failed services and many other management scripts that can be run both manually and automatically based on monitoring alerts.

With this general knowledge of monitoring systems, the next step is to determine which of these categories is right for your organization and then determine the best product for the job based on your budget and needs.

There are many tools available in all of these categories ranging from free and/or open source to expensive commercial offerings. Determining which tool or combination of tools to use can be as complicated as understanding monitoring itself.

Although open source tools generally have no upfront cost, the man-hours involved in implementing and managing the monitoring solution could easily add up to the cost of a full-time employee. Even many of the commercial offerings require a large amount of maintenance and tuning. When looking at features, functionality and cost it is important to factor in this cost of labor. Do you want a low-cost tool which has to constantly be maintained to ensure everything is being properly monitored or do you want a tool which can handle most of this work for you and provide a responsive support team to help the rest of the work? You should also determine if you want the tool to be part of your infrastructure or be utilized as a Software as a Service (SaaS) offering and whether the tool can monitor your cloud systems as well as your local infrastructure.

The importance and complexities of monitoring can easily be overlooked, and monitoring is often added as an afterthought. As a result, monitoring systems are either underutilized or not maintained enough to the point which ensures the environment is being properly monitored. Instead of sifting through the mounds of options on your own, you should choose a trusted partner that is experienced with all of these nuances and can provide you with an unbiased approach to monitoring which fits within your budget and requirements.

Roger Mitan is the director of engineering with BlueBridge Networks, a downtown Cleveland-headquartered data-center and cloud computing business. He can be reached at (216) 621-2583 and


