DatE
February 3, 2022
Reading Time
11 Min.

Monitoring vs. Alarming

Alarming
Monitoring

By

Kay Thriemer

and

Mirko Quarg

Our beautiful API just keeps getting better: monitoring is now up and running too!👉 https://pentacor.de/2021/10/11/vertrauen-ist-gut-kontrolle-ist-besser/

Photo by Carlos Muza on Unsplash

We’ve integrated a range of monitoring strategies to make sure we always have a clear view of how our system is performing.

With infrastructure monitoring, we start by covering the essential basics—making sure the API is even able to function.
That includes tracking CPU usage, memory consumption, disk activity, and network performance.

By embedding an agent directly into the application, we log incoming requests, responses, processing times, and detect exceptions. This data is sent to our monitoring tools and forms what we call our passive monitoring.

However, passive monitoring has its limits—for example, if no one sends requests to the API for a while, we wouldn’t catch potential issues.
That’s why we also use synthetic monitoring: it actively triggers our various API endpoints, simulating the kind of traffic we expect from our users.

All of this monitoring depends on metrics, which we can then analyze further using tools like Prometheus or Datadog.

But what exactly are metrics—and which ones are actually useful?

Metrics capture values that reflect the state of our systems at a specific point in time—for example, the number of users currently logged into a web application. That’s why metrics are typically collected at regular intervals—every second, every minute, or at another fixed rate—to help monitor system behavior over time.

There are two key categories of metrics: work metrics and resource metrics.

For every system that makes up our software infrastructure, we identify which work and resource metrics are available—and we collect them accordingly.

Work metrics reflect the system’s state at a high level. It’s helpful to break these down into four main types:

  • Throughput measures how much work the system completes over a given period of time. This is usually shown as an absolute value—for example, a web server might handle 500 requests per second.
  • Success metrics indicate how many operations were successfully completed. For a web server, that could be the percentage of HTTP 2xx responses.
  • Error metrics track failed operations. These are typically shown as an error rate over time or normalized per unit of work (e.g., errors per request). They’re often monitored separately from success metrics, especially when there are multiple types of errors with different levels of severity. For example, HTTP 5xx responses on a web server would count toward the error metric.
  • Performance metrics measure how efficiently a component is operating. The most common one is latency—the time it takes to complete a task. Latency can be reported as an average or percentile, like: “99% of requests are completed within 0.1 seconds.”

These metrics are essential for effective monitoring. They help answer critical questions about a system’s internal state and behavior:
Is the system available and functioning as intended?
How well is it performing?
What’s the error rate?

Resource Metrics

Most components in our software infrastructure act as resources for other systems. Some of these are low-level resources—for example, a server’s resources include physical components like the CPU, memory, hard drives, and network interfaces. But higher-level components, such as a database, can also be considered resources if another system depends on them.

Resource metrics help us build a detailed picture of system health, making them especially valuable when investigating or diagnosing issues. For each resource in our system, we collect metrics across four key areas:

  • Utilization measures how much of a resource’s capacity is being used—typically expressed as a percentage of time the resource is busy or how much of its total capacity is in use.
  • Saturation reflects how many tasks the resource has queued up that it hasn't been able to process yet—such as items waiting in a queue.
  • Errors capture internal problems that might not be immediately visible during normal operations.
  • Availability shows the percentage of time the resource successfully responded to requests. This metric is only clearly defined for resources that can be actively and regularly checked for availability.

To illustrate this, here’s an overview of common metrics across different resource types:

Ressource Utilisation Saturation Rate Errors Availability
Microservice
Average percentage of time the service was busy
Requests not processed
Internal errors of the microservice Percentage of time the microservice was available
Database Average percentage of time each database connection was busy Unprocessed requests Internal errors, e.g. replication errors Percentage of time the database was accessible

Metric Types

In addition to the categories for metrics, the different types of metrics are also important. The metric type influences how the corresponding metrics are displayed in a tool, such as Datadog or Prometheus.

The following different types exist:

  • Count
  • Rate
  • Gauge
  • Histogram

The Count metric type represents the number of events during a specific time interval. It is used, for example, to record the total number of all connections to a database or the number of requests to an endpoint. The Count metric type differs from the Rate metric type, which records the number of events per second in a defined time interval. The Rate metric type can be used to record how often things happen, such as the frequency of connections to a database or requests to an endpoint.

The Gauge metric type returns a snapshot value for a specific time interval. The last value within a time interval is returned in each case. A metric with the gauge type is primarily used to measure values continuously, e.g., the available hard drive space.

The Histogram metric type can be used to determine the statistical distribution of certain values during a defined time interval. It is possible to determine the "average", "number", "mean", or "maximum" of the measured values.

Events

In addition to metrics that are recorded more or less continuously, some monitoring systems can also record events: discrete, infrequent occurrences that can play a crucial role in understanding the behavioral changes of our system. Some examples:

  • Changes: internal code releases, builds, and build errors
  • Warnings: internally generated warnings or notifications from third-party providers
  • Scaling resources: adding or removing hosts

Unlike a single metric data point, which is generally only meaningful in context, an event usually contains enough information to be interpreted on its own. Events capture what happened at a particular point in time, with optional additional information.

Events are sometimes used to generate warnings – someone should be notified of events that indicate that critical work has failed. But more often they are used to investigate problems and correlate across systems. Events should be treated like metrics – they are valuable data that need to be collected wherever possible.

But what should good data look like?

The data collected should have four characteristics:

Good comprehensibility:
We should be able to quickly determine how each metric or event was captured and what it represents. During an outage, we do not want to spend time trying to figure out what our data means. We keep metrics and events as simple as possible and name them clearly.

Appropriate time interval:
We need to collect our metrics at appropriate intervals so that any problems become visible. If we collect metrics too infrequently or average values over long time windows, we may lose the ability to accurately reconstruct the behavior of a system. For example, periods with a resource utilization of 100 percent will be obscured if they are averaged with periods of lower utilization; on the other hand, it must also be kept in mind that collecting metrics or synthetic monitoring in particular, means a certain (continuous) load and
a) “takes away” resources from the “actual” work and
b) overlays its data, as synthetic monitoring is naturally reflected in passive monitoring. If a time interval is selected that is too short, this can lead to the system being noticeably overloaded or not containing any meaningful data.

Scope reference:
Suppose each of our services operates in multiple regions and we can check the overall health of each region or their combinations. It is then important to be able to allocate metrics to the appropriate regions so that we can alert on problems in the relevant region and investigate outages quickly.

Sufficiently long data retention:
If we discard data too early or our monitoring system aggregates our metrics after some time to reduce storage costs, we lose important information about the past. Keeping our raw data for a year or more makes it much easier to know what is “normal,” especially if our metrics show monthly, seasonal, or annual fluctuations.

With our collected metrics, we are now able to provide information about the health of our system at any given time.

BUT: Who wants to spend all day looking at the charts generated and checking the status of the system? There should be something that lets us know when unusual things happen.

An alarming system

Drawing attention to the essentials

Automated alerts are essential for monitoring. They allow us to detect problems anywhere in our infrastructure so that we can quickly identify their causes and minimize service disruptions and interruptions.
While metrics and other measurements facilitate monitoring, alerts draw attention to the specific systems that require observation, inspection, and intervention.

But alerts are not always as effective as they could be. In particular, real problems often get lost in a sea of messages. Here we want to describe a simple approach to effective alerting:

  • Alert generously, but judiciously
  • Alert about symptoms rather than causes
  • When should you alert someone (or no one)?

An alert should communicate something specific about our systems in plain language: “Two Cassandra nodes are down” or “90 percent of all web requests take more than 0.5 seconds to process and respond.” By automating alerts for as many of our systems as possible, we can respond quickly to problems and provide a better service. It also saves us time by freeing us from the constant, manual checking of metrics.

Levels of Alert Urgency

Not all alerts have the same urgency. Some require immediate intervention, some require eventual intervention, and some indicate areas that may require attention in the future. All alerts should be logged in at least one centralized location to allow easy correlation with other metrics and events.

Alerts as Records (Low Severity)

Many alerts are not associated with a service issue, so a human may not even notice them.For example, if a service responds to user requests much slower than usual, but not so slow that the end user finds it annoying, this should generate a low severity alert. This is recorded in monitoring and stored for future reference or investigation. Finally, temporary problems that could be responsible, such as network congestion, often disappear on their own. However, should the service return a large number of timeouts, this information provides an invaluable basis for our investigation.

Warnings as Notifications (Medium Severity)

The next level of alerting urgency concerns problems that require intervention but not immediately. The data storage space may be running low and should be scaled up in the next few days. Sending an email and/or posting a notification in a dedicated chat channel is a perfect way to deliver these alerts – both message types are highly visible but they don’t wake anyone up in the middle of the night and don’t disrupt our workflow.

Warnings as Alerts (High Severity)

The most urgent alerts should be given special treatment and escalated immediately to get our attention quickly. For example, response times for our web application should have an internal SLA that is at least as aggressive as our most stringent customer-facing SLA. Any instance of response times exceeding our internal SLA requires immediate attention, regardless of the time of day.

When should you leave a sleeping engineer alone?  😉

Photo by No Revisions on Unsplash

When we consider setting an alarm, we ask ourselves three questions to determine the urgency of the alarm and how it should be handled:

Is this problem real?
It may seem obvious but if the problem is not real, it should not normally generate an alarm. The following examples may trigger alarms but are probably not symptomatic of real problems. Alerting on events like these contributes to alert fatigue and may cause more serious problems to be ignored:

  • Metrics in a test environment are outside established limits.
  • A single server is performing very slowly but is part of a cluster with fast failover to other machines and is still rebooting regularly.
  • Planned upgrades result in a large number of machines being reported as offline.

If the problem actually exists, a warning should be generated. Even if the alert is not linked to a notification (via email or chat), it should be recorded in our monitoring system for later analysis and correlation.

Does this problem require attention?
There are very real reasons to call someone away from work, sleep, or their private time. However, we should only make use of this when there really is no other way. In other words, if we can adequately automate a response to a problem, we should consider doing so. However, if the problem is real and requires (human) attention, an alert should be generated to notify someone who can investigate and fix the problem. Depending on the severity of the problem, it may wait until the next morning. We can therefore distinguish between different ways of notification: the notification should at least be sent by email, chat, or a ticketing system so that the recipient can prioritize their response. Otherwise, calls, push notifications to mobile phones, etc. are also conceivable in order to draw attention to a problem as quickly as possible.

Is this problem urgent?
Not all problems are emergencies. For example, perhaps a moderately high percentage of system responses were very slow or a slightly higher percentage of queries are returning stale data. Both problems may need to be fixed soon but not at 4:00 a.m. If, on the other hand, the performance of a key system drops or stops working, we should check immediately. If the symptom is real and requires attention and it is acute, an urgent alert should be generated.

Fortunately, monitoring solutions such as Prometheus or Datadog offer the option of connecting various communication channels such as email, Slack (chat), or even SMS.

Symptom Alert

In general, an alert is the most appropriate type of warning when the system we are responsible for can no longer process requests with acceptable throughput, latency, or error rates. This is the kind of problem we want to know about immediately.

The fact that our system is no longer doing useful work is a symptom – that is, it is the manifestation of a problem that can have a number of different causes.

For example: if our website has been responding very slowly for the last three minutes, this is a symptom. Possible causes include high database latency, down application servers, high load, etc. Wherever possible, we base our alerting on symptoms rather than causes.

Alerting on symptoms leads to real, often user-related problems and not hypothetical or internal problems. Let’s compare alerting on a symptom, such as slow website responses, with alerting on possible causes of the symptom, such as high utilization of our web servers:
Our users won’t know or care about the server load if the website is still responding quickly and we’ll be annoyed if we have to take care of something that is only noticeable internally and can return to normal levels without intervention.

Long-lasting Definition of Warnings

Another good reason to refer to symptoms is that alerts triggered by symptoms tend to be persistent. This means that regardless of how the underlying system architectures may change, even without updating our alert definitions, we will receive a corresponding message if the system no longer functions as it should.

Exception to the Rule: Early Warning Signs

It is sometimes necessary to focus our attention on a small handful of metrics, even if the system is functioning appropriately.
Early warning values reflect an unacceptably high probability that serious symptoms will soon develop and require immediate intervention.

Hard disk space is a classic example. Unlike a lack of free memory or CPU, the system is unlikely to recover if we run out of hard drive space and we certainly have little time before our system hard-stops. Of course, if we can notify someone with enough lead time, we don’t have to wake anyone up in the middle of the night. Better yet, we can anticipate some situations where space is running low and create an automatic fix based on the data we can afford, such as deleting logs or data that exists elsewhere.

Conclusion: Take Symptoms Seriously

  • We only send an alert if symptoms of serious problems are detected in our system or if a critical and finite resource limit (e.g. hard drive space) is about to be reached.
  • We set up our monitoring system so that it records alerts as soon as it recognises that real problems have occurred in our infrastructure, even if these problems have not yet affected overall performance.

Process chain monitoring

Hooray, we get an alarm: “Our storage system has failed.”

This is still no cause for celebration, as the boss is at the door and wants to know which business functions are affected. With our current setup, it is not possible to provide this information immediately.

Let’s take a closer look at this with an example:

A hypothetical company operates internal and external services. The external ones consume internal services. Both types of services have dependencies on resources, such as our storage system.

With a dependency tree, it is possible to model these dependencies. If an element in our tree fails, we can automatically determine which dependent systems are affected. This enables advance information to be sent to the helpdesk, first-level support, etc., and thus takes pressure and stress off the team working on the problem.

By using weighted nodes, it is possible to calculate the criticality. This in turn makes it possible to prioritize work on individual outages.

Fortunately, parallel failures only occur in theory and certainly not on weekends or public holidays.

It should be noted here that the IT landscape in companies is “alive.” The consequence of this is that anyone who has this kind of end-to-end monitoring must also keep this model/tool up to date. This always means effort, which is often only worthwhile for critical company processes.

For example, if a bot that publishes the current canteen menu in a chat tool fails, it is not worth rolling out end-to-end monitoring for this.

Conclusion

Monitoring enables us to make statements about the status of our systems at any time.
Alarming draws attention to system anomalies.
Process chain monitoring allows us to make a statement regarding the effects of anomalies.

The prerequisite for a sensible and successful monitoring and alarming system is an overall concept, which must be developed...