DatE
October 11, 2021
Reading Time

Trust, but verify

API
Monitoring

We’ve just built another great API. Now all we need are some users to actually start consuming it. We’ve already played around with how it all works. And to make sure our consumers have a good experience, we reviewed once more what we need to keep in mind so they don’t end up hating our API:

We put a lot of effort into the design. We use error codes and return helpful error responses—or at least we think so. The services behind our API are pretty fast too. We tested them. Everything worked right out of the box, we’re satisfied, and the API is published—the party can begin!

Then suddenly someone spoils the fun and asks us if we’re actually monitoring our API. Monitoring?
Well yes, we do. If server memory runs low or the disk fills up, we get notified and can react. Our infrastructure is under control!

But the spoilsport doesn’t give up. And honestly, he doesn’t really care about our shiny servers. What he still wants to know is whether we are monitoring our API. Why? After all, everything has been tested on multiple levels—from unit tests to system integration tests to end-to-end tests. All successful. And clearly, the API works. Isn’t that enough?

But what happens if an external system we depend on suddenly stops working properly? Oh! Right. In that case, we’d definitely have a problem. Sure, we can fall back on resilience patterns to implement mitigation strategies, but failures in other systems will inevitably affect us. Fine, maybe we’ll add some monitoring after all: external APIs probably expose a health or status endpoint. We could just have our application call those regularly, aggregate the results, and then make sure our own health endpoint only returns “OK” if all of our friends are doing fine as well. We add our status endpoint into monitoring, and if something changes, we’ll know. Done! That really should be enough, right?

Um… no. Not really. Not even close. Already in October 2018, API calls accounted for 83% of all internet traffic. The vast majority of that traffic comes from cloud applications and digital transformation. Which means what our API and other APIs are doing is pretty important—to our customers, and to their customers: people. If something goes wrong, we should definitely want to know—before the customer calls, because our aggregated status information might not tell the whole story. They only reflect API availability, not errors that occur when invoking the API. And let’s be honest: we can safely assume our customer has better things to do than spend all day checking whether the API we built is actually working (especially if we’re not even doing it ourselves). Chances are they’ll be using endpoints other than just the status endpoint. So if they call us, it either means we all just got lucky that they happened to stumble upon an unexpected issue, or that they got called by their own customers.

Given the reach and scale of APIs, it’s safe to assume that when our customer gets called, their customers have not only reported the issue to them but also made it public and vented their frustration. Others quickly jump on board, and a social media storm is triggered. Eventually, the press picks it up—and our customer’s competitors couldn’t dream of better advertising. They’ll be thrilled!

Earlier this week, Facebook, Instagram, and WhatsApp were all down due to a router configuration change. Back in February, Facebook Messenger was unavailable; most users in the UK had problems receiving messages, and some couldn’t log in at all. Also in February, LinkedIn greeted its users with a generic “An error has occurred” thanks to a configuration issue. In December last year, multiple Google services—including YouTube and Gmail—were down because an authentication service failed. And in November, when Amazon Web Services suffered an outage, half the internet went down with it.

It’s a bit like the infamous “This should never happen” comment in a catch block for an exception we were convinced could never be thrown—until it is. Especially with distributed cloud applications and mutual external dependencies, the potential causes of failure are many and far from obvious.

Okay, we’re convinced. We’re definitely well-advised to continuously keep an eye on the expected behavior of our API. So, what are our options?

Photo by Miguel A. Amutio on Unsplash

Infrastructure Monitoring

We actually wanted to monitor the API itself—but isn’t it enough to just track the status and metrics of all the underlying components? If the servers for my application are running, the databases are reachable, and the API gateway is healthy, can’t we assume the API is working as well?

Hmm… no. We’ve already been down that road. An API is more than just the sum of its parts. If the infrastructure isn’t functioning as expected, we can safely assume the API won’t work either. But the reverse isn’t true: error-free infrastructure is no guarantee of an error-free API. With infrastructure monitoring, we’re only covering the basic prerequisite—checking whether the API can function at all.

For real insight, something’s still missing… So, next!

Passive Monitoring

Let’s just lie in wait: we integrate an agent into the application that provides the API and observe what’s going on.

Photo by Miguel A. Amutio on Unsplash

Our agent records incoming requests along with their corresponding responses and detects exceptions. It logs how long request processing takes and with which status the request is ultimately completed. All of this information is forwarded to a monitoring tool, allowing us to identify changes over time:

  • Is the number of requests increasing?
  • Is processing taking longer and longer?
  • Are errors occurring more frequently—and even disproportionately so?

If so, something might be brewing. Something isn’t quite right. If we’re lucky enough that our application logs are collected in the same tool as the agent’s information, we can start looking for correlations and get to the bottom of the unexpected behavior.

So, we’re definitely smarter than before and now capable of monitoring expected API behavior based on live traffic. Problem solved! Well—at least we’ll notice when we have a problem. Done!

Wait. Not so fast!

Exceptions in our application and even error responses may indicate that we might have a problem—but that doesn’t necessarily mean we actually do. It always takes at least two parties. On the one hand, it’s good to clean up in front of our own door and assume that something on our side may not be working as expected. On the other hand, there are also the users of our API. And no one guarantees us that they will behave the way we envisioned and hoped.

In fact, the potential errors are diverse and ideally can be identified via the HTTP status codes—defined in HTTP/1.1 (RFC 2616). Six different 5xx status codes signal errors on our (server) side, while three times as many 4xx status codes indicate that the issue is more likely on the client side. Either the HTTP protocol is far more imaginative in categorizing and describing client errors—or perhaps it really isn’t so unlikely that the “fault” lies with the consumer of our API, because something in their request isn’t correct or we cannot process it.

But hold on—we don’t want to play the blame game. We want our API to provide value to its consumers. So if we see an increase in client “errors,” it might just be an indication that consumers want to use the API differently than we expected—or that our documentation has been misunderstood. Aside from deliberate attacks on our API with the goal of harming our systems, we don’t assume that malicious requests are being made. We believe in the goodwill of the client! And if we not only invest in API monitoring but also in API management, we actually know the clients and consumers of our API. That means we can proactively reach out to them when anomalies appear in monitoring, clarify the situation, and work together to improve either the API itself or their understanding of it.

But what happens if we don’t receive any requests at all—neither “good” nor “bad”? If there’s no live traffic for us to monitor? Does that mean everything’s fine?

Not necessarily. If there are no requests at all, it could simply mean that no one is interested in using our (fully functional) API. From the perspective of monitoring expected functionality, that would technically be good news—yet it would still be very, very disappointing. At the same time, a lack of requests might also indicate that someone does want to use our API but cannot, because something else—possibly right outside our door—isn’t working. And we wouldn’t notice it. That’s pretty frustrating.

Our goal was to ensure that our API works for consumers. Achieving that requires more than just knowing when requests fail.

Synthetic Monitoring

For continuous monitoring, we also need continuous requests. Synthetic monitoring is the solution here.

Photo by Markus Spiske on Unsplash

If we only want to check whether our API is fundamentally reachable, the simplest approach is to regularly call a health or status check endpoint of our API. By evaluating the HTTP status codes of the responses, we can at least determine whether the API is accessible from the outside.

But what about any backend services behind our own backend service? Our API might be available—and still not actually working correctly.

Our API, its backend, and the chain of backends behind it

Of course, our API’s health check could aggregate the health checks of all its backends and only report “I’m fine!” if all the others are fine as well. But where does that stop? A backend’s backend could have its own backends—and most likely we’ll never be able (or even allowed) to call all of them, let alone know they exist. This approach definitely has its limits—or perhaps no limits at all.

So what else can we do if a simple health check isn’t enough?

If we’re already sending regular requests to a status endpoint of our API, we can do the same with other endpoints—multiple ones, maybe the particularly important or business-critical ones. In other words: synthetic monitoring allows us to simulate the behavior of a “real” user.

We call the various API endpoints exactly as we assume or expect our consumers would. If we’re lucky, we control everything we need: the requests themselves are known, of course. If we also know the backend, its business logic, and the data, we can determine exactly what a correct and expected response should look like—and verify that too.

At this point, the backend-backend-backends and any other hidden components essentially no longer matter. Our API becomes a complete black box, and we only verify that it behaves correctly at the business logic level. If the normal user flow involves more than one request, then we don’t stop after the first call—we monitor the entire business flow.

But wait, there’s still a challenge: in most cases, the data exposed by our API is not immutable or static. And when monitoring full business flows, we’re not just reading data—we’ll be creating new records, updating existing ones, maybe even deleting data. That’s often not feasible in a production system, or the expected responses for monitoring might change, making them less predictable. We’ll need to think carefully about this before implementing monitoring. Test data, access to it, and how it’s handled will always be a tricky issue!

Another challenge with monitoring business flows in production environments is potential conflicts with reporting and analytics. Naturally, there is (legitimate) business interest in measuring how many calls—for example, orders—were triggered via our API. If we’re not careful and don’t filter out synthetic monitoring requests, the numbers might look great at first glance, as though the API is being used frequently and consistently—but in reality, no “real” users are behind those calls.

On the other hand, the regularly repeated synthetic requests allow us to track metrics like response time, identify trends, and detect anomalies. We just need to be very clear about which data is being used for what purpose and how it is presented.

Speaking of metrics …

Photo by Patricia Serna on Unsplash

When we talk about additional metrics, we can briefly revisit infrastructure monitoring or passive monitoring: if we crack open the black box and dive deeper into the application logic, various tools (for example, Prometheus or Datadog) give us the ability to define our own metrics that can then be incorporated into monitoring.

Regardless of the very different approaches these tools take, custom metrics allow us to monitor business processes inside the application with arbitrary granularity and detail, relatively independent of external conditions. We can define counters, histograms, or values that may increase or decrease. Strictly speaking, though, that’s already another topic in itself.

We can also integrate monitoring agents into frontend applications to understand how users move through our website, where they run into problems, or which functionality they don’t use at all. But again—that’s a whole different field …

Takeaway

There are many ways to monitor infrastructure, APIs, and applications. Each has its own capabilities and limitations. We don’t necessarily need to use all of them, but we can be certain that operating without monitoring is not a good idea. Flying blind is never advisable.

At the same time, we shouldn’t overlook the fact that monitoring does indeed deliver a lot of value—but especially synthetic monitoring can also be quite costly to implement: we need dedicated test users and test data in production systems, and monitoring calls may need to be filtered out of reporting metrics so as not to complicate or even invalidate business reporting.

In any case, we should carefully consider how we can test—reasonably, with appropriate effort, and above all with real value—whether our systems are functioning as they should.

Trust may be good, but in this case, control is definitely better!