In an earlier post, we examined how microservices application as complex distributed systems have created more complex and less predictable performance profiles. Here we examine the state of monitoring systems and their effectiveness and limitations in helping operations (Ops) teams manage the performance of applications.
THE STATE OF MONITORING
There are many available monitoring systems and frameworks, both commercial and open source, that provide volumes of real-time telemetry on microservices, from the container level down to the cloud infrastructure.
The scope of these monitoring systems includes both white-box monitoring, which enables inspecting application internals and metrics for debugging, and black-box monitoring that can detect symptoms of possible problems, possibly indicating imminent problems. In addition, the industry is standardizing on monitoring each microservice via its four golden signals i.e., latency, traffic, errors, and saturation.
Ops teams are under pressure 24×7 to extract the pertinent information to understand and manage the incident to reduce their impact and limit outage-induced business impacts, and conduct subsequent post mortem or root cause analysis to determine a fix to the problem.
Typically, Ops is looking to answer the following:
- How do we detect an otherwise undetected condition that may impact a customer-facing service?
- Is the alert from an incident benign or a false positive?
- Does the alert indicate users are being negatively affected or not?
- How do we ensure that we are not just focusing on fixing the alert, i.e., the symptoms, and not the root cause so we do not find ourselves facing these alerts again?
When we talk to Ops teams, what we find is that these performance related concerns are the same ones they had when they managed monolithic applications over the past decades. What is different is that it is harder to answer these questions in the case of microservices given their increased complexity.
WHAT MAKES MICROSERVICE APPLICATIONS COMPLEX
Microservice applications are characterized by strong coupling between component services. As a result there is a lot of inter-service communication. That means the performance of one service can depend on many other services, and it in turn can affect the performance of many other services. The graph of connectivity between services, i.e., microservices graphs, of some large cloud providers can contain 100s of services and the resulting high degree of interdependencies creates ‘dependency hell’. The average application we come across has far fewer services. But even with a dozen to twenty services, the impact of failures or degradation in a service in the application can still result in multiple alerts. That creates many false positives that the Ops team has to contend with.
So it is not surprising that Ops teams have to resort to time-consuming analysis with a variety of tools.
CHALLENGES WITH EXISTING MONITORING SYSTEMS
We see three areas where today’s monitoring systems do not adequately help Ops teams address microservice performance issues.
Microservice applications are more than two-dimensional service maps. They are built on containers, provisioned on an orchestration layer, e.g., Kubernetes, which in turn is built on a cloud infrastructure layer. An incident in the application, or in the container, or in the orchestration layer, or in the infrastructure services can have an impact on multiple services that are masked by these layers of obfuscation.
As with many distributed systems, especially a network of services, there are many points of failure in the microservices application. Even before a failure is manifestly obvious, the application could be operating in a degraded mode given there are many more states of the overall application, and the design of microservices application allows individual services to fail in a gracefully degraded mode. Unfortunately, a degraded state can impact critical performance service levels and those states can herald imminent failures.
Integrated Real-Time Insights
With many sources of data in a complex system, adding more dashboards only leads to the ‘cognitive overload’ forcing the Ops team to go through many a series of visual queries looking for patterns that may not be applicable. More importantly, while distributed tracking tools can help in the diagnosis of performance problems, that approach forces the Ops team to operate primarily in a post-facto mode. What they lack is an integrated view into what is happening within and across the application without adding dashboards per metric.
It is no wonder that Ops teams still need war rooms and have to put in hours of onerous work for handling performance issues and resolving them well after the fact.
LACK OF INSIGHTS
In summary, Ops teams today have access to many resources to monitor their applications. However, the underlying complexity of microservice applications in terms of internal dependencies, the layered obfuscation of the problems, the degraded mode of operations, and the lack of integrated real time insights into the application create some serious challenges for Ops teams.
In a future blog we will discuss what Ops teams could use to more proactively help them in their operational effectiveness when handling performance issues in microservice applications.
 ‘Monitoring Distributed Systems’, Rob Ewaschuk, Google Site Reliability Engineering. https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/
 ‘Evolution of Microservices.’ Adrian Cockcroft, 2016, Craft Conference. https://www.slideshare.net/adriancockcroft/evolution-of-microservices-craft-conference
 ‘The Hows, Whys and Whats of Monitoring Microservices,” Dave Swersky, 2018. https://thenewstack.io/the-hows-whys-and-whats-of-monitoring-microservices
 Debugging Microservices: Lessons from Google, Facebook, Lyft, Joab Jackson, 3 Jul 2018. https://thenewstack.io/debugging-microservices-lessons-from-google-facebook-lyft/