Developers need information about technical failures in distributed systems and applications to be able to react quickly and prevent failures. Observability stands for a comprehensive approach that incorporates numerous factors for monitoring and observing the behavior of software. A central tool for this is the tool Prometheus.
Monitoring infrastructure for applications is becoming increasingly important, in part because the number of users and devices connected to the Internet is growing every year. How the system behaves under load? How long does a transaction take? What about the response time? Organizations should understand the extent to which a system is available, how it works, and how users perceive an application. So, in addition to the usual metrics, other criteria are gaining importance to better understand how system architectures or software works and behaves. This trend is also known as Observability and includes at least three pillars: Metrics data, logging and tracing. Observability, however, also includes any insight that helps organizations better understand applications, how they behave and how they function. It is about systematically identifying what the software solutions are doing and how they are performing in order to prevent failures or to quickly restore the software after a failure (recovery). A strategic approach is essential to systematically evaluate and combine metrics, logs and profiles to get a complete picture of the systems. But the tools currently in use are suitable for maximizing availability and minimizing the average time to problem resolution? No matter how much time and money companies invest in the availability of a system, there will always be incidents or failures. Therefore, it is important to prepare for, investigate and evaluate such events. Is monitoring still relevant in the age of observability?? This question is answered after looking at the evolution of monitoring and related tools over the last few years, but especially since the availability of the monitoring tool Prometheus.
Prometheus allows you to look inside the software
One trend is clear: Monitoring is evolving toward whitebox monitoring. Unlike black box monitoring, the tools more closely observe the internal workings of a process, rather than just checking externally to see if the application or process is responding as expected. This is the small but important difference: whitebox monitoring monitors behavior and not reactions. Therefore, proactive monitoring is now possible, i.e. errors, technical defects or other events can be predicted and prevented before they occur. The original developers of Prometheus were inspired by the monitoring solution Borgmon, which Google created to monitor its internal orchestration system. A tool like Borgmon was missing in the world outside of Google, so the programmers working at SoundCloud at the time decided to develop such a solution.
The success of Prometheus is based primarily on its reliability. This feature is central to a monitoring tool, as the solution for monitoring systems needs to be the most robust part of the infrastructure – everything else depends on it working. Prometheus works so reliably because the tool works in a pull model, querying data. This does not mean that pull is the only viable mode, but it makes it easier to operate reliably.
Prometheus can be set up as a single, statically linked binary that can be launched and updated very easily in any type of environment – whether containers are used or not. This simplicity, combined with reliable functionality, represents an important factor in the success of Prometheus.
Multidimensional data model
Prometheus relies on a multidimensional data model to identify time series, i.e. temporal sequences of data. When Prometheus was developed, there was no integrated monitoring system that allowed querying time series against a subset of their metrics. OpenTSDB allowed similar queries, but caused high operational costs that Prometheus wanted to avoid. Prometheus stored all its content in the integrated database LevelDB in the first version. It was used to index the time series and in the second release of Prometheus, each time series was written to a separate file. This worked very well for a long time, as the developers originally expected dynamic environments and less static virtual machines, and built Prometheus accordingly. However, the scale and frequency of changes of today’s large Kubernetes clusters and the possibility of multiple clusters exceeded all assumptions. The biggest challenge here was the cardinality of the metrics and their change (churn). The cardinality represents the total number of time series recorded by Prometheus. It describes the number of time series with the same metric, but with variable values for individual labels. Churn describes the lifetime of the time series. The worst case for Prometheus is a high churn rate, where time series start and stop frequently. In the second storage version of Prometheus, a new file had to be created each time. The problem: Millions of time series lead to millions of files, for which many file systems have to be specially tuned or may not work at all.
The Prometheus team therefore developed a third storage version to solve this problem. Instead of storing one file per time series, the storage now consists of two parts with fully functional databases that store their own copy of the index and cannot be modified. As with many databases, the kernel can now map storage efficiently from disk. The new storage architecture solves the scalability problem and significantly reduces resource consumption in most scenarios. The new storage version formed the main reason for the Prometheus 2 release in November 2017.0.
Stabilize functions
Since this release, the goal of the Prometheus project has been to stabilize existing functions. The team implemented a six-week release cycle, more detailed release documentation, and a lead role, and regularly has external security testing performed. In addition, there are a number of automated performance tests to identify problems during the development phase and verify Prometheus’ performance before release. The work paid off, as the Cloud Native Computing Foundation (CNCF) awarded Prometheus graduate status in August 2018. Prometheus is thus considered a future-proof project with stable performance and security that is not majority-owned by a single company. The CNCF seal is an important milestone for the project and the entire community.
Conclusion:
Monitoring is becoming increasingly important, but is only the entry point to successful observability. Monitoring, along with other criteria, remains a strong factor in better understanding distributed systems and applications. Future focus will be on correlating these different and increasing in number observability criteria. Alerting driven by metrics data will continue to be the starting point to resolve errors or failures as quickly as possible.
Frederic Branczyk is a software engineer at Red Hat (joined in the course of the CoreOS acquisition), part of the Prometheus core team, and head of the Kubernetes Special Interest Group.
He is committed to a significant further development of observability tools. Its goal is to develop a modern infrastructure and SRE tools that help us understand the operational aspects of applications.