Monitoring Exa's Infrastructure

Overview

As an Internet Service Provider, Exa manages a national network alongside a complex SaaS (Sofware as a Service) platform - our content filtering service, Surfprotect. Detecting issues with our infrastructure early is the crucial to maintain high service quality. Through periodic reviews and postmortems held on past incidents, we aim to constantly improve our monitoring platform to provide faster problem detection and better diagnostic information.

This article will cover the overhaul of our infrastructure over the past two years and the results/actions from our latest postmortem assessment. After investigation, we concluded that to aid our development team with debugging new, unexpected problems, we needed to perform continuous service profiling and provide significantly more detailed data regarding how our services perform under normal and outage conditions.

Our Starting Point

Our original monitoring solution was the result of organic evolution, with three open-source tools in use:

Nagios provided service availability checks and alerting.
Munin collected server statistics with long-term graphs.
Cacti provided network infrastructure availability checks and long-term graphs.

Monitoring servers and network devices via ICMP and polling our web services using HTTP provided a decent overview of our network and services. While not providing us with real-time information, we still could react promptly when network devices or services became unavailable. Access to resource utilisation graphs averaged over 5 minutes allowed us to proactively analyse our servers' performance history and predict future traffic patterns to plan for our infrastructure's capacity.

Our Challenges

While we were initially happy with this setup, as the number of devices monitored increased and our requirements and expectations grew, we encountered difficulties due to the organic growth of the platform and its lack of flexibility. In particular, we found that:

alerts were reactive instead of proactive
configuration was manual and time-intensive
we had missed configuration entries for some devices
we had no way to aggregate the data into a single view for reporting
our alerting rules were inconsistent between tools
notification duplication was causing notification fatigue

Our original platform always left us on the back foot during incidents due to a lack of proactive warning. In addition, our alerts lacked detail; each trigger required manual investigation to find the root cause.

Our services team configured all three tools manually since none had convenient API access. As a result, system configuration was time-consuming and error-prone, leading to unmonitored devices and gaps in infrastructure visibility.

We perform capacity management reports monthly to ensure our systems have sufficient capacity to cover customer demand. Graphs from our infrastructure are a critical part of these reports. However, with our data split between three tools, report creation was a slow and tedious task that took up much of our team's time.

Our engineers received hundreds of near-useless notifications every day. Overly-sensitive triggers led to many repetitive alerts and using multiple tools to send alerts only exacerbated the problem. The flood of alerts led to notification fatigue and caused critical warnings to go unnoticed.

Ultimately, the culmination of these issues led us to consider alternatives. We concluded that our monitoring platform required a complete overhaul to produce the required results.

Our First Steps

Surfprotect Healthchecker

Our previous monitoring platform only allowed us to monitor service health by monitoring the physical hosts; the lack of end-to-end service monitoring allowed incidents to go unnoticed.

Before we even started looking at a new solution, we introduced the Surfprotect Healthchecker. Since our service is a web proxy, most off-the-shelf tools are not well-equipped to monitor it effectively. We developed the healthchecker to provide our engineers with high resolution insight into the end-to-end performance of our services.

The Healthchecker service sends genuine web requests through the surfprotect service to various endpoints; by monitoring the response latency and whether the page was correctly blocked/allowed, we can get a well-rounded view of whether the service is performing as we would expect.

Surfprotect Healthchecker Example 1 — An example of the Healthchecker view for a proxy in normal operation

During incidents or service maintenance, the Healthchecker shows the impact on the customer's traffic on a given service. For example, higher latency causes darker sections, and a complete outage condition shows as a solid block. This information allows us to estimate what our customers are experiencing when using Surfprotect.

Surfprotect Healthchecker Example 2 — An example of the Healthchecker view for a proxy during maintenance

The Surfprotect Healthchecker is available to the public at health.surfprotect.co.uk.

Netdata

To improve our server monitoring capabilities, we started using an open-source tool called Netdata. Below is an example of the overview Netdata provides for one of our proxy servers.

Netdata provides thousands of high-resolution metrics for a deep-dive view of a server's performance. Netdata collects around 7 thousand metrics on this particular server and can display 650 different graphs on its built-in web interface.

So far, Netdata has been incredibly useful in tracking down many underlying causes for operating system-level issues and has been deployed on all our servers and virtual machines. When our alerting indicates a problem with a server, our first step is to use Netdata to examine the server in more depth. We found that it provides an excellent detailed view of a server's health, with a high resolution of 1-second polling intervals. This resolution allows us to quickly identify and troubleshoot any issues that might arise, especially transient problems which are not visible on a 5-minute average view. In addition, our incident procedure now utilises Netdata's export functionality to snapshot the servers during and after an incident for future review as part of an investigation or postmortem.

Major Overhaul

Zabbix

Zabbix now performs both ICMP and SNMP polling for our infrastructure. As Munin did, it also provides a capable local agent, which we have deployed to all our servers and Virtual Machines. Zabbix also has a powerful web interface for viewing alerts, metrics, and configuration.

With Zabbix performing monitoring for all of our infrastructure, it was the logical place to implement alerting, and is now even more compelling thanks to newly improved alerting rules with automatic anomaly detection. It took effort and tuning, but Zabbix became almost silent during regular operations, finally resolving our notification fatigue issue.

As our monitoring matured, we started to design and ingest more detailed metrics from our software using the Prometheus format, exposing many useful but previously invisible performance indicators, such as internal service latency. These values can provide early warning signs of incidents and assist with postmortem investigations into performance bottlenecks.

Since implementing Zabbix, we have increased our metric collection rate to a bit above 5,300 metrics collected per second, coming from around 2,500 unique devices and a total metrics count of about 117 thousand. While adding new devices to Zabbix, we have also worked on optimising which metrics we store to reduce our storage requirement.

Zabbix Processed Values — Our processed values over time on Zabbix since it was commissioned

Our Zabbix stack consists of three moderately powerful physical servers, one primary and two satellites, of which one provides storage and the other hosts additional services. As Zabbix also provides a per-metric polling rate, it allowed us to keep the storage requirement reasonable at around 1TB for two years worth of history. We can monitor critical metrics at an interval of up to once per second while saving storage space on non-critical metrics by polling them less frequently, such as once per hour or day.

To avoid replicating the manual configuration nightmare, we knew we had to automate Zabbix's device configurations from the start. To synchronise multiple data sources, we decided to use Ansible due to its strong integration with Zabbix and our DCIM (DataCentre Infrastructure Management) tool, Netbox. Since the rest of our automation is performed with Ansible, we wrote a custom plugin for our in-house customer service management system. All infrastructure and customer circuits are automatically added to Zabbix as they go live.

We now monitor previously unseen metrics using SNMP polling of our network infrastructure. For example, our monitoring of transmission power on our optical fibre transceivers paid off almost immediately by allowing us to detect a problem with an optic before it failed. Shortly after adding this function, we received an alert that the optical power had varied; upon inspection, we saw the following graph of one of our laser diodes dropping in transmit power quicker than we would typically expect. We then quickly replaced the transceiver, preventing one of our core network links from going down unexpectedly.

Optical Level Example — Power level of a failing diode, before and after replacement

Since then, our networking team has used dashboards to monitor our network's status in real-time and closely track our customers' connections, enabled by Grafana for a centralised view of all infrastructure.

PPP Sessions Migration — Some of our LNS PPP session counters on our LNS infrastructure during a customer migration.

Grafana

While Zabbix provides graphs and dashboard functionality, we knew that down the line, we would need to integrate other data sources, such as profiles and logs. Grafana provides a variety of data sources to integrate with a wide range of backend services, providing a centralised view of all available data.

With dedicated dashboards created for critical infrastructure such as Surfprotect, we now have a centralised, near real-time view of how our services perform. In addition, We use six displays in our office to display metrics and alert dashboards, providing instant access to critical metrics and currently detected issues, as shown below.

Surfprotect Dashboard — Our Grafana dashboard for Surfprotect used in our office, with latency metrics from the Surfprotect Healthchecker

Grafana dashboards provide insights into how our services are performing and how much load they are experiencing. We create reports from these dashboards to assist with our service capacity planning, allowing us to plan future infrastructure improvements. The reports that Grafana supports can be designed much quicker and with much higher accuracy than was previously possible.

Surfprotect Overall Bandwidth — Grafana Panel of our Surfprotect bandwidth

With these new dashboards, we can see our total surfprotect throughput at any given time. We can also see the traffic spike on certain occasions, such as during the 2022 England world cup game, on which we saw a traffic increase of over double through Surfprotect without issue.

Major traffic events show us how well our capacity planning is performing. Since we encountered no slowdowns during this peak, we show that our capacity planning is currently sufficient to meet customer requirements.

Grafana's integration with many data sources allows us to pull analytics data from our backend Surfprotect services to see traffic patterns. By analysing this data, we can better understand how our customers use our services and therefore identify areas of improvement.

To illustrate this point, here are some examples of dashboards we now have access to using Grafana to bring all data into a single view.

Surfprotect Latency Grafana — Grafana Panel showing the Healthchecker Surfprotect end-to-end latency in high resolution.

Surfprotect Heatmap Extract — Part of our Surfprotect deep-dive dashboard showing volumetric and latency statistics from one of our proxy servers (Full Dashboard View)

Current Work

Grafana Phlare

Applications can expose profiling data and give developers a sample of where CPU time gets spent in a process and where memory is allocated or held. This data is beneficial for performance optimisation of software.

During postmortems from recent incidents, we identified a potential improvement in our monitoring automation. One method was to implement continuous profiling.

We have had access to profile data for some time; however, our postmortem discussions revealed a need to gather profile data continuously rather than manually. Continuous profile data gives developers much more information as it shows how the application changes during different load scenarios. Furthermore, by knowing how a service behaves during regular operation, issues become apparent as the difference in behaviour is visible in the profile data.

While we were investigating potential options for continuous profiling, Grafana Labs (the company behind the Grafana software) released the first version of a new tool called Grafana Phlare. Every 10 seconds, Phlare makes profile sample requests to a list of user-defined profiling endpoints.

Grafana Phlare Example — A dashboard showing Phlare data for CPU time and Memory allocations

This functionality will allow a much faster response to more complex performance-related issues with our in-house software. We are currently in the process of testing and deploying Phlare to our Surfprotect proxies. Phlare gives our software developers critical information about the performance of our software, allowing them to diagnose issues faster and tune performance with confidence.

Future Plans

Providing Customer Access to Service Statistics

As part of our ongoing efforts to improve our monitoring capabilities, we are planning the addition of customer circuit monitoring metrics to our customer panel. This new feature will allow customers to view the quality of their service and utilisation in real time, giving them greater visibility and control over their service. In addition, this new feature will allow customers to plan better for their future needs and make more informed decisions about their service.

Victoria Metrics

Victoria Metrics is a high-performance software suite for time series data; its primary use case is scraping and storing software metrics. We were initially intrigued by Victoria Metrics's high scrape and storage performance.

We began by collecting the Prometheus metrics exposed by our services using Zabbix; however, we encountered some limitations. All metrics scraped by Zabbix must be pre-configured in the UI. This limitation goes against our goal of complete automation since new metrics added by our developers are not automatically stored.

Metrics such as service latency are often better visualised as histograms. Unfortunately, Zabbix does not support histograms natively; therefore, to display histograms in Grafana, manual configuration is required. This manual step similarly goes against our automation goals. Victoria Metrics has full histogram support and will ingest all metrics available at the requested endpoint without pre-configuration.

During testing, we've seen impressive performance performing queries on large datasets, retrieving almost a million total rows in under 1 second. This result is particularly impressive since, while testing a similar query on Zabbix, we abandoned it because the query took too long to return. Performance like this allows quick, high-resolution requests for analysing historical data for comparison without aggregating to trends for older data.

Victoria Metrics Query Performance Example

While we are trying to avoid returning to a multi-vendor solution, we can reduce the impact of adding extra tools significantly since Victoria Metrics is fully automatable, and the advantages outweigh the initial setup work required.

Victoria Metrics cannot fully replace Zabbix since it performs ICMP, SNMP, and OS monitoring better in our environment than Victoria Metrics can. With the combination of both tools, we can get the advantages of both systems, and with Grafana, we can seamlessly integrate metrics from both tools into a single dashboard.

We are currently in the process of evaluating Victoria Metrics and plan to begin rollout later this year.

Distributed Traces

When working on service optimisation, we need to find which piece of software in the request chain is causing slowdowns. Using our end-to-end monitoring and service metrics, we can see the data required for diagnostics; however, this process can be time-consuming.

The introduction of traces would centralise our access to the required data and simplify the visualisation of latency across our services. By following requests through their journey, we can get a full view of where a request is spending time and how we can improve our service quality.

Many tools are available to store trace data, such as Jaeger and Grafana Tempo. We have yet to explore this technique in-depth, but it may help us get a more comprehensive view of how our services perform.

Centralised Logging

During an incident investigation, the goal is to track down the series of events which caused the problem. Since each of our services logs locally, we can follow a request to wherever the issue is occurring.

We are currently working on automating our service deployments; part of this work is to migrate services into containers which causes logging concerns. To keep up with containers' dynamic creation and destruction, we can implement centralised logging to capture logs from all containers regardless of location and ship them back to a central log storage system. This central view gives engineers access to all service logs irrespective of where/how they are deployed.

There are many tools available for centralised logging; however, for our use case, the one we have been investigating is Grafana Loki, as it can integrate with the rest of our monitoring stack and performs well on low-spec hardware.

Conclusion

Monitoring is a never-ending project that we must keep on top of to ensure we provide our engineers with access to critical service metrics and alerts. As a result, our environment has evolved from a static environment with manual configuration to a highly dynamic environment with automated configuration and hundreds of services.

To provide our customers with an insight into our operations, we provide a couple of public services:

Status page - Our status page provides notifications regarding ongoing incidents and planned maintenance.
Healthchecker - The healthchecker gives an overview of Surfprotect service performance.