16 Nov

2023

min read

How to Choose the Right Kubernetes Monitoring Stack Tools

This in-depth guide covers metrics, tracing, log aggregation, and APM solutions to gain full visibility into your Kubernetes infrastructure and applications. Monitor metrics, tracing, logs and performance to optimize resources, cut costs and ensure availability.

Jack Dwyer

Product

Platform Engineering + DevOps

Content

Share this article

Introduction to Kubernetes Monitoring

Effective monitoring and observability is critical for running production Kubernetes clusters and workloads. As a dynamic orchestration system for containers, Kubernetes has many moving parts and complex interactions between those components. Without comprehensive monitoring and logging, it becomes very difficult to track the health and performance of a Kubernetes cluster, troubleshoot issues, and optimize resource utilization.

The key aspects of monitoring and observability for Kubernetes include:

Metrics - Measuring and collecting key performance indicators related to nodes, pods, containers, network traffic and more in real-time. Common metrics include CPU, memory, disk usage, request rates, error rates, etc.
Logging - Aggregating and analyzing logs generated by nodes, containers, Kubernetes components, and applications. Logging provides insights into operations, security and errors.
Tracing - Following the path of requests through microservices to pinpoint performance issues and errors. Distributed tracing is essential for monitoring complex microservices architectures.
Alerting - Setting up thresholds and alerts for critical metrics so that anomalies and issues can be detected proactively before they cause performance problems or downtime.

The goals of monitoring and observability for Kubernetes includes ensuring high availability of critical applications, optimizing resource utilization, meeting SLAs, detecting and troubleshooting operational issues quickly, gaining insights into usage patterns and capacity planning.

Effective Kubernetes monitoring requires choosing the right open source tools or enterprise monitoring platforms tailored for container infrastructure. The subsequent sections provide more details on metrics, logging, tracing and the leading Kubernetes monitoring tools.

Monitoring Kubernetes Cluster Resources

Keeping a close eye on cluster resources like CPU, memory, disk, and network usage is crucial for optimizing Kubernetes performance and reliability. By actively monitoring resource usage, you can identify trends, anomalies, and capacity constraints before they cause problems.

Some key cluster resource metrics to monitor include:

CPU Usage

Overall cluster CPU utilization percentage
CPU requests vs CPU limits
CPU usage breakdown by node
CPU throttling events

Monitoring CPU usage helps identify nodes under contention and workloads that are resource-starved. Spikes in CPU usage may indicate an issue that warrants investigation.

Memory Usage

Total memory capacity
Overall memory usage percentage
Memory requests vs limits
Memory usage breakdown by node
Page faults, cache misses

Tracking memory usage is important to ensure workloads have sufficient available memory. Memory pressure can cause Kubernetes to evict pods.

Disk Usage

Overall disk read/write rates
Disk capacity and utilization percentage
Disk usage breakdown by node
Disk IO throttling events

Monitoring disk usage helps avoid running out of disk space and identify workloads causing excessive disk IO that may warrant optimization.

Network Usage

Network receive/transmit rates
Network errors and dropped packets
Network traffic breakdown by node

Monitoring network usage provides visibility into pod communication and services. Network errors can indicate connectivity issues.

In addition to current usage metrics, it's also important to monitor node capacity and utilization over time. This allows you to predict when you may need to add additional nodes to your cluster to handle increased demand. The key is setting up proactive monitoring and alerts to notify you before lack of resources causes application outages or performance issues.

Monitoring Kubernetes Pods and Containers

Kubernetes pods and containers are key components that require close monitoring to ensure application availability and optimal resource utilization. Here are some key metrics and strategies for monitoring pods and containers:

Tracking Pod Status and Restarts

Monitoring pod status gives insight into the health and availability of your applications. Key metrics include:

Pod phase - Pending, Running, Succeeded, Failed, Unknown
Pod restarts - This helps identify frequent restarts that may indicate an issue
Pod readiness - Checks if a pod is ready to serve requests

Alerts can be set up for pods stuck in non-running phases or pods that are restarting frequently. The Kubernetes API and CLI tools provide access to pod status and metrics.

Monitoring Container CPU/Memory Usage

Container resource usage should be tracked to identify any over or under utilization. Important metrics are:

CPU utilization - Percentage of allocated CPU being used
Memory utilization - Percentage of allocated memory being used
Throttling - When a container hits resource limits

Comparing requests vs limits helps identify misconfigurations. Alerts can detect when containers are repeatedly throttled which indicates insufficient resources.

Watching for Resource Quota Breaches

Resource quotas manage resources used by pods/containers in a namespace. Monitoring should check for quota breaches via metrics like:

CPU/memory quota usage
Number of pods/services approaching the quota limit

Alerts can notify when quotas are close to being breached. Proactive quota increases can prevent application disruption and outages.

Carefully tracking pod status, container resource usage and quota breaches provides critical visibility into the health and resource needs of Kubernetes applications. The right monitoring and alerts enables preemptive actions to optimize application uptime and performance.

Application and Service Monitoring

Monitoring applications and services running in Kubernetes is crucial to ensure performance and availability. Key metrics to monitor include:

Response Times and Latency

Measure the time it takes for an application to respond to requests. This can be broken down into:
Network latency - The time for the request and response to travel over the network
Server processing time - The time an application takes to process the request and generate the response
High response times indicate performance problems. Set thresholds for warning and critical alerts.
Segment response times by endpoints, geographic regions, and other dimensions to isolate poorly performing areas.

Errors

Track the rate of HTTP errors like 400, 500. High error rates signify application issues.
Log error messages to help troubleshoot the root cause. Search logs for correlated spikes in errors.
Monitor for timeout errors as these often indicate overloaded applications or services.

Performance vs. SLAs

Compare response times, error rates, and throughput against service level objectives (SLOs).
Use availability monitoring to track uptime and downtime against SLOs.
Align monitoring with business KPIs like orders processed, signups completed, etc.

End User Experience

Employ synthetic monitoring to simulate user transactions through your applications. Measure success rate, response times, and functional correctness.
Implement real user monitoring to measure actual user interactions. Capture JS errors, slow endpoints, failed transactions.
Set up outside-in monitoring from geographically diverse regions to continually measure end user experience.

Carefully monitoring key application and service metrics helps ensure they are meeting performance and availability goals. Configure intelligent alerts so teams can rapidly detect and troubleshoot any issues.

Open Source Kubernetes Monitoring Tools

Open source tools provide a flexible and customizable way to monitor Kubernetes. Here are some of the most popular open source options:

Prometheus

Prometheus is a time-series database for collecting and storing metrics. It scrapes metrics exposed by Kubernetes components and workloads. Key features include:

Multi-dimensional data model with time series data identified by metric name and key/value pairs
PromQL query language to generate ad-hoc graphs and alerts
Supports service discovery to automatically detect Kubernetes pods and services
Highly customizable with easy integration of custom exporters and alerts

Prometheus works great with Grafana for visualizing the data.

cAdvisor

cAdvisor (Container Advisor) provides container users and usage statistics such as CPU, memory, filesystem, and network usage. It is integrated into the Kubernetes kubelet agent on each node for collecting resource usage metrics.

Elastic Stack

The Elastic stack (formerly ELK stack) comprises of Elasticsearch, Logstash, Kibana and Beats for log aggregation, storage and analysis. Fluentd is also commonly used to ship Kubernetes logs to Elasticsearch. Key components:

Elasticsearch: Distributed search and analytics engine for log storage
Logstash, Fluentd: Collect and ship logs from Kubernetes pods
Kibana: Visualize logs and perform advanced log analytics
Beats: Lightweight data shippers for log collection

The Elastic stack provides powerful log management capabilities for Kubernetes.

Kubernetes Dashboard

The official Kubernetes dashboard is a general purpose web UI for cluster monitoring and administration. It provides overview of applications, workloads, pods, nodes and namespaces. Resource usage metrics and health status are displayed.

Lens

Kubernetes Lens is an open source Kubernetes IDE for visualization and management of clusters and workloads. It provides detailed views into pods, nodes, containers, logs and configuration. Resource metrics and health status are also displayed.

Open source tools provide powerful yet customizable Kubernetes monitoring options. Solutions like Prometheus, Elastic stack and Lens offer deep visibility into cluster health, application performance and logs.

Enterprise Kubernetes Monitoring Platforms

For organizations that require an enterprise-grade solution, there are robust commercial Kubernetes monitoring platforms available that provide an aggregated view across metrics, logs, and traces. These solutions offer advanced features beyond open source options for monitoring large and complex Kubernetes environments.

Datadog

Datadog is a leading SaaS monitoring and analytics platform designed specifically for containers, microservices and cloud infrastructure. Key capabilities of Datadog for Kubernetes monitoring include:

Pre-built Kubernetes dashboards and integrations
Infrastructure monitoring covering nodes, deployments, pods, containers
Application performance monitoring with distributed tracing
Advanced algoritms for anomaly detection
Powerful alerting and collaboration tools
Scales to monitor any size Kubernetes cluster
Storage of metrics for up to 1 year for trend analysis

Datadog seamlessly integrates data from Kubernetes, applications, tools like Prometheus and databases into one unified platform. This provides full-stack observability across infrastructure and applications.

Sysdig

Sysdig offers Kubernetes-native monitoring combining metrics, events, and context. Key features include:

Kubernetes health monitoring with intelligent alerts
Application monitoring and topology mapping
Operational intelligence using metadata to connect metrics and events
Embedded Prometheus metrics scraping
Distributed tracing for microservices
Customizable dashboards and reporting

Sysdig specializes in container monitoring and aims to simplify troubleshooting performance issues across Kubernetes clusters, nodes, and workloads.

Dynatrace

Dynatrace offers an AI-powered observability and monitoring platform with advanced capabilities for Kubernetes environments, including:

Automatic discovery of Kubernetes objects, topology and mapping
AI-powered root cause analysis and anomaly detection
Application performance management and distributed tracing
Log aggregation and analytics
User experience and behavior monitoring
Closed-loop remediation workflows

Dynatrace leverages an AIOps approach combining metrics, logs, and traces with AI and automation to monitor Kubernetes clusters at scale. This simplifies cloud-native observability.

Benefits of Enterprise Kubernetes Monitoring

Enterprise Kubernetes monitoring platforms provide benefits like:

Consolidated observability - metrics, logs, traces in one platform
Powerful visualization and pre-built dashboards
Advanced APM, anomaly detection and tracing capabilities
Scalability to monitor large, complex environments
Reliability with high-availability architecture
Security features and access controls
Long term metric storage for trend analysis
24x7 support from Kubernetes experts

The comprehensive capabilities of these solutions justify their cost for monitoring business-critical Kubernetes clusters and cloud-native infrastructure.

Logging Architecture and Aggregation

Logging and log management is a critical component of monitoring Kubernetes. The logging architecture in Kubernetes consists of several components:

Log Collection

The logs from applications and Kubernetes system components need to be collected and stored centrally. This is done by log collection agents like Fluentd. Logs are collected from:

Containers - Stdout and stderr logs from all containers
Kubernetes Components - API server, scheduler, controller manager etc
Nodes - System logs from the Kubernetes nodes

Fluentd runs as a daemonset on each node to collect logs and forward them to central storage. Filebeat can also be used for log collection.

Log Storage and Aggregation

The collected logs are stored in a central place for aggregation. This allows you to search and analyze logs from all sources in one place. Popular choices for log aggregation include:

Elasticsearch - Stores logs and enables complex search, analytics and visualizations with Kibana.
Graylog - Open source log management solution. Provides search and analytics.
Amazon CloudWatch Logs - Fully managed log aggregation service on AWS.

Elasticsearch is a common choice as the storage backend for logs in Kubernetes. Fluentd forwards logs to Elasticsearch for central aggregation.

Log Parsing

Raw log data is hard to analyze. The log aggregation system parses the raw logs to extract metadata like timestamps, source, log levels etc. This enables powerful search and filtering.

For example, Elasticsearch uses Logstash to parse incoming logs before storing them. Logstash has filters to extract fields from unstructured log data.

Log Analysis and Visualization

Once aggregated, logs need to be analyzed and visualized to derive insights. This is done through:

Search and filtering logs based on metadata.
Visualizations like charts, graphs and dashboards.
Analytics like anomaly detection.

Kibana provides rich search, filtering, dashboards and visualizations for logs stored in Elasticsearch. Other analytics tools can also be connected to the log store.

This architecture allows aggregating logs from across the distributed Kubernetes environment and enables log analytics at scale.

Tracing Architecture in Kubernetes

Distributed tracing is a critical technique for monitoring and troubleshooting distributed applications like those running on Kubernetes. There are many open-source tools on the market for this, however Zeet handles all of this natively so you don't have to worry about it should you want to leverage Kubernetes' powerful offering.

Connecting Traces to Logs and Metrics

To get maximum insights, traces need to be correlated with log data and metrics. For example, linking a trace to the logs emitted when a specific request was handled can reveal useful debugging information.

Connecting traces with Prometheus metrics like request rates, error counts, and durations allows detecting anomalies. Some tracing platforms like Jaeger allow storing trace metadata within logs using the OpenTelemetry logging exporter.

Using Tracing for Troubleshooting

When issues arise like high latency or errors, distributed tracing gives tremendous troubleshooting abilities. Just find the problematic trace and inspect each span to pinpoint the root cause.

You can see the exact microservice that introduced the latency or error and all side effects. Tracing empowers developers to quickly remediate issues by going directly to the relevant logs and metrics for the specific failing request.

With distributed tracing architecture set up on Kubernetes using Jaeger, Zipkin or similar tools, you gain deep monitoring, alerting and troubleshooting capabilities for microservices. Tracing and connecting it with logs and metrics provides powerful Kubernetes observability.

Setting up Alerting and Notifications

Alerting is a critical component of any monitoring strategy. Alerts allow you to get notified in real-time when critical issues occur or metrics breach certain thresholds. This gives you the opportunity to quickly troubleshoot and resolve problems before they cause major outages.

When setting up alerts for your Kubernetes environment, you should consider alerting on key metrics like CPU and memory utilization, application errors and latency, pod restart rate, and node status changes. You can define alerting rules on these metrics with reasonable thresholds based on your applications and infrastructure.

Many Kubernetes monitoring tools like Prometheus allow you to set alerting rules on metrics which will fire alerts when the defined conditions are met. You can also set up alerts on the logs collected from containers and applications running in the cluster. Log-based alerts can notify you of application errors, authorization failures, node issues etc.

Once alerts are triggered, you need to carefully design how these alerts will be communicated to the appropriate teams. Integrating your monitoring system with communication channels like email, Slack or PagerDuty is important for timely notifications. Critical alerts should always be routed to on-call or SRE teams so problems can be quickly mitigated.

Runbooks should be created for each possible alert scenario. These provide clear guidelines on how to troubleshoot and resolve the issue indicated by the alert. Steps to confirm the problem, identify the root cause, remediate and recover should be outlined in the runbook. These will help responders take quick action when alerts are triggered.

A mature alerting setup with proper notifications and runbooks is essential for maximizing the value of your Kubernetes monitoring and gaining observability into cluster health and application performance.

Tips for Effective Kubernetes Monitoring

Monitoring Kubernetes can seem daunting at first, but following best practices will ensure you gain observability into your applications and infrastructure. Here are some tips to effectively monitor your Kubernetes environment:

Avoid Common Monitoring Anti-Patterns

Don't just monitor Kubernetes, monitor your apps too. Focus on application health and performance, not just Kubernetes metrics. Instrument your apps for logging, metrics, and tracing.
Don't drown in data. Start with a minimal set of metrics and expand as needed. Too many metrics can overwhelm and require complex monitoring.
Don't just collect metrics, analyze and act on them. Set up alerts and dashboards tailored to your environment. Use monitoring to answer questions about your apps and infrastructure.
Avoid vanity metrics. Monitor metrics that provide real value and insight into your environment, not just fancy visuals.
Don't just monitor, debug. Use logs, traces, and metrics together to debug problems quickly. Monitoring should help you take action.

Define Service Level Objectives

Start with business goals first. What is the user experience you want to deliver? Define these goals like application latency, uptime, etc.
Turn goals into quantifiable SLOs. Example: 99.95% of requests must complete in under 500ms. SLOs enable measurable monitoring.
Monitor metrics tied to SLOs. This brings focus to the key metrics that align monitoring to business goals. Alert when SLOs are at risk.

Iterate and Improve Over Time

Reevaluate metrics and alerts frequently. As your system evolves, your monitoring needs to as well.
Expand monitoring as required. Start small and add additional metrics, tools, and integrations as needed over time. Avoid premature optimization.
Learn from incidents. Use post-mortems to identify gaps in observability and improve monitoring to prevent repeating issues.

Effective monitoring is a practice that evolves over time. Focus on iterating monitoring to meet the needs of users and the business.

Thank you!

Your submission has been processed

Oops! Something went wrong while submitting the form.

First time at Zeet?

Share this article

Introduction to Kubernetes Monitoring

Monitoring Kubernetes Cluster Resources

Monitoring Kubernetes Pods and Containers

Tracking Pod Status and Restarts

Monitoring Container CPU/Memory Usage

Watching for Resource Quota Breaches

Application and Service Monitoring

Response Times and Latency

Errors

Performance vs. SLAs

End User Experience

Open Source Kubernetes Monitoring Tools

Prometheus

cAdvisor

Elastic Stack

Kubernetes Dashboard

Lens

Enterprise Kubernetes Monitoring Platforms

Datadog

Sysdig

Dynatrace

Benefits of Enterprise Kubernetes Monitoring

Logging Architecture and Aggregation

Log Collection

Log Storage and Aggregation

Log Parsing

Log Analysis and Visualization

Tracing Architecture in Kubernetes

Connecting Traces to Logs and Metrics

Using Tracing for Troubleshooting

Setting up Alerting and Notifications

Tips for Effective Kubernetes Monitoring

Avoid Common Monitoring Anti-Patterns

Define Service Level Objectives

Iterate and Improve Over Time

Subscribe to Changelog newsletter

Thank you!

Other articles you might like

Changelog 5.2.24 - Monitoring Workshop, DO Recap, Your Monitoring, & more

Jack Dwyer

Terraform vs Ansible: Similarities, Differences, and Use Cases

Jack Dwyer

Simple Step-By-Step Tutorial on the Terraform Dynamic Block

Jack Dwyer

Want to learn more about Zeet?