First time at Zeet?

15 Nov
min read

A Guide to Kubernetes Scaling: Tips for HPA, VPA, Cluster Autoscaling, and More

Master Kubernetes scaling with this guide. Learn horizontal and vertical pod autoscaling, cluster and manual scaling. Includes tips for configuring HPA, metrics, targets, and monitoring. Handbook for optimizing Kubernetes workloads.

Jack Dwyer

Platform Engineering + DevOps

Share this article

Introduction to Scaling in Kubernetes

Kubernetes makes it easy to scale your applications to meet fluctuating demands. Scaling refers to increasing or decreasing the resources available to your application to match usage. There are several reasons you may need to scale your apps in Kubernetes:

  • Handle increased traffic - Scale up to add more compute resources when traffic spikes so your app remains performant. Kubernetes can rapidly deploy additional pods to handle the increased load.
  • Reduce costs during low usage - Scale down your app's pods when traffic is low to stop wasting resources and lower your costs. Idle resources can be automatically reclaimed in Kubernetes.
  • Respond to events - Specific events like new product launches or scheduled promotions can drive temporary traffic surges. You can plan ahead and configure scaling to automatically adjust to known upcoming events.
  • Improve reliability - Running multiple instances of your app provides redundancy. If one instance goes down, others can still serve traffic, improving reliability.

Kubernetes offers several methods for scaling your apps:

  • Manual scaling - You directly set the number of pod replicas to deploy. Useful for predictable workloads.
  • Horizontal Pod Autoscaling (HPA) - Kubernetes automatically scales the pods in a deployment or replica set based on observed CPU utilization or other metric targets.
  • Vertical Pod Autoscaling (VPA) - Automatically adjusts CPU and memory resource allocations for your pods based on historical usage data.
  • Cluster Autoscaling - Allows Kubernetes to automatically add or remove nodes from your cluster as needed.

Each scaling method has advantages and tradeoffs that will be discussed in this guide. By understanding the available scaling options, you can design robust, auto-scaling applications in Kubernetes.

Horizontal Pod Autoscaling (HPA)

Horizontal Pod Autoscaling (HPA) allows you to automatically scale the number of pods in a replication controller, deployment, replica set or stateful set based on observed CPU utilization or other custom metrics.

HPA helps ensure your applications have enough pods to handle demand by dynamically scaling pods up or down. This is especially useful for workloads with fluctuating traffic.

How HPA Works

The Horizontal Pod Autoscaler controller monitors the defined metrics from either CPU utilization or custom metrics from pods, services or external metric APIs. It compares the current metrics against the target value set for that metric.

Based on the current metrics versus target value, the Horizontal Pod Autoscaler will scale the number of pods up or down to match the desired metric target.

It uses a proportional algorithm to determine how many pods to add or remove on each scaling event. This avoids making large changes on every scaling action.

Metrics Used By HPA

By default, the Horizontal Pod Autoscaler uses the CPU utilization metric across all pods to determine scaling actions. The CPU utilization is compared to the target CPU utilization percentage set by the user.

In addition to CPU, custom metrics can be used for scaling decisions:

  • Per-pod metric: Query Prometheus for a metric from an individual pod
  • Object metric: Query Prometheus for a metric calculated across all pods matched by a label selector
  • External metric: Query an external metrics API like Datadog or AWS CloudWatch

This allows scaling based on application-specific metrics like requests per second, database connections, or other custom metrics that represent your workload.

Setting Up HPA

To set up a Horizontal Pod Autoscaler for your deployment or replica set:

  1. Define the target CPU utilization percentage as an argument to the autoscaler
  2. Specify the minimum and maximum number of pods
  3. Deploy the HPA object
  4. The HPA controller will begin polling the CPU metric and scaling the pods based on the target
  5. You can also specify custom metrics instead of CPU utilization if desired

The HPA will efficiently scale pods up and down to maintain resource utilization at the target level specified. This helps provide high availability and optimal resource usage for your applications.

Vertical Pod Autoscaling (VPA)

Vertical Pod Autoscaling (VPA) allows Kubernetes to automatically adjust the CPU and memory requests and limits for pods based on their historical resource utilization.

What is VPA?

VPA is a Kubernetes add-on that monitors the resource usage of pods and containers over time. Based on this historical data, VPA can automatically set optimal resource requests and limits for each pod to match their actual needs.

For example, if a pod is using on average 50% of its requested CPU, VPA can automatically decrease the CPU request to more closely match the actual usage. This helps prevent over-provisioning of resources and allows more efficient use of the underlying compute resources.

The key components of VPA are:

  • VPA Recommender - Monitors pod resource usage and recommends optimal CPU/memory requests and limits
  • VPA Updater - Automatically updates the pod specifications with the recommended values
  • VPA Admission Controller - Intercepts pod creation requests and applies VPA recommendations

How VPA Works

The VPA recommender collects metrics on CPU and memory usage for the containers in each pod. It analyzes these metrics to determine the optimal resource requests and limits for each pod.

The VPA updater can then automatically modify the pod specifications to apply the recommended values. This is done using a method called "pod vertical scaling" which can update CPU and memory without needing to restart pods.

The VPA admission controller runs alongside the API server and intercepts pod creation requests. If a VPA resource exists for that pod, the admission controller will apply the recommended resources before allowing pod creation.

Benefits of VPA

Some key benefits of using Vertical Pod Autoscaling include:

  • More efficient resource utilization - VPA allows right-sizing resource allocations to actual usage
  • Avoid overprovisioning - No more guessing resource requirements
  • Improved performance - Pods have the resources they need when they need it
  • Cost savings - Reduce waste from unused allocated resources
  • Automated optimizations - VPA continually optimizes configurations as usage changes

Setting Up VPA

To use VPA in your Kubernetes cluster, you first need to install the VPA add-on. This can be done via Helm or declarative installation.

Next, deployments must be configured to allow updating resources. VPA can then monitor the pods and make recommendations.

Finally, VPA resources can be configured to specify which pods should be managed, and enforcements modes used. Updates can be set to automatic or manual control.

With VPA enabled, the resource requests and limits will be continually optimized to match your pod's requirements. VPA takes the guesswork out of resource allocation.

Cluster Autoscaler

The Cluster Autoscaler helps automatically scale up and down the nodes in your Kubernetes cluster based on resource needs. It monitors pod resource requests, scheduling events, and node utilization metrics to determine when and how many nodes to add or remove. Let's look at how the Cluster Autoscaler works:

How Cluster Autoscaler Works

  • Monitors metrics and events across the cluster like pod resource requests, pod scheduling failures, node utilization etc.
  • Makes scaling decisions to add or remove nodes if certain conditions are met. For example, it may scale up if there are pending pods that failed to schedule on the current nodes due to insufficient resources.
  • Integrates with the cloud provider API to actually add or remove nodes when it decides to scale up or down.
  • Supports scaling node pools separately if using node pools.
  • Evicts pods if needed before removing nodes to gracefully scale down.

Integrating with Cloud Providers

The Cluster Autoscaler relies on the cloud provider APIs to actually add and remove nodes. It uses out-of-the-box integration with:

  • AWS EC2
  • GCE
  • Azure AKS

To integrate with your cloud provider, you need to provide details like IAM roles, instance types, regions etc. when deploying Cluster Autoscaler.

Deploying Cluster Autoscaler

To enable automatic node scaling, you need to deploy Cluster Autoscaler to your cluster. Here are the key steps:

  • Deploy the Cluster Autoscaler manifest after configuring details for your cloud provider.
  • Annotate node pools you want to scale with
  • Set min and max nodes per node pool or cluster.
  • Adjust CA parameters as needed like scale down delays, utilization thresholds etc.

With Cluster Autoscaler running, you can sit back and allow it to automatically add or remove nodes based on real-time resource needs in your Kubernetes cluster.

Manual Scaling Methods

While Kubernetes provides automated scaling methods like HPA and Cluster Autoscaler, sometimes manually scaling your workloads is the best approach. Here are some use cases where manual scaling makes sense:

Scaling Deployments, ReplicaSets, and StatefulSets

The easiest way to manually scale Kubernetes workloads is by setting the number of replicas directly on controller objects like Deployments, ReplicaSets, or StatefulSets.

For example, to scale a Deployment named myapp to 5 replicas, you can run:

kubectl scale deployment myapp --replicas=5

This will immediately spin up or remove pods to match the desired replica count.

Predictable Scaling Needs

If your application traffic follows a predictable pattern, like spiking during business hours, you may not need the complexity of auto-scaling.

Simply ramp up the number of replicas manually before a traffic spike begins, and scale back down after traffic subsides.

Limiting Scaling

In some cases you may want to limit the maximum scaling, even if auto-scaling is enabled.

Setting a max replica count on Deployments provides an upper bound that autoscaling can't exceed.


Manual scaling allows reducing replicas to zero when an application needs to be temporarily shutdown for maintenance.

Just set --replicas=0 on the controller object to terminate all pods.

So in summary, although auto-scaling often makes sense for cloud native workloads, don't overlook manual scaling. It shines for predictable loads, guardrails, and maintenance workflows.

Declarative Scaling Policies

Kubernetes provides several ways to set scaling policies in a declarative manner. This gives you guardrails and control over how your applications scale in response to changes in demand or load.

Some key ways you can declaratively manage scaling include:

Minimum and Maximum Replicas

You can set minReplicas and maxReplicas on Deployments, ReplicaSets and StatefulSets. This allows you to constrain the autoscaler to stay within a range you define.

For example:

apiVersion: apps/v1kind: Deployment metadata:  name: myappspec:  replicas: 3  minReplicas: 2  maxReplicas: 6

This limits the autoscaler to only scale between 2 and 6 replicas.

Default Replicas

In addition to min and max, you can set a replicas field to define the default or initial number of pods. The autoscaler will scale up and down from this default value.

Target Resource Utilization

For Horizontal Pod Autoscaling (HPA), you can set target metrics like CPU or memory utilization. This gives the autoscaler a goal to scale to. For example:

apiVersion: autoscaling/v2beta2kind: HorizontalPodAutoscalermetadata:  name: myapp-hpaspec:  scaleTargetRef:    apiVersion: apps/v1    kind: Deployment    name: myapp  minReplicas: 2  maxReplicas: 6  targetCPUUtilizationPercentage: 50

Here the HPA will aim to maintain 50% CPU utilization by scaling between 2-6 replicas.

Using declarative policies allows you to set clear boundaries and targets for scaling instead of relying solely on the autoscaler. This provides more control and predictable scaling behavior.

Choosing the Right Scaling Method

When it comes to scaling your Kubernetes workloads, you have several options to choose from including Horizontal Pod Autoscaling (HPA), Vertical Pod Autoscaling (VPA), Cluster Autoscaler, and manual scaling methods. How do you decide which one is right for your use case? Here are some key considerations when selecting a scaling method:

Workload Characteristics

  • Type of application - Is it stateful or stateless? Long running or short lived? This impacts whether HPA or CA are suitable.
  • Resource usage - Is CPU or memory the bottleneck? This determines whether HPA or VPA are more applicable.
  • Traffic patterns - Is demand stable, cyclical or spiky? More unpredictable workloads favor HPA.
  • Number of nodes - Do you need to scale nodes or just pods? Cluster Autoscaler only helps with nodes.
  • Level of automation - Do you need full automation or is manual control ok? HPA and CA provide full automation.

Pros and Cons

  • HPA - Responds quickly to load changes. Limited to CPU and memory metrics. Can thrash if not tuned properly.
  • VPA - Optimizes resource allocation. May impact workload density. Needs historical metrics.
  • Cluster Autoscaler - Scales nodes efficiently. Metrics and tuning varies by cloud provider.
  • Manual Scaling - Direct control over scaling. Labor intensive. Risk of over/under provisioning.

Decision Criteria

  • Current bottlenecks - CPU, memory, node capacity?
  • Traffic patterns - stable or highly variable?
  • Level of automation needed - manual ok or need full automation?
  • Application characteristics - stateful, stateless, long/short running?
  • Existing metrics & monitoring - can leverage metrics for autoscaling?

Considering these factors will help determine the right method or combination of scaling approaches for your specific workload and environment. The best option provides efficiency, performance and automation while minimizing resource overprovisioning and scaling events. Assess your situation, try different approaches iteratively, and continue optimizing over time.

Monitoring & Alerting Scaling

To keep your scaling setup running smoothly, you need visibility into what's happening under the hood. Monitoring key metrics and setting up alerts allows you to stay on top of scaling events and respond quickly if needed. Here are some best practices for monitoring and alerts around scaling:

Scaling Metrics to Watch

  • CPU utilization - This metric gives visibility into how much your pods are using the allocated CPU resources. Watch for sustained high utilization as a trigger for scaling.
  • Memory usage - Monitor memory consumption to ensure pods have enough available memory. Spikes may indicate a need to scale up.
  • Pod startup latency - Track how long it takes for new pods to start up. Long startup times can indicate problems with horizontal scaling.
  • Nodes ready/unavailable - Monitor the number of ready and unavailable nodes in your cluster. Significant unavailable nodes may trigger the need for cluster autoscaling.
  • Pod status - Watch for pods stuck in pending status that can't be scheduled. This may indicate a need for more nodes. Also watch for pod failures that may require scaling down problematic deployments.
  • Application metrics - Incorporate key app metrics like transactions, latency, errors to understand the impact of scaling on your services.

Scaling Alerts

  • CPU/Memory thresholds - Set alerts on CPU and memory usage to notify when pods are consistently above limits.
  • Pending pods - Trigger alerts when pods remain stuck in pending state for too long indicating issues with scaling up.
  • Scaling events - Generate alerts on significant scaling up or down events like rapidly adding/removing nodes or 100+ pods at once.
  • Application errors - Be notified of spikes in application errors that may be related to scaling.
  • Auto-recovery - Configure auto-recovery alerts to automatically rollback bad scaling releases.

Responding to Scaling

  • Check metrics dashboards - When an alert triggers, go to your metrics dashboards to analyze the issue.
  • Review logs - Check logs of related controllers, pods, nodes to find clues on the scaling problem.
  • Tune configurations - If scaling too much/little, adjust your HPA targets, node autoscaling parameters.
  • Debug bottlenecks - Profile your apps, identify bottlenecks preventing scaling from working optimally.
  • Add resources - If consistently resource constrained, increase node sizes, pod resources.
  • Spread load - Distribute load better by adding nodes/pods or partitioning data.

With good visibility and alerts around scaling, you can catch issues early and respond quickly to keep your applications running smoothly. Adjust configurations and resources as needed based on data.

Autoscaling Best Practices

When configuring autoscaling in Kubernetes, it's important to start small and tune it iteratively based on real workload patterns. Here are some best practices to follow:

  • Start conservatively - When first enabling autoscaling, be conservative with the minimum and maximum number of replicas. Slowly expand the limits as you gain confidence. Drastic autoscaling can lead to resource starvation or overprovisioning.
  • Tune metrics and targets carefully - The metrics and target values will determine how autoscaling behaves for your workload. Take time to understand spikes and trends before setting target values. Choose metrics that truly reflect demand.
  • Watch scaling events - Check the HorizontalPodAutoscaler events frequently to see if scaling is happening as expected. Look for patterns in how replicas are added or removed. Tweak the configurations if scaling happens too rapidly or slowly.
  • Iterate and improve over time - Autoscaling is complex and workloads are dynamic. Expect to periodically revisit and adjust the metrics, targets, min/max replicas etc. as you learn more. The ideal values will evolve over time.
  • Set alerts for unexpected scaling - Use monitoring alerts for cases like rapid scaling/descaling in short periods or hitting max/min replica constraints frequently. These indicate the HPA configuration needs adjustment.
  • Test before deploying changes - Perform load tests to validate autoscaling changes before rolling out to production traffic. Ensure the metrics and scaling behave as expected under simulated conditions.

Following these best practices will help you gain increased control and predictability over how autoscaling behaves for your critical workloads. Take an incremental iterative approach to autoscaling for maximum benefits.

Common Scaling Issues

Scaling Kubernetes workloads can be tricky sometimes. Here are some of the common issues you may encounter and how to troubleshoot them:

Oscillating Scaling

This is when the number of pods keeps fluctuating up and down frequently. Some common causes include:

  • The scaling thresholds are too close together. Try increasing the distance between scale up and scale down thresholds.
  • Usage spikes are triggering scaling but then subsiding quickly. Consider using a longer scaling window to smooth out temporary spikes.

-Time lags in scaling actions means pods are still ramping up when the autoscaler decides to scale down. Increase the cooldown period to allow previous actions to take effect.

  • Conflicting horizontal and vertical scaling. VPA may be resizing pods while HPA is scaling pod counts. Try disabling VPA temporarily.

Slow Scaling

Sometimes scaling up or down can take too long. Some reasons this can happen:

  • Node autoscaling is slow to add new nodes for the extra pods. Consider pre-provisioning extra nodes.
  • Pods are configured with long termination periods blocking scaling down. Decrease termination grace period.
  • Backlogged scheduler due to resource bottlenecks. Check for node resource saturation issues.
  • Delays in metric collection means old data is used for scaling decisions. Increase metric scrape frequency.

Metric Issues

The chosen scaling metric can also cause problems:

  • Metric does not correlate well with actual load. Pick a metric that matches the workload's behavior.
  • Spiky or volatile metrics lead to thrashing. Use rate metrics or longer averaging windows.
  • Metrics are delayed or missing leading to bad decisions. Reduce metric delays and increase reliability.
  • Targets are based on incorrect assumptions about metric meaning. Validate with load testing.


Some ways to troubleshoot scaling issues:

  • Review the horizontal pod autoscaler events and descriptions.
  • Check the metric scraper logs for errors or delays.
  • Validate metric values with live query or debugging tool.
  • Plot historical metric data to visualize trends over time.
  • Simulate load to check if scaling happens as expected.
  • Use kubectl commands to check component status and logs.

With careful metric selection, threshold configuration and monitoring, most scaling issues can be identified and fixed.

Subscribe to Changelog newsletter

Jack from the Zeet team shares DevOps & SRE learnings, top articles, and new Zeet features in a twice-a-month newsletter.

Thank you!

Your submission has been processed
Oops! Something went wrong while submitting the form.