Skip to main content
k8s_autoscaling_uhbuqx.webp

Scaling Applications in Kubernetes: A Guide to HPA, VPA, and KEDA for Production Workloads

Anish Bista

Anish Bista


Introduction

In today’s fast-paced digital world, where user demands fluctuate unpredictably, ensuring your applications perform optimally is a challenge. Kubernetes provides a powerful framework for scaling applications dynamically, but selecting the right scaling strategy is crucial. Whether it's responding to peak traffic or reducing costs during idle periods, scaling enhances performance, ensures reliability, and optimizes resource utilization.

In this guide, we’ll dive into three essential scaling mechanisms in Kubernetes: Horizontal Pod Autoscaler (HPA), Vertical Pod Autoscaler (VPA), and KEDA (Kubernetes Event-Driven Autoscaling), with practical use cases and step-by-step implementation. Let’s begin by exploring why scaling is essential for cloud infrastructure.

Why Scaling is Essential for Your Cloud Infrastructure?

  • Optimize Costs: According to the Flexera 2024 State of the Cloud Report, 32% of cloud costs are wasted, with 75% of organizations reporting an increase in cloud waste. Effective scaling ensures resources are used only when needed, preventing over-provisioning and saving costs.
  • Boost Performance: 77% of performance issues stem from improper resource allocation. Scaling ensures that your applications have enough resources during high demand, maintaining smooth user experiences.
  • Enhance Operations: Automate everything, including capacity management, to reduce manual interventions and improve operational efficiency.

Horizontal Pod Autoscaler (HPA)

The Horizontal Pod Autoscaler (HPA) is one of the most essential features in Kubernetes, designed to help applications automatically scale based on real-time resource utilization. It enables applications to scale efficiently and effectively without manual intervention, ensuring that resources are allocated appropriately in response to changes in demand.

By continuously monitoring the performance of your pods, the HPA can dynamically adjust the number of pod replicas within a deployment, replica set, or stateful set. This scaling process is driven by various performance metrics such as CPU and memory usage, as well as custom application-specific or external metrics, providing a comprehensive and adaptable approach to autoscaling.

Key Use Cases

  • Handling Varying Workloads: Applications often face fluctuations in demand. For example, e-commerce platforms experience higher traffic during sales events. HPA ensures that the right number of pods are running to handle this dynamic load, thus preventing resource bottlenecks.
  • Optimizing Infrastructure Costs: Scaling down during off-peak hours helps save infrastructure costs. For example, a service running on a weekend may not need as many resources as during weekdays, and HPA can scale down the pods accordingly.
  • Ensuring High Availability: During sudden load spikes or increased user activity (such as a viral campaign), HPA automatically scales up the number of pods to maintain the application's availability, preventing service disruptions due to resource exhaustion.

How HPA Works

Working of HPA
Working of HPA (source: kubecost.com)

The Horizontal Pod Autoscaler works by monitoring the resource utilization of the pods and adjusting the number of pods running in a cluster based on predefined thresholds.

It continuously evaluates certain metrics (CPU, memory, or custom metrics) and compares them to the target values. If the actual resource usage exceeds or falls below the target, the HPA controller adjusts the replica count of the deployment to align the workload with the available resources.

Metrics Supported

  • CPU Usage: This is one of the most common metrics. HPA monitors the CPU usage of each pod. If the CPU usage exceeds a target threshold (for example 80%), HPA will scale up the number of pods to spread the load. Conversely, if CPU usage is consistently low, HPA will scale down the pod count to save resources.

  • Memory Usage: Like CPU usage, memory consumption is also monitored. HPA scales the number of pods based on memory usage patterns, ensuring that pods are not starved for memory or excessively over-provisioned.

  • Custom Metrics: In some cases, application-specific metrics (e.g., request rate, database connections, queue length) are used for scaling decisions. These metrics can be provided by the application itself or collected through a custom monitoring solution.

  • External Metrics: HPA can scale pods based on external factors such as traffic in a cloud load balancer or external API request counts. These metrics are collected and made available by the Kubernetes external metrics API.

Kubernetes Metrics Server

The Metrics Server is a lightweight, core Kubernetes component that collects metrics from each node and pod in the cluster. It aggregates the CPU and memory usage data from the kubelet running on each node and exposes this data to the HPA. This data is essential for HPA to make scaling decisions.

  • The Metrics Server collects data at short intervals (typically every 60 seconds).
  • It’s a cluster-wide service, meaning it monitors the usage of resources across all nodes in the Kubernetes cluster.

Scaling Algorithm

The HPA adjusts the number of pods based on a simple formula:

desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue))
yaml
  • currentReplicas: The current number of pods.
  • currentMetricValue: The current value of the metric (e.g., current CPU usage).

For example, if you have 3 replicas and the CPU usage is at 80% of the desired target, HPA will increase the replicas to ensure that the overall load is distributed more evenly across a greater number of pods.

Triggers for Scaling

  1. Scaling UP:
    • High CPU Usage: If the CPU usage of the pods exceeds a certain threshold (e.g., 80%), the HPA will increase the number of replicas to distribute the load.
    • Increased Requests: When the number of requests to the application increases (e.g., due to higher user activity or a sudden burst in traffic), the HPA will scale up to ensure that there are enough pods to handle the load.
    • Custom Metrics Exceeding Thresholds: If any application-specific metrics (e.g., queue length, database connections) exceed their thresholds, HPA can trigger scaling up.
  2. Scaling DOWN:
    • Low Resource Usage: If the pods consistently use fewer resources than requested (e.g., the CPU usage drops below the target), HPA can scale down the number of replicas to save resources and reduce infrastructure costs.
    • Idle Periods: During periods of low traffic (such as non-peak hours), the HPA will reduce the number of pods to prevent unnecessary resource consumption.
    • Low Custom Metrics: When custom metrics (e.g., traffic, request rate) are lower than expected, HPA scales down to save resources.

Best Practices for HPA

  1. Set Appropriate Metric Thresholds: The effectiveness of HPA depends on the proper configuration of metric thresholds. Too high or too low can result in either scaling issues or unnecessary scaling actions.
  2. Use Multiple Metrics: Combine multiple metrics (e.g., CPU and memory) for better control over pod scaling, or use custom and external metrics to reflect application-specific performance.
  3. Test Scaling Strategies: Regularly test how your application behaves under load. Set realistic targets for your metrics and monitor how HPA scales your pods in different scenarios.
  4. Avoid Over-scaling: Ensure that scaling decisions do not result in an excessive number of pods, which can cause the system to become inefficient. You can set limits on the number of replicas to avoid excessive scaling.

With these detailed insights, you should have a better understanding of how Horizontal Pod Autoscaler (HPA) operates in Kubernetes, its triggers, and how to implement it efficiently in your environments!

HPA Limitations

  • Limited Metrics: By default, supports only CPU and memory metrics. Custom metrics need additional setup like Prometheus Adapter.
  • Slow Scaling: Scaling decisions are delayed during sudden traffic spikes.
  • Resource Wastage: HPA does not adjust per-pod resource requests or limits.
  • Lack of Granularity: Inefficient for applications requiring per-pod optimizations.

Sure! Here's a brief explanation of what each step in the demo is doing:

Demo: Scaling an NGINX Deployment with HPA

  1. Create Deployment:

    • We create a Kubernetes deployment for NGINX with two replicas. The kubectl create deployment command deploys NGINX and ensures two pods are running by default.
    kubectl create deployment nginx --image=nginx --replicas=2
    bash
  2. Expose Deployment as Service:

    • We expose the NGINX deployment as a service so that it can be accessed externally via a LoadBalancer. The service will route traffic to the NGINX pods on port 80.
    kubectl expose deployment nginx --type=LoadBalancer --port=80 --target-port=80 --name=nginx-service
    bash
  3. Ingress Configuration:

    • An Ingress is created to manage external HTTP access to the NGINX service. This configuration uses the NGINX ingress controller to route traffic to the nginx-service on port 80.
    apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: nginx-ingress annotations: kubernetes.io/ingress.class: nginx spec: ingressClassName: nginx rules: - http: paths: - path: / pathType: Prefix backend: service: name: nginx-service port: number: 80
    yaml
  4. Configure HPA:

    • Horizontal Pod Autoscaler (HPA) is configured to automatically scale the NGINX deployment based on CPU and memory utilization.
    • We set a minimum of 1 replica and a maximum of 10 replicas, with scaling triggered when the CPU utilization exceeds 80% or the memory usage exceeds 500Mi.
    apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: nginx-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: nginx minReplicas: 1 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 80 - type: Resource resource: name: memory target: type: AverageValue averageValue: 500Mi
    yaml
  5. Load Testing with K6:

    • We use K6, a load testing tool, to simulate traffic to the NGINX service.
    • The script generates 100 virtual users (vus) making requests to the service for 30 seconds.
    • The test checks whether the NGINX service responds successfully with a status of 200.
    import http from 'k6/http'; import { check } from 'k6'; export const options = { vus: 100, duration: '30s', }; const BASE_URL = 'http://<LoadBalancerIP>'; export default function () { const url = `${BASE_URL}`; const resp = http.get(url); check(resp, { 'endpoint was successful': (resp) => resp.status === 200, }); }
    javascript
    • After saving the script, you run it with the k6 run k6-test.js command to perform the load test, which simulates real traffic and triggers the HPA to scale the deployment.

Vertical Pod Autoscaler (VPA) in Kubernetes

Kubernetes is a powerful platform for orchestrating containerized applications, but one of the challenges when managing these applications is ensuring that resources—specifically CPU and memory—are allocated appropriately. Efficient resource management is crucial for the stability and performance of applications, especially as workloads can fluctuate over time. This is where the Vertical Pod Autoscaler (VPA) comes into play. It automatically adjusts the CPU and memory (RAM) requests and limits for containers running in a pod to match their actual resource usage.

Let's dive into a detailed explanation of VPA, how it works, when to use it, and best practices for implementing it.

Introduction

The Vertical Pod Autoscaler (VPA) is a Kubernetes component that helps manage and optimize the resource allocation for pods. Specifically, it automatically adjusts the resource requests (minimum resources required for a container) and limits (maximum resources a container can use) for CPU and memory based on the real-time usage patterns of the pods.

The goal of VPA is to ensure that pods get the right amount of resources, preventing them from being over- or under-allocated, which could lead to performance issues or waste of resources. By fine-tuning the resource requests and limits, VPA makes sure that containers get the resources they need for optimal performance, without over-provisioning or wasting resources.

How VPA Works

Working of VPA
Working of VPA (source: kubecost.com)

VPA works by continuously monitoring the resource usage of containers and adjusting their CPU and memory settings. Here’s how it operates:

Scaling Modes

  • Auto Mode:

    • Automatically adjusts the resource requests and limits of a pod based on its usage.
    • This is the most dynamic mode, where VPA constantly optimizes resource allocations in real-time.
  • Initial Mode:

    • Sets the initial resource requests and limits for a pod at startup, but does not make adjustments after that.
    • This is useful when you want to set reasonable defaults for your applications without further interference.
  • Off Mode:

    • VPA provides recommendations for resource adjustments but does not apply them automatically.
    • This mode is useful when you want to monitor the pod's resource consumption and review recommendations before taking action.

When VPA Adjusts Resources

VPA can adjust the resource allocation in the following scenarios:

  • Pod is using more or less than requested: If a pod consistently consumes more CPU or memory than requested, VPA will increase the resource allocation. Conversely, if a pod uses fewer resources than requested, VPA will reduce the allocation.

  • Changes in historical usage patterns: VPA learns over time and adjusts its predictions. If a pod's resource consumption increases or decreases suddenly, it will adjust the resources accordingly.

  • OOMKills or CPU throttling: If a pod is Out of Memory (OOM) killed or throttled due to high CPU usage, VPA will intervene and adjust the pod's resource allocation to prevent these issues from recurring.

Key Differences Between HPA and VPA

Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA) are both important components in Kubernetes for autoscaling, but they serve different purposes. Here’s a quick comparison:

AspectHorizontal Pod Autoscaler (HPA)Vertical Pod Autoscaler (VPA)
Scaling FocusScales the number of pod replicas based on metrics like CPU or memory utilization.Adjusts the CPU and memory requests/limits for individual pods.
Scaling DirectionScaling out: Increases the number of pod replicas when load increases.Scaling up/down: Adjusts the resource allocation (CPU/memory) of existing pods without changing the number of replicas.
When to UseUse when workloads need to be spread across more pods to handle increased load.Use when a pod requires more or fewer resources based on actual usage patterns.
Impact on Pod CountAffects the number of pods (horizontal scaling).Affects the resource allocation within a pod (vertical scaling).

Components of the VPA

VPA is composed of several components, each responsible for a specific task in the autoscaling process:

  • Metrics Server:

    • Collects real-time resource usage data (CPU, memory) from the pods.
    • This data is essential for VPA to make accurate decisions on resource adjustments.
  • VPA Recommender:

    • Analyzes the collected data to generate recommendations for resource adjustments.
    • The recommender takes into account historical usage patterns and current resource consumption to determine the optimal CPU and memory requests/limits for each pod.
  • VPA Updater:

    • Applies the recommended resource adjustments to the pods.
    • The updater ensures that the new resource requests and limits are applied correctly to improve performance and prevent resource wastage.
  • VPA Admission Controller:

    • Ensures that the changes are correctly applied when a pod is created or updated.
    • It validates that the resource adjustments comply with VPA requirements during pod creation or updates.
  • Deployment:

    • The Deployment defines the pods, which are modified based on VPA's recommendations.
    • VPA ensures that the correct resource allocation is maintained throughout the pod’s lifecycle.

Setting Up VPA

To use VPA, you need to create a VerticalPodAutoscaler resource. Below is an example YAML manifest to configure VPA for a deployment:

apiVersion: apps/v1 kind: VerticalPodAutoscaler metadata: name: my-app-vpa namespace: default spec: targetRef: apiVersion: apps/v1 kind: Deployment name: my-app updatePolicy: updateMode: 'Auto'
yaml

In this example, VPA will automatically adjust the resource requests and limits for the my-app deployment based on real-time usage patterns.

Configuring Recommendations and Policies

VPA allows you to fine-tune how recommendations are generated. For instance, you can set minimum and maximum resource limits for CPU and memory to prevent excessive resource allocation or under-allocation. Here’s how you can configure these limits:

apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: my-app-vpa spec: resourcePolicy: containerPolicies: - containerName: my-app-container minAllowed: memory: '512Mi' cpu: '500m' maxAllowed: memory: '2Gi' cpu: '2'
yaml

This ensures that the container named my-app-container will never request less than 512Mi of memory or 500m CPU, nor more than 2Gi of memory or 2 CPUs.

Best Practices

  • Start with Off Mode:

    • Begin by using Off mode to observe the VPA’s recommendations before automatically applying them. This provides valuable insights into how VPA adjusts resources without affecting the application’s stability.
  • Use VPA in Conjunction with HPA:

    • For workloads that require both horizontal scaling (HPA) and vertical scaling (VPA), use them together. VPA adjusts the pod’s resource allocation, while HPA manages the number of pod replicas.
  • Set Limits:

    • Always define minimum and maximum resource limits for the pods. This prevents VPA from allocating resources that exceed the acceptable range for your application.
  • Monitor Performance:

    • Once VPA is enabled, monitor the performance of your pods to ensure that the resource adjustments lead to more efficient resource usage and better application stability.

Disadvantages of VPA

While VPA is a powerful tool for managing pod resources, it does have some limitations:

  • No Scaling Down:

    • VPA does not automatically reduce resources once they’ve been increased. If a pod's resource usage decreases after scaling up, VPA won’t scale it down, potentially leading to inefficiency.
  • Pod Disruption:

    • Resource changes usually require pod restarts, which can lead to downtime for applications, especially in stateful applications or high-availability systems.
  • Limited Scope:

    • VPA only adjusts resource requests and limits for individual pods. It doesn’t affect replica scaling or other external factors that may influence the performance of the pod.

Kubernetes Event-driven Autoscaling (KEDA): Scaling Beyond Resource Metrics

Introduction

Introduction to Keda
Introduction to Keda

KEDA (Kubernetes Event-driven Autoscaling) is an open-source project designed to provide event-driven scaling for Kubernetes workloads. It scales deployments or stateful sets based on external events, complementing HPA and VPA. With KEDA, you can scale Kubernetes workloads based on events such as messages from a Kafka topic, HTTP requests, or custom metrics.

Key Features of KEDA

  • Event-Driven Scaling: Scales workloads in response to external events such as Kafka messages, HTTP requests, or custom metrics.
  • Seamless Integration: Works alongside HPA and VPA, allowing you to combine event-driven scaling with resource-based scaling for a more comprehensive autoscaling solution.
  • Wide Range of Supported Event Sources: KEDA supports a variety of event sources like Kafka, RabbitMQ, AWS SQS, Azure Event Hubs, and custom metrics.

Why we need KEDA despite HPA and VPA?

Kubernetes provides powerful autoscaling tools like Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA), but they come with limitations that prevent them from efficiently handling dynamic, event-driven workloads. HPA scales based on resource metrics such as CPU and memory usage, while VPA adjusts resource requests to optimize steady-state workloads. However, in the world of modern cloud-native applications, many workloads are not driven solely by CPU or memory consumption. Instead, they respond to external events, such as messages in a queue, HTTP requests, or other custom events.

For example, applications relying on message queues (like Kafka or RabbitMQ) or custom event sources need to scale dynamically in response to changing traffic volumes, not just resource utilization. Here, HPA and VPA fall short because they cannot react to these external triggers. This is where KEDA (Kubernetes Event-Driven Autoscaling) comes in, providing a much-needed solution for scaling workloads based on events rather than resource metrics.

How KEDA solves limitations of HPA and VPA?

KEDA is built specifically to handle event-driven workloads. It enables autoscaling based on external events such as:

  • Kafka, RabbitMQ, or Azure Event Hubs messages.
  • HTTP requests or custom metrics that reflect external activity.
  • Other event sources like Google Pub/Sub, AWS SQS, and more.

KEDA works in conjunction with HPA and VPA, allowing you to scale based on both resource metrics and event-driven triggers. This flexibility makes KEDA ideal for applications in modern cloud-native environments, where the workload can fluctuate rapidly due to external stimuli, like sudden traffic surges or message bursts in a queue.

How KEDA works

KEDA works by connecting Kubernetes to external event sources. It uses two key components to achieve this:

  1. ScaledObject: Defines scaling logic based on external events for a deployment or stateful set. The ScaledObject specifies the event source (e.g., Kafka, RabbitMQ), and the scaling behavior (e.g., scale based on message count).
  2. ScaledJob: Used for scaling jobs based on external events. For example, a job could scale up to process messages in a queue as they arrive.

Example ScaledObject Configuration

Here’s an example of how to configure a ScaledObject that scales a deployment based on Kafka messages:

apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: my-app-scaled-object spec: scaleTargetRef: name: my-app-deployment triggers: - type: kafka metadata: topic: my-topic consumerGroup: my-consumer-group bootstrapServers: my-kafka-broker:9092 lagThreshold: '5'
yaml

In this example, the application scales based on the number of messages in the Kafka topic my-topic. The lagThreshold parameter ensures that scaling only happens when the lag between consumers and the producer reaches 5 messages.

Setting Up KEDA

To install KEDA, follow these steps:

  1. Add the KEDA Helm Repo:

    helm repo add kedacore https://kedacore.github.io/charts
    bash
  2. Update Helm Repo:

    helm repo update
    bash
  3. Install the KEDA Helm Chart:

    helm install keda kedacore/keda --namespace keda --create-namespace
    bash

After installation, KEDA will be ready to manage event-driven scaling in your Kubernetes environment.

Best Practices for Using KEDA

  1. Choose the Right Event Source: Depending on your workload, choose an event source that aligns with your use case. For example, if you are processing messages in a queue, Kafka or RabbitMQ might be the best choice.
  2. Monitor and Troubleshoot: Always monitor the scaling behavior of your workloads and the event sources. Tools like Prometheus and Grafana can help you track performance and diagnose issues in your scaling logic.
  3. Fine-Tune Scaling Parameters: Adjust parameters like min/max replicas, lag thresholds, and scaling cooldowns to optimize scaling behavior and avoid unnecessary scaling events.

Choosing the Right Tool for Your Use Case

Here’s a quick comparison to help decide when to use HPA, VPA, and KEDA:

  • When to Use HPA: For predictable, resource-based scaling (e.g., CPU/memory usage).
  • When to Use VPA: When optimizing steady workloads with fixed resource needs.
  • When to Use KEDA: For event-driven workloads that react to external triggers (e.g., Kafka, HTTP requests).

Combining Tools for Optimal Scaling

A hybrid strategy can often deliver the best results, combining the strengths of HPA, VPA, and KEDA. For example:

  • KEDA can handle scaling based on event-driven triggers (e.g., message queues).
  • HPA can scale based on CPU/memory usage when the system is idle but handling traffic.
  • VPA can adjust resource requests for steady-state workloads.

This combination ensures that your Kubernetes environment can handle both predictable and unpredictable workloads efficiently.

Conclusion

Scaling applications in Kubernetes is essential for maintaining performance and cost efficiency in production. While HPA and VPA are great for resource-based autoscaling, they fall short in handling dynamic, event-driven workloads. This is where KEDA (Kubernetes Event-Driven Autoscaling) excels, enabling autoscaling based on external events like message queues, HTTP requests, or custom metrics.

By combining HPA, VPA, and KEDA, Kubernetes can handle both resource-based and event-driven scaling. Together, these tools ensure your applications are scalable, resilient, and cost-effective, making Kubernetes an even more powerful platform for modern cloud-native workloads. There are more advanced mechanisms such as cluster autoscaling, descheduler and Karpenter that we can cover in the future. If you need help with Kubernetes scaling, contact us today.

Enjoying this post?

Get our posts directly in your inbox.