Guide
Autoscaling explained
Traffic is not flat. A product launch, a viral post, or a Black Friday sale can multiply request volume in minutes. Running enough servers for peak load 24/7 wastes money; running too few drops users during spikes. Autoscaling closes that gap by adding or removing compute automatically based on signals you define. This guide covers vertical versus horizontal scaling, reactive and predictive policies, which metrics to trust (CPU, requests per second, queue depth), how Kubernetes Horizontal Pod Autoscaler and cluster autoscaler interact, scale-to-zero cold-start traps, hysteresis and cooldown windows, cost guardrails, a Harbor Fleet order API worked example, a strategy decision table, common pitfalls, and a production checklist — alongside our load balancing guide and Prometheus monitoring explainer.
Vertical vs horizontal scaling
Vertical scaling (scale up) gives each instance more CPU, RAM, or disk. It is simple — change an instance type, restart, done — but hits hard ceilings (largest VM size, single-node database limits) and usually requires downtime or brief disruption. Horizontal scaling (scale out) adds more identical instances behind a load balancer. Stateless web APIs and workers are the classic horizontal targets; monolithic databases often scale vertically first, then shard or replicate.
Modern autoscaling almost always means horizontal replica count for stateless tiers, optionally combined with vertical node resizing at the infrastructure layer. The two dimensions are complementary: cluster autoscaler provisions bigger or more nodes; HPA spreads pods across them.
Reactive vs predictive autoscaling
Reactive (threshold-based) scaling watches live metrics and adds capacity when a signal crosses a target for a sustained window. It is easy to reason about but lags demand — by the time CPU hits 80%, users may already see latency spikes. Predictive scaling uses historical patterns (same hour yesterday, known campaign schedules) to pre-warm capacity before the spike. AWS Auto Scaling scheduled actions, Google Cloud predictive autoscaling, and custom cron-based replica bumps are common implementations. Production stacks often blend both: predictive baseline plus reactive headroom.
What to scale on: metrics that actually work
The wrong metric causes either thrashing (rapid scale-up/down loops) or under-provisioning (latency rises before the scaler reacts). Pick signals tied to user-visible pain or backlog growth.
CPU and memory utilization
CPU average across pods is the default Kubernetes HPA metric. It works when work is CPU-bound and requests/limits are set honestly. It fails when pods are I/O-bound (waiting on databases) — CPU stays low while queues grow. Memory-based HPA is supported but risky: scaling on memory can trigger OOM kills during scale-up lag. Always set resource requests so the scheduler and autoscaler share the same picture of capacity.
Requests per second and latency
For HTTP services, scaling on RPS or custom metrics like p95 latency tracks demand more directly than CPU. You need a metrics pipeline (Prometheus plus adapters, or cloud-native metrics) feeding the autoscaler. Target “requests per pod” rather than global RPS so the math stays stable as replica count changes.
Queue depth and consumer lag
Background workers should scale on queue depth, oldest-message age, or Kafka consumer lag — not CPU. A backlog of 10,000 jobs with idle CPU means you need more consumers, not bigger machines. Pair queue-based scaling with max concurrency per pod so new replicas actually drain work.
External and custom metrics
Cloud load balancers expose active connection counts; serverless platforms expose concurrent executions. Custom metrics (GPU utilization, embedding batch size, cache hit ratio) belong in HPA v2 via the metrics API. Document the SLO each metric protects so on-call engineers know why replica count moved.
Kubernetes autoscaling in practice
On Kubernetes, three layers often run together:
- Horizontal Pod Autoscaler (HPA) — adjusts
Deploymentreplica count from metrics. Default target: 70% average CPU utilization. Supports min/max bounds, scale-down stabilization windows, and multiple metrics (take the highest recommended replica count). - Vertical Pod Autoscaler (VPA) — recommends or mutates CPU/memory requests. Rarely combined with HPA on the same workload without careful tuning; many teams use VPA in recommendation-only mode.
- Cluster Autoscaler — adds or removes worker nodes when pods cannot schedule (pending due to insufficient CPU/memory) or when nodes sit underutilized. It does not replace HPA; it supplies the floor HPA consumes.
A typical scale-out path: traffic rises, per-pod CPU crosses the HPA target, HPA requests more replicas, new pods pend if the cluster lacks capacity, cluster autoscaler provisions a node, pods schedule, the service endpoints update. Scale-in reverses with delays — HPA waits before removing pods; cluster autoscaler waits longer before draining nodes to avoid flapping.
Scale to zero and cold starts
Knative, KEDA, and some serverless platforms allow scale to zero replicas when demand is absent. That saves money on idle services but introduces cold starts: container pull, JVM warmup, connection pool init, and JIT compilation can add seconds of latency on the first request. Use scale-to-zero for internal tools and batch workers; keep minimum replicas ≥ 1 (often 2 for HA) on user-facing APIs with tight latency SLOs. Pre-warm images on nodes and use startup probes so Kubernetes does not route traffic before the app is ready.
Hysteresis, cooldowns, and cost caps
Without damping, autoscalers oscillate: scale up at 70% CPU, load spreads thin, CPU drops to 30%, scale down, load concentrates, spike again. Fix this with:
- Asymmetric thresholds — scale up at 70% CPU, scale down only below 40%.
- Stabilization windows — HPA
behavior.scaleDown.stabilizationWindowSecondsignores brief dips. - Cooldown periods — cloud ASGs enforce minimum time between activities; mirror this in custom scalers.
- Max replica caps — hard ceiling prevents runaway bills during DDoS or retry storms.
- Scale-down limits — remove at most N pods per minute so draining connections finish gracefully.
Tag autoscaling groups and namespaces with cost-center labels. Alert when replica count or node count exceeds budget thresholds for more than an hour — autoscaling should not be a silent budget leak.
Worked example: Harbor Fleet order API
Harbor Fleet runs a stateless order API on Kubernetes behind an AWS Application
Load Balancer. Baseline: 3 replicas, 500m CPU request, 1 CPU limit, HPA min 3 / max
30, target 65% CPU, scale-down stabilization 300 seconds. Prometheus exports
http_requests_per_second per pod via a custom metrics adapter; HPA
also watches p95 latency > 250ms as a secondary metric (max of the two replica
calculations wins).
During a flash sale, RPS jumps 8x in two minutes. CPU per pod hits 85%; HPA adds
6 replicas within 90 seconds. Four new pods pend — cluster autoscaler adds
two m6i.large nodes in the node pool. p95 latency peaks at 310ms then
falls to 120ms as endpoints register. After the sale, RPS drops but stabilization
prevents scale-down for five minutes, avoiding thrash from checkout retries. Ops
reviews the event in
Grafana:
replica count, ALB target health, and HPA decisions on one dashboard. Takeaway:
dual metrics (CPU + latency) caught an I/O-heavy spike that CPU alone would have
under-provisioned; min replicas = 3 preserved HA during node bootstrap.
Strategy decision table
| Workload type | Scale dimension | Primary metric | Notes |
|---|---|---|---|
| Stateless REST/GraphQL API | Horizontal (HPA) | RPS per pod or p95 latency | Min replicas ≥ 2; avoid scale-to-zero on hot path |
| CPU-bound batch workers | Horizontal | Queue depth or job age | Scale on backlog, not CPU |
| WebSocket / long-poll | Horizontal + sticky sessions | Active connections per pod | Draining matters on scale-down |
| Monolithic JVM service | Vertical first, then horizontal | Heap pressure + GC pause time | Cold JVM warmup hurts fast scale-out |
| GPU inference | Horizontal on GPU pool | Queue wait time or GPU util | Expensive nodes — tight max cap |
| Nightly ETL | Scheduled + reactive | Time window + CPU | Predictive cron pre-warms; reactive handles overruns |
| Multi-tenant SaaS | Horizontal per service | Per-tenant rate limits + global CPU | Noisy neighbor isolation via separate HPAs |
Common pitfalls
- Missing or wrong resource requests — HPA divides usage by requests; unset requests make scaling math meaningless.
- Scaling on CPU for I/O-bound apps — pods look idle while latency explodes; add queue or latency metrics.
- No readiness probes — traffic hits starting pods; users see errors during scale-out.
- Too-aggressive scale-down — connection drops and cache loss; use PDBs and stabilization windows.
- Cluster autoscaler without headroom — HPA requests pods that cannot schedule; pending pods are not serving traffic.
- Ignoring scale-up lead time — new nodes take minutes; predictive scaling or over-provisioned min replicas cover known events.
- Retry storms amplifying load — clients retry on 503s, doubling RPS; fix clients and use circuit breakers.
- Unbounded max replicas — autoscaling during attacks or bugs can 10x cloud spend in an hour.
Production checklist
- Define SLOs (latency, error rate, queue age) before choosing scale metrics.
- Set CPU/memory requests and limits on every pod HPA manages.
- Configure min replicas for HA; max replicas for cost control.
- Add readiness and liveness probes; verify new pods pass readiness before receiving traffic.
- Enable scale-down stabilization and asymmetric scale-up vs scale-down policies.
- Ensure cluster autoscaler node pools have room for max HPA replica count.
- Export HPA decisions and replica count to your metrics stack; dashboard beside latency.
- Load-test scale-out path (including node provisioning time) quarterly.
- Document runbooks for “stuck at max replicas” and “pending pods” alerts.
- Review autoscaling bills monthly; adjust targets if average utilization is consistently low.
Key takeaways
- Horizontal autoscaling adds replicas; vertical scaling grows each instance — most web tiers scale out.
- Pick metrics tied to user pain or backlog, not CPU alone.
- Kubernetes HPA and cluster autoscaler solve different problems and must be tuned together.
- Hysteresis and cooldowns prevent thrashing; max caps prevent bill shocks.
- Scale to zero saves money on idle workloads but trades latency on cold start.
Related reading
- Kubernetes fundamentals explained — Deployments, services, and the control plane HPA plugs into
- Load balancing explained — how traffic reaches scaled replicas
- Grafana explained — dashboards for replica count, latency, and HPA behavior
- Prometheus monitoring explained — custom metrics adapters for smarter scaling