Kubernetes is a great way to deploy cloud-native applications in the cloud or on-premises. One of the Kubernetes Pod Autoscaling features’ biggest advantages is to automatically scale your application based on demand. This can be extremely helpful when the load an application encounters is variable. Kubernetes has three different types of scaling: Cluster scaling, Vertical scaling, and Horizontal scaling.
Cluster scaling is mostly useful in a cloud environment, where additional hosts can be added or removed via an API. For instance, AWS provides a way for their managed Kubernetes to automatically scale the underlying infrastructure so there are more resources available for pods.
Vertical scaling will increase or decrease the resources for a pod based on the load it sees by restarting the pod with updated resource requirements. A Vertical autoscaler is most useful when you are unsure of what size to make a pod, and more replicas won’t help. Kubernetes will make a recommendation based on the pod usage.
Finally, horizontal scaling will increase or decrease the replicas of a pod based on the resource requirements. A horizontal scaler is the most commonly used way to scale stateless applications and the easiest since the pods total resources can be spread over the cluster and the vertical scaler can only make a pod as big as the host it is on.
Kubernetes Horizontal Pod Autoscaling Best Practices
The default way of using horizontal scaling or vertical scaling is to install the metrics server. This allows the Kubernetes horizontal pod autoscaler to retrieve data based on the deployment’s cpu or memory usage. This can solve a lot of problems quite well, but sometimes your pods are not bound on cpu or memory.
At a recent client, phData needed to horizontally scale a custom service based on the number of https requests. The normal pattern would be to install a tool like Prometheus to collect the metrics exposed by the application and return them to the horizontal scaler pod for scaling decisions. However, in our case, the client already had a logging/resource usage infrastructure in place, so this wasn’t an option. We looked at tying into their existing infrastructure, but that would have added complexity as none of that infrastructure was based on Kubernetes.
At this point, we decided building our own custom metrics pod and service would be the simplest way to achieve our goal of scaling on custom metrics. To start, we created an endpoint in our microservices that would return the specific metrics we needed as a JSON object. For us, that was the number of requests we received in the last minute and the average request time of those requests. It worked much like a Prometheus endpoint in Spring Boot Actuator, but the aggregation is happening in the application instead of in Prometheus.
Once we got the endpoint running, we needed a way for Kubernetes to interact with the endpoint and process the data. The custom metrics server is the piece that can act as the in-between for the endpoint and the Kubernetes horizontal pod autoscaler. Once the provider.go has the new metric added and a process to query the service to retrieve the metrics, they can then be returned to the horizontal pod autoscaler as needed.
Horizontal Pod Autoscaling in Action
To demonstrate this, here is a link to a sample application (k8s-scaling-app) that will return a random number in a JSON object when the endpoint is called. If you want to run the application locally, use Minikube, and follow the instructions in the readme to build the docker image.
Once the application is running, you can open a shell to a container, install curl (apk add curl), and query the service url. It should look like this:
The app will keep returning 2 as the number of replicas for 180 seconds. At which point it will generate a new random number. This takes the place of a real value.
With that working, create the docker container, deploy the needed permissions and the pod for autoscaler-api (instructions in the readme). Once the pod is running, you can confirm that the api is working by running the below query and seeing the result:
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/service/sample-metrics-app/replicas | jq
Within 2 minutes the pods should start to scale up or down depending on the results from the random number generator. Below is a diagram showing how the Kubernetes horizontal pod autoscaling process works.
What to Watch Out For
I will say, this method isn’t a silver bullet and you need to be mindful of how you implement it. If you only query the Kubernetes service once, there will only be one pod’s response. In the event that the ingress isn’t using a round-robin or the load is not evenly distributed across the pods, there could be spikes on pods that would not be visible to the horizontal pod autoscaler. In the example that is coded up, we would have an issue where, depending on which pod the request is routed to, the horizontal pod autoscaler will get a different random value.
So if the load is distributed evenly across all pods it will work as-is. But if not, you can do what we did with our client, and implement a process to query all of the pods and return their values. In this case, we just averaged all pods together. But doing a max or percentiles could also be helpful.
Looking for More Information?
phData is the largest pure-play provider of administration services for Kubernetes, Kafka, Snowflake, and AWS. We deliver 24×7 monitoring, management, enhancements, and system administration so you can focus on what really matters – advancing your business. Reach out to firstname.lastname@example.org to learn how you can focus on your most important work by letting us manage your pipelines and platform.