Overview
For the current Kubernetes clusters, I reviewed several industry best practices and went with a Worker Node resource configuration of 4 CPUs and 16 Gigabytes of RAM per worker node. In addition, I have a spreadsheet that describes all the nodes in each cluster in order to understand the resource requirements and properly extend the cluster or provision larger worker nodes.
This document describes the Pros and Cons of sizing Worker Nodes to help in understanding the decision behind the worker node design I went with.
Note that I’ve expanded my original document as the author of the article linked at the end of this post had me think about my own decision and confirmed my thoughts on the reasons behind my choices. In rewriting my document, I ignored the cloud considerations from the original article since the clusters I’m designing are on-prem and not being hosted in the cloud at this time. We may revisit the article at a later date should we migrate to the cloud.
Considerations
Probably the key piece of information you’ll require that will guide you in your decision are the resource requirements of the microservices that will be hosted on the clusters. If a monster “microservice” requires 2 vCPUs, then a 2 vCPU worker node won’t do the trick. Even a 4 vCPU node might be a bit problematic since there will likely be other microservices that will need to run on the workers. Not to mention any replication requirements or autoscaling of the product. These are things you’ll want to be aware of when deciding on a worker node size.
Large Worker Nodes
Let’s consider the Pros and Cons of a large worker node.
Pro: Fewer Servers
Simple enough, there are fewer servers to manage. Fewer accounts, fewer agents, and less overhead overall in managing the environment. There are fewer servers that will need care and feeding.
Pro: Resource Requirements
The control plane resource needs are fewer with fewer worker nodes. A 4 worker node cluster with 10 CPUs per node gives you a 40 CPU cluster or 40000 millicpus. If the Operating System and associated services reserve 5% of CPU per worker node, a 4 worker node cluster will lose about 2000 millicpus or 2 CPUs bringing the cluster to 38 CPUs for the hosted microservices. In addition, the control plane has less work to do. Every worker node is registered with the control plane so fewer workers means fewer networking requirements and a lighter load on the control plane.
Pro: Cost Savings
If you’re running a managed Kubernetes cluster such as rancher.io or OpenShift, you might be more efficient depending on the size of the smaller nodes. The number of CPUs in a cluster determines the benefit. Smaller nodes lets you be more flexible but could increase the CPU requirements and therefore cost more in the long term however putting in a 5th 10 CPU node might be less cost effective especially if you only need 2 or 4 more CPUs.
Con: Lots of Pods
Kubernetes does have a limit of about 110 pods on a node before the node starts having issues. With multiple large nodes, the likelihood of too many pods on a worker node increases which can affect the hosted services. The agents on the worker node have a lot more work to do. Docker is busier, kubelet is busier, in general the worker node works harder. Don’t forget the various probes performed by kubelet such as liveness, readiness, and startup probes.
Con: Replication
If you’re using Horizontal Pod Autoscaling (HPA) or just configure the deployment to have multiple pods for replication purposes, fewer nodes means your replication is reduced to the number of nodes. If you have a product with 4 replicas and only a 3 node cluster, you effectively only have 3 replicas. There are 4 running pods but if a node fails and it’s hosting two of the replicas, you’ve just lost 50% of your service.
Con: Bigger Impact
A pretty simple note here, with fewer nodes, if a node does fail more pods are affected. The service impact is potentially greater.
Smaller Worker Nodes
In reading the above Pros and Cons, you can probably figure out the Pros and Cons of smaller worker nodes but let’s list them anyway.
Pro: Smaller Impact
With more smaller nodes, if a node goes away, fewer pods and services are impacted. This can be important when we want to manage nodes such as when patching and upgrading.
Pro: Replication
This assumes you have a high number of pods for a service. More smaller nodes ensures you can both have multiple pods and can use HPA to automatically scale it out further if needed. And with smaller nodes, one node failing doesn’t significantly affect a service.
Con: More Servers
Yes, there are more servers to manage. More accounts, more agents, more resource usage. In an environment where Infrastructure as Code is the rule, more nodes shouldn’t have the same impact as if you were manually managing the nodes.
Con: Resource Availability
For more nodes, the control plane overhead increases and the overall cluster size might be affected. For example, my design of a 4 CPU worker node, a 10 node cluster is taking the same amount of resources as a 4 node 40 CPU cluster or 2 CPUs used for overhead across the cluster. But as noted, the cluster management overhead increases as more worker nodes are added. More networking as every node talks to every node.
Con: Pod Limits
Again, this is related to how many resources a product uses. With smaller nodes, there’s a chance of sufficient resource fragmentation that pods might not be able to start if the node is too small in ratio with microservice requirements. If the microservices are truly “micro”, then smaller nodes could be fine but if the microservices are moderately sized, nodes that are too small will have wasted resources and services may not start.
Conclusion
As noted initially, knowing what is needed with regards to the microservices, will give you the best guidelines on worker node size. Consider other factors as well such as the cost per CPU of a node when spread out vs all together. And don’t forget, you can add larger worker nodes to a cluster that might be too small to begin with. Especially if they’re virtual machines. Power down a node, increase the CPU and RAM requirements, and bring it back up.