Currently Kubernetes is not configured to treat any pod as more or less important than any other pod with the exception of critical Kubernetes pods such as the kube-apiserver, kube-scheduler, and kube-controller-manager.
Multiple products with different Service Class requirements are hosted on Kubernetes but there is no configuration that provides any prioritization of these products.
The research goal is to identify a process or configuration which would let the Applications and Operations teams identify and ensure their products have priority when using cluster resources. For example, in the event of an unintentional failure such as a worker node failure, or an intentional failure such as removing a worker node from a cluster pool for maintenance.
A secondary goal is to determine if overcommitting the Kubernetes clusters is a viable solution to resource availability.
As always, this is a summation that generally applies to my environment. For full details, links to documents are provided at the end of this document.
Service Class is used to define service availability. This is not relevant to individual components of a product but of the overall service itself. This is a list of Service Class definitions.
- Mission Critical Service (MCS) – 99.999% up-time.
- Business Critical Service (BCS) – 99.9% up-time.
- Business Essential Service (BES) – 99% up-time.
- Business Support Service (BSS) – 98% up-time.
- Unsupported Business Service (UBS) – No guaranteed service up-time
- LAB – No guaranteed service up-time.
Note that the PriorityClass design does not ensure the hosted Product satisfies the contracted Service Class. PriorityClass Objects ensures that resources are available to more critical Products should there be resource exhaustion due to overcommitment or worker node failure.
Kubernetes as of version 1.14 has introduced PriorityClass Objects. This object lets us assign a resource priority to a pod that lets a pod jump ahead in the scheduling queue.
- 2,000,001,000 – This is used for critical pods running on Kubernetes nodes (system-mode-critical).
- 2,000,000,000 – This is used for critical pods which manage Kubernetes clusters (system-cluster-critical)
- 1,000,000,000 – This level and lower is available for any product to use.
- 0 – This is the default level for all non-critical pods.
Linux:cschelin@lnmt1cuomtool11$ kubectl get priorityclasses -A NAME VALUE GLOBAL-DEFAULT AGE system-cluster-critical 2000000000 false 22d system-node-critical 2000001000 false 22d
The following pods are assigned to the system-node-critical Object.
The following pods are assigned to the system-cluster-critical Object.
A PriorityClass Object lets us define a set of values which can be used by applications in order to ensure availability based on Service Class. See the below recommendations to be configured for the Kubernetes environments.
- 7,000,000 – Critical Infrastructure Service
- 6,000,000 – Mission Critical Service
- 5,000,000 – Infrastructure Service
- 4,000,000 – Business Critical Plus Service (a product that requires 99.99% up-time)
- 3,000,000 – Business Critical Service
- 2,000,000 – Business Essential Service
- 1,000,000 – Business Support Service
- 500,000 – Unsupported Business Service and LAB Services (global default)
Most of the items in the list are well know Service Class definitions. For the ones that I’ve added, additional details follow.
Critical Infrastructure Service
Any pod that is used by any or all other pods in the cluster. Especially if the pod is used by a MCS product.
Standard infrastructure pods such as kube-state-metrics and the metrics-server pods. This includes other services such as Prometheus and Filebeat.
Business Critical Plus Service
Currently there is no 4 9’s Service Class defined however some products have been deployed as requiring 4 9’s support. For this reason, a PriorityClass Object was created to satisfy that Service Class request.
- MCS pods in a deployment will run as long as resources are available.
- If there are not enough resources for the lower PriorityClass deployments, pods will be started until resources are exhausted. Remaining pods will be put in a Pending state.
- If additional MCS pods need to start, lower PriorityClass pods will be Terminated. New pods will start and remain in a Pending state.
- Once the additional MCS pods are not needed, they will be deleted and any Pending pods will start.
- For multiple MCS deployments there is no PriorityClass priority. If there are unsufficient resources for all MCS pods to start, then any remaining MCS pods will be put in a Pending state.
- If any lower PriorityClass pods has sufficient resources to start where a higher PriorityClass pod is unable to start, the lower PriorityClass pod will start.
There is a PriorityClass option called preemptionPolicy which has been made available in Kubernetes 1.15. This option lets you configure a PriorityClass to not evict pods of a lower PriorityClass. The option moves pods up in the scheduling queue, however it doesn’t evict pods if cluster resources are running low.
This Object lets you specific the number of pods that must remain running. However, in testing this doesn’t appear to apply to PriorityClass evictions. If there is insufficient resources, pods in a lower PriorityClass will be evicted regardless of this setting. It will prevent a voluntary failure such as draining a worker node if there aren’t sufficient remaining pods.
For Deployments, you’d add the below defined name as a spec.priorityClassName: [name].
The following configurations are recommended for the environment.
Critical Infrastructure Service
apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: critical-infrastructure value: 7000000 globalDefault: false description: "This priority class is reserved for infrastructure services that all pods use."
Mission Critical Service
apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: mission-critical value: 6000000 globalDefault: false description: "This priority class is reserved for services that require 99.999% uptime."
apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: infrastructure value: 5000000 globalDefault: false description: "This priority class is reserved for infrastructure services."
Business Critical Plus Service
apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: business-critical-plus value: 4000000 globalDefault: false description: "This priority class is reserved for services that require 99.99% uptime."
Business Critical Service
apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: business-critical value: 3000000 globalDefault: false description: "This priority class is reserved for services that require 99.9% uptime."
Business Essential Service
apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: business-essential value: 2000000 globalDefault: false description: "This priority class is reserved for services that require 99% uptime."
Business Support Service
apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: business-support value: 1000000 globalDefault: false description: "This priority class is reserved for services that require 98% uptime."
Unsupported Business Service
Note the globalDefault setting here defining any pod that fails to set a PriorityClass in their Deployments.
apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: unsupported-business value: 500000 globalDefault: true description: "This priority class is reserved for services that have no uptime requirements."
PriorityClass Object Table
Linux:cschelin@lnmt1cuomtool11$ kubectl get pc -A NAME VALUE GLOBAL-DEFAULT AGE business-critical 3000000 false 3d9h business-critical-plus 4000000 false 3d9h business-essential 2000000 false 3d9h business-support 1000000 false 3d9h critical-infrastructure 7000000 false 3s infrastructure 5000000 false 6s mission-critical 6000000 false 14s system-cluster-critical 2000000000 false 25d system-node-critical 2000001000 false 25d unsupported-business 500000 true 3d9h
The above recommendations provide a reliable way of ensuring critical products that are deployed to Kubernetes will have the necessary resources to respond appropriately to requests.
In order to prevent service disruption, ensure any deployed product doesn’t consume more resources than the minimum required for all deployed products.
This might also permit overcommitting resources in the clusters.