Kubernetes Pod Schedule Prioritization

Introduction

Currently Kubernetes is not configured to treat any pod as more or less important than any other pod with the exception of critical Kubernetes pods such as the kube-apiserver, kube-scheduler, and kube-controller-manager.

Multiple products with different Service Class requirements are hosted on Kubernetes but there is no configuration that provides any prioritization of these products.

The research goal is to identify a process or configuration which would let the Applications and Operations teams identify and ensure their products have priority when using cluster resources. For example, in the event of an unintentional failure such as a worker node failure, or an intentional failure such as removing a worker node from a cluster pool for maintenance.

A secondary goal is to determine if overcommitting the Kubernetes clusters is a viable solution to resource availability.

As always, this is a summation that generally applies to my environment. For full details, links to documents are provided at the end of this document.

Service Class

Service Class is used to define service availability. This is not relevant to individual components of a product but of the overall service itself. This is a list of Service Class definitions.

  • Mission Critical Service (MCS) – 99.999% up-time.
  • Business Critical Service (BCS) – 99.9% up-time.
  • Business Essential Service (BES) – 99% up-time.
  • Business Support Service (BSS) – 98% up-time.
  • Unsupported Business Service (UBS) – No guaranteed service up-time
  • LAB – No guaranteed service up-time.

Note that the PriorityClass design does not ensure the hosted Product satisfies the contracted Service Class. PriorityClass Objects ensures that resources are available to more critical Products should there be resource exhaustion due to overcommitment or worker node failure.

PriorityClass Objects

Kubernetes as of version 1.14 has introduced PriorityClass Objects. This object lets us assign a resource priority to a pod that lets a pod jump ahead in the scheduling queue.

  • 2,000,001,000 – This is used for critical pods running on Kubernetes nodes (system-mode-critical).
  • 2,000,000,000 – This is used for critical pods which manage Kubernetes clusters (system-cluster-critical)
  • 1,000,000,000 – This level and lower is available for any product to use.
  • 0 – This is the default level for all non-critical pods.
Linux:cschelin@lnmt1cuomtool11$  kubectl get priorityclasses -A
NAME                      VALUE        GLOBAL-DEFAULT   AGE
system-cluster-critical   2000000000   false            22d
system-node-critical      2000001000   false            22d

system-node-critical Object

The following pods are assigned to the system-node-critical Object.

  • calico-node
  • kube-proxy

system-cluster-critical Object

The following pods are assigned to the system-cluster-critical Object.

  • calico-kube-controllers
  • coredns
  • etcd
  • kube-apiserver
  • kube-controller-manager
  • kube-scheduler

PriorityClass Definitions

A PriorityClass Object lets us define a set of values which can be used by applications in order to ensure availability based on Service Class. See the below recommendations to be configured for the Kubernetes environments.

  • 7,000,000 – Critical Infrastructure Service
  • 6,000,000 – Mission Critical Service
  • 5,000,000 – Infrastructure Service
  • 4,000,000 – Business Critical Plus Service (a product that requires 99.99% up-time)
  • 3,000,000 – Business Critical Service
  • 2,000,000 – Business Essential Service
  • 1,000,000 – Business Support Service
  • 500,000 – Unsupported Business Service and LAB Services (global default)

Most of the items in the list are well know Service Class definitions. For the ones that I’ve added, additional details follow.

Critical Infrastructure Service

Any pod that is used by any or all other pods in the cluster. Especially if the pod is used by a MCS product.

Infrastructure Service

Standard infrastructure pods such as kube-state-metrics and the metrics-server pods. This includes other services such as Prometheus and Filebeat.

Business Critical Plus Service

Currently there is no 4 9’s Service Class defined however some products have been deployed as requiring 4 9’s support. For this reason, a PriorityClass Object was created to satisfy that Service Class request.

Testing

In testing:

  1. MCS pods in a deployment will run as long as resources are available.
  2. If there are not enough resources for the lower PriorityClass deployments, pods will be started until resources are exhausted. Remaining pods will be put in a Pending state.
  3. If additional MCS pods need to start, lower PriorityClass pods will be Terminated. New pods will start and remain in a Pending state.
  4. Once the additional MCS pods are not needed, they will be deleted and any Pending pods will start.
  5. For multiple MCS deployments there is no PriorityClass priority. If there are unsufficient resources for all MCS pods to start, then any remaining MCS pods will be put in a Pending state.
  6. If any lower PriorityClass pods has sufficient resources to start where a higher PriorityClass pod is unable to start, the lower PriorityClass pod will start.

Pod Premption

There is a PriorityClass option called preemptionPolicy which has been made available in Kubernetes 1.15. This option lets you configure a PriorityClass to not evict pods of a lower PriorityClass. The option moves pods up in the scheduling queue, however it doesn’t evict pods if cluster resources are running low.

PodDisruptionBudget

This Object lets you specific the number of pods that must remain running. However, in testing this doesn’t appear to apply to PriorityClass evictions. If there is insufficient resources, pods in a lower PriorityClass will be evicted regardless of this setting. It will prevent a voluntary failure such as draining a worker node if there aren’t sufficient remaining pods.

Configuration Settings

For Deployments, you’d add the below defined name as a spec.priorityClassName: [name].

The following configurations are recommended for the environment.

Critical Infrastructure Service

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: critical-infrastructure
value: 7000000
globalDefault: false
description: "This priority class is reserved for infrastructure services that all pods use."

Mission Critical Service

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: mission-critical
value: 6000000
globalDefault: false
description: "This priority class is reserved for services that require 99.999% uptime."

Infrastructure Service

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: infrastructure
value: 5000000
globalDefault: false
description: "This priority class is reserved for infrastructure services."

Business Critical Plus Service

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: business-critical-plus
value: 4000000
globalDefault: false
description: "This priority class is reserved for services that require 99.99% uptime."

Business Critical Service

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: business-critical
value: 3000000
globalDefault: false
description: "This priority class is reserved for services that require 99.9% uptime."

Business Essential Service

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: business-essential
value: 2000000
globalDefault: false
description: "This priority class is reserved for services that require 99% uptime."

Business Support Service

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: business-support
value: 1000000
globalDefault: false
description: "This priority class is reserved for services that require 98% uptime."

Unsupported Business Service

Note the globalDefault setting here defining any pod that fails to set a PriorityClass in their Deployments.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: unsupported-business
value: 500000
globalDefault: true
description: "This priority class is reserved for services that have no uptime requirements."

PriorityClass Object Table

Linux:cschelin@lnmt1cuomtool11$ kubectl get pc -A
NAME                              VALUE        GLOBAL-DEFAULT   AGE
business-critical                 3000000      false            3d9h
business-critical-plus            4000000      false            3d9h
business-essential                2000000      false            3d9h
business-support                  1000000      false            3d9h
critical-infrastructure           7000000      false            3s
infrastructure                    5000000      false            6s
mission-critical                  6000000      false            14s
system-cluster-critical           2000000000   false            25d
system-node-critical              2000001000   false            25d
unsupported-business              500000       true             3d9h

Conclusion

The above recommendations provide a reliable way of ensuring critical products that are deployed to Kubernetes will have the necessary resources to respond appropriately to requests.

In order to prevent service disruption, ensure any deployed product doesn’t consume more resources than the minimum required for all deployed products.

This might also permit overcommitting resources in the clusters.

References

This entry was posted in Computers, Kubernetes and tagged . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *