Kubernetes’ scheduling magic revealed
Understanding how the Kubernetes scheduler makes scheduling decisions is critical to ensure consistent performance and optimal resource utilization.
Kubernetes is an industry-changing technology that allows massive scale and simplicity for the orchestration of containers. Most of us happily push thousands of deployments and pods to Kubernetes every day. Have you ever wondered what sorcery is at play in Kubernetes to determine where all those pods will be created in the Kubernetes cluster? All of this is made possible by the kube-scheduler.
Understanding how the Kubernetes scheduler makes scheduling decisions is critical in order to ensure consistent performance and optimal resource utilization. All scheduling in Kubernetes is done based upon a few key pieces of information. First, it is using information about the worker node to determine what the total capacity of the node is. Using kubectl describe node <node>
will give you all the information you need to understand regarding how the scheduler sees the world.
Capacity: cpu: 4 ephemeral-storage: 103079200Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 16427940Ki pods: 110 Allocatable: cpu: 3600m ephemeral-storage: 98127962034 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 14932524020 pods: 110
Here we see what the scheduler sees as being the total capacity of the worker node as well as the allocatable capacity. The allocatable numbers factor in kubelet settings for Kubernetes and system reserved space. Allocatable represents the total space the scheduler has to work with for a given node.
Next, we need to look at how we instruct the scheduler about our workload. It is important to note that Kubernetes does not consider actual CPU and memory utilization of a workload. It factors in only the resource descriptions provided by the developer or operator. Here is an example from a pod object definition:
resources: limits: cpu: 100m memory: 170Mi requests: cpu: 100m memory: 170Mi
These are the specifications provided at the container level. The developer must provide these resource requests and limits on a per container basis, not per pod. What do these specifications mean? The limits are only considered by the kubelet and are not a factor during scheduling. This indicates that the cgroup of this container will be set to limit CPU utilization to 10% of a single CPU core, and if memory utilization exceeds 170MB, then the process will be killed and restarted; there is no “soft” memory limit in Kubernetes use of cgroups. The requests are used by the scheduler to determine the best worker node on which to place this workload. Note that the scheduler is summing the resource requests of all containers in the pod to determine where to place it. The kubelet is enforcing limits on a per-container basis.
We now have enough information to understand the basic resource-based scheduling logic that Kubernetes uses. When a new pod is created, the scheduler looks at the total resource requests of the pod and then attempts to find the worker node that has the most available resources. This is tracked by the scheduler for each node, as seen in kubectl describe node
:
Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) CPU Requests CPU Limits Memory Requests Memory Limits ------------ ---------- --------------- ------------- 1333m (37%) 2138m (59%) 1033593344 (6%) 1514539264 (10%)
You can investigate the exact details of the Kubernetes scheduler via the source code. There are two key concepts in scheduling. On the first pass, the scheduler attempts to filter the nodes that are capable of running a given pod based on resource requests and other scheduling requirements. On the second pass the scheduler weighs the eligible nodes based on absolute and relative resource utilization of the nodes and other factors. The highest weighted eligible node is selected for scheduling of the pod.
This post is part of a collaboration between O’Reilly and IBM. See our statement of editorial independence.