Chapter 4. Use Cases, Benefits, and Limitations

Kubernetes brings agility and velocity to an organization’s development environment but adds complexity. Not only does Cluster API help tame the complexity of Kubernetes, it helps drive use cases on top of the Kubernetes platform and provides additional benefits.

Managing the Cluster Lifecycle

In the journey to adopting Kubernetes, an organization often starts by thinking about the initial design and development of a single cluster. Although it’s true that Cluster API makes it easy to stand up one cluster, its real goal is lifecycle management of multiple clusters from Day 0 (creation) through Day 2 (management until end of life). This means that Cluster API simplifies operations no matter where the organization is in its journey.

Day 2 operations include scaling clusters up and down in response to demand–including potential expansions to new environments, upgrading Kubernetes and others. Because Cluster API brings consistent, declarative control to Kubernetes clusters on different types of infrastructure, you can easily give both IT and development teams the ability to provision clusters themselves.

Cluster API makes it simple to deploy and manage multiple clusters, automating cluster lifecycle management in a repeatable manner and providing centralized visibility through the management cluster. Much of the Day 2 capability of Cluster API comes from the provider plug-in architecture, with which infrastructure hosts can provide tools that help implement best practices for the kinds of clusters they expect to host.

Managing Clusters with GitOps

GitOps is a DevOps methodology for automating IT infrastructure by treating it as code. In GitOps, the desired state of the cluster is stored in a Git repository. This provides a single, auditable source of truth with versioning and rollback and makes it easier to reproduce cluster infrastructure after a cluster has been decommissioned or in the event that disaster strikes.

When you use Cluster API and GitOps together, you can perform operations on clusters with Git pull requests. This makes it safer to give engineering teams the power to provision and maintain their own clusters because Git makes cluster operations auditable: you know everything that’s happened, who did it, and when.

Upgrading Clusters

The traditional upgrade technique, called an inline upgrade, involves upgrading Kubernetes components in place. This approach has a number of risks, many of which boil down to the upgrade failing on a node, necessitating the unexpected movement of pods and apps to other nodes. If other nodes are in similar condition, a cascade of upgrade failures can bring a cluster to its knees. If the upgrade failures are related to a long series of patches and manual configurations that have been applied over time, it can be very difficult to debug the problems and get back to a good state.

Cluster API makes upgrades safer and reduces their impact on cluster capacity. Cluster API performs a rolling upgrade, which consists of provisioning new, upgraded nodes one by one and moving pods to them from older nodes. This approach keeps as much cluster capacity as possible available during an upgrade. In the worst case, where a new node isn’t successfully provisioned and added, there’s no impact to running workloads because pods are moved only when the new node joins the cluster. Figure 4-1 shows the process of performing a rolling upgrade.

Rolling upgrade
Figure 4-1. Rolling upgrade

By making it easier and safer to upgrade a Kubernetes cluster, Cluster API encourages more frequent updates. This keeps Kubernetes up to date and more secure, and it reduces the risk of configuration drift, where the real state of the cluster is different from the desired state codified in the blueprint/manifest, due to incremental changes performed on the cluster directly.

Scaling

Cluster API makes it easy to scale clusters up and down as workloads change. For worker nodes, the main task is to ensure the right amount of hardware is provisioned for the current demand, and no more. For control nodes, the concern is mainly to make sure that there is redundancy so that if a control plane node fails, another can take its place seamlessly.

The KCP lets you declaratively scale up your Kubernetes control plane to manage availability, ensuring that control nodes are arranged across availability zones to minimize the likelihood that more than one failure will occur at any time.

For worker nodes, it’s as simple as specifying the new desired number of worker nodes; Cluster API takes care of provisioning new machines and adding them to the cluster. You can use a tool called the Autoscaler to automatically adjust the number of worker nodes to match the number of pods you need for your workloads. The Autoscaler uses metrics like application load or average CPU usage per node to scale the cluster up and down as needed, using Cluster API providers to manage the infrastructure. Cluster API also has the ability to run workers with a different hardware configuration (the instance type) by specifying a different machine deployment or machine resources that join the same cluster.

Self-Healing

Cluster API gives Kubernetes clusters the ability to self-heal by provisioning new infrastructure. When a node fails, Kubernetes can spin up new instances of the pods on a new node, but Kubernetes has no native ability to provision new infrastructure. If enough machines fail, Kubernetes can eventually run out of resources. Because Cluster API manages the infrastructure and Kubernetes together, it can automatically provision more infrastructure in a cloud or data center environment when nodes fail.

Cluster API uses the MachineHealthCheck controller to monitor the condition of the control plane and worker nodes, making sure they are healthy. This includes ensuring they are reachable over the network and aren’t running out of disk space, pod capacity, or other resources. When a node fails, runs out of resources, or becomes unreachable, Cluster API provisions a new node and adds it to the cluster. Once a new node is provisioned, Kubernetes will attempt to reschedule pending pods from the failed node.

Managing Multiple Clusters

As the organization begins to expand its Kubernetes deployment to include multiple clusters, often in different cloud environments, Cluster API provides a consistent interface for operations across different providers with different infrastructure and APIs. This is especially important for companies with a presence in diverse environments that require multiple providers.

By abstracting the different deployment mechanisms and APIs offered by varying infrastructure providers and vendors, Cluster API makes it possible to standardize tooling across entire deployments regardless of where they are in the world, whether in a public cloud, in a virtualized or bare metal data center, or at the edge. This gives cluster administrators more control over the configuration and installed software, a standardized approach to cluster lifecycle management, and the ability to reuse existing components across multiple clusters.

Limitations and Challenges

Cluster API provides a lot of tools that make managing the lifecycles of multiple clusters easier, but it is not without its limitations. At the time of this writing, Cluster API is still a beta project, which might make some companies hesitate to adopt it in production environments. In some cases, Cluster API control plane upgrades can create bugs or other problems, which can disrupt workload clusters in severe cases.

Because the management cluster holds the credentials for the target environments where the workload clusters are deployed, it is a potential security target. Obtaining access to the management cluster could give an attacker access to all the workload clusters in turn, which is especially troublesome if the workload clusters are owned by separate tenants.

While Cluster API provides declarative management for Kubernetes clusters, some foundational capabilities are currently out of scope, such as identity authentication, backup and restore, logging/monitoring or lifecycle management of additional packs, and integrations required by an application—including the application itself.

Cluster API is limited in the number of clusters and nodes it can manage. As a rule of thumb, a single Cluster API management cluster can handle approximately one hundred clusters, depending on the number of nodes in each.

The management cluster is potentially a single point of failure. If the management cluster fails, the workload clusters continue to operate, but you can no longer manage them through Cluster API. If you have a backup of all the CRDs and other Cluster API resources (in a Git repo, for example), you can create a new management cluster and regain central control of the workload clusters.

A single management cluster might not represent sufficient separation for different tenants of workload clusters. You can overcome this problem by bringing up different management clusters for different groups of tenants, but this negates the benefit of being able to manage multiple clusters through one interface.

Get Cluster API and Declarative Kubernetes Management now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.