Kubernetes Performance Measurements and Roadmap

Thursday, September 10, 2015

Kubernetes Performance Measurements and Roadmap

No matter how flexible and reliable your container orchestration system is, ultimately, you have some work to be done, and you want it completed quickly. For big problems, a common answer is to just throw more machines at the problem. After all, more compute = faster, right?

Interestingly, adding more nodes is a little like the tyranny of the rocket equation - in some systems, adding more machines can actually make your processing slower. However, unlike the rocket equation, we can do better. Kubernetes in v1.0 version supports clusters with up to 100 nodes. However, we have a goal to 10x the number of nodes we will support by the end of 2015. This blog post will cover where we are and how we intend to achieve the next level of performance.

What do we measure?

The first question we need to answer is: “what does it mean that Kubernetes can manage an N-node cluster?” Users expect that it will handle all operations “reasonably quickly,” but we need a precise definition of that. We decided to define performance and scalability goals based on the following two metrics:

1.“API-responsiveness”: 99% of all our API calls return in less than 1 second
2.“Pod startup time”: 99% of pods (with pre-pulled images) start within 5 seconds

Note that for “pod startup time” we explicitly assume that all images necessary to run a pod are already pre-pulled on the machine where it will be running. In our experiments, there is a high degree of variability (network throughput, size of image, etc) between images, and these variations have little to do with Kubernetes’ overall performance.

The decision to choose those metrics was made based on our experience spinning up 2 billion containers a week at Google. We explicitly want to measure the latency of user-facing flows since that’s what customers will actually care about.

How do we measure?

To monitor performance improvements and detect regressions we set up a continuous testing infrastructure. Every 2-3 hours we create a 100-node cluster from HEAD and run our scalability tests on it. We use a GCE n1-standard-4 (4 cores, 15GB of RAM) machine as a master and n1-standard-1 (1 core, 3.75GB of RAM) machines for nodes.

In scalability tests, we explicitly focus only on the full-cluster case (full N-node cluster is a cluster with 30 * N pods running in it) which is the most demanding scenario from a performance point of view. To reproduce what a customer might actually do, we run through the following steps:

Populate pods and replication controllers to fill the cluster
Generate some load (create/delete additional pods and/or replication controllers, scale the existing ones, etc.) and record performance metrics
Stop all running pods and replication controllers
Scrape the metrics and check whether they match our expectations

It is worth emphasizing that the main parts of the test are done on full clusters (30 pods per node, 100 nodes) - starting a pod in an empty cluster, even if it has 100 nodes will be much faster.

To measure pod startup latency we are using very simple pods with just a single container running the “gcr.io/google_containers/pause:go” image, which starts and then sleeps forever. The container is guaranteed to be already pre-pulled on nodes (we use it as the so-called pod-infra-container).

Performance data

The following table contains percentiles (50th, 90th and 99th) of pod startup time in 100-node clusters which are 10%, 25%, 50% and 100% full.

	10%-full	25%-full	50%-full	100%-full
50th percentile	.90s	1.08s	1.33s	1.94s
90th percentile	1.29s	1.49s	1.72s	2.50s
99th percentile	1.59s	1.86s	2.56s	4.32s

As for api-responsiveness, the following graphs present 50th, 90th and 99th percentiles of latencies of API calls grouped by kind of operation and resource type. However, note that this also includes internal system API calls, not just those issued by users (in this case issued by the test itself).

Some resources only appear on certain graphs, based on what was running during that operation (e.g. no namespace was put at that time).

As you can see in the results, we are ahead of target for our 100-node cluster with pod startup time even in a fully-packed cluster occurring 14% faster in the 99th percentile than 5 seconds. It’s interesting to point out that LISTing pods is significantly slower than any other operation. This makes sense: in a full cluster there are 3000 pods and each of pod is roughly few kilobytes of data, meaning megabytes of data that need to processed for each LIST.

#####Work done and some future plans

The initial performance work to make 100-node clusters stable enough to run any tests on them involved a lot of small fixes and tuning, including increasing the limit for file descriptors in the apiserver and reusing tcp connections between different requests to etcd.

However, building a stable performance test was just step one to increasing the number of nodes our cluster supports by tenfold. As a result of this work, we have already taken on significant effort to remove future bottlenecks, including:

Rewriting controllers to be watch-based: Previously they were relisting objects of a given type every few seconds, which generated a huge load on the apiserver.
Using code generators to produce conversions and deep-copy functions: Although the default implementation using Go reflections are very convenient, they proved to be extremely slow, as much as 10X in comparison to the generated code.
Adding a cache to apiserver to avoid deserialization of the same data read from etcd multiple times
Reducing frequency of updating statuses: Given the slow changing nature of statutes, it only makes sense to update pod status only on change and node status only every 10 seconds.
Implemented watch at the apiserver instead of redirecting the requests to etcd: We would prefer to avoid watching for the same data from etcd multiple times, since, in many cases, it was filtered out in apiserver anyway.

Looking further out to our 1000-node cluster goal, proposed improvements include:

Moving events out from etcd: They are more like system logs and are neither part of system state nor are crucial for Kubernetes to work correctly.
Using better json parsers: The default parser implemented in Go is very slow as it is based on reflection.
Rewriting the scheduler to make it more efficient and concurrent
Improving efficiency of communication between apiserver and Kubelets: In particular, we plan to reduce the size of data being sent on every update of node status.

This is by no means an exhaustive list. We will be adding new elements (or removing existing ones) based on the observed bottlenecks while running the existing scalability tests and newly-created ones. If there are particular use cases or scenarios that you’d like to see us address, please join in!

We have weekly meetings for our Kubernetes Scale Special Interest Group on Thursdays 11am PST where we discuss ongoing issues and plans for performance tracking and improvements.
If you have specific performance or scalability questions before then, please join our scalability special interest group on Slack: https://kubernetes.slack.com/messages/sig-scale
General questions? Feel free to join our Kubernetes community on Slack: https://kubernetes.slack.com/messages/kubernetes-users/
Submit a pull request or file an issue! You can do this in our GitHub repository. Everyone is also enthusiastically encouraged to contribute with their own experiments (and their result) or PR contributions improving Kubernetes. - Wojciech Tyczynski, Google Software Engineer

Raw Block Volume support to Beta Mar 7
Automate Operations on your Cluster with OperatorHub.io Feb 28
Building a Kubernetes Edge (Ingress) Control Plane for Envoy v2 Feb 12
Runc and CVE-2019-5736 Feb 11
Poseidon-Firmament Scheduler – Flow Network Graph Based Scheduler Feb 6
Update on Volume Snapshot Alpha for Kubernetes Jan 17
Container Storage Interface (CSI) for Kubernetes GA Jan 15
APIServer dry-run and kubectl diff Jan 14

Creating a Raspberry Pi cluster running Kubernetes, the installation (Part 2) Dec 22
Managing Kubernetes Pods, Services and Replication Controllers with Puppet Dec 17
How Weave built a multi-deployment solution for Scope using Kubernetes Dec 12
Creating a Raspberry Pi cluster running Kubernetes, the shopping list (Part 1) Nov 25
Monitoring Kubernetes with Sysdig Nov 19
One million requests per second: Dependable and dynamic distributed systems at scale Nov 11
Kubernetes 1.1 Performance upgrades, improved tooling and a growing community Nov 9
Kubernetes as Foundation for Cloud Native PaaS Nov 3
Some things you didn’t know about kubectl Oct 28
Kubernetes Performance Measurements and Roadmap Sep 10
Using Kubernetes Namespaces to Manage Environments Aug 28
Weekly Kubernetes Community Hangout Notes - July 31 2015 Aug 4
The Growing Kubernetes Ecosystem Jul 24
Weekly Kubernetes Community Hangout Notes - July 17 2015 Jul 23
Strong, Simple SSL for Kubernetes Services Jul 14
Weekly Kubernetes Community Hangout Notes - July 10 2015 Jul 13
Announcing the First Kubernetes Enterprise Training Course Jul 8
Kubernetes 1.0 Launch Event at OSCON Jul 2
How did the Quake demo from DockerCon Work? Jul 2
The Distributed System ToolKit: Patterns for Composite Containers Jun 29
Slides: Cluster Management with Kubernetes, talk given at the University of Edinburgh Jun 26
Cluster Level Logging with Kubernetes Jun 11
Weekly Kubernetes Community Hangout Notes - May 22 2015 Jun 2
Kubernetes on OpenStack May 19
Weekly Kubernetes Community Hangout Notes - May 15 2015 May 18
Docker and Kubernetes and AppC May 18
Kubernetes Release: 0.17.0 May 15
Resource Usage Monitoring in Kubernetes May 12
Weekly Kubernetes Community Hangout Notes - May 1 2015 May 11
Kubernetes Release: 0.16.0 May 11
AppC Support for Kubernetes through RKT May 4
Weekly Kubernetes Community Hangout Notes - April 24 2015 Apr 30
Borg: The Predecessor to Kubernetes Apr 23
Kubernetes and the Mesosphere DCOS Apr 22
Weekly Kubernetes Community Hangout Notes - April 17 2015 Apr 17
Kubernetes Release: 0.15.0 Apr 16
Introducing Kubernetes API Version v1beta3 Apr 16
Weekly Kubernetes Community Hangout Notes - April 10 2015 Apr 11
Faster than a speeding Latte Apr 6
Weekly Kubernetes Community Hangout Notes - April 3 2015 Apr 4
Participate in a Kubernetes User Experience Study Mar 31
Weekly Kubernetes Community Hangout Notes - March 27 2015 Mar 28
Kubernetes Gathering Videos Mar 23
Welcome to the Kubernetes Blog! Mar 20

Kubernetes Performance Measurements and Roadmap

Thursday, September 10, 2015