Using sysctls in a Kubernetes Cluster

FEATURE STATE: Kubernetes v1.12 beta

This feature is currently in a beta state, meaning:

The version names contain beta (e.g. v2beta3).
Code is well tested. Enabling the feature is considered safe. Enabled by default.
Support for the overall feature will not be dropped, though details may change.
The schema and/or semantics of objects may change in incompatible ways in a subsequent beta or stable release. When this happens, we will provide instructions for migrating to the next version. This may require deleting, editing, and re-creating API objects. The editing process may require some thought. This may require downtime for applications that rely on the feature.
Recommended for only non-business-critical uses because of potential for incompatible changes in subsequent releases. If you have multiple clusters that can be upgraded independently, you may be able to relax this restriction.
Please do try our beta features and give feedback on them! After they exit beta, it may not be practical for us to make more changes.

This document describes how to configure and use kernel parameters within a Kubernetes cluster using the sysctl interface.

Before you begin
Listing all Sysctl Parameters
Enabling Unsafe Sysctls
Setting Sysctls for a Pod
PodSecurityPolicy

Before you begin

You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. If you do not already have a cluster, you can create one by using Minikube, or you can use one of these Kubernetes playgrounds:

To check the version, enter kubectl version.

Listing all Sysctl Parameters

In Linux, the sysctl interface allows an administrator to modify kernel parameters at runtime. Parameters are available via the /proc/sys/ virtual process file system. The parameters cover various subsystems such as:

kernel (common prefix: kernel.)
networking (common prefix: net.)
virtual memory (common prefix: vm.)
MDADM (common prefix: dev.)
More subsystems are described in Kernel docs.

To get a list of all parameters, you can run

sudo sysctl -a

Enabling Unsafe Sysctls

Sysctls are grouped into safe and unsafe sysctls. In addition to proper namespacing a safe sysctl must be properly isolated between pods on the same node. This means that setting a safe sysctl for one pod

must not have any influence on any other pod on the node
must not allow to harm the node’s health
must not allow to gain CPU or memory resources outside of the resource limits of a pod.

By far, most of the namespaced sysctls are not necessarily considered safe. The following sysctls are supported in the safe set:

kernel.shm_rmid_forced,
net.ipv4.ip_local_port_range,
net.ipv4.tcp_syncookies.

Note: The example net.ipv4.tcp_syncookies is not namespaced on Linux kernel version 4.4 or lower.

This list will be extended in future Kubernetes versions when the kubelet supports better isolation mechanisms.

All safe sysctls are enabled by default.

All unsafe sysctls are disabled by default and must be allowed manually by the cluster admin on a per-node basis. Pods with disabled unsafe sysctls will be scheduled, but will fail to launch.

With the warning above in mind, the cluster admin can allow certain unsafe sysctls for very special situations like e.g. high-performance or real-time application tuning. Unsafe sysctls are enabled on a node-by-node basis with a flag of the kubelet, e.g.:

kubelet --allowed-unsafe-sysctls \
  'kernel.msg*,net.ipv4.route.min_pmtu' ...

For minikube, this can be done via the extra-config flag:

minikube start --extra-config="kubelet.allowed-unsafe-sysctls=kernel.msg*,net.ipv4.route.min_pmtu"...

Only namespaced sysctls can be enabled this way.

Setting Sysctls for a Pod

A number of sysctls are namespaced in today’s Linux kernels. This means that they can be set independently for each pod on a node. Only namespaced sysctls are configurable via the pod securityContext within Kubernetes.

The following sysctls are known to be namespaced. This list could change in future versions of the Linux kernel.

kernel.shm*,
kernel.msg*,
kernel.sem,
fs.mqueue.*,
net.*.

Sysctls with no namespace are called node-level sysctls. If you need to set them, you must manually configure them on each node’s operating system, or by using a DaemonSet with privileged containers.

Use the pod securityContext to configure namespaced sysctls. The securityContext applies to all containers in the same pod.

This example uses the pod securityContext to set a safe sysctl kernel.shm_rmid_forced and two unsafe sysctls net.ipv4.route.min_pmtu and kernel.msgmax There is no distinction between safe and unsafe sysctls in the specification.

Warning: Only modify sysctl parameters after you understand their effects, to avoid destabilizing your operating system.

apiVersion: v1
kind: Pod
metadata:
  name: sysctl-example
spec:
  securityContext:
    sysctls:
    - name: kernel.shm_rmid_forced
      value: "0"
    - name: net.ipv4.route.min_pmtu
      value: "552"
    - name: kernel.msgmax
      value: "65536"
  ...

Warning: Due to their nature of being unsafe, the use of unsafe sysctls is at-your-own-risk and can lead to severe problems like wrong behavior of containers, resource shortage or complete breakage of a node.

It is good practice to consider nodes with special sysctl settings as tainted within a cluster, and only schedule pods onto them which need those sysctl settings. It is suggested to use the Kubernetes taints and toleration feature to implement this.

A pod with the unsafe sysctls will fail to launch on any node which has not enabled those two unsafe sysctls explicitly. As with node-level sysctls it is recommended to use taints and toleration feature or taints on nodes to schedule those pods onto the right nodes.

PodSecurityPolicy

You can further control which sysctls can be set in pods by specifying lists of sysctls or sysctl patterns in the forbiddenSysctls and/or allowedUnsafeSysctls fields of the PodSecurityPolicy. A sysctl pattern ends with a * character, such as kernel.*. A * character on its own matches all sysctls.

By default, all safe sysctls are allowed.

Both forbiddenSysctls and allowedUnsafeSysctls are lists of plain sysctl names or sysctl patterns (which end with *). The string * matches all sysctls.

The forbiddenSysctls field excludes specific sysctls. You can forbid a combination of safe and unsafe sysctls in the list. To forbid setting any sysctls, use * on its own.

If you specify any unsafe sysctl in the allowedUnsafeSysctls field and it is not present in the forbiddenSysctls field, that sysctl can be used in Pods using this PodSecurityPolicy. To allow all unsafe sysctls in the PodSecurityPolicy to be set, use * on its own.

Do not configure these two fields such that there is overlap, meaning that a given sysctl is both allowed and forbidden.

Warning: If you whitelist unsafe sysctls via the allowedUnsafeSysctls field in a PodSecurityPolicy, any pod using such a sysctl will fail to start if the sysctl is not whitelisted via the --allowed-unsafe-sysctls kubelet flag as well on that node.

This example allows unsafe sysctls prefixed with kernel.msg to be set and disallows setting of the kernel.shm_rmid_forced sysctl.

apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: sysctl-psp
spec:
  allowedUnsafeSysctls:
  - kernel.msg*
  forbiddenSysctls:
  - kernel.shm_rmid_forced
 ...

Feedback

Was this page helpful?

Thanks for the feedback. If you have a specific, answerable question about how to use Kubernetes, ask it on Stack Overflow. Open an issue in the GitHub repo if you want to report a problem or suggest an improvement.

Tasks