NodeFilesystemSpaceFillingUp #

Meaning #

This alert is based on an extrapolation of the space used in a file system. It fires if both the current usage is above a certain threshold and the extrapolation predicts to run out of space in a certain time. This is a warning-level alert if that time is less than 24h. It’s a critical alert if that time is less than 4h.

Full context

The filesystem on Kubernetes nodes mainly consists of the operating system, container ephemeral storage, container images, and container logs. Since Kubelet automatically handles cleaning up old logs and deleting unused images, container ephemeral storage is a common cause of this alert. Although this alert may be triggered before Kubelet’s garbage collection kicks in.

Impact #

A filesystem running full is very bad for any process in need to write to the filesystem. But even before a filesystem runs full, performance is usually degrading.

Diagnosis #

Study the recent trends of filesystem usage on a dashboard. Sometimes a periodic pattern of writing and cleaning up can trick the linear prediction into a false alert. Use the usual OS tools to investigate what directories are the worst and/or recent offenders. Is this some irregular condition, e.g. a process fails to clean up behind itself or is this organic growth? If monitoring is enabled, the following metric can be watched in PromQL.

node_filesystem_free_bytes

Check the alert’s mountpoint label.

Mitigation #

For the case that the mountpoint label is /, /sysroot or /var; then removing unused images solves that issue:

Debug the node by accessing the node filesystem:

$ NODE_NAME=<instance label from alert>
$ kubectl -n default debug node/$NODE_NAME
$ chroot /host

Remove dangling images:

# TODO: Command needed

Remove unused images:

# TODO: Command needed

Exit debug:

$ exit
$ exit