AWS VPC CNI plugin random livenessProbe failures after upgrading to Kubernetes 1.20

Problem

After upgrading the EKS 1.19 to 1.20, we can find some pods were using readiness and liveness probes failures without reason message. Take AWS VPC CNI plugin aws-node pod as an example below:

# Pod event
...
Normal   Killing    85m                    kubelet  Container aws-node failed liveness probe, will be restarted
Normal   Pulling    85m (x2 over 106m)     kubelet  Pulling image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon-k8s-cni:v1.8.0-eksbuild.1"
Normal   Created    85m (x2 over 106m)     kubelet  Created container aws-node
Normal   Pulled     85m                    kubelet  Successfully pulled image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon-k8s-cni:v1.8.0-eksbuild.1" in 158.21763ms
Normal   Started    85m (x2 over 106m)     kubelet  Started container aws-node
Warning  Unhealthy  11m (x31 over 106m)    kubelet  Readiness probe failed:
Warning  Unhealthy  4m57s (x28 over 100m)  kubelet  Liveness probe failed:

What changes in Kubernetes 1.20

According to the Kubernetes 1.20: The Raddest Release[1]:

A longstanding bug regarding exec probe timeouts that may impact existing pod definitions has been fixed. Prior to this fix, the field timeoutSeconds was not respected for exec probes. Instead, probes would run indefinitely, even past their configured deadline, until a result was returned. With this change, the default value of 1 second will be applied if a value is not specified and existing pod definitions may no longer be sufficient if a probe takes longer than one second.

This means a bug fix in Kubernetes 1.20:  Fixing Kubelet Exec Probe Timeouts[2] and KEP-1972: kubelet exec probe timeouts[3]. Now the default timeout 1s is respected but is periodically too short causing it to fail and pods to restart.

Workaround

Here are 2 methods to mitigate the issue:

  1. Disable the feature gate ExecProbeTimeout on kubelet: As a cluster administrator, we can disable the feature gate ExecProbeTimeout (set it to false) on each kubelet to restore the behavior from older versions, then remove that override once all the exec probes in the cluster have a timeoutSeconds value set[4].
  2. Increase the timeoutSeconds to a proper value: If you have pods that are impacted from the default 1 second timeout, you should update their probe timeout so that you’re ready for the eventual removal of that feature gate.

For VPC CNI plugin, however, issue #1425[5] is still considered a bug and needs to be followup.

Reference

  1. Kubernetes 1.20: The Raddest Release
  2. Fixing Kubelet Exec Probe Timeouts
  3. KEP-1972: kubelet exec probe timeouts.
  4. Configure Liveness, Readiness and Startup Probes
  5. aws-node is restarting (Crashing, exiting on 137) sporadically which causes all pods on that node to stuck on ContainerCreating state. #1425