AWS VPC CNI plugin random livenessProbe failures after upgrading to Kubernetes 1.20
Problem
After upgrading the EKS 1.19 to 1.20, we can find some pods were using readiness and liveness probes failures without reason message. Take AWS VPC CNI plugin aws-node
pod as an example below:
# Pod event
...
Normal Killing 85m kubelet Container aws-node failed liveness probe, will be restarted
Normal Pulling 85m (x2 over 106m) kubelet Pulling image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon-k8s-cni:v1.8.0-eksbuild.1"
Normal Created 85m (x2 over 106m) kubelet Created container aws-node
Normal Pulled 85m kubelet Successfully pulled image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon-k8s-cni:v1.8.0-eksbuild.1" in 158.21763ms
Normal Started 85m (x2 over 106m) kubelet Started container aws-node
Warning Unhealthy 11m (x31 over 106m) kubelet Readiness probe failed:
Warning Unhealthy 4m57s (x28 over 100m) kubelet Liveness probe failed:
What changes in Kubernetes 1.20
According to the Kubernetes 1.20: The Raddest Release[1]:
A longstanding bug regarding exec probe timeouts that may impact existing pod definitions has been fixed. Prior to this fix, the fieldtimeoutSeconds
was not respected for exec probes. Instead, probes would run indefinitely, even past their configured deadline, until a result was returned. With this change, the default value of1 second
will be applied if a value is not specified and existing pod definitions may no longer be sufficient if a probe takes longer than one second.
This means a bug fix in Kubernetes 1.20: Fixing Kubelet Exec Probe Timeouts[2] and KEP-1972: kubelet exec probe timeouts[3]. Now the default timeout 1s
is respected but is periodically too short causing it to fail and pods to restart.
Workaround
Here are 2 methods to mitigate the issue:
- Disable the feature gate
ExecProbeTimeout
onkubelet
: As a cluster administrator, we can disable the feature gateExecProbeTimeout
(set it to false) on eachkubelet
to restore the behavior from older versions, then remove that override once all the exec probes in the cluster have atimeoutSeconds
value set[4]. - Increase the
timeoutSeconds
to a proper value: If you have pods that are impacted from the default 1 second timeout, you should update their probe timeout so that you’re ready for the eventual removal of that feature gate.
For VPC CNI plugin, however, issue #1425[5] is still considered a bug and needs to be followup.