After upgrading the EKS 1.19 to 1.20, we can find some pods were using readiness and liveness probes failures without reason message. Take AWS VPC CNI plugin
aws-node pod as an example below:
# Pod event
Normal Killing 85m kubelet Container aws-node failed liveness probe, will be restarted
Normal Pulling 85m (x2 over 106m) kubelet Pulling image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon-k8s-cni:v1.8.0-eksbuild.1"
Normal Created 85m (x2 over 106m) kubelet Created container aws-node
Normal Pulled 85m kubelet Successfully pulled image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon-k8s-cni:v1.8.0-eksbuild.1" in 158.21763ms
Normal Started 85m (x2 over 106m) kubelet Started container aws-node
Warning Unhealthy 11m (x31 over 106m) kubelet Readiness probe failed:
Warning Unhealthy 4m57s (x28 over 100m) kubelet Liveness probe failed:
What changes in Kubernetes 1.20
According to the Kubernetes 1.20: The Raddest Release:
A longstanding bug regarding exec probe timeouts that may impact existing pod definitions has been fixed. Prior to this fix, the field
timeoutSecondswas not respected for exec probes. Instead, probes would run indefinitely, even past their configured deadline, until a result was returned. With this change, the default value of
1 secondwill be applied if a value is not specified and existing pod definitions may no longer be sufficient if a probe takes longer than one second.
This means a bug fix in Kubernetes 1.20: Fixing Kubelet Exec Probe Timeouts and KEP-1972: kubelet exec probe timeouts. Now the default timeout
1s is respected but is periodically too short causing it to fail and pods to restart.
Here are 2 methods to mitigate the issue:
- Disable the feature gate
kubelet: As a cluster administrator, we can disable the feature gate
ExecProbeTimeout(set it to false) on each
kubeletto restore the behavior from older versions, then remove that override once all the exec probes in the cluster have a
- Increase the
timeoutSecondsto a proper value: If you have pods that are impacted from the default 1 second timeout, you should update their probe timeout so that you’re ready for the eventual removal of that feature gate.
For VPC CNI plugin, however, issue #1425 is still considered a bug and needs to be followup.