Decoding the Self-Healing Kubernetes: Step by Step
Audio : Listen to This Blog.
Prologue
Business application that fails to operate 24/7 would be considered inefficient in the market. The idea is that applications run uninterrupted irrespective of a technical glitch, feature update, or a natural disaster. In today’s heterogeneous environment where infrastructure is intricately layered, a continuous workflow of application is possible via self-healing.
Kubernetes, which is a container orchestration tool, facilitates the smooth working of the application by abstracting machines physically. Moreover, the pods and containers in Kubernetes can self-heal.
Captain America asked Bruce Banner in Avengers to get angry to transform into ‘The Hulk’. Bruce replied, “That’s my secret Captain. I’m always angry.”
You must have understood the analogy here. Let’s simplify – Kubernetes will self-heal organically, whenever the system is affected.
Kubernetes’s self-healing property ensures that the clusters always function at the optimal state. Kubernetes can self-detect two types of object – podstatus and containerstatus. Kubernetes’s orchestration capabilities can monitor and replace unhealthy container as per the desired configuration. Likewise, Kubernetes can fix pods, which are the smallest units encompassing single or multiple containers.
The three container states include
1. Waiting – created but not running. A container, which is in a waiting stage, will still run operations like pulling images or applying secrets, etc. To check the Waiting pod status, use the below command.
kubectl describe pod [POD_NAME]
Along with this state, a message and reason about the state are displayed to provide more information.
... State: Waiting Reason: ErrImagePull ...
2. Running Pods – containers that are running without issues. The following command is executed before the pod enters the Running state.
postStart
Running pods will display the time of the entrance of the container.
... State: Running Started: Wed, 30 Jan 2019 16:46:38 +0530 ...
3. Terminated Pods – container, which fails or completes its execution; stands terminated. The following command is executed before the pod is moved to Terminated.
prestop
Terminated pods will display the time of the entrance of the container.
... State: Terminated Reason: Completed Exit Code: 0 Started: Wed, 30 Jan 2019 11:45:26 +0530 Finished: Wed, 30 Jan 2019 11:45:26 +0530 ...
Kubernetes’ self-healing Concepts – pod’s phase, probes, and restart policy.
The pod phase in Kubernetes offers insight into the pod’s placement. We can have
- Pending Pods – created but not running
- Running Pods – runs all the containers
- Succeeded Pods – successfully completed container lifecycle
- Failed Pods – minimum one container failed and all container terminated
- Unknown Pods
Kubernetes execute liveliness and readiness probes for the Pods to check if they function as per the desired state. The liveliness probe will check a container for its running status. If a container fails the probe, Kubernetes will terminate it and create a new container in accordance with the restart policy. The readiness probe will check a container for its service request serving capabilities. If a container fails the probe, then Kubernetes will remove the IP address of the related pod.
Liveliness probe example.
apiVersion: v1 kind: Pod metadata: labels: test: liveness name: liveness-http spec: containers: - args: - /server image: k8s.gcr.io/liveness livenessProbe: httpGet: # when "host" is not defined, "PodIP" will be used # host: my-host # when "scheme" is not defined, "HTTP" scheme will be used. Only "HTTP" and "HTTPS" are allowed # scheme: HTTPS path: /healthz port: 8080 httpHeaders: - name: X-Custom-Header value: Awesome initialDelaySeconds: 15 timeoutSeconds: 1 name: liveness
The probes include
- ExecAction – to execute commands in containers.
- TCPSocketAction – to implement a TCP check w.r.t to the IP address of a container.
- HTTPGetAction – to implement a HTTP Get check w.r.t to the IP address of a container.
Each probe gives one of three results:
- Success: The Container passed the diagnostic.
- Failure: The Container failed the diagnostic.
- Unknown: The diagnostic failed, so no action should be taken.
Demo description of Self-Healing Kubernetes – Example 1
We need to set the code replication to trigger the self-healing capability of Kubernetes.
Let’s see an example of the Nginx file.
apiVersion: apps/v1 kind: Deployment metadata: name: nginx-deployment-sample spec: selector: matchLabels: app: nginx replicas:4 template: metadata: labels: app: nginx spec: containers: - name: nginx image: nginx:1.14.2 ports: - containerPort: 80
In the above code, we see that the total number of pods across the cluster must be 4.
Let’s now deploy the file.
kubectl apply nginx-deployment-sample
Let’s list the pods, using
kubectl get pods -l app=nginx
Here is the output.
NAME READY STATUS RESTARTS AGE nginx-deployment-test-83586599-r299i 1/1 Running 0 5s nginx-deployment-test-83586599-f299h 1/1 Running 0 5s nginx-deployment-test-83586599-a534k 1/1 Running 0 5s nginx-deployment-test-83586599-v389d 1/1 Running 0 5s
As you see above, we have created 4 pods.
Let’s delete one of the pods.
kubectl delete nginx-deployment-test-83586599-r299i
The pod is now deleted. We get the following output
pod "deployment nginx-deployment-test-83586599-r299i" deleted
Now again, list the pods.
kubectl get pods -l app=nginx
We get the following output.
NAME READY STATUS RESTARTS AGE nginx-deployment-test-83586599-u992j 1/1 Running 0 5s nginx-deployment-test-83586599-f299h 1/1 Running 0 5s nginx-deployment-test-83586599-a534k 1/1 Running 0 5s nginx-deployment-test-83586599-v389d 1/1 Running 0 5s
We have 4 pods again, despite deleting one.
Kubernetes has self-healed to create a new node and maintain the count to 4.
Demo description of Self-Healing Kubernetes – Example 2
Get pod details
$ kubectl get pods -o wide
Get first nginx pod and delete it – one of the nginx pods should be in ‘Terminating’ status
$ NGINX_POD=$(kubectl get pods -l app=nginx --output=jsonpath="{.items[0].metadata.name}") $ kubectl delete pod $NGINX_POD; kubectl get pods -l app=nginx -o wide $ sleep 10
Get pod details – one nginx pod should be freshly started
$ kubectl get pods -l app=nginx -o wide
Get deployement details and check the events for recent changes
$ kubectl describe deployment nginx-deployment
Halt one of the nodes (node2)
$ vagrant halt node2 $ sleep 30
Get node details – node2 Status=NotReady
$ kubectl get nodes
Get pod details – everything looks fine – you need to wait 5 minutes
$ kubectl get pods -o wide
Pod will not be evicted until it is 5 minutes old – (see Tolerations in ‘describe pod’ ). It prevents Kubernetes to spin up the new containers when it is not necessary
$ NGINX_POD=$(kubectl get pods -l app=nginx --output=jsonpath="{.items[0].metadata.name}") $ kubectl describe pod $NGINX_POD | grep -A1 Tolerations
Sleeping for 5 minutes
$ sleep 300
Get pods details – Status=Unknown/NodeLost and new container was started
$ kubectl get pods -o wide
Get deployment details – again AVAILABLE=3/3
$ kubectl get deployments -o wide
Power on the node2 node
$ vagrant up node2 $ sleep 70
Get node details – node2 should be Ready again
$ kubectl get nodes
Get pods details – ‘Unknown’ pods were removed
$ kubectl get pods -o wide
Source: GitHub. Author: Petr Ruzicka
Conclusion
Kubernetes can self-heal applications and containers, but what about healing itself when the nodes are down? For Kubernetes to continue self-healing, it needs a dedicated set of infrastructure, with access to self-healing nodes all the time. The infrastructure must be driven by automation and powered by predictive analytics to preempt and fix issues beforehand. The bottom line is that at any given point in time, the infrastructure nodes should maintain the required count for uninterrupted services.
Reference: kubernetes.io, GitHub