Mar 29, 20233 min read

How To Run LitmusChaos Memory Hog Experiment On GKE Node

"Chaos engineering is the discipline of experimenting on a software system in production to build confidence in the system's capability to withstand turbulent and unexpected conditions".

Chaos engineering is an exciting discipline whose goal is to surface evidence of weaknesses in a system before those weaknesses become critical issues.

Through the test, you can experiment with your system to find useful insight into how your system will respond to the types of turbulent conditions that happen in production. This method is quite effective in preventing downtime or production outages before their occurrence.

Litmus Chaos

Litmus is an open-source chaos engineering platform that helps SREs and Developers to find weaknesses in both Non-Kubernetes as well as platforms and applications running on Kubernetes by providing a complete Chaos Engineering framework and associated Chaos Experiments.

Getting Started with Litmus Chaos Tests in Kubernetes

In this blog post, we will use the GKE cluster and we are going to do a Memory chaos experiment on top of the GKE cluster at the node level.

Prerequisites

You must have Kubernetes (1.17 or later) cluster up and running
You must have KUBECTL access to your Kubernetes cluster
Make sure to install helm

We assume that you have all prerequisites and now you are ready for installation.

Litmus Installation

Once you have the Kubernetes cluster ready then install the litmus using the helm chart.

Add litmus helm repository

$ helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/

Create Namespace

$ kubectl create ns litmuschaos

Install litmus Chaos Center

$ helm install chaos litmuschaos/litmus --namespace=litmuschaos --set portal.frontend.service.type=NodePort

Verify your installation

$ kubectl get po -n litmuschaos 
 
NAME                                                                    READY                                          STATUS   RESTARTS    
chaos-litmus-auth-server-65bbc55ddf-ssckq        1/1     Running   0           
chaos-litmus-frontend-588f4c66cf-4zbw8           1/1     Running   0           
chaos-litmus-server-7fb7954696-cwvhl             1/1     Running   0           
chaos-mongodb-5db7f895dd-xqlzf                   1/1     Running   0

$ kubectl get svc -n litmuschaos 
NAME                              TYPE              CLUSTER-IP     PORT(S)  
chaos-litmus-auth-server-service  ClusterIP   10.64.4.23  9003/TCP,3030/TCP   
chaos-litmus-frontend-service     NodePort  10.64.5.219    9091:30213/TCP      
chaos-litmus-server-service       ClusterIP   10.64.11.232  9002/TCP,8000/TCP   
chaos-mongodb                     ClusterIP       10.64.6.22   27017/TCP

Accessing the Chaos Center

For logging in to Chaos Center, you must port-forward the chaos-litmus-frontend-service and copy the PORT of the chaos-litmus-frontend-service and use your IP and PORT in this manner <NODE-IP:PORT> The default credential is below, and you can change the password on the first login.

Username: admin

Password: litmus

Once you have a login and the project is created then it will automatically register the Self Chaos Delegate. For verification, you have to run the below command.

$ kubectl get po -n litmuschaos                                                                                     
NAME                                         READY   STATUS    RESTARTS                                               AGE 
chaos-exporter-d767fcf5-cjr5w                 1/1   Running   0      32m 
chaos-litmus-auth-server-65bbc55ddf-ssckq     1/1   Running   0      88m 
chaos-litmus-frontend-588f4c66cf-4zbw8        1/1   Running   0      88m 
chaos-litmus-server-7fb7954696-cwvhl          1/1   Running   0      88m 
chaos-mongodb-5db7f895dd-xqlzf                1/1   Running   0      88m 
chaos-operator-ce-7cf6cc79b4-99jgr            1/1   Running   0      32m 
event-tracker-5ddb594676-q2sgk                1/1   Running   0      32m 
subscriber-5cbcb4df94-kzvcl                   1/1   Running   0      32m 
workflow-controller-8c548f686-h7ql4           1/1   Running   0      32m

Now you can see self-agent under Chaos Delegates in active state.

Node-Memory-Hog litmus Chaos Experiment with GKE Node

This experiment helps us to verify the resiliency of applications whose replicas may be evicted on the nodes turning un-schedulable due to memory issues. Now we are going to check the reliability of Kubernetes Node by injecting a stress test.

Source: Chaos Process — Chaos Process Chart

As we know our self-agent is in an active state so the next step is to select Chaos Scenarios and click to Schedule a Chaos Scenarios. In this workflow creation page choose the self-agent as a delegate and click on Next.

After that, you will have to choose chaos scenarios and you can choose pre-defined scenarios but for this blog, we have chosen Chaos Hubs.

In this stage, you can define the workflow name and description.

Now it’s time to tune the chaos scenarios and as we are testing memory hog, we will select generic/node-memory-hog.

Adjusting the weight of the experiments in the chaos scenario means giving a weightage to your experiment as a way of attaching the importance of that experiment to your workflow. The resilience score is calculated based on the weightage and probe success percentage of the experiment.

Once you have verified and confirmed all details then click on Finish. Now we have successfully scheduled the chaos experiment!

Once all steps have been completed successfully, the workflow graph looks like this:

The experiment is complete!

We can also check the Chaos Analytics result report for the experiment:

In this chaos experiment, the resilience score is 100% for 10 points for the memory stress process, which shows us that the Kubernetes Node has good resilience for the defined configuration.

In this blog post, we have seen how we can perform the chaos experiment of node-memory-hog by using litmus chaos tools, and with these tools, you can perform so many experiments like azure-disk-loss, gcp-vm-instance-stop container-kill, pod-network-latency, node-drain, pod-network-loss, pod-autoscaler, node-restart, etc.

Comment below to learn more about LitmusChaos and how it can help you in preventing downtime or production disruptions.