Chaos Mesh: Complete Guide for Cloud-Native Chaos Engineering

Introduction

Chaos Mesh is an open-source cloud-native chaos engineering platform built for Kubernetes environments. It helps engineers simulate various system failures at the pod, node, and network levels to improve application resilience and identify potential weaknesses before they impact users in production. By systematically injecting controlled chaos, teams can build confidence in their system’s ability to withstand turbulent conditions.

Core Concepts

ConceptDescription
Chaos ExperimentsIndividual fault injections targeting specific components
Chaos WorkflowsOrdered sequences of chaos experiments with defined scheduling
Chaos DashboardWeb UI for managing, monitoring, and visualizing experiments
Namespaced ScopeSecurity boundary restricting chaos to specific namespaces
Custom ResourcesKubernetes CRDs representing different chaos types
Chaos ControllerCore component managing the lifecycle of chaos experiments

Setup and Installation

Prerequisites

  • Kubernetes cluster (v1.16+)
  • Helm v3 or kubectl
  • Permissions to create custom resources

Quick Installation (Helm)

# Add Chaos Mesh repository
helm repo add chaos-mesh https://charts.chaos-mesh.org

# Update repository
helm repo update

# Install Chaos Mesh
helm install chaos-mesh chaos-mesh/chaos-mesh -n chaos-testing --create-namespace

Installation with kubectl

# Create namespace
kubectl create ns chaos-testing

# Apply Chaos Mesh manifests
curl -sSL https://mirrors.chaos-mesh.org/latest/install.yaml | kubectl apply -f -

Verify Installation

kubectl get pods -n chaos-testing

Chaos Experiment Types

Pod Chaos

TypeDescriptionKey Parameters
PodChaosDisrupts Pod statesaction, mode, selector
Pod-KillTerminates target PodsgracePeriod, count
Pod-FailureInjects Pod failuresduration, count
Container-KillKills specific containerscontainerNames

Network Chaos

TypeDescriptionKey Parameters
NetworkChaosDisrupts network conditionsaction, mode, target
PartitionCreates network isolationdirection, target
LossSimulates packet lossloss percentage, correlation
DelayAdds network latencylatency, jitter, correlation
DuplicateDuplicates packetsduplicate percentage
CorruptCorrupts packetscorrupt percentage
BandwidthLimits bandwidthrate, limit, buffer

IO Chaos

TypeDescriptionKey Parameters
IOChaosFile system faultsaction, method, path
LatencyAdds I/O latencydelay, paths
FaultInjects I/O errorserrno, path
AttrOverrideModifies file attributesattr

Time Chaos

TypeDescriptionKey Parameters
TimeChaosManipulates system timetimeOffset

Stress Testing

TypeDescriptionKey Parameters
StressChaosCPU/Memory pressurestressors (cpu/memory)
CPUCPU stress testingworkers, load, options
MemoryMemory stress testingworkers, size, options

DNS Chaos

TypeDescriptionKey Parameters
DNSChaosDNS resolution issuespatterns, action

JVM Chaos

TypeDescriptionKey Parameters
JVMChaosJava application faultsaction, class, method

Kernel Chaos

TypeDescriptionKey Parameters
KernelChaosKernel failuresfailKernRequest

Creating Chaos Experiments

Basic Template

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos  # Change to required chaos type
metadata:
  name: pod-kill-example
  namespace: chaos-testing
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - default
    labelSelectors:
      app: example
  scheduler:
    cron: "@every 5m"
  duration: "10s"

Applying Experiments

# Create experiment
kubectl apply -f experiment.yaml

# View experiments
kubectl get podchaos -n chaos-testing

# Delete experiment
kubectl delete -f experiment.yaml

Chaos Workflows

Basic Workflow Example

apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
  name: example-workflow
  namespace: chaos-testing
spec:
  entry: entry
  templates:
    - name: entry
      templateType: Serial
      deadline: 240s
      children:
        - pod-kill-example
        - network-delay-example
    - name: pod-kill-example
      templateType: PodChaos
      deadline: 60s
      podChaos:
        action: pod-kill
        mode: one
        selector:
          namespaces:
            - default
          labelSelectors:
            app: example
    - name: network-delay-example
      templateType: NetworkChaos
      deadline: 60s  
      networkChaos:
        action: delay
        mode: all
        selector:
          namespaces:
            - default
          labelSelectors:
            app: example
        delay:
          latency: "90ms"
          correlation: "25"
          jitter: "10ms"

Working with Chaos Dashboard

Access Dashboard

# Port-forward the dashboard service
kubectl port-forward -n chaos-testing svc/chaos-dashboard 2333:2333

# Access via browser
# http://localhost:2333

Dashboard Features

  • Experiment creation wizard
  • Real-time experiment status monitoring
  • Workflow visualization
  • Event timeline
  • Experiment archiving
  • RBAC management

Monitoring Chaos Experiments

CLI Monitoring

# Get all chaos experiments
kubectl get podchaos,networkchaos,iochaos,timechaos,kernelchaos,stresschaos -n chaos-testing

# Describe specific experiment
kubectl describe podchaos pod-kill-example -n chaos-testing

# Check events
kubectl get events -n chaos-testing

Integration with Prometheus

# Add Prometheus annotations to metrics service
apiVersion: v1
kind: Service
metadata:
  name: chaos-mesh-controller-manager
  namespace: chaos-testing
  annotations:
    prometheus.io/scrape: 'true'
    prometheus.io/port: '10080'

Comparison with Other Chaos Engineering Tools

FeatureChaos MeshLitmusGremlinChaos Toolkit
FocusKubernetes-nativeKubernetesMulti-platformPlatform-agnostic
InstallationHelm/KubectlHelm/OperatorSaaS + agentsPython CLI
UIBuilt-in dashboardLitmus PortalWeb UICLI only
ArchitectureCRD-basedOperator-basedAgent-basedExecution driver
Learning CurveModerateModerateLowModerate
Kubernetes SupportExtensiveExtensiveBasicVia plugins
LicenseOpen sourceOpen sourceCommercialOpen source
CommunityActive (CNCF)Active (CNCF)CommercialModerate

Common Challenges and Solutions

ChallengeSolution
Permissions IssuesEnsure proper RBAC setup with cluster-admin for installation and appropriate roles for namespaces
Failed ExperimentsCheck event logs with kubectl describe and ensure selector matches target pods
Resource ConstraintsUse resource quotas and tune stress parameters to avoid OOM kills
Blast Radius ControlUse namespaced scope and selective pod targeting with specific labels
Dashboard AccessUse port forwarding or ingress configuration for persistent access
Experiment MonitoringIntegrate with Prometheus/Grafana for better visibility
Custom Resource IssuesVerify CRD installation and API version compatibility
Recovery ProblemsImplement proper finalizers and use time-bounded experiments

Best Practices

Planning and Implementation

  • Start with non-production environments before moving to production
  • Begin with simple experiments and gradually increase complexity
  • Use small blast radius initially and expand gradually
  • Document baseline system behavior before running chaos experiments
  • Implement automatic recovery mechanisms before testing

Execution Safety

  • Always set a reasonable timeout for experiments (avoid infinite chaos)
  • Use the Dashboard to monitor experiment status in real-time
  • Implement circuit breakers to automatically stop experiments if critical metrics are affected
  • Schedule experiments during low-traffic periods initially
  • Create a chaos engineering runbook for emergency procedures

Organization and Scaling

  • Use standardized labels for targeting consistent application components
  • Organize chaos experiments by service or failure type
  • Integrate chaos experiments into CI/CD pipelines for continuous resilience testing
  • Set up alerts for unexpected behavior during experiments
  • Maintain a chaos experiment catalog with results and learnings

Advanced Techniques

Using Templates and Annotations

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-template
  annotations:
    experiment.chaos-mesh.org/pause: "true"  # Paused initially
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - default
    labelSelectors:
      app: example
  scheduler:
    cron: "@every 5m"

Customizing Chaos Mesh

# Build from source
git clone https://github.com/chaos-mesh/chaos-mesh.git
cd chaos-mesh
make

# Create custom chaos
kubectl apply -f examples/custom-template.yaml

Resources for Further Learning

Official Documentation

Community Resources

  • CNCF Slack (#chaos-mesh channel)
  • Chaos Mesh Blog: https://chaos-mesh.org/blog/
  • Monthly community meetings

Related Reading

  • “Chaos Engineering: System Resiliency in Practice” (O’Reilly)
  • “Kubernetes Patterns” for resilient application design
  • “Site Reliability Engineering” by Google

Training and Certification

  • CNCF Chaos Engineering Certification (upcoming)
  • Chaos Mesh workshops at KubeCon and CloudNativeCon
Scroll to Top