Introduction
Chaos Mesh is an open-source cloud-native chaos engineering platform built for Kubernetes environments. It helps engineers simulate various system failures at the pod, node, and network levels to improve application resilience and identify potential weaknesses before they impact users in production. By systematically injecting controlled chaos, teams can build confidence in their system’s ability to withstand turbulent conditions.
Core Concepts
Concept | Description |
---|
Chaos Experiments | Individual fault injections targeting specific components |
Chaos Workflows | Ordered sequences of chaos experiments with defined scheduling |
Chaos Dashboard | Web UI for managing, monitoring, and visualizing experiments |
Namespaced Scope | Security boundary restricting chaos to specific namespaces |
Custom Resources | Kubernetes CRDs representing different chaos types |
Chaos Controller | Core component managing the lifecycle of chaos experiments |
Setup and Installation
Prerequisites
- Kubernetes cluster (v1.16+)
- Helm v3 or kubectl
- Permissions to create custom resources
Quick Installation (Helm)
# Add Chaos Mesh repository
helm repo add chaos-mesh https://charts.chaos-mesh.org
# Update repository
helm repo update
# Install Chaos Mesh
helm install chaos-mesh chaos-mesh/chaos-mesh -n chaos-testing --create-namespace
Installation with kubectl
# Create namespace
kubectl create ns chaos-testing
# Apply Chaos Mesh manifests
curl -sSL https://mirrors.chaos-mesh.org/latest/install.yaml | kubectl apply -f -
Verify Installation
kubectl get pods -n chaos-testing
Chaos Experiment Types
Pod Chaos
Type | Description | Key Parameters |
---|
PodChaos | Disrupts Pod states | action, mode, selector |
Pod-Kill | Terminates target Pods | gracePeriod, count |
Pod-Failure | Injects Pod failures | duration, count |
Container-Kill | Kills specific containers | containerNames |
Network Chaos
Type | Description | Key Parameters |
---|
NetworkChaos | Disrupts network conditions | action, mode, target |
Partition | Creates network isolation | direction, target |
Loss | Simulates packet loss | loss percentage, correlation |
Delay | Adds network latency | latency, jitter, correlation |
Duplicate | Duplicates packets | duplicate percentage |
Corrupt | Corrupts packets | corrupt percentage |
Bandwidth | Limits bandwidth | rate, limit, buffer |
IO Chaos
Type | Description | Key Parameters |
---|
IOChaos | File system faults | action, method, path |
Latency | Adds I/O latency | delay, paths |
Fault | Injects I/O errors | errno, path |
AttrOverride | Modifies file attributes | attr |
Time Chaos
Type | Description | Key Parameters |
---|
TimeChaos | Manipulates system time | timeOffset |
Stress Testing
Type | Description | Key Parameters |
---|
StressChaos | CPU/Memory pressure | stressors (cpu/memory) |
CPU | CPU stress testing | workers, load, options |
Memory | Memory stress testing | workers, size, options |
DNS Chaos
Type | Description | Key Parameters |
---|
DNSChaos | DNS resolution issues | patterns, action |
JVM Chaos
Type | Description | Key Parameters |
---|
JVMChaos | Java application faults | action, class, method |
Kernel Chaos
Type | Description | Key Parameters |
---|
KernelChaos | Kernel failures | failKernRequest |
Creating Chaos Experiments
Basic Template
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos # Change to required chaos type
metadata:
name: pod-kill-example
namespace: chaos-testing
spec:
action: pod-kill
mode: one
selector:
namespaces:
- default
labelSelectors:
app: example
scheduler:
cron: "@every 5m"
duration: "10s"
Applying Experiments
# Create experiment
kubectl apply -f experiment.yaml
# View experiments
kubectl get podchaos -n chaos-testing
# Delete experiment
kubectl delete -f experiment.yaml
Chaos Workflows
Basic Workflow Example
apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
name: example-workflow
namespace: chaos-testing
spec:
entry: entry
templates:
- name: entry
templateType: Serial
deadline: 240s
children:
- pod-kill-example
- network-delay-example
- name: pod-kill-example
templateType: PodChaos
deadline: 60s
podChaos:
action: pod-kill
mode: one
selector:
namespaces:
- default
labelSelectors:
app: example
- name: network-delay-example
templateType: NetworkChaos
deadline: 60s
networkChaos:
action: delay
mode: all
selector:
namespaces:
- default
labelSelectors:
app: example
delay:
latency: "90ms"
correlation: "25"
jitter: "10ms"
Working with Chaos Dashboard
Access Dashboard
# Port-forward the dashboard service
kubectl port-forward -n chaos-testing svc/chaos-dashboard 2333:2333
# Access via browser
# http://localhost:2333
Dashboard Features
- Experiment creation wizard
- Real-time experiment status monitoring
- Workflow visualization
- Event timeline
- Experiment archiving
- RBAC management
Monitoring Chaos Experiments
CLI Monitoring
# Get all chaos experiments
kubectl get podchaos,networkchaos,iochaos,timechaos,kernelchaos,stresschaos -n chaos-testing
# Describe specific experiment
kubectl describe podchaos pod-kill-example -n chaos-testing
# Check events
kubectl get events -n chaos-testing
Integration with Prometheus
# Add Prometheus annotations to metrics service
apiVersion: v1
kind: Service
metadata:
name: chaos-mesh-controller-manager
namespace: chaos-testing
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '10080'
Comparison with Other Chaos Engineering Tools
Feature | Chaos Mesh | Litmus | Gremlin | Chaos Toolkit |
---|
Focus | Kubernetes-native | Kubernetes | Multi-platform | Platform-agnostic |
Installation | Helm/Kubectl | Helm/Operator | SaaS + agents | Python CLI |
UI | Built-in dashboard | Litmus Portal | Web UI | CLI only |
Architecture | CRD-based | Operator-based | Agent-based | Execution driver |
Learning Curve | Moderate | Moderate | Low | Moderate |
Kubernetes Support | Extensive | Extensive | Basic | Via plugins |
License | Open source | Open source | Commercial | Open source |
Community | Active (CNCF) | Active (CNCF) | Commercial | Moderate |
Common Challenges and Solutions
Challenge | Solution |
---|
Permissions Issues | Ensure proper RBAC setup with cluster-admin for installation and appropriate roles for namespaces |
Failed Experiments | Check event logs with kubectl describe and ensure selector matches target pods |
Resource Constraints | Use resource quotas and tune stress parameters to avoid OOM kills |
Blast Radius Control | Use namespaced scope and selective pod targeting with specific labels |
Dashboard Access | Use port forwarding or ingress configuration for persistent access |
Experiment Monitoring | Integrate with Prometheus/Grafana for better visibility |
Custom Resource Issues | Verify CRD installation and API version compatibility |
Recovery Problems | Implement proper finalizers and use time-bounded experiments |
Best Practices
Planning and Implementation
- Start with non-production environments before moving to production
- Begin with simple experiments and gradually increase complexity
- Use small blast radius initially and expand gradually
- Document baseline system behavior before running chaos experiments
- Implement automatic recovery mechanisms before testing
Execution Safety
- Always set a reasonable timeout for experiments (avoid infinite chaos)
- Use the Dashboard to monitor experiment status in real-time
- Implement circuit breakers to automatically stop experiments if critical metrics are affected
- Schedule experiments during low-traffic periods initially
- Create a chaos engineering runbook for emergency procedures
Organization and Scaling
- Use standardized labels for targeting consistent application components
- Organize chaos experiments by service or failure type
- Integrate chaos experiments into CI/CD pipelines for continuous resilience testing
- Set up alerts for unexpected behavior during experiments
- Maintain a chaos experiment catalog with results and learnings
Advanced Techniques
Using Templates and Annotations
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill-template
annotations:
experiment.chaos-mesh.org/pause: "true" # Paused initially
spec:
action: pod-kill
mode: one
selector:
namespaces:
- default
labelSelectors:
app: example
scheduler:
cron: "@every 5m"
Customizing Chaos Mesh
# Build from source
git clone https://github.com/chaos-mesh/chaos-mesh.git
cd chaos-mesh
make
# Create custom chaos
kubectl apply -f examples/custom-template.yaml
Resources for Further Learning
Official Documentation
Community Resources
- CNCF Slack (#chaos-mesh channel)
- Chaos Mesh Blog: https://chaos-mesh.org/blog/
- Monthly community meetings
Related Reading
- “Chaos Engineering: System Resiliency in Practice” (O’Reilly)
- “Kubernetes Patterns” for resilient application design
- “Site Reliability Engineering” by Google
Training and Certification
- CNCF Chaos Engineering Certification (upcoming)
- Chaos Mesh workshops at KubeCon and CloudNativeCon