Chaos Engineering: Testing Resiliency with Chaos Monkey and Gremlin

Mohammad Abu Mattar
Chaos Engineering , System Reliability , DevOps , Site Reliability Engineering , Testing
Published: 31 Jan, 2026
16 Mins read

Modern software systems are incredibly complex. They’re spread across massive networks with countless moving parts. Because of this complexity, unexpected failures are inevitable. Servers crash. Networks slow down. Dependencies fail. Instead of waiting for something to break at 3 AM and scrambling to fix it, there’s a smarter approach: find and fix weaknesses before they cause real problems. That’s what Chaos Engineering is all about. As Dr. Werner Vogels, Amazon’s CTO, famously said, “Everything fails, all the time.” This simple truth is the foundation of an entire field dedicated to preparing systems for inevitable failures.

Chaos Engineering is essentially a way to test how strong your system is by intentionally introducing faults. The goal? Find weak spots and fix them before they lead to real outages. It’s not about creating chaos for chaos’s sake. It’s a careful, scientific method that uses controlled experiments to identify and prevent problems before they happen. This approach fundamentally changes how companies think about system reliability. The old way was reactive: wait for something to fail, then fix it. That’s expensive in terms of downtime, reputation damage, and wasted resources. Chaos Engineering flips this around. It’s proactive. You’re preventing failures instead of just recovering from them. This shift represents a more mature engineering culture where prevention is valued over firefighting.

What Exactly is Chaos Engineering?

Chaos Engineering helps you understand how failures happen in complex systems and gives you practical ways to prevent or reduce them. At its core, it’s about running controlled experiments to uncover hidden weaknesses. Think of it like a vaccination for your software: you’re introducing a small, controlled problem to build immunity against bigger, real-world disasters.

Here’s the key difference between Chaos Engineering and traditional testing. Traditional testing verifies that your system works as expected. It’s checking known features and confirming requirements. You’re testing what you know. Chaos Engineering, on the other hand, looks for unknown weaknesses before they become problems. It’s proactive, not reactive.

This shift matters more than you might think. Modern systems run in unpredictable environments with countless variables. Relying only on traditional testing leaves blind spots. You won’t know how robust your system really is until it’s under real pressure. By intentionally introducing controlled failures, you’ll discover which parts of your system are solid and which need work.

This organized approach helps companies build stronger products, which directly impacts their bottom line and customer satisfaction. But there’s more to it than just technical benefits. When systems aren’t resilient enough, the hidden costs add up fast. You’re not just losing money during outages. You’re also slowly losing customer trust, burning out your operations team with constant incident response, exhausting developers who are always firefighting, and missing opportunities for innovation because your resources are tied up maintaining basic stability. Chaos Engineering tackles these problems early, creating a more sustainable and productive engineering environment.

Why is System Resiliency So Crucial Today?

When we talk about resiliency in software, we’re talking about your system’s ability to keep working even when things go wrong. It might be running in a degraded state, but it’s still keeping core functions alive. More importantly, it’s about how quickly you can get back to normal operations. Real resilience means you can anticipate problems, absorb the impact, adapt to changes, and recover quickly from whatever the environment throws at you.

In today’s hyper-connected world, even brief outages have massive consequences. Think about e-commerce during Black Friday, banking systems processing millions of transactions, or healthcare systems managing patient data. When these systems fail, the damage is immediate and severe. You’re looking at revenue losses, brand damage, frustrated customers, and potential regulatory penalties. The Digital Operational Resilience Act (DORA), for example, requires regular resiliency testing to identify weaknesses. This regulatory pressure isn’t going away. Building resilient systems isn’t optional anymore. It’s essential for protecting your business, your customers, and your reputation.

The strategic importance of resilience goes way beyond technical features. It’s become a competitive differentiator and, increasingly, a regulatory requirement. System uptime and reliability directly drive customer satisfaction, revenue, and compliance. Companies that proactively invest in resilience using practices like Chaos Engineering are better positioned to meet strict regulations and outpace competitors by offering more stable, trustworthy services. Furthermore, there are often hidden costs associated with unaddressed system fragility. These extend beyond the immediate financial hit of an outage to include the gradual erosion of customer trust, increased operational overhead from constant incident response, developer burnout due to continuous firefighting, and the opportunity cost of being unable to innovate because resources are tied up in maintaining basic stability. By proactively identifying and rectifying these weaknesses, Chaos Engineering can substantially reduce these long-term, often unseen, expenses, contributing to a more sustainable and productive engineering organization.

What are the Guiding Principles of Chaos Engineering?

Chaos Engineering isn’t about randomly breaking things and seeing what happens. It’s a disciplined, scientific approach to understanding how systems behave under stress. These core principles guide how you design and run experiments, ensuring you get valuable insights instead of causing accidental damage.

Build a Hypothesis Around Steady State Behavior: Before introducing any disruption, you need to understand what normal looks like. Your steady state is the system’s baseline behavior, measured by key metrics like throughput, error rates, and response times. Once you’ve established this baseline, you form a hypothesis about how the system should behave when you introduce a specific fault. For example: “Even if our payment microservice fails, users can still browse products and add items to their cart.” This baseline is crucial for measuring the actual impact of your chaos experiments.
Mimic Real-World Problems: Your experiments should simulate actual failures that happen in production environments. This ensures your insights are practical and actionable. Real-world problems include server crashes, network latency, database slowdowns, sudden traffic spikes, or third-party API failures. The more realistic your simulations, the more valuable your learnings.
Test in Production (or Production-like Environments): Systems behave differently under real load with real traffic patterns. For the most accurate results, you should run experiments in production or in environments that closely mirror it. Yes, this sounds risky, but you’ll do it with strict safety controls to limit the blast radius. Start small in staging environments to build confidence, then gradually move to production with tight monitoring. The blast radius concept is critical here. By carefully controlling the scope of your experiments, you can isolate variables, observe specific impacts, and learn how your system reacts to particular faults without causing widespread damage.
Automate Your Chaos Tests: Running experiments manually is time-consuming and doesn’t scale. Automation ensures tests run consistently and reliably. The best approach? Integrate chaos testing directly into your CI/CD pipeline. This way, you’re catching problems early during development and deployment, not after they reach production.

One thing you absolutely can’t skip: strong monitoring and observability. Without good monitoring, you can’t define your steady state, measure the impact of experiments, or detect when something goes wrong. Investing in comprehensive monitoring and logging is both a prerequisite and an ongoing requirement for successful Chaos Engineering.

There’s another benefit that’s often overlooked. Chaos Engineering makes your team operationally stronger. By regularly simulating failures, teams practice incident response, test their alerting systems, and sharpen their debugging skills. This regular exposure to stress builds confidence and creates a more mature, resilient engineering team. You’re not just finding system bugs. You’re also uncovering and fixing gaps in your operational processes and team readiness.

Chaos Monkey: The Original Primate of Production Chaos

Chaos Monkey was born out of necessity at Netflix. As they pioneered cloud-native architecture, they faced a critical challenge: keeping their streaming service available while running on thousands of cloud servers. Their solution? Build a tool that randomly shuts down instances in production. Sounds crazy, right? But this seemingly destructive action had a powerful purpose. It forced engineers to design services that could handle instance failures from day one. The tool exposed engineers to failures frequently, encouraging them to build naturally resilient services.

Chaos Monkey’s job is simple: randomly terminate virtual machine instances and containers during specific time windows. While its main action is random termination, you can customize its behavior through configuration files and integration with Spinnaker (Netflix’s continuous delivery platform). It includes an outage checker that prevents it from running during existing incidents, so it won’t make ongoing problems worse.

Here’s a basic example of a Chaos Monkey configuration:

1
# Chaos Monkey configuration for AWS
2
accounts:
3
  - name: production
4
    enabled: true
5
    # Only run during business hours (PST)
6
    schedule:
7
      enabled: true
8
      startHour: 9
9
      endHour: 17
10
      timezone: America/Los_Angeles
11

12
terminationStrategy:
13
  # Randomly terminate instances
14
  randomSelection: true
15

16
  # Probability of termination (10% chance)
17
  probability: 0.1
18

19
  # Maximum number of instances to terminate per run
20
  maxTerminationsPerDay: 5
21

22
# Exclude critical services
23
exceptions:
24
  - serviceName: auth-service
25
  - serviceName: payment-processor
26

27
# Notification settings
28
notifications:
29
  email:
30
    - devops@example.com
31
  slack:
32
    channel: '#chaos-engineering'

Despite its historical significance, Chaos Monkey has some significant limitations. Its biggest constraint is limited attack types. It only does one thing: random instance termination. This seriously limits the kinds of failure scenarios you can simulate. The unpredictable, completely random nature means you have limited control over the blast radius. This lack of precision can cause more harm than good if your system isn’t ready.

Chaos Monkey also has major dependencies. It needs Spinnaker and MySQL for full integration. A big downside? Netflix no longer actively develops or maintains it, making it less practical for teams looking for ongoing support and new features. It also lacks built-in recovery or rollback mechanisms. Any fault tolerance or outage detection requires custom code.

Chaos Monkey works with environments that Spinnaker supports: AWS, Google Compute Engine (GCE), Azure, and Kubernetes. It’s been specifically tested with AWS, GCE, and Kubernetes. If your applications are managed through Spinnaker, you can set up Chaos Monkey to terminate instances within these cloud and container platforms.

The evolution from Chaos Monkey to modern tools shows a shift from forced resilience to controlled learning. Chaos Monkey’s original idea was revolutionary in its simplicity. By randomly terminating instances, it forced engineers to build resilience into their services. But its limitations (just one random fault type and no ongoing maintenance) show that while random disruption can uncover weaknesses, lasting resilience needs a more controlled, varied, and analytical approach. The industry has moved from a raw break it to see what happens mindset to a more mature, controlled, and data-driven experimental science.

Gremlin: The Modern Platform for Controlled Chaos

Gremlin stands out as a leading cloud-native platform built specifically to make Chaos Engineering safe, easy, and secure. Its main goal? Improve system uptime, validate reliability, and help companies build a strong reliability culture. Unlike Chaos Monkey’s random approach, Gremlin gives you precise control over fault injection.

Gremlin provides an extensive fault injection library that lets you simulate real-world failures across different system layers:

Resource Attacks: These test how your system handles resource constraints.

CPU attacks stress test high-demand scenarios
Memory attacks check for leaks or resource-heavy applications
I/O attacks create read/write pressure to test storage performance
GPU attacks stress AI, LLM, and video encoding workloads

Here’s a simple example of running a CPU attack with Gremlin:

#!/bin/bash

# Attack all cores on a specific container for 60 seconds
gremlin attack-container \
  --container-id abc123 \
  --type cpu \
  --cores 0 \
  --length 60

Network Attacks: These simulate network problems.

Blackhole attacks drop all network traffic to simulate complete outages
Latency attacks inject delays to test responsiveness under slow networks
Packet Loss attacks drop or corrupt traffic to mimic poor network conditions
DNS attacks block DNS access to test fallback mechanisms

Here’s how to inject network latency:

#!/bin/bash

# Add 100ms latency to all egress traffic for 2 minutes
gremlin attack-container \
  --container-id abc123 \
  --type latency \
  --delay 100 \
  --length 120

State Attacks: These test application and system state changes.

Process Killer stops specific processes to simulate application crashes
Shutdown attacks restart the host OS to test host failure recovery
Time Travel attacks change system time to test for clock drift or certificate expiry
Certificate checks verify certificate chains for expiration

Here’s a process killer example:

#!/bin/bash

# Kill all nginx processes and repeat every 5 seconds for 2 minutes
gremlin attack-container \
  --container-id abc123 \
  --type process_killer \
  --process nginx \
  --interval 5 \
  --length 120

You can combine these attack types to create hundreds of pre-built and custom scenarios for very targeted, complex simulations.

Gremlin’s platform supports multi-environment deployments. It’s truly cloud-native and runs almost anywhere: all major public clouds (AWS, Azure, GCP), Linux, Windows, containerized environments like Kubernetes, and even on-premise with Gremlin Private Edition. This wide compatibility makes it versatile for companies with diverse infrastructure.

Beyond fault injection, Gremlin offers features that enhance your Chaos Engineering practice. Its GameDay Manager helps organize reliability events, cutting down prep and execution time. The platform automatically analyzes and stores experiment results, so teams can review outcomes and turn data into real improvements. It integrates with Jira for efficient action item tracking. Gremlin also provides reliability scoring and continuous risk monitoring, helping you define, measure, and track service reliability across your organization. It can automatically discover and test system dependencies, giving you deeper insights into system weaknesses.

Gremlin’s extensive feature set (from diverse fault types and multi-cloud support to GameDay management and Jira integration) shows that Chaos Engineering has matured into a complete managed service. This evolution reflects the growing complexity of modern systems and the increasing need for advanced reliability management.

What makes Gremlin particularly valuable is its focus on safety and control. By positioning itself as making Chaos Engineering safe, easy, and secure, Gremlin directly addresses common concerns about the practice. Many teams worry about causing more harm than good. By offering precise control over fault injection, automatic halt and rollback features, and a user-friendly interface, Gremlin transforms a potentially risky practice into a controlled, value-generating activity. This changes how willing businesses are to adopt these methods, especially in sensitive production environments.

Chaos Monkey vs. Gremlin: Which Tool is Right for You?

Choosing between Chaos Monkey and Gremlin depends on your company’s specific needs, budget, and chaos engineering maturity. While both tools aim to make systems more resilient, their approaches and capabilities are quite different.

Here’s a side-by-side comparison:

Feature/Aspect	Chaos Monkey	Gremlin
Origin & Maintenance	Developed by Netflix, historically significant. No longer actively developed or maintained.	Commercial product, actively developed and maintained.
Control Over Faults	Injects faults randomly, giving a more realistic test environment for broad resilience. Limited control over blast radius and execution.	Offers precise control over fault injection, allowing targeted experiments. Provides automatic halt and rollback mechanisms.
Types of Faults	Primarily one attack type: random instance termination (shutdown).	Wide range of fault types: CPU, Memory, Disk, I/O, Blackhole, Latency, Packet Loss, Process Killer, Shutdown, DNS, Time Travel, Certificate Expiry, GPU.
Cloud/Environment Support	Tightly integrated with Netflix OSS and AWS. Works with AWS, GCE, and Kubernetes via Spinnaker. Limited multi-cloud or hybrid cloud support.	Cloud-native platform supporting all public clouds (AWS, Azure, GCP), Linux, Windows, Kubernetes, and on-premise environments.
Ease of Use	Easy to set up and use for its specific function. Requires Spinnaker and MySQL for full integration.	User-friendly web interface and CLI. May require more initial setup and configuration than Chaos Monkey.
Cost	Free and open-source.	Requires payment for advanced features and enterprise use.
Reporting & Analytics	No built-in detailed reporting or analysis; requires custom code for outage detection and fault tolerance.	Offers rich analytics and visualization tools, automatic analysis, and storage of results. Integrates with Jira for action item tracking.
Safety & Risk Mitigation	May cause system downtime and false positives. High risk if unprepared due to randomness.	Designed for safety and security. Allows starting small and scaling experiments, with features to confidently recreate incidents.
Additional Features	Basic functionality for instance termination.	GameDay Manager, scenario sharing, scheduled scenarios, reliability scoring, dependency discovery, failure flags, private edition.

This table gives you a clear, easy-to-scan comparison for quickly grasping the main differences and trade-offs.

When choosing between these tools, consider your priorities. If cost is your main concern and you only need basic random instance termination, Chaos Monkey (or similar open-source tools like Pumba for Docker/Kubernetes) could be a good starting point. However, if you need precise control over fault injection, diverse fault types, full multi-cloud support, advanced features like GameDay management, and detailed analytics, Gremlin is the stronger choice for enterprise-level reliability efforts.

Ultimately, your decision should factor in your team’s chaos engineering maturity, available budget, and the complexity of the systems you’re testing. The clear difference between Chaos Monkey and Gremlin reflects how Chaos Engineering has become a commercial and professional discipline. What started as an internal Netflix experiment has grown into a dedicated industry with advanced platforms, showing the growing recognition that reliability is a core business function.

How Do We Inject Chaos in AWS Environments?

In AWS environments, Chaos Engineering follows a structured framework designed to find resilience gaps in your workloads. It’s not about randomly breaking production systems. It’s a valuable tool for understanding how your workloads behave under simulated failure conditions.

The most common approach uses the AWS Fault Injection Simulator (FIS), a managed service built specifically for running chaos engineering experiments on AWS services. Here’s the typical workflow:

Define Steady State: First, define the normal operating condition for your systems. This baseline lets you measure what happens when you inject chaos. Collect and analyze data during stable conditions to set performance baselines and identify normal behavior patterns.
Design Chaos Tests: Plan controlled chaos experiments to simulate different failure scenarios within that steady state. Identify specific components or services to target and determine the experiment’s scope and severity. AWS FIS is perfect for this.
Execute Experiments: Set up the necessary infrastructure for running tests, including test environments, monitoring, and logging systems. Then run your defined experiments.
Analyze and Fix: During experiments, continuously monitor system behavior. Collect and analyze data to assess performance, stability, and resilience impacts, comparing against your baselines.
Iterate and Improve: Repeat these steps periodically to ensure your system remains resilient over time.

Here’s a practical example using AWS FIS to terminate EC2 instances:

1
{
2
  "description": "Test application resilience by terminating EC2 instances",
3
  "targets": {
4
    "ec2-instances": {
5
      "resourceType": "aws:ec2:instance",
6
      "resourceTags": {
7
        "Environment": "staging",
8
        "ChaosReady": "true"
9
      },
10
      "selectionMode": "COUNT(2)"
11
    }
12
  },
13
  "actions": {
14
    "terminate-instances": {
15
      "actionId": "aws:ec2:terminate-instances",
16
      "parameters": {},
17
      "targets": {
18
        "Instances": "ec2-instances"
19
      }
20
    }
21
  },
22
  "stopConditions": [
23
    {
24
      "source": "aws:cloudwatch:alarm",
25
      "value": "arn:aws:cloudwatch:us-east-1:123456789012:alarm:HighErrorRate"
26
    }
27
  ],
28
  "roleArn": "arn:aws:iam::123456789012:role/FISExperimentRole",
29
  "tags": {
30
    "Name": "EC2-Termination-Test",
31
    "Team": "Platform"
32
  }
33
}

To run this experiment:

#!/bin/bash

# Create the experiment template
TEMPLATE_ID=$(aws fis create-experiment-template \
  --cli-input-json file://aws-fis-ec2-termination.json \
  --query 'experimentTemplate.id' \
  --output text)

echo "Created experiment template: $TEMPLATE_ID"

# Start the experiment
EXPERIMENT_ID=$(aws fis start-experiment \
  --experiment-template-id "$TEMPLATE_ID" \
  --query 'experiment.id' \
  --output text)

echo "Started experiment: $EXPERIMENT_ID"

# Monitor experiment status
aws fis get-experiment \
  --id "$EXPERIMENT_ID" \
  --query 'experiment.state.status' \
  --output text

Examples of Chaos Experiments in AWS Services:

Amazon Aurora: Simulate network latency between Aurora instances, introduce failures in replica instances, or test increased load on read/write capacity.
Amazon Kinesis: Simulate higher data ingestion rates to test stream scaling.
Amazon EC2: Test Spot Instance interruptions to verify application handling of sudden terminations.
Amazon DynamoDB: Deny traffic to/from regional endpoints to test failover mechanisms.

A unique challenge arises with serverless environments like AWS Lambda. You don’t control or access the underlying infrastructure, making traditional fault injection difficult. Here are two approaches:

1. Using a Library in Lambda Code:

You can inject faults directly within the Lambda function using a library:

1
import {DynamoDBClient, PutItemCommand} from '@aws-sdk/client-dynamodb';
2
import {S3Client, GetObjectCommand} from '@aws-sdk/client-s3';
3

4
// Initialize AWS SDK v3 clients
5
const dynamoClient = new DynamoDBClient({region: process.env.AWS_REGION});
6
const s3Client = new S3Client({region: process.env.AWS_REGION});
7

8
/**
9
 * Chaos injection library for Lambda functions
10
 * Supports error injection, latency injection, and custom failure scenarios
11
 */
12
class ChaosInjector {
13
  constructor(config = {}) {
14
    this.errorRate = parseFloat(config.errorRate) || 0;
15
    this.latencyMs = parseInt(config.latencyMs, 10) || 0;
16
    this.failureTypes = config.failureTypes || ['generic_error'];
17
    this.enabled = config.enabled !== false;
18
  }
19

20
  /**
21
   * Inject artificial latency
22
   */
23
  async injectLatency() {
24
    if (this.latencyMs > 0) {
25
      console.log(`[Chaos] Injecting ${this.latencyMs}ms latency`);
26
      await new Promise((resolve) => setTimeout(resolve, this.latencyMs));
27
    }
28
  }
29

30
  /**
31
   * Inject random errors based on error rate
32
   */
33
  async injectError() {
34
    if (Math.random() < this.errorRate) {
35
      const failureType =
36
        this.failureTypes[Math.floor(Math.random() * this.failureTypes.length)];
37

38
      console.error(`[Chaos] Injecting failure: ${failureType}`);
39

40
      switch (failureType) {
41
        case 'timeout_error':
42
          throw new Error('Chaos: Simulated timeout error');
43
        case 'throttle_error':
44
          const error = new Error('Chaos: Simulated throttling');
45
          error.code = 'ThrottlingException';
46
          throw error;
47
        case 'service_unavailable':
48
          const unavailableError = new Error('Chaos: Service unavailable');
49
          unavailableError.statusCode = 503;
50
          throw unavailableError;
51
        default:
52
          throw new Error('Chaos: Simulated generic failure');
53
      }
54
    }
55
  }
56

57
  /**
58
   * Execute chaos injection
59
   */
60
  async inject() {
61
    if (!this.enabled) {
62
      return;
63
    }
64

65
    await this.injectLatency();
66
    await this.injectError();
67
  }
68
}
69

70
// Initialize chaos injector from environment variables
71
const chaos = new ChaosInjector({
72
  errorRate: process.env.CHAOS_ERROR_RATE || 0,
73
  latencyMs: process.env.CHAOS_LATENCY_MS || 0,
74
  failureTypes: process.env.CHAOS_FAILURE_TYPES?.split(',') || [
75
    'generic_error',
76
  ],
77
  enabled: process.env.CHAOS_ENABLED !== 'false',
78
});
79

80
/**
81
 * Lambda handler with chaos engineering
82
 */
83
export const handler = async (event, context) => {
84
  console.log('Processing request:', {requestId: context.requestId});
85

86
  try {
87
    // Inject chaos before processing
88
    await chaos.inject();
89

90
    // Your actual business logic
91
    const result = await processRequest(event);
92

93
    return {
94
      statusCode: 200,
95
      headers: {
96
        'Content-Type': 'application/json',
97
        'X-Request-Id': context.requestId,
98
      },
99
      body: JSON.stringify(result),
100
    };
101
  } catch (error) {
102
    console.error('Error processing request:', {
103
      error: error.message,
104
      stack: error.stack,
105
      requestId: context.requestId,
106
    });
107

108
    return {
109
      statusCode: error.statusCode || 500,
110
      headers: {
111
        'Content-Type': 'application/json',
112
        'X-Request-Id': context.requestId,
113
      },
114
      body: JSON.stringify({
115
        error: error.message,
116
        requestId: context.requestId,
117
      }),
118
    };
119
  }
120
};
121

122
/**
123
 * Example business logic using AWS SDK v3
124
 */
125
async function processRequest(event) {
126
  const {userId, action} = JSON.parse(event.body || '{}');
127

128
  // Example: Write to DynamoDB using AWS SDK v3
129
  if (action === 'save') {
130
    const command = new PutItemCommand({
131
      TableName: process.env.TABLE_NAME,
132
      Item: {
133
        userId: {S: userId},
134
        timestamp: {N: Date.now().toString()},
135
        data: {S: JSON.stringify(event.body)},
136
      },
137
    });
138

139
    await dynamoClient.send(command);
140
  }
141

142
  // Example: Read from S3 using AWS SDK v3
143
  if (action === 'fetch') {
144
    const command = new GetObjectCommand({
145
      Bucket: process.env.BUCKET_NAME,
146
      Key: `users/${userId}/data.json`,
147
    });
148

149
    const response = await s3Client.send(command);
150
    const data = await response.Body.transformToString();
151
    return {data: JSON.parse(data)};
152
  }
153

154
  return {message: 'Success', userId, action};
155
}

2. Using a Lambda Extension:

You can deploy a Lambda layer that injects failures without changing your main function code:

1
#!/usr/bin/env python3
2
"""Lambda Extension for Chaos Engineering
3

4
This extension intercepts Lambda invocations and injects controlled failures
5
to test system resilience without modifying application code.
6
"""
7

8
import os
9
import sys
10
import json
11
import random
12
import time
13
import signal
14
import requests
15
from pathlib import Path
16
from typing import Dict, List, Optional, Tuple
17
from datetime import datetime
18

19
# Lambda Extensions API endpoint
20
EXTENSION_API = f"http://{os.getenv('AWS_LAMBDA_RUNTIME_API')}/2020-01-01/extension"
21

22

23
class ChaosExtension:
24
    """Chaos injection engine for Lambda functions"""
25

26
    def __init__(self):
27
        self.extension_id: Optional[str] = None
28
        self.error_rate = float(os.getenv('CHAOS_ERROR_RATE', '0.0'))
29
        self.latency_ms = int(os.getenv('CHAOS_LATENCY_MS', '0'))
30
        self.max_latency_ms = int(os.getenv('CHAOS_MAX_LATENCY_MS', '5000'))
31
        self.enabled = os.getenv('CHAOS_ENABLED', 'true').lower() == 'true'
32

33
        try:
34
            self.failure_types = json.loads(
35
                os.getenv('CHAOS_FAILURE_TYPES', '["http_error"]')
36
            )
37
        except json.JSONDecodeError:
38
            self.failure_types = ['http_error']
39
            print('[chaos-extension] Warning: Invalid CHAOS_FAILURE_TYPES, using default')
40

41
        # Validate configuration
42
        self._validate_config()
43

44
    def _validate_config(self) -> None:
45
        """Validate chaos configuration parameters"""
46
        if not 0 <= self.error_rate <= 1:
47
            print(f'[chaos-extension] Warning: Invalid error_rate {self.error_rate}, clamping to [0,1]')
48
            self.error_rate = max(0, min(1, self.error_rate))
49

50
        if self.latency_ms < 0:
51
            print(f'[chaos-extension] Warning: Negative latency {self.latency_ms}ms, setting to 0')
52
            self.latency_ms = 0
53

54
        if self.latency_ms > self.max_latency_ms:
55
            print(f'[chaos-extension] Warning: Latency {self.latency_ms}ms exceeds max {self.max_latency_ms}ms')
56
            self.latency_ms = self.max_latency_ms
57

58
    def register(self) -> str:
59
        """Register extension with Lambda Extensions API"""
60
        try:
61
            response = requests.post(
62
                f'{EXTENSION_API}/register',
63
                json={'events': ['INVOKE', 'SHUTDOWN']},
64
                headers={'Lambda-Extension-Name': 'chaos-extension'},
65
                timeout=5
66
            )
67
            response.raise_for_status()
68
            self.extension_id = response.headers['Lambda-Extension-Identifier']
69
            print(f'[chaos-extension] Registered with ID: {self.extension_id}')
70
            return self.extension_id
71
        except Exception as e:
72
            print(f'[chaos-extension] Failed to register: {e}')
73
            sys.exit(1)
74

75
    def next_event(self) -> Dict:
76
        """Wait for next Lambda event"""
77
        try:
78
            response = requests.get(
79
                f'{EXTENSION_API}/event/next',
80
                headers={'Lambda-Extension-Identifier': self.extension_id},
81
                timeout=None
82
            )
83
            response.raise_for_status()
84
            return response.json()
85
        except Exception as e:
86
            print(f'[chaos-extension] Error getting next event: {e}')
87
            sys.exit(1)
88

89
    def should_inject_chaos(self) -> bool:
90
        """Determine if chaos should be injected for this invocation"""
91
        if not self.enabled:
92
            return False
93
        return random.random() < self.error_rate
94

95
    def inject_latency(self) -> None:
96
        """Inject artificial latency"""
97
        if self.latency_ms > 0:
98
            actual_latency = random.randint(
99
                self.latency_ms // 2,
100
                self.latency_ms
101
            )
102
            print(f'[chaos-extension] Injecting {actual_latency}ms latency')
103
            time.sleep(actual_latency / 1000.0)
104

105
    def inject_error(self) -> Optional[Tuple[int, Dict]]:
106
        """Inject simulated error based on failure type"""
107
        if not self.should_inject_chaos():
108
            return None
109

110
        failure_type = random.choice(self.failure_types)
111
        timestamp = datetime.utcnow().isoformat()
112

113
        print(f'[chaos-extension] Injecting failure: {failure_type} at {timestamp}')
114

115
        error_scenarios = {
116
            'http_error': (500, {
117
                'error': 'Chaos: Simulated HTTP 500 Internal Server Error',
118
                'type': 'InternalServerError',
119
                'timestamp': timestamp
120
            }),
121
            'timeout': (408, {
122
                'error': 'Chaos: Simulated request timeout',
123
                'type': 'RequestTimeout',
124
                'timestamp': timestamp
125
            }),
126
            'throttle': (429, {
127
                'error': 'Chaos: Simulated throttling',
128
                'type': 'ThrottlingException',
129
                'timestamp': timestamp
130
            }),
131
            'service_unavailable': (503, {
132
                'error': 'Chaos: Service temporarily unavailable',
133
                'type': 'ServiceUnavailable',
134
                'timestamp': timestamp
135
            }),
136
            'bad_gateway': (502, {
137
                'error': 'Chaos: Bad gateway response',
138
                'type': 'BadGateway',
139
                'timestamp': timestamp
140
            })
141
        }
142

143
        if failure_type == 'timeout':
144
            # Simulate timeout with actual delay
145
            timeout_duration = random.randint(1, 5)
146
            print(f'[chaos-extension] Simulating {timeout_duration}s timeout')
147
            time.sleep(timeout_duration)
148

149
        return error_scenarios.get(
150
            failure_type,
151
            (500, {'error': 'Chaos: Unknown failure type', 'timestamp': timestamp})
152
        )
153

154
    def process_invoke(self, event: Dict) -> None:
155
        """Process INVOKE event"""
156
        request_id = event.get('requestId', 'unknown')
157
        print(f'[chaos-extension] Processing invocation: {request_id}')
158

159
        # Inject latency before function execution
160
        self.inject_latency()
161

162
        # Check if error should be injected
163
        error = self.inject_error()
164
        if error:
165
            status_code, error_body = error
166
            print(f'[chaos-extension] Chaos injected: {status_code} - {error_body["error"]}')
167

168
    def run(self) -> None:
169
        """Main extension loop"""
170
        print('[chaos-extension] Starting chaos extension')
171
        print(f'[chaos-extension] Configuration:')
172
        print(f'  - Enabled: {self.enabled}')
173
        print(f'  - Error rate: {self.error_rate * 100:.1f}%')
174
        print(f'  - Latency: {self.latency_ms}ms (max: {self.max_latency_ms}ms)')
175
        print(f'  - Failure types: {self.failure_types}')
176

177
        # Register extension
178
        self.register()
179

180
        # Main event loop
181
        while True:
182
            event = self.next_event()
183
            event_type = event.get('eventType')
184

185
            if event_type == 'INVOKE':
186
                self.process_invoke(event)
187
            elif event_type == 'SHUTDOWN':
188
                print('[chaos-extension] Shutdown event received')
189
                break
190
            else:
191
                print(f'[chaos-extension] Unknown event type: {event_type}')
192

193

194
def signal_handler(signum, frame):
195
    """Handle shutdown signals gracefully"""
196
    print(f'[chaos-extension] Received signal {signum}, shutting down')
197
    sys.exit(0)
198

199

200
if __name__ == '__main__':
201
    # Register signal handlers
202
    signal.signal(signal.SIGTERM, signal_handler)
203
    signal.signal(signal.SIGINT, signal_handler)
204

205
    try:
206
        chaos = ChaosExtension()
207
        chaos.run()
208
    except Exception as e:
209
        print(f'[chaos-extension] Fatal error: {e}')
210
        sys.exit(1)