Simulating Failures in Redis Software: A Resiliency Testing Guide – Redis Knowledge Base

Redis Software is designed for high availability and fault tolerance. This guide provides engineers with structured steps to simulate and validate failure scenarios—including node crashes, shard termination, and network isolation—to ensure the cluster behaves as expected under stress. It covers cluster components, test prerequisites, detailed failure scenarios, and best practices for verifying automatic failover, shard promotion, client reconnection, and system recovery.

Cluster Components

Node
A physical or virtual machine that hosts Redis Software and one or more shards. Can serve as a master or replica node.

Shard
A Redis process storing and managing a subset of database data.

Master shards handle live operations
Replica shards enable high availability and failover

Database
Logical grouping of shards that serve client applications. Databases support replication and expose one or more connection endpoints.

Endpoint
A DNS-based address clients use to connect to a Redis database.
Format: redis-XXXXX.<cluster FQDN>:<port>
Use rladmin status database to retrieve endpoint info.

Cluster
A group of coordinated nodes that run Redis Software and manage data distribution, replication, failover, and scaling.

Prerequisites

Cluster Requirements

At least 3 nodes (odd number for quorum)
Replication enabled for all databases
Master and replica shards placed on separate nodes (dense/sparse policy)

Client and Network Setup

Clients must connect via DNS-based endpoints, not node IPs
Client libraries should support retry and auto-reconnection
DNS resolution must be verified across the cluster
Avoid unsupported tools like VMware vMotion or live migration

Monitoring & Baseline Health

Use rladmin status, RedisInsight, or Prometheus
Capture pre-test metrics and shard location
Ensure monitoring and alerting are functional

Backup and Environment Readiness

Complete backups of critical data
Run tests during a maintenance window or in a staging environment

Testing Scenarios and Methods

1. Master Shard Process Kill or Node Reboot

Goal: Test automatic promotion of replica when master shard fails.

Steps:

Use kill -9 <redis-server PID> or reboot the node
Validate failover via: rladmin status shards
Measure downtime using app-side monitoring or CLI timestamps
Confirm master shard is promoted and clients reconnect

Expected Behavior:

Temporary replication or CCS errors
Replica becomes new master
Minimal client-facing disruption

2. Replica Node Reboot or Process Kill

Goal: Confirm high availability is maintained when a replica fails.

Steps:

Reboot replica node or stop its Redis process
Monitor cluster for auto-rebalance

Expected Behavior:

No service disruption
Replica reappears after recovery
Temporary replication warning may be logged

3. Endpoint Migration and Shard Movement

Goal: Simulate shard migration and endpoint reassignment.

Steps:

Migrate shard: rladmin migrate shard <SHARD_ID> target_node <NODE_ID>
Migrate endpoint (optional): rladmin migrate endpoint_to_shards

Expected Behavior:

Redis resumes services on target node
Application connectivity remains intact

4. Network Partition or Full Node Isolation

Goal: Simulate network failure and validate reconnection logic.

Steps:

Use tools like iptables or supervisorctl to block traffic
Observe monitoring tools and client behavior

Expected Behavior:

Temporary degradation shown in monitoring
Automatic retries and recovery
Shards promoted or resynced as needed

5. Master Node Reboot Without Hosting Shards

Goal: Confirm that rebooting a non-shard-serving master node has no negative impact.

Expected Behavior:

Temporary rladmin errors
No effect on data or client sessions

Best Practices

Always connect applications via DNS-based endpoints
Distribute master and replica shards across separate nodes
Ensure client libraries support retry/backoff logic
Monitor with rladmin, Redis Insight, or custom dashboards
Use memtier_benchmark to validate performance under test
After each test:
- Confirm database and endpoint locations
- Check system logs and metrics
- Document outcomes and update recovery procedures
- Revisit proxy policies (e.g., rladmin bind endpoint policy all-master-shards) if endpoint placement or routing changes

After completing your resiliency tests, make sure to validate cluster behavior and investigate logs.
→ See Redis Resiliency Test: How to Analyze Logs and Validate Failover for step-by-step diagnostics, CLI review, and monitoring insights.

Additional Resources

rladmin CLI Reference

Redis Insight

memtier_benchmark

Related to

Cluster Components

Prerequisites

Testing Scenarios and Methods

1. Master Shard Process Kill or Node Reboot

2. Replica Node Reboot or Process Kill

3. Endpoint Migration and Shard Movement

4. Network Partition or Full Node Isolation

5. Master Node Reboot Without Hosting Shards

Best Practices

Additional Resources

Related articles