Redis Software is designed for high availability and fault tolerance. This guide provides engineers with structured steps to simulate and validate failure scenarios—including node crashes, shard termination, and network isolation—to ensure the cluster behaves as expected under stress. It covers cluster components, test prerequisites, detailed failure scenarios, and best practices for verifying automatic failover, shard promotion, client reconnection, and system recovery.
Cluster Components
Node
A physical or virtual machine that hosts Redis Software and one or more shards. Can serve as a master or replica node.
Shard
A Redis process storing and managing a subset of database data.
- Master shards handle live operations
- Replica shards enable high availability and failover
Database
Logical grouping of shards that serve client applications. Databases support replication and expose one or more connection endpoints.
Endpoint
A DNS-based address clients use to connect to a Redis database.
Format: redis-XXXXX.<cluster FQDN>:<port>
Use rladmin status database to retrieve endpoint info.
Cluster
A group of coordinated nodes that run Redis Software and manage data distribution, replication, failover, and scaling.
Prerequisites
Cluster Requirements
- At least 3 nodes (odd number for quorum)
- Replication enabled for all databases
- Master and replica shards placed on separate nodes (dense/sparse policy)
Client and Network Setup
- Clients must connect via DNS-based endpoints, not node IPs
- Client libraries should support retry and auto-reconnection
- DNS resolution must be verified across the cluster
- Avoid unsupported tools like VMware vMotion or live migration
Monitoring & Baseline Health
- Use
rladmin status, RedisInsight, or Prometheus - Capture pre-test metrics and shard location
- Ensure monitoring and alerting are functional
Backup and Environment Readiness
- Complete backups of critical data
- Run tests during a maintenance window or in a staging environment
Testing Scenarios and Methods
1. Master Shard Process Kill or Node Reboot
Goal: Test automatic promotion of replica when master shard fails.
Steps:
- Use
kill -9 <redis-server PID>or reboot the node - Validate failover via:
rladmin status shards - Measure downtime using app-side monitoring or CLI timestamps
- Confirm master shard is promoted and clients reconnect
Expected Behavior:
- Temporary replication or CCS errors
- Replica becomes new master
- Minimal client-facing disruption
2. Replica Node Reboot or Process Kill
Goal: Confirm high availability is maintained when a replica fails.
Steps:
- Reboot replica node or stop its Redis process
- Monitor cluster for auto-rebalance
Expected Behavior:
- No service disruption
- Replica reappears after recovery
- Temporary replication warning may be logged
3. Endpoint Migration and Shard Movement
Goal: Simulate shard migration and endpoint reassignment.
Steps:
- Migrate shard:
rladmin migrate shard <SHARD_ID> target_node <NODE_ID> - Migrate endpoint (optional):
rladmin migrate endpoint_to_shards
Expected Behavior:
- Redis resumes services on target node
- Application connectivity remains intact
4. Network Partition or Full Node Isolation
Goal: Simulate network failure and validate reconnection logic.
Steps:
- Use tools like
iptablesorsupervisorctlto block traffic - Observe monitoring tools and client behavior
Expected Behavior:
- Temporary degradation shown in monitoring
- Automatic retries and recovery
- Shards promoted or resynced as needed
5. Master Node Reboot Without Hosting Shards
Goal: Confirm that rebooting a non-shard-serving master node has no negative impact.
Expected Behavior:
- Temporary
rladminerrors - No effect on data or client sessions
Best Practices
- Always connect applications via DNS-based endpoints
- Distribute master and replica shards across separate nodes
- Ensure client libraries support retry/backoff logic
- Monitor with
rladmin, Redis Insight, or custom dashboards - Use
memtier_benchmarkto validate performance under test - After each test:
- Confirm database and endpoint locations
- Check system logs and metrics
- Document outcomes and update recovery procedures
- Revisit proxy policies (e.g.,
rladmin bind endpoint policy all-master-shards) if endpoint placement or routing changes
After completing your resiliency tests, make sure to validate cluster behavior and investigate logs.
→ See Redis Resiliency Test: How to Analyze Logs and Validate Failover for step-by-step diagnostics, CLI review, and monitoring insights.
0 comments
Please sign in to leave a comment.