A failover test in Redis Cloud simulates a controlled disruption such as an endpoint migration, node failure, or cluster outage to confirm that your applications can reconnect, recover, and continue operating without data loss. These tests are a critical part of validating high-availability and disaster recovery, ensuring that applications meet recovery time and recovery point objectives.
This article explains the types of failover tests available in Redis Cloud, the prerequisites and checklist to prepare, the step-by-step process for scheduling and executing a test, what to expect during the test, and how to troubleshoot issues and apply best practices for application resiliency.
For more background on Redis Cloud HA and DR, see Develop highly available apps with Redis Cloud.
Quick Fix Table
Problem |
Likely Cause |
Quick Fix |
|---|---|---|
App doesn’t reconnect after failover |
DNS caching or no retry logic |
Disable long DNS TTLs; configure retries/timeouts |
DB didn’t survive failover |
Replication not enabled |
Enable HA before testing |
Failover took longer than expected |
Node failure test or DR scenario |
Confirm replication and persistence; test in staging first |
Failover Test Types in Redis Cloud
Redis Cloud supports several failover scenarios with your Technical Account Manager or with Redis Customer Support
-
Shard Failover:
Goal: Simulates a shard migration from one node to another, without a rebind of the db endpoint.
Requirements: Available only for databases with replication enabled.
Expected Impact: Minimal latency during the shards failover
Test Duration: 30 minutes window.
Note: Most recommended test for customers before go-live.
-
Node Failure Test
Goal: Validate application behavior during a simulated node failure and recovery.
Requirements: Available only for databases with replication enabled.
Expected Impact: Minimal latency during the failover process and temporary disconnections during endpoint rebind. Properly configured clients will experience a short disconnection followed by automatic reconnection.
Duration: 60 minutes window.
Note: Most recommended test for customers prior to go-live.
-
Full-Cluster Disaster Recovery Drill
Goal: Validate full cluster restoration from persistence files in a simulated disaster scenario.
Requirements: Available only for databases with persistence enabled
Expected Impact: Complete outage during recovery; the process can last up to 3 hours.
Duration: 2-3 hours
Note: To learn more about Redis Cloud backup and recovery procedures, see Backup and restore procedures.
Important: Failover testing is available only on Pro plans and must be scheduled with Redis Support at least 1 week in advance. Customer presence over a chat room is required for all tests.
Pre-Test Checklist
Check |
Why It Matters |
Action |
|---|---|---|
Replication enabled |
Non-replicated DBs cannot fail over |
Enable HA replication |
Persistence/backups enabled |
Required for DR test recovery |
Verify RDB/AOF and backups |
Test environment |
Reduces production risk |
Prefer QA/staging first |
Database details |
Redis Support requires IDs & endpoints |
Record DB names/IDs in advance |
Schedule approval |
Needed for coordination |
Open Zendesk ticket 1+ week prior |
Timezone clarity |
Avoids scheduling errors |
Always use UTC |
Other DBs on same node |
Node tests impact all co-resident DBs |
Verify placement with Support |
Schedule and Execute a Failover Test
-
Request a Test
-
Open a Redis Support ticket.
For support ticket instructions, visit the Redis Cloud Support Portal.
Provide DB IDs, environment, test type (Lite vs Node/DR), and UTC time windows.
-
-
Coordinate Test
Redis Support confirms schedule and shares an RLchat link ~30 minutes before test.
Customer presence on RLchat is required throughout the entirety of the test.
-
Execute the Test
Failover test: Shards are failed over to another node
Node test: Node is forcibly stopped— master shards and endpoints over to replicas.
DR drill: Cluster outage simulated; recovery from persistence/backup.
-
Monitor During Test
Keep workload running (continuous reads/writes).
Observe app logs for disconnects and reconnects.
Track recovery time for RTO compliance.
-
Validate After Test
Redis Support confirms completion.
Verify app traffic resumes and performance normalizes.
Collect logs/metrics as evidence.
What to Expect During the Test
The following outcomes apply to well-configured clients:
Shard failover: Elevated latency during the failover, which may last until the shard is failed over back to the original node.
Node failover: Up to ~1 minute; may include double failover when node returns.
DR drill: Longer downtime; only HA + persistence DBs will recover.
For client connection best practices, see Connect to a Redis Cloud database.
Troubleshooting
Symptom |
Cause |
Resolution |
|---|---|---|
App fails to reconnect |
DNS cached too long |
Reduce DNS TTL; configure JVM |
App hangs during failover |
No retry/reconnect logic |
Add retry with exponential backoff |
Failover takes too long |
Client-side timeouts too long |
Use 1–2s connect/socket timeouts |
Load balancer issues |
Proxy or LB doesn’t refresh DNS |
Align LB timeouts with client |
DB lost after failover |
Replication not enabled |
Enable HA; non-replicated DBs cannot survive |
Client errors persist |
Old library bugs |
Upgrade to latest Lettuce/Jedis/Redisson |
Best Practices & Application Preparation
Use connection pooling for efficiency.
Implement retry + backoff for transient errors.
Prefer hostnames over static IPs for endpoints.
Set sensible timeouts (1–2s).
-
Enable monitoring with Redis Insights, Prometheus, or Grafana.
Test with live traffic (not idle connections).
Run regular failover/DR drills in staging and production.
Disaster Recovery
Replication (HA): Required for node failover.
-
Persistence: Required for DR drills; validate restore procedures.
For detailed recovery runbooks and DR drill checklists, see Disaster recovery and backup guidance.
Backups: Configure daily/hourly backups for point-in-time recovery.
DR planning: Include failback, rollback, and client-side validation in DR runbooks.
0 comments
Please sign in to leave a comment.