After simulating node, shard, or network failures in Redis Software, it’s critical to verify correct failover behavior, client reconnection, and system recovery. This guide covers diagnostic tools, common test issues, and post-test validation steps to ensure failover tests are accurate and actionable.
If you haven’t already run resiliency tests, start with:
→ Simulating Failures in Redis Software: A Resiliency Testing Guide
Diagnostic Tools
rladmin CLI
-
rladmin status— check overall cluster state -
rladmin status shards— inspect shard-level status -
rladmin issues_only— identify unresolved issues only -
rladmin failover,migrate,bind, andendpoint_to_shards— simulate and validate recovery behavior
Log Files
- Located by default at:
/var/opt/redislabs/log/ - Check for replication link events, failover actions, shard status transitions, and latency anomalies
Monitoring Tools
- RedisInsight: Visual dashboards and timeline view of failover events
- Prometheus/Grafana: Time series metrics on node health, replication lag, memory, and client connections
- Application logs: Capture connection errors, retry loops, or write failures
Benchmarking
- memtier_benchmark: Use to validate cluster performance before, during, and after failure simulation
Common Issues and Resolutions
Replica Doesn’t Promote to Master
- Cause: Replica and master were placed on the same failure domain
- Fix: Review shard placement policies (dense vs. sparse); relocate shards across multiple nodes
Data Unavailability After Test
- Check node state and Redis processes with
psorrladmin status - Investigate
rladmin status issues_onlyfor unresolved replication or endpoint problems - Verify clients connect via endpoint, not static IPs
Performance Drops During Recovery
- Ensure RAM usage stays under 80%
- Check for slow queries using
SLOWLOGand monitor system metrics (CPU, swap, disk I/O)
Client Fails to Reconnect Automatically
- Cause: Application does not support reconnect logic or uses incorrect endpoint
- Fix: Update client libraries to support DNS-based failover; test reconnection under failure
Manual Recovery Is Needed
- If automatic failover fails, refer to the Cluster Recovery Guide
Post-Test Validation
Check Cluster Status
- Use
rladmin statusand ensure all nodes and shards are marked Optimal
Verify Endpoint Placement
- Use
rladmin status databaseto confirm that endpoints are colocated with master shards when possible
Log Review
- Review Redis logs (
/var/opt/redislabs/log/) for:- Replication state changes
- CCS error recovery
- Endpoint migrations
Analyze Monitoring Metrics
- RedisInsight/Prometheus should show consistent performance and client connection recovery after failover
Share Test Results
- For deeper analysis, use the Redis Admin Console to generate a support package containing logs, metrics, and configuration data. This package can be shared internally or with Redis for further review.
0 comments
Please sign in to leave a comment.