Redis Resiliency Test: How to Analyze Logs and Validate Failover – Redis Knowledge Base

After simulating node, shard, or network failures in Redis Software, it’s critical to verify correct failover behavior, client reconnection, and system recovery. This guide covers diagnostic tools, common test issues, and post-test validation steps to ensure failover tests are accurate and actionable.

If you haven’t already run resiliency tests, start with:
→ Simulating Failures in Redis Software: A Resiliency Testing Guide

Diagnostic Tools

rladmin CLI

rladmin status — check overall cluster state
rladmin status shards — inspect shard-level status
rladmin issues_only — identify unresolved issues only
rladmin failover, migrate, bind, and endpoint_to_shards — simulate and validate recovery behavior

Log Files

Located by default at: /var/opt/redislabs/log/
Check for replication link events, failover actions, shard status transitions, and latency anomalies

Monitoring Tools

RedisInsight: Visual dashboards and timeline view of failover events
Prometheus/Grafana: Time series metrics on node health, replication lag, memory, and client connections
Application logs: Capture connection errors, retry loops, or write failures

Benchmarking

memtier_benchmark: Use to validate cluster performance before, during, and after failure simulation

Common Issues and Resolutions

Replica Doesn’t Promote to Master

Cause: Replica and master were placed on the same failure domain
Fix: Review shard placement policies (dense vs. sparse); relocate shards across multiple nodes

Data Unavailability After Test

Check node state and Redis processes with ps or rladmin status
Investigate rladmin status issues_only for unresolved replication or endpoint problems
Verify clients connect via endpoint, not static IPs

Performance Drops During Recovery

Ensure RAM usage stays under 80%
Check for slow queries using SLOWLOG and monitor system metrics (CPU, swap, disk I/O)

Client Fails to Reconnect Automatically

Cause: Application does not support reconnect logic or uses incorrect endpoint
Fix: Update client libraries to support DNS-based failover; test reconnection under failure

Manual Recovery Is Needed

If automatic failover fails, refer to the Cluster Recovery Guide

Post-Test Validation

Check Cluster Status

Use rladmin status and ensure all nodes and shards are marked Optimal

Verify Endpoint Placement

Use rladmin status database to confirm that endpoints are colocated with master shards when possible

Log Review

Review Redis logs (/var/opt/redislabs/log/) for:
- Replication state changes
- CCS error recovery
- Endpoint migrations

Analyze Monitoring Metrics

RedisInsight/Prometheus should show consistent performance and client connection recovery after failover

Share Test Results

For deeper analysis, use the Redis Admin Console to generate a support package containing logs, metrics, and configuration data. This package can be shared internally or with Redis for further review.

Related to

Diagnostic Tools

Common Issues and Resolutions

Post-Test Validation

Related articles