Redis Software clusters operating in distributed environments can experience issues related to performance, connectivity, availability, and synchronization. This guide provides a comprehensive approach to diagnosing and resolving such problems. Topics include Investigating System-Level Issues, Gathering Diagnostics and Support Packages, and Troubleshooting Connectivity, Latency, CRDB Sync Lag, Shard Imbalance, and Node Failures.
Investigating System-Level Issues
Monitor Cluster Health
- Use Prometheus, Grafana, or Redis Insight to track key metrics:
- Latency, memory usage, CPU utilization, connection counts, replication lag.
- Node resource checks:
- Ensure RAM/CPU < 80%, swap is disabled, time is synchronized, disk space is sufficient.
- Confirm the
redislabsuser has correctumasksettings and log storage isn’t full.
Additional Insight: Our cluster watchdog logs can help identify frequent disconnects or inter-node communication issues.
Use Key Logs for Analysis
Log files are found in /var/opt/redislabs/log/:
-
event_log.log– CPU, latency, failovers, endpoint changes. -
cluster_wd.log– node watchdog states and failures. -
dmcproxy.log– connection negotiation and certificate failures. cnm_exec.log– shard movement, failovers, migrations-
redis-ID.log– individual shard errors and usage patterns.
Run Health Check Commands
rladmin statusrladmin status shardsrladmin status issues_onlyrladmin status extra all-
supervisorctl status– confirms Redis processes are running. -
rlcheck– performs cluster-wide health validation.
Gathering Diagnostics and Support Packages
-
UI method:
Go to Cluster Manager > Generate Support Package → Choose "Full package including all nodes and databases". -
CLI method:
Use commands documented in Redis Software CLI reference.For full support package: rladmin cluster debug_info
For a node debuginfo package (if support package generation falis): /opt/redislabs/bin/debuginfo
Contents of Support Package
- System and Redis logs
- Cluster config metadata
- Metrics and performance statistics
(No customer data or key contents are included.)
Tip: If the file exceeds upload limits, split per Redis Support instructions. You can split the .tar.gz file into smaller files using the split command in Linux. For example, you can split the support package into 50MB files:split --bytes=50M <support_package_filename> <support_package_filename>
Troubleshooting Steps (by Issue Type)
Connectivity Issues
-
Use
redis_clito verify endpoint availability.If redis-cli can’t connect, it’ll usually provide an error message with more with additional details to help identify the issue.
- Inspect
event_log.loganddmcproxy.logfor:- Refused connections
Resource exceeded errors (e.g., buffer limits)
- TLS errors (expired certs, trust mismatches)
- Confirm endpoint resolution with:
rladmin status endpoints- DNS resolution tools (
dig,nslookup)
- Rebind endpoints if needed after node restarts or IP changes.
Latency Problems
- Monitor latency metrics in Grafana/Prometheus.
- Review Redis
SLOWLOGand long command traces. - Use RedisInsight to detect inefficient command patterns or hot keys.
- Investigate client storming or burst traffic issues.
CRDB Sync Lag
- Monitor CRDB replication health via Admin UI.
- Look for lag warnings in logs.
- Validate:
- Inter-cluster reachability
- TLS/cert validity
- Protocol compatibility
- Causes include network issues, restarts, or replication misconfig.
Shard Imbalance
- Identify via:
- Shard size
- Operations/sec
- Rebalance with Redis resharding tools.
- Mitigate with:
- Better key hashing strategies
- Removal of hot keys
Reducing or modifying the use of hashtags {} to force multiple keys onto a single hash slot
Node Failures
- Check
cluster_wd.logandevent_log.logfor nodes being “declared dead” - Replace or repair nodes using recovery best practices.
Additional Cluster-Wide Recommendations
Distributed Synchronization Setup
- For Active-Active or Replica Of databases:
-
Use distributed syncer mode (note that this is enabled by default for new databases in Redis Enterprise 7.6 forward):
rladmin tune db db:<ID> syncer_mode distributed
- Adjust proxy policies to reflect replication strategy.
-
Related Issues Covered in Separate Articles
The following distributed system issues are handled in more detail in their respective troubleshooting articles:
-
Troubleshooting TLS Connection Failures Caused by Certificate Expiration
Expired or mismatched TLS certificates can block connectivity, especially during failovers or node restarts. -
Troubleshooting CRDB Sync Stalls After Cluster Reboot
Walks through syncer-mode settings, lag monitoring, and inter-cluster connection problems. -
Hot Key Imbalance Overloading a Shard
Explains how large or frequently-accessed keys skew traffic and cause shard overload. -
Incomplete Node Maintenance Causing Split-Brain States
Covers best practices to avoid partial upgrades or improper node restarts that desynchronize the cluster. -
Dianosing and Resolving Endpoint Flapping
Focuses on DNS, endpoint binding, and client resolution issues after failover or cluster reconfiguration.
2 comments
Tanrieek Ahm
Is there any metrics or monitoring attribute which gives me shard imbalance stats?
Natalie Patterson
Hi Tanrieek, the Hot Key Imbalance Overloading a Shard article is the best match for explaining how to monitor shard-level metrics that reflect imbalance.
Please sign in to leave a comment.