Troubleshooting Distributed System Issues in Redis Software

Redis Software clusters operating in distributed environments can experience issues related to performance, connectivity, availability, and synchronization. This guide provides a comprehensive approach to diagnosing and resolving such problems. Topics include Investigating System-Level Issues, Gathering Diagnostics and Support Packages, and Troubleshooting Connectivity, Latency, CRDB Sync Lag, Shard Imbalance, and Node Failures.

Investigating System-Level Issues

Monitor Cluster Health

Use Prometheus, Grafana, or Redis Insight to track key metrics:
- Latency, memory usage, CPU utilization, connection counts, replication lag.
Node resource checks:
- Ensure RAM/CPU < 80%, swap is disabled, time is synchronized, disk space is sufficient.
- Confirm the redislabs user has correct umask settings and log storage isn’t full.
Additional Insight: Our cluster watchdog logs can help identify frequent disconnects or inter-node communication issues.

Use Key Logs for Analysis

Log files are found in /var/opt/redislabs/log/:

event_log.log – CPU, latency, failovers, endpoint changes.
cluster_wd.log – node watchdog states and failures.
dmcproxy.log – connection negotiation and certificate failures.
cnm_exec.log – shard movement, failovers, migrations
redis-ID.log – individual shard errors and usage patterns.

Run Health Check Commands

rladmin status
rladmin status shards
rladmin status issues_only
rladmin status extra all
supervisorctl status – confirms Redis processes are running.
rlcheck – performs cluster-wide health validation.

Gathering Diagnostics and Support Packages

Generate a Support Package

UI method:
Go to Cluster Manager > Generate Support Package → Choose "Full package including all nodes and databases".
CLI method:
Use commands documented in Redis Software CLI reference.
- For full support package: rladmin cluster debug_info
- For a node debuginfo package (if support package generation falis): /opt/redislabs/bin/debuginfo

Contents of Support Package

System and Redis logs
Cluster config metadata
Metrics and performance statistics
(No customer data or key contents are included.)

Tip: If the file exceeds upload limits, split per Redis Support instructions. You can split the .tar.gz file into smaller files using the split command in Linux. For example, you can split the support package into 50MB files:split --bytes=50M <support_package_filename> <support_package_filename>

Troubleshooting Steps (by Issue Type)

Connectivity Issues

Use redis_cli to verify endpoint availability.
- If redis-cli can’t connect, it’ll usually provide an error message with more with additional details to help identify the issue.
Inspect event_log.log and dmcproxy.log for:
- Refused connections
- Resource exceeded errors (e.g., buffer limits)
- TLS errors (expired certs, trust mismatches)
Confirm endpoint resolution with:
- rladmin status endpoints
- DNS resolution tools (dig, nslookup)
Rebind endpoints if needed after node restarts or IP changes.

Latency Problems

Monitor latency metrics in Grafana/Prometheus.
Review Redis SLOWLOG and long command traces.
Use RedisInsight to detect inefficient command patterns or hot keys.
Investigate client storming or burst traffic issues.

CRDB Sync Lag

Monitor CRDB replication health via Admin UI.
Look for lag warnings in logs.
Validate:
- Inter-cluster reachability
- TLS/cert validity
- Protocol compatibility
Causes include network issues, restarts, or replication misconfig.

Shard Imbalance

Identify via:
- Shard size
- Operations/sec
Rebalance with Redis resharding tools.
Mitigate with:
- Better key hashing strategies
- Removal of hot keys
- Reducing or modifying the use of hashtags {} to force multiple keys onto a single hash slot

Node Failures

Check cluster_wd.log and event_log.log for nodes being “declared dead”
Replace or repair nodes using recovery best practices.

Additional Cluster-Wide Recommendations

Distributed Synchronization Setup

For Active-Active or Replica Of databases:
- Use distributed syncer mode (note that this is enabled by default for new databases in Redis Enterprise 7.6 forward):
```
rladmin tune db db:<ID> syncer_mode distributed
```
- Adjust proxy policies to reflect replication strategy.

Investigating System-Level Issues

Monitor Cluster Health

Use Key Logs for Analysis

Run Health Check Commands

Gathering Diagnostics and Support Packages

Troubleshooting Steps (by Issue Type)

Connectivity Issues

Latency Problems

CRDB Sync Lag

Shard Imbalance

Node Failures

Additional Cluster-Wide Recommendations

Distributed Synchronization Setup

Related Issues Covered in Separate Articles

2 comments

Tanrieek Ahm

Natalie Patterson

Investigating System-Level Issues

Monitor Cluster Health

Use Key Logs for Analysis

Run Health Check Commands

Gathering Diagnostics and Support Packages

Troubleshooting Steps (by Issue Type)

Connectivity Issues

Latency Problems

CRDB Sync Lag

Shard Imbalance

Node Failures

Additional Cluster-Wide Recommendations

Distributed Synchronization Setup

Related Issues Covered in Separate Articles

Related articles

Tanrieek Ahm

Natalie Patterson