Tools for Identifying Failures in Clusters – Redis Knowledge Base

Identifying failures in a Redis cluster requires a combination of monitoring, inspection, and logging tools. This article outlines the most effective Diagnostic Tools available, categorized by function and use case. These tools help detect and diagnose issues related to cluster health, node failures, configuration errors, resource limits, and application behavior.

Prometheus & Grafana

Real-time and historical monitoring tools for Redis metrics

Prometheus collects cluster metrics such as latency, memory usage, throughput, and errors
Grafana visualizes these metrics and supports alerting to quickly surface anomalies
Ideal for long-term observability and proactive failure detection

Redis Insight (with Copilot)

Visual interface for performance and anomaly detection

Tracks command patterns, memory usage, and throughput
Includes Copilot, which provides automated troubleshooting suggestions
Useful for identifying performance bottlenecks and misbehaving clients

Redis Admin UI

Web-based UI for quick status checks

Shows the current status of nodes, databases, and cluster components
Elements in warning or error states are highlighted with yellow or red indicators
Alerts may appear on both the node and database views
Logged events can be reviewed in the Logs section of the UI

Command-Line Tools

rladmin

Primary CLI for Redis
rladmin status extra all shows cluster-wide status including nodes, shards, and endpoints

This command displays the status of all cluster elements, including node and shard roles, current versions, and whether each component is in an OK state or showing as Missing or in error.

You can also run the following command only to see the errors:
```
rladmin status extra all errors_only
```
Review the rladmin documentation for all commands available.

You can also type “?” or “help” to see the rest of commands and use tab for completion works in rladmin CLI.:

rlcheck

Runs a suite of health and configuration checks on the cluster

supervisorctl

Monitors Redis internal service processes
Useful for checking if any core management components have failed

redis-cli

Direct database interaction tool
Common commands: INFO, PING, SLOWLOG, MONITOR, and key inspection (--bigkeys, --memkeys)

Operating System Utilities

Standard system tools for node-level diagnostics

df: Check disk usage
free: Check memory availability
top or htop: Monitor CPU usage and load average
dig: Verify DNS resolution and network connectivity

Log Files

Review logs for detailed error messages and service events

Key log files:
- event_log.log
- cluster_wd.log (watchdog)
- supervisord.log
- dmcproxy.log
- resource_mgr.log
- Shard logs (e.g., redis-<id>.log)
Default log directory: /var/opt/redislabs/log
Critical for identifying failure chains, process crashes, or sync issues

Support Packages

Comprehensive diagnostic bundles for deep-dive troubleshooting

Can be generated via Redis GUI or CLI
Includes configuration files, logs, system stats, and health reports
Useful for Redis Support or internal post-incident analysis

Best Practice

Use these tools in combination to investigate from multiple angles:

Monitor with Prometheus/Grafana
Investigate with RedisInsight, rladmin, and log files
Validate infrastructure health with OS commands
Collect a Support Package for complex or unclear failures

This multi-layered approach ensures a thorough and systematic troubleshooting process across Redis deployments.

Related to