Cluster and Database Recovery – Redis Knowledge Base

This article provides an overview of the recovery process for a Redis Enterprise Software cluster and its databases after a failure. It outlines the general steps for Cluster Recovery, Database Recovery, Node Maintenance, and includes a Troubleshooting Table to address common recovery challenges. While this article summarizes the process, detailed step-by-step instructions are available in the Redis documentation:

Note: The recovery process described here assumes you are recovering to new, clean nodes with clean persistent storage. If you reuse original nodes or drives, make sure to follow the additional steps in the official documentation to back up and prepare the environment.

Prerequisites

A recent Redis cluster configuration backup (/ccs/ccs-redis.rdb)
Accessible persistence files (AOF/RDB or backups) for each database
Target nodes with no running Redis processes and matching software version
Cluster name and FQDN unchanged from original installation
Clean persistent storage drives for the new cluster nodes.

Warning: During recovery or planned maintenance operations, never place a majority of cluster nodes into Maintenance Mode or reboot them simultaneously. Doing so can cause quorum loss, database unavailability, or potential data loss.

Cluster Recovery

Install Redis Software on the clean nodes.
Mount persistent storage containing:
- The cluster configuration file (/ccs/ccs-redis.rdb)
- All database persistence files (AOF/RDB/backups, typically under /var/opt/redislabs/persist/)
Recover the cluster configuration on the first node using rladmin cluster recover.
Join the remaining nodes back into the cluster with rladmin cluster join.
Verify the cluster is healthy and update DNS records if necessary.

Full instructions: Recover a failed cluster

Database Recovery

Confirm that the cluster is healthy (rladmin status).
Ensure database persistence files are accessible and correctly mounted.
Recover databases using the rladmin recover commands:
- rladmin recover all – recover all databases
- rladmin recover db db:<id> – recover a single database by ID
- rladmin recover db <name> – recover a single database by name
- rladmin recover db <name> only_configuration – recover configuration only
If databases use modules, make sure the required module versions are installed before recovery.
Verify that databases are active and clients can connect.

Full instructions: Recover a failed database.

Node Maintenance and OS Patching

For safely patching or rebooting cluster nodes:

Enable maintenance mode: rladmin node <id> maintenance_mode on
Perform patching or reboot.
Disable maintenance mode: rladmin node <id> maintenance_mode off

Full instructions: Maintenance mode for cluster nodes

Important: Planned OS patching and node reboots should always be performed one node at a time. Confirm the cluster remains healthy and maintains quorum before proceeding to the next node.

Troubleshooting Scenarios

Symptom	Possible Cause	Recommended Fix
Cluster doesn’t recover	Missing or mismatched `/ccs/ccs-redis.rdb`	Verify you are using the correct cluster configuration file from persistent storage. Recheck recovery file paths and versions.
Databases stuck in “pending recovery” or not showing in `rladmin`	Persistence files missing or not mounted	Ensure persistence files are mounted on all nodes. Set the recovery path with `rladmin node <id> recovery_path set`. Then run the appropriate `rladmin recover` command.
Database marked as “missing files”	Persistence files missing, misplaced, or inaccessible	Confirm the recovery path is set correctly on all nodes. Move files to the correct location if misplaced. Ensure file ownership/permissions are `redislabs:redislabs` with 640 permissions.
Database marked as “missing files”	Corrupted persistence files	Replace corrupted files with valid copies. If none are available, contact Redis Support.
`rladmin` shows nodes but no databases	Database restore skipped	Restore manually with `rladmin recover db <db_name>` or enable auto-recovery.
Port 53 errors	DNS server conflict	Ensure PowerDNS is bound and available. See DNS configuration guidance.
High CPU, RAM, or network usage after recovery	Shard migration or rebalancing in progress	Monitor until load stabilizes. If resource usage remains high, contact Redis Support.

Helpful Commands

Cluster status: rladmin status extra all
Errors only: rladmin status extra all errors_only
Supervisor check: supervisorctl status
Node diagnostics: rlcheck

Related to