A disk failure on a Redis Software node typically shows up as I/O errors and/or persistence (AOF) write/fsync failures, and can also cause missing or incomplete logs if Redis paths were on the affected disk. In these cases, the safest recovery path is operational: move shards off the node first, then repair/rebuild the host, then reintroduce and rebalance in a controlled manner.
This article provides a Quick Fix Table, a step-by-step procedure, and ends with validation and operational guardrails.
Quick Fix Table
| Symptom or Objective | Action |
|---|---|
OS shows Input/output error (for example, ls: cannot open directory '.': Input/output error) |
Treat this as a disk failure. Immediately prioritize evacuating shards from the node (maintenance mode or shard evacuation) before attempting any in-place repairs. |
| AOF errors such as “Fail to fsync” or “Error writing to the AOF file” | Assume disk instability. Evacuate shards from the node. If shard movement fails because AOF cannot be enabled or written on the failing disk, use the persistence workaround described in Step 2 (advanced procedure; Support guidance is strongly recommended). |
| Disk replaced and filesystem rebuilt (node is effectively empty) | Treat the node as new hardware. Remove or replace the node in the cluster as appropriate, then reinstall Redis Software and re-add the node to the cluster. |
| Installer fails with an upgrade-style error (for example, “socket path cannot be set during upgrade”) | This typically indicates remnants of a previous installation. Remove residual Redis Software packages and directories, then perform a clean installation. |
| No application impact yet, but disk shows signs of degradation | Proactively evacuate shards. Persistence activity (such as AOF fsync operations) often surfaces disk degradation early through slow sync times or intermittent write errors. |
If you'd like, I can also provide a shortened “Quick Fix” version optimized for the very top of a troubleshooting article.
Prerequisites
Before beginning:
Another node (or nodes) with sufficient capacity to host the shards being evacuated.
Cluster admin access (Cluster Manager UI and/or
rladmin).A maintenance window is required if large sync/migrations are expected (shard moves can be resource-intensive).
Step-by-Step Procedure
Step 1: Confirm failure and isolate the node
Goal: Confirm it’s a storage issue and stop the cluster from placing more work on the node.
Look for indicators such as:
OS-level Input/output error.
Redis persistence errors (AOF fsync/write failures).
If the node is reachable, prefer putting it into maintenance mode immediately. Maintenance mode is designed to prevent data loss during hardware/OS maintenance; when it’s activated, shards move off of the node (when space is available), and the node is marked so shards won’t migrate back to it.
Note: Maintenance mode performs quorum checks and will not activate if taking the node offline would cause quorum loss.
Step 2: Evacuate (vacate) shards from the failing node
Goal: Ensure the failing node hosts no shards so it can be safely repaired/rebuilt.
Preferred approach: maintenance mode evacuation
Activate maintenance mode for the node, and monitor until all shards and endpoints have migrated off the node.
Verify maintenance mode via
rladmin status
If you cannot evacuate everything due to capacity constraints (read carefully)
Maintenance mode includes an option to avoid evicting replica shards when the cluster doesn’t have enough resources to move everything off the node. This reduces safety (replicas remain on the maintenance node). Use only if you understand the implications for availability and durability.
Step 3: Repair or rebuild the host (outside Redis scope)
Once shards are evacuated:
Replace the failed disk and rebuild/remount the filesystem with your infrastructure team.
If Redis Software directories/persistence paths were on the failed disk, assume data on that disk is not trustworthy; proceed as if the node is empty.
Step 4: Reinstall and rejoin the node (recommended when Redis paths were impacted)
If the disk failure affected the Redis installation/persistence paths, a clean reinstall and re-add are often the simplest and safest approach.
4A) Remove/replace the node (cluster side)
Remove the node via the Cluster Manager UI (Nodes screen → Remove node). Redis Software migrates resources off the node during removal.
4B) Reinstall and add back
Install Redis Software on a clean/supported OS, then join the cluster (“Add a node” flow).
Verify node health (UI “Verify node” or
rlcheck).
Common pitfall: if the installer acts like an upgrade and errors (for example, “socket path cannot be set during upgrade”), it usually means old packages/directories are still present and must be cleaned before reinstall.
Step 5: Rebalance shard placement
After the node is stable and rejoined:
Rebalance database shards to distribute shards across nodes according to the placement policy. (REST API:
PUT /v1/bdbs/{uid}/actions/rebalance).
Validation and Monitoring
After recovery, validate:
Cluster/node health
Run
rlcheckand ensure a healthy result (“ALL TESTS PASSED” is shown as the expected output for healthy nodes in the OS upgrade procedure).Run
rladmin status extra alland confirm the overall OK status for cluster/nodes/endpoints/shards.
Persistence and disk symptoms
Confirm the original AOF write/fsync errors are no longer occurring.
Operational Guardrails
Avoid quorum loss: maintenance mode checks quorum, but it does not account for other nodes already in maintenance; do not put the majority of nodes into maintenance simultaneously.
Avoid “maintenance mode stacking”: if you activate maintenance mode multiple times, you must deactivate it the same number of times.
DNS updates: if the cluster uses DNS, update DNS records when nodes are added/replaced.
Optional: CLI-based workflow (reference)
Prefer the Cluster Manager UI unless you have an established operations runbook. CLI steps vary by version and deployment standards. Put a node into maintenance mode (evacuate shards)
rladmin node <node_id> maintenance_mode on overwrite_snapshotVerify it:
rladmin statusRemove a node (if rebuilding/reinstalling)
rladmin node <node_id> remove
0 comments
Please sign in to leave a comment.