Redis Enterprise for Kubernetes relies on quorum to maintain cluster leadership and management operations. During upgrades (Operator, Kubernetes, or OpenShift), quorum can be lost if a majority of nodes become unavailable or out of sync. When this happens, the cluster cannot elect a master and remains degraded. This article explains how to identify quorum loss, safely recover the cluster, and validate system health, along with troubleshooting and prevention guidance.
Quick Fix Table
| Issue | Resolution |
|---|---|
| Cluster shows no master / lost quorum | Run cluster recovery from a healthy redis-enterprise pod |
| REC stuck in degraded state | Verify infra health, then execute quorum recovery |
| Pods running but no leader elected | Use rladmin recovery workflow from a stable node |
| Cluster unhealthy after upgrade | Restore quorum, then allow Operator reconciliation |
| Databases not active | Validate database state after cluster recovery |
Prerequisites
Access and permissions
kubectl or oc access
Ability to exec into redis-enterprise pods
Access to Operator logs and REC resource
Baseline health checks
Operator pod is running
CRDs are healthy
Nodes are Ready and schedulable
PVCs are Bound and attached
Understanding Quorum (Critical)
Quorum = majority of nodes (N/2 + 1)
-
Example:
3 nodes → quorum = 2
5 nodes → quorum = 3
If fewer than the majority are available or consistent, cluster management halts.
Step-by-Step: Recover Cluster Quorum
1. Verify cluster state
Run:
kubectl get pods -n <namespace>
kubectl get rec -n <namespace>Confirm:
redis-enterprise pods exist
REC shows degraded / no master
2. Validate infrastructure health
Check:
kubectl get nodes
kubectl get pvc,pvEnsure:
Nodes are Ready
PVCs are Bound
No scheduling or storage failures
3. Identify the correct recovery node
Choose a stable redis-enterprise pod with:
No crash loops
Healthy attached storage
Most recent state (not newly recreated)
Avoid:
Recently restarted pods
Pods stuck initializing
Nodes with missing volumes
4. Execute cluster recovery (core step)
Exec into the selected pod:
kubectl exec -it <redis-enterprise-pod> -n <namespace> -- bashRun cluster recovery using rladmin
Use the official recovery command for your version (example pattern):
rladmin cluster recoverWhat this does:
Re-establishes quorum
Elects a new cluster master
Restores cluster management
If required:
Remove unreachable or stale nodes using rladmin
Allow them to rejoin later via Operator
5. Allow Operator reconciliation
Do NOT manually interfere here
-
Operator will:
Reconcile StatefulSet
Restart/rejoin nodes
Restore full cluster state
Monitor:
kubectl get rec
kubectl get pods6. Validate recovery
Cluster validation
Admin Console shows healthy cluster
All nodes present and connected
Database validation
Databases are Active
Modules loaded correctly
Reads/writes succeed
Critical Guardrails (Do NOT Skip)
Do NOT do the following:
Delete redis-enterprise pods blindly
Scale down StatefulSet
Recreate the REC resource
Restart all nodes simultaneously
These actions can:
Prolong downtime
Cause data inconsistency
Create split-brain scenarios
Troubleshooting
Cluster still cannot form quorum
Verify enough nodes exist to reach a majority
Ensure the selected recovery node is valid
Split-brain risk
Ensure only one recovery attempt is active
Confirm majority of nodes are not independently forming clusters
Pods running but cluster unhealthy
Check Operator logs
Inspect redis-enterprise pod logs
Storage issues
Confirm PVCs are Bound and correctly attached
Recovery command fails
Verify permissions inside pod
Confirm correct command for your version
OpenShift-Specific Considerations
Security
Validate SCC (Security Context Constraints)
Ensure Operator permissions remain intact
Node scheduling
-
If nodes were drained:
Uncordon nodes
Ensure pods can reschedule
Prevention Best Practices
Plan upgrades carefully
Follow supported upgrade paths
Avoid combining infra + Redis upgrades
Maintain quorum
Never take down majority of nodes
Stagger upgrades
Avoid concurrent changes
Do not reshard or scale during upgrades
Monitor actively
Replication health
Pod lifecycle
Cluster status
When to Contact Support
Engage Redis Support if:
Quorum cannot be restored
Recovery repeatedly fails
Cluster state is inconsistent
Provide:
Operator logs
redis-enterprise pod logs
REC YAML and events
Timeline of actions taken
0 comments
Please sign in to leave a comment.