Redis Software pods in OpenShift can restart repeatedly or remain stuck in Terminating due to resource pressures, storage issues, configuration errors, or incomplete recovery tasks. This article explains how to diagnose common causes, gather logs, safely handle stuck pods, and apply best practices for OpenShift and Redis lifecycle management to reduce downtime and data loss.
Covered in this guide: Common Causes, Step-by-Step Troubleshooting, Cluster Recovery and Stuck Pods, Persistent Volume Management, Best Practices, and a Quick Reference Table.
Prerequisites
Access to the OpenShift CLI (oc) with cluster admin permissions.
Redis Software operator and REC service account permissions.
log_collector.py script available in the cluster.
Understanding of OpenShift Security Context Constraints (SCC) and RBAC.
Common Causes of Redis Pod Restarts
Resource pressure: OOMKilled (exit code 137), CPU/memory exhaustion, node contention.
Storage issues: Detached or stuck PVCs, misconfigured persistent storage, multi-attach errors.
Cluster operations: Multiple nodes drained at once, forced pod deletion leaving ghost containers, deleted PVCs still in use.
Cluster health/configuration: Loss of quorum, missing CCS files, issues with SCC or RBAC.
Liveness/readiness probe failures: Slow API responses, misconfigured network policies.
Step-by-Step Troubleshooting
Here’s a cleaned-up version with consistent numbering and bullet formatting, while keeping it customer-facing and scannable:
Collect Logs and Diagnostics
Run the log_collector script.
If the script takes an extended period, add --skip_support_package to skip collecting the REC support package.
Use -m all to gather a complete set of diagnostics.
In Kubernetes environments, specify the namespace with -n <NAMESPACE> so the script does not collect from the wrong environment.
Example
python log_collector.py -m all -n <NAMESPACE>
# e.g.
python log_collector.py -m all -n redisFor all available flags and modes, see the log_collector options page or run:
python log_collector.py -hCheck Pod and Node Health
Inspect pods:
oc get pods, oc describe pod <pod> → look for restart counts, OOMKilled, probe errors.
Inspect nodes:
oc get nodes, oc describe node <node> → identify CPU, memory, or disk resource exhaustion.
Inspect Storage
Check PVCs:
oc get pvc, oc describe pvc → verify PVCs are not stuck in Terminating or detached.
Look for Orphaned Resources
Check for ghost containers (on worker nodes):
-
Docker:
docker ps --filter label!=io.kubernetes.pod.namespace -
CRI-O:
crictl ps -a --no-trunc | grep redis
Cluster Recovery
If quorum is lost:
Follow the Cluster Recovery documentation
Recovery requires valid PVCs. Missing files may require redeployment.
Pods Stuck in Terminating
Common causes include:
Unfinished enslave_node tasks
Stale or stuck cluster tasks
Insufficient resources
Checks:
-
Confirm each REC pod has enough Provisional RAM using:
rladmin status extra all -
Look for stuck tasks:
rladmin cluster running_actions If tasks are stuck, contact Redis Support for assistance clearing them.
Validate SCC, RBAC, and Operator Configuration
Check SCC bindings:
Ensure redis-enterprise-scc-v2 is applied and bound to the REC service account.
For Redis Enterprise ≥ 7.22, binding is only required if automatic resource adjustment is enabled (it is disabled by default).
For versions < 7.22, binding is required in all cases.
Check RBAC permissions:
Confirm the REC service account has the necessary RBAC roles.
Maintenance Procedures
Pod maintenance best practices:
Drain or delete one pod at a time.
Never delete a majority of REC pods unless intentionally decommissioning.
Deep-Dive: Stuck Pods & Cluster Recovery
Scenario: Two of three pods stuck in Terminating after maintenance.
Root cause: Failed enslave_node tasks, PVC finalizers, or storage mount issues.
Resolution: Clean tasks with ccs-cli; if persistence is lost, redeploy REC and restore from backups.
-
Cluster recovery workflow:
kubectl patch rec <cluster-name> --type merge --patch '{"spec":{"clusterRecovery":true}}'Wait for REC to return to Running, then restore databases from persistence files.
Persistent Volume Management (PVC/PV)
Use block storage only with EXT4/XFS. NFS is not supported.
PVCs can only be expanded, not shrunk. For details, see the PVC Expansion Guide.
Delete PVCs only when pods are fully terminated and no Redis processes are active.
Best Practices for Pod Lifecycle
Always enable persistence in production clusters.
When upgrading, follow the supported procedure. For the procedure, see Upgrade Redis Enterprise Cluster on OpenShift CLI.
Avoid force-deleting pods; prefer supported recovery steps.
Troubleshooting Quick Reference
| Issue | Symptoms | Likely Cause | Fix |
|---|---|---|---|
| CrashLoop (OOMKilled, exit 137) | Frequent restarts, probe failures | Resource shortage, storage misconfig | Increase limits, check PVCs |
| Pods stuck in Terminating | >30 minutes in Terminating | Unfinished preStop, worker node that hosts pod is down/missing | Clean stuck tasks by reaching out to Redis Support |
| Cluster recovery fails | REC stuck in RecoveringFirstPod | Missing PVCs or files, PVC still mounted to another worker node | Validate PVCs, Reach out to Redis Support |
| No pods after REC creation | Operator logs SCC errors | Missing SCC/RBAC | Apply redis-enterprise-scc-v2, bind to REC service account |
| Database inaccessible | DBs stuck in running/active-change-pending | Stuck/running state machine | Reach out to Redis Support, delete and recreate database (if acceptable) |
0 comments
Please sign in to leave a comment.