Troubleshooting Redis Software Pod Restarts in OpenShift – Redis Knowledge Base

Redis Software pods in OpenShift can restart repeatedly or remain stuck in Terminating due to resource pressures, storage issues, configuration errors, or incomplete recovery tasks. This article explains how to diagnose common causes, gather logs, safely handle stuck pods, and apply best practices for OpenShift and Redis lifecycle management to reduce downtime and data loss.

Covered in this guide: Common Causes, Step-by-Step Troubleshooting, Cluster Recovery and Stuck Pods, Persistent Volume Management, Best Practices, and a Quick Reference Table.

Prerequisites

Access to the OpenShift CLI (oc) with cluster admin permissions.
Redis Software operator and REC service account permissions.
log_collector.py script available in the cluster.
Understanding of OpenShift Security Context Constraints (SCC) and RBAC.

Common Causes of Redis Pod Restarts

Resource pressure: OOMKilled (exit code 137), CPU/memory exhaustion, node contention.
Storage issues: Detached or stuck PVCs, misconfigured persistent storage, multi-attach errors.
Cluster operations: Multiple nodes drained at once, forced pod deletion leaving ghost containers, deleted PVCs still in use.
Cluster health/configuration: Loss of quorum, missing CCS files, issues with SCC or RBAC.
Liveness/readiness probe failures: Slow API responses, misconfigured network policies.

Step-by-Step Troubleshooting

Here’s a cleaned-up version with consistent numbering and bullet formatting, while keeping it customer-facing and scannable:

Collect Logs and Diagnostics

Run the log_collector script.

If the script takes an extended period, add --skip_support_package to skip collecting the REC support package.
Use -m all to gather a complete set of diagnostics.
In Kubernetes environments, specify the namespace with -n <NAMESPACE> so the script does not collect from the wrong environment.

Example

python log_collector.py -m all -n <NAMESPACE>
# e.g.
python log_collector.py -m all -n redis

For all available flags and modes, see the log_collector options page or run:

python log_collector.py -h

Check Pod and Node Health

Inspect pods:

oc get pods, oc describe pod <pod> → look for restart counts, OOMKilled, probe errors.

Inspect nodes:

oc get nodes, oc describe node <node> → identify CPU, memory, or disk resource exhaustion.

Inspect Storage

Check PVCs:

oc get pvc, oc describe pvc → verify PVCs are not stuck in Terminating or detached.

Look for Orphaned Resources

Check for ghost containers (on worker nodes):

Docker:

docker ps --filter label!=io.kubernetes.pod.namespace

CRI-O:
```
crictl ps -a --no-trunc | grep redis
```

Cluster Recovery

If quorum is lost:

Follow the Cluster Recovery documentation
Recovery requires valid PVCs. Missing files may require redeployment.

Pods Stuck in Terminating

Common causes include:

Unfinished enslave_node tasks
Stale or stuck cluster tasks
Insufficient resources

Checks:

Confirm each REC pod has enough Provisional RAM using:
```
rladmin status extra all
```
Look for stuck tasks:
```
rladmin cluster running_actions
```
If tasks are stuck, contact Redis Support for assistance clearing them.

Validate SCC, RBAC, and Operator Configuration

Check SCC bindings:

Ensure redis-enterprise-scc-v2 is applied and bound to the REC service account.
For Redis Enterprise ≥ 7.22, binding is only required if automatic resource adjustment is enabled (it is disabled by default).
For versions < 7.22, binding is required in all cases.

Check RBAC permissions:

Confirm the REC service account has the necessary RBAC roles.

Maintenance Procedures

Pod maintenance best practices:

Drain or delete one pod at a time.
Never delete a majority of REC pods unless intentionally decommissioning.

Deep-Dive: Stuck Pods & Cluster Recovery

Scenario: Two of three pods stuck in Terminating after maintenance.

Root cause: Failed enslave_node tasks, PVC finalizers, or storage mount issues.
Resolution: Clean tasks with ccs-cli; if persistence is lost, redeploy REC and restore from backups.
Cluster recovery workflow:
```
kubectl patch rec <cluster-name> --type merge --patch '{"spec":{"clusterRecovery":true}}'
```
Wait for REC to return to Running, then restore databases from persistence files.

Persistent Volume Management (PVC/PV)

Use block storage only with EXT4/XFS. NFS is not supported.
PVCs can only be expanded, not shrunk. For details, see the PVC Expansion Guide.
Delete PVCs only when pods are fully terminated and no Redis processes are active.

Best Practices for Pod Lifecycle

Always enable persistence in production clusters.
When upgrading, follow the supported procedure. For the procedure, see Upgrade Redis Enterprise Cluster on OpenShift CLI.
Avoid force-deleting pods; prefer supported recovery steps.

Troubleshooting Quick Reference

Issue	Symptoms	Likely Cause	Fix
CrashLoop (OOMKilled, exit 137)	Frequent restarts, probe failures	Resource shortage, storage misconfig	Increase limits, check PVCs
Pods stuck in Terminating	>30 minutes in Terminating	Unfinished preStop, worker node that hosts pod is down/missing	Clean stuck tasks by reaching out to Redis Support
Cluster recovery fails	REC stuck in RecoveringFirstPod	Missing PVCs or files, PVC still mounted to another worker node	Validate PVCs, Reach out to Redis Support
No pods after REC creation	Operator logs SCC errors	Missing SCC/RBAC	Apply redis-enterprise-scc-v2, bind to REC service account
Database inaccessible	DBs stuck in running/active-change-pending	Stuck/running state machine	Reach out to Redis Support, delete and recreate database (if acceptable)

Related to