Understanding Cluster Convergence and Endpoint Rebinding – Redis Knowledge Base

Cluster convergence and endpoint rebinding are essential concepts in maintaining high availability and resilience in Redis environments. This article explains how each process works, how to monitor and manage them during topology changes like failover or maintenance, and how to avoid downtime through proper client configuration. It includes step-by-step instructions, recommendations for Application Configuration for Robustness and detailed Troubleshooting Common Issues.

Cluster Convergence

Cluster convergence is the process by which Redis nodes reach a consistent, updated view of the cluster topology after events like failover, maintenance, or topology changes. During convergence, shards and endpoints may be reassigned to maintain performance and availability.

How to Monitor Cluster Convergence:

Run rladmin status to verify shard and node health.
Run rladmin cluster running_actions to check if convergence tasks (failovers, migrations) are still running.
Use rlcheck or external monitoring tools (Prometheus, Grafana) to confirm nodes aren't CPU or memory constrained (>80% usage).
In OSS APIs or older Redis versions, expect brief endpoint unavailability during these events. Redis delays old endpoint removal to mitigate this.

Endpoint Rebinding

Endpoint rebinding refers to DNS/IP updates for Redis database endpoints after changes like node migration or failover. Clients must resolve the updated endpoint to reconnect successfully.

How to Identify Endpoint Rebinding:

From client: run dig <endpoint> to test DNS resolution.
From cluster node: run dig @localhost <endpoint> to isolate external vs. internal DNS issues.

Confirm connectivity with:

redis-cli -h <endpoint> -p <port> -a <password> INFO
redis-cli -h <endpoint> -p <port> -a <password> PING

Add --tls, --cacert, etc. if applicable.

Clients must be configured to re-resolve DNS upon disconnect.

Application Configuration for Robustness

To minimize impact during endpoint rebinding:

Timeouts: Increase connectionTimeout and readTimeout to tolerate short outages.
Idle Checks: Enable testWhileIdle, minEvictableIdleTimeMillis, etc. to refresh stale connections.
Cluster Support: Use cluster-aware clients (e.g., JedisCluster, UnifiedJedis) to auto-refresh topology.
DNS-First: Avoid hardcoding IPs—use FQDNs to maintain failover compatibility.

Troubleshooting Common Issues

Clients Cannot Connect After Maintenance or Failover

Verify endpoint DNS on the client (dig <endpoint>).
Confirm Redis connectivity using redis-cli.
Review client-side network/firewall rules if only client is failing.
Validate TLS configuration if applicable.

Unexpected Client Disconnects

Rebinding leads to disconnects by design—clients must retry.
Ensure reconnection and retry logic is enabled.
Confirm client supports auto-refresh of topology.

Endpoint Unavailability During Upgrades

Some OSS clusters may have 30–60s delays during endpoint removal.
Redis defers removing old endpoints to allow DNS TTLs to expire.
Coordinate maintenance during low-traffic windows.

Additional Resources

Related to