Cluster convergence and endpoint rebinding are essential concepts in maintaining high availability and resilience in Redis environments. This article explains how each process works, how to monitor and manage them during topology changes like failover or maintenance, and how to avoid downtime through proper client configuration. It includes step-by-step instructions, recommendations for Application Configuration for Robustness and detailed Troubleshooting Common Issues.
Cluster Convergence
Cluster convergence is the process by which Redis nodes reach a consistent, updated view of the cluster topology after events like failover, maintenance, or topology changes. During convergence, shards and endpoints may be reassigned to maintain performance and availability.
How to Monitor Cluster Convergence:
- Run
rladmin statusto verify shard and node health. - Run
rladmin cluster running_actionsto check if convergence tasks (failovers, migrations) are still running. - Use
rlcheckor external monitoring tools (Prometheus, Grafana) to confirm nodes aren't CPU or memory constrained (>80% usage). - In OSS APIs or older Redis versions, expect brief endpoint unavailability during these events. Redis delays old endpoint removal to mitigate this.
Endpoint Rebinding
Endpoint rebinding refers to DNS/IP updates for Redis database endpoints after changes like node migration or failover. Clients must resolve the updated endpoint to reconnect successfully.
How to Identify Endpoint Rebinding:
- From client: run
dig <endpoint>to test DNS resolution. - From cluster node: run
dig @localhost <endpoint>to isolate external vs. internal DNS issues. -
Confirm connectivity with:
redis-cli -h <endpoint> -p <port> -a <password> INFO redis-cli -h <endpoint> -p <port> -a <password> PINGAdd
--tls,--cacert, etc. if applicable. - Clients must be configured to re-resolve DNS upon disconnect.
Application Configuration for Robustness
To minimize impact during endpoint rebinding:
-
Timeouts: Increase
connectionTimeoutandreadTimeoutto tolerate short outages. -
Idle Checks: Enable
testWhileIdle,minEvictableIdleTimeMillis, etc. to refresh stale connections. -
Cluster Support: Use cluster-aware clients (e.g.,
JedisCluster,UnifiedJedis) to auto-refresh topology. - DNS-First: Avoid hardcoding IPs—use FQDNs to maintain failover compatibility.
Troubleshooting Common Issues
Clients Cannot Connect After Maintenance or Failover
- Verify endpoint DNS on the client (
dig <endpoint>). - Confirm Redis connectivity using
redis-cli. - Review client-side network/firewall rules if only client is failing.
- Validate TLS configuration if applicable.
Unexpected Client Disconnects
- Rebinding leads to disconnects by design—clients must retry.
- Ensure reconnection and retry logic is enabled.
- Confirm client supports auto-refresh of topology.
Endpoint Unavailability During Upgrades
- Some OSS clusters may have 30–60s delays during endpoint removal.
- Redis defers removing old endpoints to allow DNS TTLs to expire.
- Coordinate maintenance during low-traffic windows.
Additional Resources
- DNS Setup for Redis Cluster Access
- Troubleshooting Redis in Distributed Environments
- Troubleshoot Endpoint Flapping
0 comments
Please sign in to leave a comment.