Endpoint flapping—where Redis database endpoints rapidly alternate between available and unavailable states—typically indicates DNS misconfiguration, premature rebinding, or networking issues during topology changes or maintenance. These disruptions can degrade client connectivity and overall cluster stability. This article provides engineering-level diagnostic steps, outlines common root causes and mitigation strategies, and includes troubleshooting guidance, rladmin command examples, and reference resources.
Step-by-Step Diagnosis and Resolution
1. Verify Endpoint Existence and Database Health
- Run
rladmin status endpoints
• Confirms that the endpoint is registered and visible in the cluster. - In the Admin UI or via
rladmin, validate the endpoint status:
• Ensure the endpoint is active.
• Confirm the associated database is healthy and in active state.
2. Test DNS Resolution
To validate DNS resolution for a database or cluster endpoint:
Identify the correct endpoint to test:
- Run
rladmin status databases
• Use the value from the ENDPOINT column for the database in question. - If testing the cluster’s FQDN, substitute the fully qualified domain (e.g.,
cluster-name.redislabs.com).
From the client machine:
dig <DATABASE_ENDPOINT>
- If this fails, try the same from a node in the Redis cluster.
From a cluster node:
dig <DATABASE_ENDPOINT>
- If the node resolves it but the client cannot, this indicates a client-side DNS resolver issue.
Bypass local DNS resolvers using public DNS:
dig @8.8.8.8 <DATABASE_ENDPOINT> # Google DNS dig @1.1.1.1 <DATABASE_ENDPOINT> # Cloudflare DNS dig @9.9.9.9 <DATABASE_ENDPOINT> # Quad9 DNS
- These queries check DNS resolution using trusted public resolvers.
- Useful for isolating local DNS misconfigurations if the cluster/database is publicly accessible.
3. Assess DNS Server Mapping and Configuration
- Ensure all DNS records (internal or external) reflect current cluster topology and are not stale.
For custom FQDNs:
- Confirm the following:
• NS (Name Server) records are correctly delegated.
• A records, CNAME, or ALIAS entries are accurately pointing to the intended endpoint.
• TTL values are appropriate (not too long during change windows). - Use external tools like
digornslookupto trace resolution paths.
Example:
dig +trace <your-fqdn>
4. Isolate Network and Firewall Issues
Check for environmental factors that may block DNS or Redis traffic:
- Review firewall rules:
• Allow outbound and inbound traffic on port 53 (DNS), and Redis ports like 6379, 12000+ (depending on configuration). Read more on Port Configurations - Inspect load balancers, NAT, or proxies:
• Misconfigured network devices can interrupt DNS resolution or prevent packets from reaching Redis endpoints.
Fallback test using direct IP:
redis-cli -h <IP_ADDRESS> -p <PORT>
- If the IP-based connection works but FQDN does not, this confirms a DNS resolution failure.
5. Address Flapping During Maintenance or Failover
- Common causes:
- Premature removal of endpoints before DNS TTL expiry.
- Clients caching stale records or failing to refresh.
- Mitigations:
- Increase the rebind grace period to allow for DNS TTL expiration and cache clearing.
- Validate certificate chains on mTLS endpoints (check for expired or missing intermediates).
Common Issues and Troubleshooting
| Issue | Likely Cause | Resolution |
|---|---|---|
| DNS resolution fails temporarily | Short TTL or DNS propagation delays | Check rladmin status endpoints to confirm the endpoint is registered and active. Then use dig or redis-cli to test DNS resolution and connectivity. |
| Stale/incorrect DNS mapping | Topology change not fully reflected | Compare DNS records to current cluster layout |
| Firewall/network interference | Port blocks or packet loss | Temporarily disable firewall; audit access control policies |
| Endpoint removed too early | DNS TTL not expired before deletion | Adjust TTLs; increase rebind grace period |
| Client DNS caching | Java or other runtimes cache DNS aggressively | Flush local DNS cache or lower client TTL settings |
| Use of IP instead of FQDN | Workaround for DNS failure | Only use temporarily; bypasses Redis failover features |
Additional Engineering Considerations
-
Proxy Policy Validation
-
Check database endpoint proxy policies:
rladmin status endpoints
-
Rebind as needed:
rladmin bind db:<db_id> endpoint <endpoint_id> policy <all-master-shards|all-nodes>
-
-
Persistent Flapping or Connection Failures
- Check for:
- TLS handshake or certificate validation errors.
- Output buffer or slave buffer errors.
-
Adjust buffers if needed:
rladmin tune db <db_name> slave_buffer <size>
- Check for:
-
Automated Detection and Logging
- Watch for repeated connect/disconnect logs in:
dmcproxy.logevent_log.log
- Use log analysis or monitoring tools to flag excessive endpoint churn.
- Watch for repeated connect/disconnect logs in:
-
Support Diagnostics
- Collect full support packages from all nodes involved (especially for CRDB or geo-replication).
- Include convergence history and endpoint policy bindings in support requests.
0 comments
Please sign in to leave a comment.