Diagnosing and Resolving Endpoint Flapping in Redis Software – Redis Knowledge Base

Endpoint flapping—where Redis database endpoints rapidly alternate between available and unavailable states—typically indicates DNS misconfiguration, premature rebinding, or networking issues during topology changes or maintenance. These disruptions can degrade client connectivity and overall cluster stability. This article provides engineering-level diagnostic steps, outlines common root causes and mitigation strategies, and includes troubleshooting guidance, rladmin command examples, and reference resources.

Step-by-Step Diagnosis and Resolution

1. Verify Endpoint Existence and Database Health

Run rladmin status endpoints
• Confirms that the endpoint is registered and visible in the cluster.
In the Admin UI or via rladmin, validate the endpoint status:
• Ensure the endpoint is active.
• Confirm the associated database is healthy and in active state.

2. Test DNS Resolution

To validate DNS resolution for a database or cluster endpoint:

Identify the correct endpoint to test:

Run rladmin status databases
• Use the value from the ENDPOINT column for the database in question.
If testing the cluster’s FQDN, substitute the fully qualified domain (e.g., cluster-name.redislabs.com).

From the client machine:

dig <DATABASE_ENDPOINT>

If this fails, try the same from a node in the Redis cluster.

From a cluster node:

dig <DATABASE_ENDPOINT>

If the node resolves it but the client cannot, this indicates a client-side DNS resolver issue.

Bypass local DNS resolvers using public DNS:

dig @8.8.8.8 <DATABASE_ENDPOINT>   # Google DNS  
dig @1.1.1.1 <DATABASE_ENDPOINT>   # Cloudflare DNS  
dig @9.9.9.9 <DATABASE_ENDPOINT>   # Quad9 DNS

These queries check DNS resolution using trusted public resolvers.
Useful for isolating local DNS misconfigurations if the cluster/database is publicly accessible.

3. Assess DNS Server Mapping and Configuration

Ensure all DNS records (internal or external) reflect current cluster topology and are not stale.

For custom FQDNs:

Confirm the following:
• NS (Name Server) records are correctly delegated.
• A records, CNAME, or ALIAS entries are accurately pointing to the intended endpoint.
• TTL values are appropriate (not too long during change windows).
Use external tools like dig or nslookup to trace resolution paths.

Example:

dig +trace <your-fqdn>

4. Isolate Network and Firewall Issues

Check for environmental factors that may block DNS or Redis traffic:

Review firewall rules:
• Allow outbound and inbound traffic on port 53 (DNS), and Redis ports like 6379, 12000+ (depending on configuration). Read more on Port Configurations
Inspect load balancers, NAT, or proxies:
• Misconfigured network devices can interrupt DNS resolution or prevent packets from reaching Redis endpoints.

Fallback test using direct IP:

redis-cli -h <IP_ADDRESS> -p <PORT>

If the IP-based connection works but FQDN does not, this confirms a DNS resolution failure.

5. Address Flapping During Maintenance or Failover

Common causes:
- Premature removal of endpoints before DNS TTL expiry.
- Clients caching stale records or failing to refresh.
Mitigations:
- Increase the rebind grace period to allow for DNS TTL expiration and cache clearing.
- Validate certificate chains on mTLS endpoints (check for expired or missing intermediates).

Common Issues and Troubleshooting

Issue	Likely Cause	Resolution
DNS resolution fails temporarily	Short TTL or DNS propagation delays	Check `rladmin status endpoints` to confirm the endpoint is registered and active. Then use `dig` or `redis-cli` to test DNS resolution and connectivity.
Stale/incorrect DNS mapping	Topology change not fully reflected	Compare DNS records to current cluster layout
Firewall/network interference	Port blocks or packet loss	Temporarily disable firewall; audit access control policies
Endpoint removed too early	DNS TTL not expired before deletion	Adjust TTLs; increase rebind grace period
Client DNS caching	Java or other runtimes cache DNS aggressively	Flush local DNS cache or lower client TTL settings
Use of IP instead of FQDN	Workaround for DNS failure	Only use temporarily; bypasses Redis failover features

Additional Engineering Considerations

Proxy Policy Validation

Check database endpoint proxy policies:
```
rladmin status endpoints
```

Rebind as needed:

rladmin bind db:<db_id> endpoint <endpoint_id> policy <all-master-shards|all-nodes>

Persistent Flapping or Connection Failures
- Check for:
  - TLS handshake or certificate validation errors.
  - Output buffer or slave buffer errors.
- Adjust buffers if needed:
```
rladmin tune db <db_name> slave_buffer <size>
```
Automated Detection and Logging
- Watch for repeated connect/disconnect logs in:
  - dmcproxy.log
  - event_log.log
- Use log analysis or monitoring tools to flag excessive endpoint churn.
Support Diagnostics
- Collect full support packages from all nodes involved (especially for CRDB or geo-replication).
- Include convergence history and endpoint policy bindings in support requests.

Additional Resources

DNS Setup for Redis Cluster Access

Test client connection

Network port configurations

Related to

Step-by-Step Diagnosis and Resolution

Common Issues and Troubleshooting

Additional Engineering Considerations

Additional Resources

Related articles