Active-Active (CRDB) databases in Redis Software rely on syncer processes over TLS connections to replicate data between clusters. When a participant becomes unreachable, replication can stall, links may show as disconnected, and commands such as crdb-cli crdb health-report may hang or return errors.
In practice, these issues are almost always caused by network, DNS, TLS, or endpoint configuration problems rather than Redis data problems. This guide explains how to diagnose and restore connectivity using Step-by-Step Instructions and Common Scenarios.
For background on Active-Active and the syncer process, see:
Quick Fix
| What you see | What to do |
|---|---|
crdb-cli crdb health-report hangs or times out |
Identify which participant is unreachable and test connectivity |
| Participant shows disconnected or error state | Validate DNS, network, and endpoint reachability |
TLS or connection errors (SSL_connect failed) |
Verify certificates and trust chain |
| “no listener on port” errors | Check database endpoint and listener on target cluster |
| Replication stops across all clusters (3+ participants) | Check for multi-participant syncer behavior and consult Redis Support if needed |
Prerequisites
Access to all participating clusters (CLI/SSH and UI as needed).
-
Tools available on each cluster node:
crdb-clirladminopenssl
CRDB GUID and database IDs.
A supported Redis Software version where Active-Active is available.
Step-by-Step Instructions
1. Identify the unreachable participant
Run from any participating cluster:
crdb-cli crdb health-report --crdb-guid <CRDB_GUID>Look for links in disconnected, trying, or error state.
Note which cluster name/instance ID is failing.
See also:
crdb-cli crdb health-reportreference.
2. Test connectivity to the participant
From each other cluster that should talk to the failing participant, test TLS connectivity to the CRDB endpoint (or load balancer in front of it):
openssl s_client -connect <endpoint>:<port> -servername <endpoint>Successful handshake (you see a certificate and either
Verify return code: 0 (ok)or only a self-signed warning): basic network + TLS path is working.-
Failure or timeout (no connect, handshake errors, or
SSL_connect failed): investigate:Network reachability (firewall, security groups, routing).
DNS resolution of
<endpoint>.TLS configuration (certificates, SNI, cipher/cipher-suite compatibility).
Use
-showcertsto see the full chain when needed:openssl s_client -connect <endpoint>:<port> -servername <endpoint> -showcerts
3. Verify DNS and routing
From all participating clusters:
-
Ensure endpoint hostnames resolve:
dig <endpoint> # or nslookup <endpoint> Confirm bidirectional connectivity between all participants (both directions should be able to open TLS connections to the CRDB endpoint).
-
If a load balancer is used:
Verify the VIP forwards traffic to the correct Redis nodes.
Confirm the CRDB listener port is included in the LB pool.
Check for health-check or SSL/TLS profile misconfigurations that could cause connection resets.
4. Verify listener and database state
On the target cluster (where the problematic endpoint lives):
rladmin status db all
rladmin status extra allCheck that:
The CRDB database is running (not failed, not in recovery).
The expected port is listening.
There are no shard or state-machine errors (for example, shards stuck in
STALEorDOWN).
If you see “no listener on <port>” in logs or health-report:
Confirm the database endpoint exists and is bound to the expected port.
If a load balancer is used, verify that the LB maps to this listener and not to a different service/port.
For more on replication backlogs and Active-Active replication, see:
Database replication.
5. Validate TLS configuration
On the clusters where sync is failing:
-
Check syncer logs for TLS-related errors, for example:
SSL_connect failedencryption errorcertificate or SNI mismatch messages
-
Confirm that:
Certificates are valid and not expired.
The CA trust chain is consistent across clusters.
Client/server certs and keys match.
If you recently rotated proxy or syncer certificates, refresh the CRDB configuration across all participants:
crdb-cli crdb update --crdb-guid <CRDB_GUID> --forceThis forces the CRDB to pick up the updated certificate configuration on all instances without changing other settings.
For TLS configuration details, see:
Enable TLS and
Update certificates.
6. Consider multi-participant sync behavior (3+ participants)
For CRDB topologies with three or more participants, a single unreachable site can sometimes cause replication to halt across all regions, depending on how the syncer is implemented and configured in your product version.
In modern Redis Software releases, the syncer is designed to be more resilient so that, when one region is down, the remaining healthy participants can continue to replicate. In older versions, or if the cluster was created before newer behavior was introduced, this may not be the case.
-
If you have a 3+ participant CRDB and you observe:
One region is unreachable, and
Replication stopping between the healthy regions as well,
-
Then, after completing all network / DNS / listener / TLS checks, it is appropriate to:
Open a case with Redis Support,
-
Include:
CRDB GUID,
Redis Software version for each cluster,
A recent
crdb-cli crdb health-reportoutput and syncer logs,
And explicitly note that you observe multi-participant replication halting when one site is down.
Support can then:
Confirm which syncer behavior/feature set is available in your specific version.
Advise whether any configuration changes (including robust syncer or equivalent mechanisms) are applicable and safe for your deployment.
Propose alternatives when the feature is not available in your version.
7. Validate recovery
Once network, DNS, listener, TLS issues have been resolved (and any syncer-related adjustments appropriate for your version have been applied):
crdb-cli crdb health-report --crdb-guid <CRDB_GUID>
All links between participants should report connected.
Any prior replication backlog should drain, and replication should resume automatically.
Use your monitoring (for example, CRDB replication lag metrics or dashboards) to confirm that:
Sync lag returns to normal.
There are no recurring syncer TLS or connectivity errors.
Common Scenarios
CRDB health-report hangs or times out
Cause: One participant is unreachable (network, DNS, TLS, or listener).
Fix: Use
crdb-cli crdbhealth-report to identify the failing link, then test that link withopenssl s_client. Restore connectivity or endpoint/listener configuration before re-running the report.
TLS errors (SSL_connect failed)
Cause: Certificate mismatch, invalid or expired certs, SNI mismatch, or trust chain issues between participants.
-
Fix:
-
Validate certificates (dates, chain, SNI) with:
openssl s_client -connect <endpoint>:<port> -servername <endpoint> -showcertsopenssl x509 -in <cert_file> -noout -text
-
If certs were rotated or changed, run:
crdb-cli crdb update --crdb-guid <CRDB_GUID> --forceSo all participants pick up the updated proxy/syncer configuration.
-
“no listener on <port>” errors
Cause: Database endpoint is not listening, misconfigured, or the load balancer points to a non-Redis target.
-
Fix:
Use
rladmin status db allandrladmin status extra allto ensure the CRDB database is running and bound to the expected port.Verify that any intermediate load balancer forwards traffic to that listener and not elsewhere.
Replication stops across all clusters (3+ participants)
Cause: In a multi-participant CRDB (three or more regions), one unreachable site can, in some versions or configurations, cause replication to halt across all participants instead of allowing the healthy regions to continue.
-
Fix:
First, complete all checks for network, DNS, listener, and TLS.
-
If the only remaining symptom is that healthy regions stop replicating when one site is down:
Open a case with Redis Support,
Provide the CRDB GUID, Redis Software versions, and recent logs,
Mention that you are seeing a global replication halt in a 3+ participant CRDB when one region is unreachable.
-
Support will:
Confirm which syncer behavior is expected for your version,
Determine whether any configuration or feature changes (including robust syncer or its successors) are available and appropriate,
Or advise alternatives if that capability is not present in your release.
Connectivity fails for one region only
Cause: Load balancer, DNS, or firewall issue affecting only that region.
-
Fix: From each other region:
Run
openssl s_clientagainst that region’s CRDB endpoint.Correct DNS records, firewall rules, or LB mapping so all clusters can reach each other consistently.
Credential mismatch after UI update
Cause: CRDB admin credentials or database password changed in the UI on one cluster but not propagated to all CRDB participants.
-
Fix: Update credentials consistently on all clusters, or use:
crdb-cli crdb update --crdb-guid <CRDB_GUID> --credentials id=<instance_id>,username=<user>,password=<password>So the CRDB configuration uses a coherent set of credentials across all instances.
See also: Manage Active-Active databases.
When to contact Redis Support
Contact Redis Support (and provide support packages from all participating clusters) if:
All network, DNS, listener, and TLS checks pass but links remain disconnected.
crdb-cli crdb health-reportconsistently times out or fails even after you restore connectivity.You see persistent shard, CCS, or state-machine errors associated with the CRDB.
You suspect a configuration mismatch that cannot be safely corrected with documented tools.
You need to safely remove or re-add participants in a complex production CRDB topology.
You run a 3+ participant CRDB and observe global replication stop when one region is unreachable, after ruling out all connectivity issues.
Key takeaways
Most CRDB issues in this category are network, DNS, TLS, or endpoint-related, not data-corruption problems.
-
Always validate in order:
Network
DNS
Listener/endpoint
TLS & certificates
Syncer behavior (especially in multi-participant topologies)
In 3+ participant CRDBs, it is important that the syncer behavior for your version allows healthy regions to continue replicating when one region is down; if in doubt, work with Redis Support.
Once connectivity and configuration are correct for your version, CRDB replication will normally resume automatically and converge without data loss, provided the replication backlogs have not been exhausted.
0 comments
Please sign in to leave a comment.