Troubleshoot Unreachable Active-Active (CRDB) Participants – Redis Knowledge Base

Active-Active (CRDB) databases in Redis Software rely on syncer processes over TLS connections to replicate data between clusters. When a participant becomes unreachable, replication can stall, links may show as disconnected, and commands such as crdb-cli crdb health-report may hang or return errors.

In practice, these issues are almost always caused by network, DNS, TLS, or endpoint configuration problems rather than Redis data problems. This guide explains how to diagnose and restore connectivity using Step-by-Step Instructions and Common Scenarios.

For background on Active-Active and the syncer process, see:

Active-Active geo-distributed Redis

Syncer process

Quick Fix

What you see	What to do
`crdb-cli crdb health-report` hangs or times out	Identify which participant is unreachable and test connectivity
Participant shows disconnected or error state	Validate DNS, network, and endpoint reachability
TLS or connection errors (`SSL_connect failed`)	Verify certificates and trust chain
“no listener on port” errors	Check database endpoint and listener on target cluster
Replication stops across all clusters (3+ participants)	Check for multi-participant syncer behavior and consult Redis Support if needed

Prerequisites

Access to all participating clusters (CLI/SSH and UI as needed).
Tools available on each cluster node:
- crdb-cli
- rladmin
- openssl
CRDB GUID and database IDs.
A supported Redis Software version where Active-Active is available.

Step-by-Step Instructions

1. Identify the unreachable participant

Run from any participating cluster:

crdb-cli crdb health-report --crdb-guid <CRDB_GUID>

Look for links in disconnected, trying, or error state.
Note which cluster name/instance ID is failing.

See also: crdb-cli crdb health-report reference.

2. Test connectivity to the participant

From each other cluster that should talk to the failing participant, test TLS connectivity to the CRDB endpoint (or load balancer in front of it):

openssl s_client -connect <endpoint>:<port> -servername <endpoint>

Successful handshake (you see a certificate and either Verify return code: 0 (ok) or only a self-signed warning): basic network + TLS path is working.
Failure or timeout (no connect, handshake errors, or SSL_connect failed): investigate:
- Network reachability (firewall, security groups, routing).
- DNS resolution of <endpoint>.
- TLS configuration (certificates, SNI, cipher/cipher-suite compatibility).

Use -showcerts to see the full chain when needed:
openssl s_client -connect <endpoint>:<port> -servername <endpoint> -showcerts

3. Verify DNS and routing

From all participating clusters:

Ensure endpoint hostnames resolve:
```
dig <endpoint>
# or
nslookup <endpoint>
```
Confirm bidirectional connectivity between all participants (both directions should be able to open TLS connections to the CRDB endpoint).
If a load balancer is used:
- Verify the VIP forwards traffic to the correct Redis nodes.
- Confirm the CRDB listener port is included in the LB pool.
- Check for health-check or SSL/TLS profile misconfigurations that could cause connection resets.

4. Verify listener and database state

On the target cluster (where the problematic endpoint lives):

rladmin status db all
rladmin status extra all

Check that:

The CRDB database is running (not failed, not in recovery).
The expected port is listening.
There are no shard or state-machine errors (for example, shards stuck in STALE or DOWN).

If you see “no listener on <port>” in logs or health-report:

Confirm the database endpoint exists and is bound to the expected port.
If a load balancer is used, verify that the LB maps to this listener and not to a different service/port.

For more on replication backlogs and Active-Active replication, see:
Database replication.

5. Validate TLS configuration

On the clusters where sync is failing:

Check syncer logs for TLS-related errors, for example:
- SSL_connect failed
- encryption error
- certificate or SNI mismatch messages
Confirm that:
- Certificates are valid and not expired.
- The CA trust chain is consistent across clusters.
- Client/server certs and keys match.

If you recently rotated proxy or syncer certificates, refresh the CRDB configuration across all participants:

crdb-cli crdb update --crdb-guid <CRDB_GUID> --force

This forces the CRDB to pick up the updated certificate configuration on all instances without changing other settings.

For TLS configuration details, see:
Enable TLS and
Update certificates.

6. Consider multi-participant sync behavior (3+ participants)

For CRDB topologies with three or more participants, a single unreachable site can sometimes cause replication to halt across all regions, depending on how the syncer is implemented and configured in your product version.

In modern Redis Software releases, the syncer is designed to be more resilient so that, when one region is down, the remaining healthy participants can continue to replicate. In older versions, or if the cluster was created before newer behavior was introduced, this may not be the case.

If you have a 3+ participant CRDB and you observe:
- One region is unreachable, and
- Replication stopping between the healthy regions as well,
Then, after completing all network / DNS / listener / TLS checks, it is appropriate to:
- Open a case with Redis Support,
- Include:
  - CRDB GUID,
  - Redis Software version for each cluster,
  - A recent crdb-cli crdb health-report output and syncer logs,
- And explicitly note that you observe multi-participant replication halting when one site is down.

Support can then:

Confirm which syncer behavior/feature set is available in your specific version.
Advise whether any configuration changes (including robust syncer or equivalent mechanisms) are applicable and safe for your deployment.
Propose alternatives when the feature is not available in your version.

7. Validate recovery

Once network, DNS, listener, TLS issues have been resolved (and any syncer-related adjustments appropriate for your version have been applied):

crdb-cli crdb health-report --crdb-guid <CRDB_GUID>

All links between participants should report connected.
Any prior replication backlog should drain, and replication should resume automatically.

Use your monitoring (for example, CRDB replication lag metrics or dashboards) to confirm that:

Sync lag returns to normal.
There are no recurring syncer TLS or connectivity errors.

Common Scenarios

CRDB health-report hangs or times out

Cause: One participant is unreachable (network, DNS, TLS, or listener).
Fix: Use crdb-cli crdb health-report to identify the failing link, then test that link with openssl s_client. Restore connectivity or endpoint/listener configuration before re-running the report.

TLS errors (`SSL_connect failed`)

Cause: Certificate mismatch, invalid or expired certs, SNI mismatch, or trust chain issues between participants.
Fix:
- Validate certificates (dates, chain, SNI) with:
  - openssl s_client -connect <endpoint>:<port> -servername <endpoint> -showcerts
  - openssl x509 -in <cert_file> -noout -text
- If certs were rotated or changed, run:
```
crdb-cli crdb update --crdb-guid <CRDB_GUID> --force
```
  So all participants pick up the updated proxy/syncer configuration.

“no listener on <port>” errors

Cause: Database endpoint is not listening, misconfigured, or the load balancer points to a non-Redis target.
Fix:
- Use rladmin status db all and rladmin status extra all to ensure the CRDB database is running and bound to the expected port.
- Verify that any intermediate load balancer forwards traffic to that listener and not elsewhere.

Replication stops across all clusters (3+ participants)

Cause: In a multi-participant CRDB (three or more regions), one unreachable site can, in some versions or configurations, cause replication to halt across all participants instead of allowing the healthy regions to continue.
Fix:
- First, complete all checks for network, DNS, listener, and TLS.
- If the only remaining symptom is that healthy regions stop replicating when one site is down:
  - Open a case with Redis Support,
  - Provide the CRDB GUID, Redis Software versions, and recent logs,
  - Mention that you are seeing a global replication halt in a 3+ participant CRDB when one region is unreachable.
- Support will:
  - Confirm which syncer behavior is expected for your version,
  - Determine whether any configuration or feature changes (including robust syncer or its successors) are available and appropriate,
  - Or advise alternatives if that capability is not present in your release.

Connectivity fails for one region only

Cause: Load balancer, DNS, or firewall issue affecting only that region.
Fix: From each other region:
- Run openssl s_client against that region’s CRDB endpoint.
- Correct DNS records, firewall rules, or LB mapping so all clusters can reach each other consistently.

Credential mismatch after UI update

Cause: CRDB admin credentials or database password changed in the UI on one cluster but not propagated to all CRDB participants.
Fix: Update credentials consistently on all clusters, or use:
```
crdb-cli crdb update --crdb-guid <CRDB_GUID> --credentials id=<instance_id>,username=<user>,password=<password>
```
So the CRDB configuration uses a coherent set of credentials across all instances.

See also: Manage Active-Active databases.

When to contact Redis Support

Contact Redis Support (and provide support packages from all participating clusters) if:

All network, DNS, listener, and TLS checks pass but links remain disconnected.
crdb-cli crdb health-report consistently times out or fails even after you restore connectivity.
You see persistent shard, CCS, or state-machine errors associated with the CRDB.
You suspect a configuration mismatch that cannot be safely corrected with documented tools.
You need to safely remove or re-add participants in a complex production CRDB topology.
You run a 3+ participant CRDB and observe global replication stop when one region is unreachable, after ruling out all connectivity issues.

Key takeaways

Most CRDB issues in this category are network, DNS, TLS, or endpoint-related, not data-corruption problems.
Always validate in order:
1. Network
2. DNS
3. Listener/endpoint
4. TLS & certificates
5. Syncer behavior (especially in multi-participant topologies)
In 3+ participant CRDBs, it is important that the syncer behavior for your version allows healthy regions to continue replicating when one region is down; if in doubt, work with Redis Support.
Once connectivity and configuration are correct for your version, CRDB replication will normally resume automatically and converge without data loss, provided the replication backlogs have not been exhausted.

Related to