Keycloak fail recovery: what breaks and how to fix it
A practical guide to Keycloak recovery scenarios: split-brain in Infinispan, corrupted database sequences, stuck migration, dangling sessions, and broken trust stores.
- Keycloak
- Security
- Operations
- Reliability
Everyone writes about how to configure Keycloak HA. Few write about what happens when the HA itself breaks. The cluster splits, the database sequence jumps, a migration gets stuck mid-flight, or certificates expire overnight and nobody notices until users cannot log in.
This is the pragmatic recovery guide — six real scenarios I have encountered, what they look like in the logs, and how to fix them without panicking.
Scenario 1: split-brain in Infinispan
A split-brain happens when cluster nodes lose network connectivity between themselves but remain reachable to clients. JGroups detects the partition and forms separate clusters, each believing itself to be the authoritative group.
Symptoms:
- Logs show
JGroups: MERGE eventorfailed to receive response fromfollowed by node addresses. - Different nodes report different cluster sizes in
org.jgroups.protocolslogs. - Users report intermittent authentication failures — sometimes it works, sometimes it does not, depending on which “cluster” handles the request.
Recovery:
- Stop client traffic at the load balancer. Block all nodes from serving new requests.
- Pick one node as the “source of truth.” This is usually the node with the most recent session data, but in practice the database holds definitive state. Any node works.
- Restart the remaining nodes one by one, allowing each to rejoin the first node cleanly via JGroups TCPPing.
- Monitor
viewAcceptedlogs until all nodes report the same view ID and member count. - Restore traffic.
If sessions feel stale after recovery, clear the Infinispan cache via the Keycloak admin console → Server Info → Cache → Clear. This forces all nodes to reload from the database. Do this during a maintenance window.
Scenario 2: corrupted database sequences
Keycloak uses database sequences for EVENT_ENTITY, USER_SESSION, and other high-traffic tables. If a sequence gets out of sync — typically during a failed migration or an unclean node restart mid-transaction — inserts start failing with duplicate key value violates unique constraint.
Symptoms:
- Errors in the log like
ERROR: duplicate key value violates unique constraint "constraint_xxx"during login or token refresh. - The affected table has a gap or duplicate in its primary key sequence.
Recovery:
Find the current max ID and reset the sequence. For Postgres:
SELECT setval(''event_entity_seq'', COALESCE((SELECT MAX(id) FROM event_entity), 1));
Run this for every sequence that shows constraint violations. The Keycloak migration scripts create sequences named xxx_seq for most tables. Check event_entity_seq, user_session_seq, and resource_server_seq as starting points.
Scenario 3: stuck migration
Keycloak uses Flyway for database migrations. In a multi-node cluster, only one node should run migrations. The rest wait. If the migrating node crashes mid-migration, or if all nodes start simultaneously and race on Flyway locks, the migration can get stuck.
Symptoms:
- Nodes fail to start with
FlywayException: Found non-empty schema without schema history tableorMigration checksum mismatch. - The
flyway_schema_historytable has partially applied migrations, or a migration marked as running that never completed.
Recovery:
- Check the
flyway_schema_historytable:SELECT * FROM flyway_schema_history ORDER BY installed_rank; - If a migration is stuck in “running” state (
success = 0), manually verify if the SQL was applied. If yes, mark it as successful:UPDATE flyway_schema_history SET success = TRUE WHERE version = 'X.Y.Z'; - If the migration was partially applied, you may need to revert the incomplete changes manually (check the Flyway migration script in the Keycloak JAR under
META-INF/jpa-changelog/). - Restart a single node with
--spi-events-listener-jboss-loggingand verify the migration passes. Then bring up the remaining nodes.
Scenario 4: dangling sessions and orphaned offline tokens
When a node dies ungracefully, its cached sessions are lost. The database still holds records for those sessions, but they point to a node that no longer exists. Users who try to log out or refresh may get errors.
Symptoms:
- Users report “session expired” or “logout failed” errors even with valid tokens.
- The
OFFLINE_USER_SESSIONorUSER_SESSIONtable has rows withBROKER_SESSION_IDpointing to a dead node.
Recovery:
- Use the Keycloak admin API to purge stale sessions:
DELETE /admin/realms/{realm}/sessions?type=offline - For a deeper clean, query the database directly. Remove orphaned sessions from
OFFLINE_USER_SESSIONwhere the associated node is no longer part of the cluster. - After cleanup, clear the Infinispan cache via the admin console.
- Users with valid tokens will need to re-authenticate the next time they make a request that triggers a session check. This is normal and expected after an unclean failure.
Scenario 5: broken trust store or keystore
Keycloak is opinionated about TLS. If a certificate expires, or if a keystore is rotated incorrectly, the node may fail to start or may reject otherwise valid connections.
Symptoms:
- Node fails to start with
java.security.KeyStoreExceptionorjavax.net.ssl.SSLHandshakeException. - Identity provider federation stops working with
SSLHandshakeException: PKIX path building failed.
Recovery:
- Verify the keystore:
keytool -list -keystore keycloak.jks. Check expiry dates and alias names. - If the certificate expired, generate a new one and update the keystore. Keycloak must be restarted to pick up keystore changes.
- For trusted third-party certificates (identity providers, LDAP), import them into the Java trust store:
keytool -import -alias idp -keystore cacerts -file idp-cert.pem. - Test with
openssl s_client -connect idp.example.com:443from the Keycloak host to rule out network-level TLS issues before blaming Keycloak.
Scenario 6: the cluster simply will not form
This is the most frustrating scenario. You start the second node, check the logs, and it just sits there. No view change, no state transfer, no errors — nothing.
Common causes:
- Wrong bind address. JGroups binds to the first available interface. If that is
127.0.0.1, the nodes cannot see each other. Setjboss.bind.address.privateto the actual network interface IP. - Firewall blocking JGroups ports. JGroups uses port 7600 (UDP) or 7800 (TCP) by default. In cloud environments, security groups often block these.
- Multicast disabled. Many cloud providers do not support multicast at all. If you are using UDP multicast discovery and it does not work, switch to TCPPing or DNS_PING.
- Infinispan configuration mismatch. If nodes use different cache configurations (e.g., different distributed vs replicated cache settings), they will not form a cluster. Compare
standalone-ha.xmlacross nodes.
Diagnostic tools:
jgroups-diag— a standalone JGroups diagnostic tool that tests discovery and membership outside of Keycloak.- Infinispan CLI — connect to an Infinispan endpoint via JMX and inspect the cluster view.
- Keycloak admin API → Server Info → System → shows the current cluster members.
- Kubernetes: use
kubectl execinto each pod and check/proc/net/tcpfor listening ports.
Prevention: the recovery runbook
The most valuable artefact you can create for your Keycloak cluster is not another monitoring dashboard. It is a recovery runbook that covers every scenario above, with exact commands, SQL queries, and API calls.
Things to include:
- Database credentials and connection strings for direct SQL access.
- The exact
UPDATEandSELECTqueries for sequence repair and session cleanup. - Keystore paths and passwords.
- Load balancer admin access for traffic cut-off.
- The restart order and expected view size after each restart.
Keycloak HA is not magic. It is understanding its state model and having a plan for when that model breaks. The nodes will fail, the network will partition, and certificates will expire. The difference between a bad day and a catastrophic one is whether you have tested the recovery path before you need it.
Was this useful?