Skip to content
← Journal

Keycloak fail recovery: what breaks and how to fix it

A practical guide to Keycloak recovery scenarios: split-brain in Infinispan, corrupted database sequences, stuck migration, dangling sessions, and broken trust stores.

Published May 13, 2026 · 10 min read
  • Keycloak
  • Security
  • Operations
  • Reliability

Everyone writes about how to configure Keycloak HA. Few write about what happens when the HA itself breaks. The cluster splits, the database sequence jumps, a migration gets stuck mid-flight, or certificates expire overnight and nobody notices until users cannot log in.

This is the pragmatic recovery guide — six real scenarios I have encountered, what they look like in the logs, and how to fix them without panicking.

Scenario 1: split-brain in Infinispan

A split-brain happens when cluster nodes lose network connectivity between themselves but remain reachable to clients. JGroups detects the partition and forms separate clusters, each believing itself to be the authoritative group.

Symptoms:

  • Logs show JGroups: MERGE event or failed to receive response from followed by node addresses.
  • Different nodes report different cluster sizes in org.jgroups.protocols logs.
  • Users report intermittent authentication failures — sometimes it works, sometimes it does not, depending on which “cluster” handles the request.

Recovery:

  1. Stop client traffic at the load balancer. Block all nodes from serving new requests.
  2. Pick one node as the “source of truth.” This is usually the node with the most recent session data, but in practice the database holds definitive state. Any node works.
  3. Restart the remaining nodes one by one, allowing each to rejoin the first node cleanly via JGroups TCPPing.
  4. Monitor viewAccepted logs until all nodes report the same view ID and member count.
  5. Restore traffic.

If sessions feel stale after recovery, clear the Infinispan cache via the Keycloak admin console → Server Info → Cache → Clear. This forces all nodes to reload from the database. Do this during a maintenance window.

Scenario 2: corrupted database sequences

Keycloak uses database sequences for EVENT_ENTITY, USER_SESSION, and other high-traffic tables. If a sequence gets out of sync — typically during a failed migration or an unclean node restart mid-transaction — inserts start failing with duplicate key value violates unique constraint.

Symptoms:

  • Errors in the log like ERROR: duplicate key value violates unique constraint "constraint_xxx" during login or token refresh.
  • The affected table has a gap or duplicate in its primary key sequence.

Recovery:

Find the current max ID and reset the sequence. For Postgres:

SELECT setval(''event_entity_seq'', COALESCE((SELECT MAX(id) FROM event_entity), 1));

Run this for every sequence that shows constraint violations. The Keycloak migration scripts create sequences named xxx_seq for most tables. Check event_entity_seq, user_session_seq, and resource_server_seq as starting points.

Scenario 3: stuck migration

Keycloak uses Flyway for database migrations. In a multi-node cluster, only one node should run migrations. The rest wait. If the migrating node crashes mid-migration, or if all nodes start simultaneously and race on Flyway locks, the migration can get stuck.

Symptoms:

  • Nodes fail to start with FlywayException: Found non-empty schema without schema history table or Migration checksum mismatch.
  • The flyway_schema_history table has partially applied migrations, or a migration marked as running that never completed.

Recovery:

  1. Check the flyway_schema_history table: SELECT * FROM flyway_schema_history ORDER BY installed_rank;
  2. If a migration is stuck in “running” state (success = 0), manually verify if the SQL was applied. If yes, mark it as successful: UPDATE flyway_schema_history SET success = TRUE WHERE version = 'X.Y.Z';
  3. If the migration was partially applied, you may need to revert the incomplete changes manually (check the Flyway migration script in the Keycloak JAR under META-INF/jpa-changelog/).
  4. Restart a single node with --spi-events-listener-jboss-logging and verify the migration passes. Then bring up the remaining nodes.

Scenario 4: dangling sessions and orphaned offline tokens

When a node dies ungracefully, its cached sessions are lost. The database still holds records for those sessions, but they point to a node that no longer exists. Users who try to log out or refresh may get errors.

Symptoms:

  • Users report “session expired” or “logout failed” errors even with valid tokens.
  • The OFFLINE_USER_SESSION or USER_SESSION table has rows with BROKER_SESSION_ID pointing to a dead node.

Recovery:

  1. Use the Keycloak admin API to purge stale sessions: DELETE /admin/realms/{realm}/sessions?type=offline
  2. For a deeper clean, query the database directly. Remove orphaned sessions from OFFLINE_USER_SESSION where the associated node is no longer part of the cluster.
  3. After cleanup, clear the Infinispan cache via the admin console.
  4. Users with valid tokens will need to re-authenticate the next time they make a request that triggers a session check. This is normal and expected after an unclean failure.

Scenario 5: broken trust store or keystore

Keycloak is opinionated about TLS. If a certificate expires, or if a keystore is rotated incorrectly, the node may fail to start or may reject otherwise valid connections.

Symptoms:

  • Node fails to start with java.security.KeyStoreException or javax.net.ssl.SSLHandshakeException.
  • Identity provider federation stops working with SSLHandshakeException: PKIX path building failed.

Recovery:

  1. Verify the keystore: keytool -list -keystore keycloak.jks. Check expiry dates and alias names.
  2. If the certificate expired, generate a new one and update the keystore. Keycloak must be restarted to pick up keystore changes.
  3. For trusted third-party certificates (identity providers, LDAP), import them into the Java trust store: keytool -import -alias idp -keystore cacerts -file idp-cert.pem.
  4. Test with openssl s_client -connect idp.example.com:443 from the Keycloak host to rule out network-level TLS issues before blaming Keycloak.

Scenario 6: the cluster simply will not form

This is the most frustrating scenario. You start the second node, check the logs, and it just sits there. No view change, no state transfer, no errors — nothing.

Common causes:

  • Wrong bind address. JGroups binds to the first available interface. If that is 127.0.0.1, the nodes cannot see each other. Set jboss.bind.address.private to the actual network interface IP.
  • Firewall blocking JGroups ports. JGroups uses port 7600 (UDP) or 7800 (TCP) by default. In cloud environments, security groups often block these.
  • Multicast disabled. Many cloud providers do not support multicast at all. If you are using UDP multicast discovery and it does not work, switch to TCPPing or DNS_PING.
  • Infinispan configuration mismatch. If nodes use different cache configurations (e.g., different distributed vs replicated cache settings), they will not form a cluster. Compare standalone-ha.xml across nodes.

Diagnostic tools:

  • jgroups-diag — a standalone JGroups diagnostic tool that tests discovery and membership outside of Keycloak.
  • Infinispan CLI — connect to an Infinispan endpoint via JMX and inspect the cluster view.
  • Keycloak admin API → Server Info → System → shows the current cluster members.
  • Kubernetes: use kubectl exec into each pod and check /proc/net/tcp for listening ports.

Prevention: the recovery runbook

The most valuable artefact you can create for your Keycloak cluster is not another monitoring dashboard. It is a recovery runbook that covers every scenario above, with exact commands, SQL queries, and API calls.

Things to include:

  • Database credentials and connection strings for direct SQL access.
  • The exact UPDATE and SELECT queries for sequence repair and session cleanup.
  • Keystore paths and passwords.
  • Load balancer admin access for traffic cut-off.
  • The restart order and expected view size after each restart.

Keycloak HA is not magic. It is understanding its state model and having a plan for when that model breaks. The nodes will fail, the network will partition, and certificates will expire. The difference between a bad day and a catastrophic one is whether you have tested the recovery path before you need it.

Was this useful?