Keycloak high availability: patterns that survive a production outage

Production patterns for running Keycloak in HA mode: database replication, Infinispan rebalancing, load balancer config, and what happens when a node disappears.

Published May 10, 2026 · 8 min read

Keycloak
Security
Operations
Reliability

Keycloak is not just another microservice. It is the gateway to your entire platform — the service that decides who gets in, how long they stay, and what they can do. When Keycloak goes down, nobody logs in. Not your users, not your admins, not your CI pipeline. Everything stops.

So you run it in high availability mode. Two nodes, three nodes, a cluster behind a load balancer. But Keycloak is not stateless. It has a distributed cache (Infinispan), persistent sessions, offline tokens, and async event processing. HA here is more subtle than “just add another pod.”

This article covers the patterns I have used to keep Keycloak clusters healthy in production, and what actually matters when a node disappears.

The real SPOF: the database

Every Keycloak node connects to the same database. That database is your single point of failure. If the database goes down, the cluster is blind: no new logins, no token refresh, no user lookups. The Infinispan cache buys you seconds, not minutes.

The standard answer is Postgres with Patroni or repmgr, or a managed service like RDS Multi-AZ or Cloud SQL. That is the right answer, but you need to test the failure path:

What happens during a Patroni failover? Keycloak uses HikariCP for connection pooling. By default it will throw SQLException and retry. The pool must be configured with a short connection timeout (connectionTimeout: 3000) and a validation query (SELECT 1). Without those, a database failover can hang every request for the default 30-second timeout.
Read replicas are tempting but dangerous. Keycloak writes sessions and events on every login and token refresh. A stale read can return “session expired” for an active user. If you use replicas, pin all writes and critical reads to the primary via a datasource that routes by statement type, or do not use replicas at all.
Connection pool sizing: each Keycloak node can use 50–200 connections depending on load. Two nodes = 400 connections at peak. Make sure your database max_connections accounts for every possible node in the cluster plus headroom for administrative connections.

Infinispan and JGroups: the distributed cache

Keycloak uses Infinispan as its second-level cache, replicated across nodes via JGroups. This cache holds authentication sessions, user sessions, client sessions, and offline sessions. It is the reason your users do not hit the database on every page load.

The default transport is UDP multicast, which works well on a LAN and is a disaster in most cloud environments. In the cloud, switch to TCPPing with a static list of initial hosts:

<stack>TCP(tcp_initial_hosts=node1[7800],node2[7800])</stack>

For Kubernetes, use DNS_PING which resolves a headless service to discover peers:

<stack>S3_PING(..., location=my-namespace/cluster-name)</stack>

Key things to monitor in Infinispan:

View changes — every time a node joins or leaves, JGroups emits a new view. Monitor org.jgroups.protocols logs for viewAccepted messages. A view change mid-request can cause a transient authentication failure. That is normal. Multiple view changes in a minute are not.
State transfer — when a node rejoins, Infinispan transfers cached state from existing members. This can cause CPU spikes on all nodes. Plan for it: do not restart all nodes simultaneously.
Cache rebalancing — if you use a distributed cache mode (not replicated), keys are spread across nodes. A node leaving triggers rebalancing. During rebalancing, lookups for keys that lived on the departed node will miss and fall through to the database. Your database needs to handle that surge.

Load balancer and sticky sessions

Keycloak sets a cookie called AUTH_SESSION_ID. This is the key to its local Infinispan cache. If the load balancer routes a request to a node that does not hold that session in its local cache, the node has to fetch it from the database or from another cluster member.

Using sticky sessions (session affinity) is the pragmatic choice. It avoids the cross-node state transfer on every request and keeps latency predictable. Most load balancers support it natively — Nginx ip_hash, HAProxy cookie directive, AWS ALB stickyness.

But sticky sessions introduce a failure mode: if your stickiness is too aggressive, a single node can accumulate too many sessions and become hot. Set a reasonable max-age on the cookie (4–8 hours) so long-lived clients rebalance naturally.

Health checks deserve care. Keycloak exposes /auth/health (or /realms/master/.well-known/openid-configuration for older versions). Do not just check HTTP 200 — verify the node can reach the database and its cache peers. A node that is “running” but disconnected from the cluster will respond 200 to a simple TCP check and serve stale sessions. Use a scripted check that hits an authenticated endpoint.

What happens during a node failure

When one Keycloak node goes down:

Active sessions that were pinned to that node are lost from the local cache. The remaining nodes must fetch them from the database. Users may see a brief re-authentication (1–3 seconds) depending on database latency.
Tokens already issued remain valid until expiry. Keycloak validates tokens against its signature, not against a cache. A JWT access token works whether the issuing node is alive or not.
Logins in flight during the failure will fail. The client (user or service) must retry. Make sure your clients have retry logic with exponential backoff. A hard-coded 3-second timeout and one retry covers most cases.
Offline tokens are stored in the database by default, so they survive node failures. They are reloaded into cache on first use after the failure.

The cluster should stabilise within 5–15 seconds after a single node loss, assuming the database can handle the cache-miss surge.

Blue-green deployments and rolling restarts

Keycloak supports blue-green deployments, but the transition must be managed:

Drain the node first: set the load balancer health check to failing. Wait for all in-flight requests to complete (typically 30–60 seconds). Then restart or replace the node.
Do not restart all nodes at once. The cluster needs at least one surviving node to serve cache state to new members. If every node restarts simultaneously, the cache is cold and every request hits the database.
Rolling restarts are safe with 3+ nodes. Restart one, let it rejoin and rebalance, then restart the next. With 2 nodes, you will have a brief window of degraded cache performance during the restart of each node.

Split-brain prevention

Infinispan uses JGroups for cluster membership. If the network partitions, JGroups may form two separate clusters. This is rare with TCP-based discovery on a reliable network, but possible.

Mitigation strategies:

TCPPing with a witness. If your infrastructure allows, add a third “witness” node that participates in the cluster vote but serves no traffic. This breaks ties during a partition.
Monitor membership changes. Alert on any viewAccepted with fewer members than expected. A cluster that drops from 3 to 2 members should be investigated immediately.
Database as source of truth. If you suspect a split-brain, the database holds the definitive session state. Restart all nodes cleanly and let them reload from the database.

Keycloak HA is not difficult, but it is not automatic either. The most common incidents I have seen were not caused by Keycloak itself — they were caused by the infrastructure around it misconfigured. A connection pool too small, a health check too shallow, a JGroups discovery that worked in staging but failed in production.

Understand its state model. Test your failure paths. And keep a runbook for when the cluster does not form.

Was this useful?