High Availability on Kubernetes
The chart's defaults run one replica of each component — fine for small teams and labs. Three HA paths are wired in:
- Management + signal share a broker mode toggle
(
cluster.mode) for distributed coordination - Relay has its own multi-pod fabric independent of the broker mode (per ADR-0014)
- Signal terminates TLS on the binary itself when running under ingress-nginx (community), bypassing a known streaming bug
Cluster broker — the three modes
The cluster.mode switch controls how management and signal pods
coordinate state across replicas. Three options:
cluster.mode: disabled ← single replica, no broker (default)
cluster.mode: embedded ← NATS+JetStream embedded in each pod
cluster.mode: external ← bring-your-own NATS or Redis
Mode 1 — disabled (lab / dev)
disabled (lab / dev)No broker. Single replica per component. Smallest footprint.
cluster:
mode: disabled
management:
replicaCount: 1
signal:
replicaCount: 1
Everything talks to itself in-process. Fine for labs, single-user mesh. Not recommended once you have more than a handful of peers.
Mode 2 — embedded (recommended for HA)
embedded (recommended for HA)Each pod runs an embedded NATS+JetStream cluster. The chart switches
management and signal to StatefulSet, renders a Headless Service
for stable DNS, and wires NATS routes via OPENZRO_CLUSTER_PEERS
(FQDN-resolved to handle GKE VPC-native search domains correctly).
cluster:
mode: embedded
embedded:
clientPort: 4222
clusterPort: 6222
jetstream:
storage: memory # locks bucket only — no PVC needed
management:
replicaCount: 3 # 3 is JetStream meta-leader quorum minimum
signal:
replicaCount: 3
Trade-offs:
- ✅ Zero external infra dependency
- ✅ Auto-wired by chart, JetStream cluster forms automatically
- ✅ Signal cross-pod dispatch via local NATS (
OPENZRO_SIGNAL_DISPATCHER=nats) - ⚠️ JetStream cluster needs odd number ≥ 3 for quorum (don't run with 2 replicas — split-brain on leader election)
- ⚠️ All replicas must boot within ~60s for first leader election (the management coordinator retries bucket creation with backoff)
This is the default recommendation. Operators don't need to manage external NATS or Redis.
Mode 3a — external with bundled NATS subchart
external with bundled NATS subchartThe chart bundles the upstream nats-io/nats subchart. Enable it
to run a separate NATS deployment in the same release, decoupled
from openzro pods:
cluster:
mode: external
# external.url left empty — chart auto-derives nats://<release>-nats:4222
nats:
enabled: true
config:
cluster:
enabled: true
replicas: 3
jetstream:
enabled: true
memoryStore:
enabled: true
maxSize: 256Mi
fileStore:
enabled: false # change if you want persistent jetstream
management:
replicaCount: 3
signal:
replicaCount: 3
Trade-offs:
- ✅ Decouples broker from openzro pods (independent scaling, restarts)
- ✅ Standard NATS deployment with its own monitoring / observability
- ✅ Same quorum semantics as embedded
- ⚠️ More resources (separate pods)
- ⚠️ Two charts to keep updated (chart subchart + openzro chart)
Use this when you want the broker as a separate concern but don't have an existing NATS cluster.
Mode 3b — external with operator-managed NATS
external with operator-managed NATSWhen you already operate NATS at scale (shared with other workloads, managed cloud NATS, NATS Operator), point the chart at it:
cluster:
mode: external
external:
url: nats://my-nats.shared.svc.cluster.local:4222
# or with auth:
# url: nats://user:password@nats.example.com:4222
nats:
enabled: false # don't deploy the subchart
management:
replicaCount: 3
signal:
replicaCount: 3
Trade-offs:
- ✅ Reuses existing NATS infra (cost, monitoring, ops)
- ✅ Allows cross-workload subject namespacing
- ⚠️ openzro publishes under
oz.cluster.*andoz.signal.*— ensure these subjects don't collide with other tenants - ⚠️ JetStream KV bucket
openzro_locksis created on first start if absent (you can pre-create it for stricter access control)
Mode 3c — external with Redis
external with RedisThe openzro core supports Redis as an alternative coordinator backend (signal pub/sub + KV locks). The chart does not yet auto-wire Redis like it does NATS — operators set the env var manually and disable the chart's NATS plumbing.
# Disable the chart's cluster.mode flow — we'll wire Redis directly.
cluster:
mode: disabled
management:
replicaCount: 3
envRaw:
- name: OPENZRO_REDIS_URL
value: "redis://my-redis.shared.svc.cluster.local:6379"
# If using Sentinel / Redis Cluster:
# - name: OPENZRO_REDIS_URL
# value: "redis://default:password@redis-sentinel:26379?master_name=mymaster"
signal:
replicaCount: 3
envRaw:
- name: OPENZRO_REDIS_URL
value: "redis://my-redis.shared.svc.cluster.local:6379"
- name: OPENZRO_SIGNAL_DISPATCHER
value: "redis"
The openzro factory at cluster/factory/factory.go reads
OPENZRO_REDIS_URL directly — no chart configuration knob beyond
the env var. The signal binary needs OPENZRO_SIGNAL_DISPATCHER=redis
to use Redis pub/sub (vs the in-process default).
Trade-offs:
- ✅ Reuse existing Redis (Memorystore, ElastiCache, in-cluster Redis)
- ✅ Often cheaper to run than NATS at small scale
- ⚠️ No first-class chart support —
cluster.modemust bedisabled, envRaw is the escape hatch - ⚠️ Signal pub/sub via Redis is fan-out (no JetStream-equivalent ordering guarantees) — fine for openzro's signaling pattern
- ⚠️ KV locks via Redis use SET NX EX (TTL-based), no Raft consensus
Track openzro/helms#X
for the proposal to add cluster.mode: external-redis as a first-class
option in a future alpha.
Relay multi-pod fabric (ADR-0014)
The relay has its own multi-pod fabric, independent of
cluster.mode. At relay.replicaCount > 1 the chart auto-wires:
- A Headless Service (
<release>-relay-internal) that resolves to every relay pod's IP — used for inter-pod discovery - A second container port (
relay.cluster.port, default7090) for inter-pod TCP traffic - Downward API env vars (
POD_IP,POD_NAME) - An HMAC-SHA256 secret (auto-generated on first install, preserved across upgrades) that authenticates inter-pod HELLO frames
relay:
replicaCount: 3
cluster:
enabled: true # null (default) = auto when replicaCount > 1
port: 7090
authSecret:
value: "" # leave empty for chart auto-gen
# value: "your-32-char-secret" # OR pin a literal
# existingSecret: "my-relay-secret" # OR point at your own
# existingSecretKey: "authSecret" # ...with this key name
What's actually shared between relay pods
Each pod owns its connected peers — there's no shared state between pods. What flows across the inter-pod fabric is:
- A small routing oracle populated lazily on cache miss (peer ID → pod address, 5min TTL)
- The forwarded packets themselves when peers on different pods need to communicate
No conntrack replication, no per-flow state sync. When a relay pod dies, peers reconnect (sub-second WireGuard handshake) and the load balancer may land them on a different pod — the locator caches in the surviving pods invalidate lazily on the next failed forward.
NetworkPolicy guidance
Operators with strict pod-to-pod NetworkPolicy must allow TCP/7090
between pods labeled app.kubernetes.io/name: openzro-relay. The
HMAC gate authenticates HELLO either way — NetworkPolicy is
defense-in-depth, not the primary trust boundary.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: relay-internal
namespace: openzro
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: openzro-relay
ingress:
- from:
- podSelector:
matchLabels:
app.kubernetes.io/name: openzro-relay
ports:
- protocol: TCP
port: 7090
Signal HA — Ingress vs LoadBalancer
Signal has two exposure paths, with significant operational implications:
| Path | When to use | Status |
|---|---|---|
signal.ingress | Envoy / Traefik / NGINX-Plus controllers | OK |
signal.tls + LoadBalancer | ingress-nginx (community) | Recommended |
Why the LoadBalancer path
ingress-nginx#5366
documents a long-standing bug: gRPC server-streaming initial
metadata is not flushed until the first response body chunk.
Signal's ConnectStream sends only the x-wiretrustee-peer-registered
header on the metadata path — no body until peers start exchanging
candidates — so peer clients hang on registration and time out at
60s.
The community ingress-nginx project was archived in March 2026 with no fix planned. Operators on community ingress-nginx must bypass it for signal:
signal:
publicHostname: signal.example.com # DNS for peer client
containerPort: 443
service:
type: LoadBalancer
port: 443
tls:
enabled: true
certManager:
enabled: true
issuerRef:
kind: ClusterIssuer
name: letsencrypt-prod
ingress:
enabled: false # bypass nginx-ingress
Same pattern as the relay. Cert-manager renews via DNS-01 (or whatever
challenge your ClusterIssuer uses). DNS for signal.publicHostname
must point at the Service LoadBalancer's external IP (allocated by
the cloud provider after first sync).
Why keep the Ingress option
Operators on alternative controllers don't need this workaround:
- Envoy Gateway — handles HTTP/2 trailers and streaming metadata per the spec
- Traefik — same
- NGINX-Plus (F5 commercial fork) — fixed independently in 2023
For those operators, signal.ingress.enabled: true is the simpler
path.
Management gRPC — bidi streaming and the FlowService trap
The signal story above is a server-streaming gRPC RPC — the
first response chunk a peer expects (after the initial metadata)
is x-wiretrustee-peer-registered. The same ingress-nginx bug
affects bidi-streaming RPCs even more aggressively. The
canonical example in the chart is flow.FlowService.Events, the
RPC peers use to push per-flow telemetry to the management when
Settings → Network Traffic Logs is enabled.
The shape that breaks:
Peer (gRPC client) Management (gRPC server)
────────────────── ─────────────────────
open Events stream ──────────────► Events handler runs
call stream.Header() stream.Recv() ──┐
(blocks on initial │ blocks until
response metadata) │ first peer event
│
↑ ▼
ingress-nginx upstream buffer holds (no DATA frame
the response HEADERS frame waiting emitted yet)
for a DATA frame to come along
Result: peers hang in WaitToBeOnlineAndSubscribe indefinitely
and log the now-famous
flow receiver sent no headers
receive failed: check header: should have headers
Confirmed by tcpdump on the management pod (HEADERS frame is sent
by the binary but never lands at the gRPC client) and by the
control test of pointing Flow.URI at the in-cluster
openzro-management-grpc.openzro.svc.cluster.local:33073
(plain HTTP/2 without ingress-nginx in the path) — events flow
end-to-end as expected.
proxy-buffering: off doesn't fix it. Verified at the production
admin cluster — annotation lands on the Ingress and gets reflected
in the rendered nginx config, but the failure mode persists. The
upstream request body buffer isn't the only buffer that matters;
HTTP/2 frame-level batching at the upstream connection still
clobbers the HEADERS frame ordering.
The fix: management.grpcProxy
management.grpcProxySame shape as signal.tls and relay.tls — a dedicated nginx
Deployment in front of the management gRPC port, with
proxy_buffering off + proxy_request_buffering off set
locally (the cluster ingress-nginx can't safely flip these
globally without breaking other workloads). LoadBalancer Service
in front of the proxy, TLS terminated by the proxy itself via
cert-manager, peers route there for FlowService.Events while
zt.example.com (the main host) keeps serving Login / Sync /
dashboard / Dex / /api through the cluster ingress unchanged.
management:
grpcProxy:
enabled: true
publicHostname: mgmt-grpc.example.com # new DNS, peer-facing
replicas: 2
service:
type: LoadBalancer
port: 443
tls:
certManager:
enabled: true
issuerRef:
kind: ClusterIssuer
name: letsencrypt-prod
After helm upgrade:
- Get the LB IP allocated by the cloud:
kubectl -n openzro get svc <release>-openzro-management-grpc-proxy - Create the DNS A record
mgmt-grpc.example.com → <IP>. - Wait for cert-manager to issue the certificate
(
Ready: Trueon the Certificate CRD). - The chart auto-derives
management.config.flow.uritomgmt-grpc.example.com:443and embeds it inmanagement.json. - Peers receive the new Flow.URI via their next Sync update and open the FlowService stream against the proxy.
Why peers don't need to be reconfigured: the management hands the
Flow URL down to peers in every Sync response (FlowConfig.URL),
and the peer's openzro client uses it for the flow gRPC dial.
There's no per-peer config of --flow-url. Operators changing
the Flow URI just redeploy the chart; peers pick it up
automatically on their next Sync.
Why not enable management.tls directly on the binary?
management.tls directly on the binary?The chart does support TLS on the management binary itself
(management.tls.enabled: true, management.publicGrpcHostname),
mirroring the relay/signal pattern even more closely. Most
operators should still pick grpcProxy instead because:
- Single shared Ingress. The bundled
ingress.openzroIngress serves/,/dexand (optionally)/apion the same hostname. ingress-nginx appliesbackend-protocolper-Ingress, not per-path — so flipping the management binary to TLS would require settingbackend-protocol: HTTPSon the bundled Ingress, breaking dashboard and Dex (which don't terminate TLS at their backends). - Backwards compatibility. Existing peers stay registered
with
--management-url=https://example.com. They reachLogin/Syncthrough the cluster ingress as today; only the flow data path moves out. - Simpler debugging. A standalone nginx pod with one location
block is trivial to
kubectl execinto and inspect. The same is harder when management is doing TLS termination + content negotiation + gRPC + HTTP API on the same listener.
management.tls is documented as the alternative for operators
who run with split Ingresses (one Ingress per service) and don't
care about the dashboard footgun. Most should not pick it.
Why this exists at all
The trade-off is deliberate. The cluster ingress-nginx serves
many other workloads with their own buffering / streaming
characteristics. Flipping proxy-buffering: off cluster-wide
hurts every HTTP/1.1 webhook / REST API in the cluster that
benefits from buffered responses. The dedicated proxy serves
one route, so it can be tuned aggressively without
collateral damage.
Alternatives the chart does not bundle
If your cluster already runs one of these, point peers at it
instead and skip grpcProxy.enabled: true entirely:
| Alternative | Notes |
|---|---|
| HAProxy ingress controller | gRPC streaming works out of the box. Common pattern in shops that already use HAProxy. |
| Envoy Gateway | Modern Gateway API + GRPCRoute. Cleanest if you're greenfield. |
| Contour (Envoy-based) | Same Envoy lineage, packaged for VMware Tanzu / Bitnami stacks. |
| GKE Gateway API + HTTPS LB | Native to GCP. Works well for GKE-only deploys; ties you to Google. |
The grpcProxy path is the lowest-friction option for operators
already running ingress-nginx and not looking to introduce
another controller for one route.
Operations — verifying HA
Cluster broker is up
Embedded NATS:
kubectl logs -n openzro openzro-management-0 \
| grep "JetStream cluster new metadata leader"
# Expected: JetStream cluster new metadata leader: openzro-management-1/openzro
External NATS via subchart:
kubectl logs -n openzro <release>-nats-0 | tail
External Redis (mode 3c):
kubectl exec -n openzro openzro-management-0 -- env | grep OPENZRO_REDIS_URL
# Expected: OPENZRO_REDIS_URL=redis://...
Relay fabric is healthy
kubectl logs -n openzro -l app.kubernetes.io/name=openzro-relay \
| grep "cluster discovery"
# Expected: cluster discovery: openzro-relay-internal reconciled — +N peers
Prometheus metrics on the cluster fabric (when
metrics.serviceMonitor.enabled: true):
| Metric | Type | Healthy steady-state |
|---|---|---|
relay_cluster_forwards_total{result} | counter | OK rate dominates |
relay_cluster_lookup_duration_seconds | histogram | p99 < 100ms |
relay_cluster_hello_rejects_total{reason} | counter | ~0 (any non-zero is a security signal) |
relay_cluster_streams | gauge | matches peer count |
A sustained hmac_mismatch rate across the fleet means a pod has
the wrong secret — check the auth secret rotation.
The multi-pod relay fabric scales linearly to ~30 pods under the
current broadcast-on-miss design. Beyond that, the WHO_HAS
broadcast amplifies and lookup latency starts to climb.
Monitoring relay_cluster_lookup_duration_seconds p99 lets you
trigger optimization work before scale becomes a real problem.
Storage HA
Independent of the cluster broker, both management and Dex need a shared SQL backend for HA. Sqlite single-node only — see Storage Backends for the postgres / mysql / external DB patterns.