High Availability on Kubernetes

The chart's defaults run one replica of each component — fine for small teams and labs. Three HA paths are wired in:

  • Management + signal share a broker mode toggle (cluster.mode) for distributed coordination
  • Relay has its own multi-pod fabric independent of the broker mode (per ADR-0014)
  • Signal terminates TLS on the binary itself when running under ingress-nginx (community), bypassing a known streaming bug

Cluster broker — the three modes

The cluster.mode switch controls how management and signal pods coordinate state across replicas. Three options:

cluster.mode: disabled   ← single replica, no broker (default)
cluster.mode: embedded   ← NATS+JetStream embedded in each pod
cluster.mode: external   ← bring-your-own NATS or Redis

Mode 1 — disabled (lab / dev)

No broker. Single replica per component. Smallest footprint.

cluster:
  mode: disabled

management:
  replicaCount: 1
signal:
  replicaCount: 1

Everything talks to itself in-process. Fine for labs, single-user mesh. Not recommended once you have more than a handful of peers.

Each pod runs an embedded NATS+JetStream cluster. The chart switches management and signal to StatefulSet, renders a Headless Service for stable DNS, and wires NATS routes via OPENZRO_CLUSTER_PEERS (FQDN-resolved to handle GKE VPC-native search domains correctly).

cluster:
  mode: embedded
  embedded:
    clientPort: 4222
    clusterPort: 6222
    jetstream:
      storage: memory   # locks bucket only — no PVC needed

management:
  replicaCount: 3   # 3 is JetStream meta-leader quorum minimum
signal:
  replicaCount: 3

Trade-offs:

  • ✅ Zero external infra dependency
  • ✅ Auto-wired by chart, JetStream cluster forms automatically
  • ✅ Signal cross-pod dispatch via local NATS (OPENZRO_SIGNAL_DISPATCHER=nats)
  • ⚠️ JetStream cluster needs odd number ≥ 3 for quorum (don't run with 2 replicas — split-brain on leader election)
  • ⚠️ All replicas must boot within ~60s for first leader election (the management coordinator retries bucket creation with backoff)

This is the default recommendation. Operators don't need to manage external NATS or Redis.

Mode 3a — external with bundled NATS subchart

The chart bundles the upstream nats-io/nats subchart. Enable it to run a separate NATS deployment in the same release, decoupled from openzro pods:

cluster:
  mode: external
  # external.url left empty — chart auto-derives nats://<release>-nats:4222

nats:
  enabled: true
  config:
    cluster:
      enabled: true
      replicas: 3
    jetstream:
      enabled: true
      memoryStore:
        enabled: true
        maxSize: 256Mi
      fileStore:
        enabled: false   # change if you want persistent jetstream

management:
  replicaCount: 3
signal:
  replicaCount: 3

Trade-offs:

  • ✅ Decouples broker from openzro pods (independent scaling, restarts)
  • ✅ Standard NATS deployment with its own monitoring / observability
  • ✅ Same quorum semantics as embedded
  • ⚠️ More resources (separate pods)
  • ⚠️ Two charts to keep updated (chart subchart + openzro chart)

Use this when you want the broker as a separate concern but don't have an existing NATS cluster.

Mode 3b — external with operator-managed NATS

When you already operate NATS at scale (shared with other workloads, managed cloud NATS, NATS Operator), point the chart at it:

cluster:
  mode: external
  external:
    url: nats://my-nats.shared.svc.cluster.local:4222
    # or with auth:
    # url: nats://user:password@nats.example.com:4222

nats:
  enabled: false   # don't deploy the subchart

management:
  replicaCount: 3
signal:
  replicaCount: 3

Trade-offs:

  • ✅ Reuses existing NATS infra (cost, monitoring, ops)
  • ✅ Allows cross-workload subject namespacing
  • ⚠️ openzro publishes under oz.cluster.* and oz.signal.* — ensure these subjects don't collide with other tenants
  • ⚠️ JetStream KV bucket openzro_locks is created on first start if absent (you can pre-create it for stricter access control)

Mode 3c — external with Redis

The openzro core supports Redis as an alternative coordinator backend (signal pub/sub + KV locks). The chart does not yet auto-wire Redis like it does NATS — operators set the env var manually and disable the chart's NATS plumbing.

# Disable the chart's cluster.mode flow — we'll wire Redis directly.
cluster:
  mode: disabled

management:
  replicaCount: 3
  envRaw:
    - name: OPENZRO_REDIS_URL
      value: "redis://my-redis.shared.svc.cluster.local:6379"
    # If using Sentinel / Redis Cluster:
    # - name: OPENZRO_REDIS_URL
    #   value: "redis://default:password@redis-sentinel:26379?master_name=mymaster"

signal:
  replicaCount: 3
  envRaw:
    - name: OPENZRO_REDIS_URL
      value: "redis://my-redis.shared.svc.cluster.local:6379"
    - name: OPENZRO_SIGNAL_DISPATCHER
      value: "redis"

The openzro factory at cluster/factory/factory.go reads OPENZRO_REDIS_URL directly — no chart configuration knob beyond the env var. The signal binary needs OPENZRO_SIGNAL_DISPATCHER=redis to use Redis pub/sub (vs the in-process default).

Trade-offs:

  • ✅ Reuse existing Redis (Memorystore, ElastiCache, in-cluster Redis)
  • ✅ Often cheaper to run than NATS at small scale
  • ⚠️ No first-class chart support — cluster.mode must be disabled, envRaw is the escape hatch
  • ⚠️ Signal pub/sub via Redis is fan-out (no JetStream-equivalent ordering guarantees) — fine for openzro's signaling pattern
  • ⚠️ KV locks via Redis use SET NX EX (TTL-based), no Raft consensus

Track openzro/helms#X for the proposal to add cluster.mode: external-redis as a first-class option in a future alpha.

Relay multi-pod fabric (ADR-0014)

The relay has its own multi-pod fabric, independent of cluster.mode. At relay.replicaCount > 1 the chart auto-wires:

  • A Headless Service (<release>-relay-internal) that resolves to every relay pod's IP — used for inter-pod discovery
  • A second container port (relay.cluster.port, default 7090) for inter-pod TCP traffic
  • Downward API env vars (POD_IP, POD_NAME)
  • An HMAC-SHA256 secret (auto-generated on first install, preserved across upgrades) that authenticates inter-pod HELLO frames
relay:
  replicaCount: 3
  cluster:
    enabled: true        # null (default) = auto when replicaCount > 1
    port: 7090
    authSecret:
      value: ""              # leave empty for chart auto-gen
      # value: "your-32-char-secret"        # OR pin a literal
      # existingSecret: "my-relay-secret"   # OR point at your own
      # existingSecretKey: "authSecret"     # ...with this key name

What's actually shared between relay pods

Each pod owns its connected peers — there's no shared state between pods. What flows across the inter-pod fabric is:

  • A small routing oracle populated lazily on cache miss (peer ID → pod address, 5min TTL)
  • The forwarded packets themselves when peers on different pods need to communicate

No conntrack replication, no per-flow state sync. When a relay pod dies, peers reconnect (sub-second WireGuard handshake) and the load balancer may land them on a different pod — the locator caches in the surviving pods invalidate lazily on the next failed forward.

NetworkPolicy guidance

Operators with strict pod-to-pod NetworkPolicy must allow TCP/7090 between pods labeled app.kubernetes.io/name: openzro-relay. The HMAC gate authenticates HELLO either way — NetworkPolicy is defense-in-depth, not the primary trust boundary.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: relay-internal
  namespace: openzro
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: openzro-relay
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: openzro-relay
      ports:
        - protocol: TCP
          port: 7090

Signal HA — Ingress vs LoadBalancer

Signal has two exposure paths, with significant operational implications:

PathWhen to useStatus
signal.ingressEnvoy / Traefik / NGINX-Plus controllersOK
signal.tls + LoadBalanceringress-nginx (community)Recommended

Why the LoadBalancer path

ingress-nginx#5366 documents a long-standing bug: gRPC server-streaming initial metadata is not flushed until the first response body chunk. Signal's ConnectStream sends only the x-wiretrustee-peer-registered header on the metadata path — no body until peers start exchanging candidates — so peer clients hang on registration and time out at 60s.

The community ingress-nginx project was archived in March 2026 with no fix planned. Operators on community ingress-nginx must bypass it for signal:

signal:
  publicHostname: signal.example.com   # DNS for peer client
  containerPort: 443
  service:
    type: LoadBalancer
    port: 443
  tls:
    enabled: true
    certManager:
      enabled: true
      issuerRef:
        kind: ClusterIssuer
        name: letsencrypt-prod
  ingress:
    enabled: false   # bypass nginx-ingress

Same pattern as the relay. Cert-manager renews via DNS-01 (or whatever challenge your ClusterIssuer uses). DNS for signal.publicHostname must point at the Service LoadBalancer's external IP (allocated by the cloud provider after first sync).

Why keep the Ingress option

Operators on alternative controllers don't need this workaround:

  • Envoy Gateway — handles HTTP/2 trailers and streaming metadata per the spec
  • Traefik — same
  • NGINX-Plus (F5 commercial fork) — fixed independently in 2023

For those operators, signal.ingress.enabled: true is the simpler path.

Management gRPC — bidi streaming and the FlowService trap

The signal story above is a server-streaming gRPC RPC — the first response chunk a peer expects (after the initial metadata) is x-wiretrustee-peer-registered. The same ingress-nginx bug affects bidi-streaming RPCs even more aggressively. The canonical example in the chart is flow.FlowService.Events, the RPC peers use to push per-flow telemetry to the management when Settings → Network Traffic Logs is enabled.

The shape that breaks:

Peer (gRPC client)                    Management (gRPC server)
  ──────────────────                    ─────────────────────
  open Events stream  ──────────────►   Events handler runs
  call stream.Header()                  stream.Recv()  ──┐
  (blocks on initial                                     │ blocks until
   response metadata)                                    │ first peer event
                                                         │
  ↑                                                      ▼
  ingress-nginx upstream buffer holds                 (no DATA frame
  the response HEADERS frame waiting                  emitted yet)
  for a DATA frame to come along

Result: peers hang in WaitToBeOnlineAndSubscribe indefinitely and log the now-famous

flow receiver sent no headers
receive failed: check header: should have headers

Confirmed by tcpdump on the management pod (HEADERS frame is sent by the binary but never lands at the gRPC client) and by the control test of pointing Flow.URI at the in-cluster openzro-management-grpc.openzro.svc.cluster.local:33073 (plain HTTP/2 without ingress-nginx in the path) — events flow end-to-end as expected.

proxy-buffering: off doesn't fix it. Verified at the production admin cluster — annotation lands on the Ingress and gets reflected in the rendered nginx config, but the failure mode persists. The upstream request body buffer isn't the only buffer that matters; HTTP/2 frame-level batching at the upstream connection still clobbers the HEADERS frame ordering.

The fix: management.grpcProxy

Same shape as signal.tls and relay.tls — a dedicated nginx Deployment in front of the management gRPC port, with proxy_buffering off + proxy_request_buffering off set locally (the cluster ingress-nginx can't safely flip these globally without breaking other workloads). LoadBalancer Service in front of the proxy, TLS terminated by the proxy itself via cert-manager, peers route there for FlowService.Events while zt.example.com (the main host) keeps serving Login / Sync / dashboard / Dex / /api through the cluster ingress unchanged.

management:
  grpcProxy:
    enabled: true
    publicHostname: mgmt-grpc.example.com   # new DNS, peer-facing
    replicas: 2
    service:
      type: LoadBalancer
      port: 443
    tls:
      certManager:
        enabled: true
        issuerRef:
          kind: ClusterIssuer
          name: letsencrypt-prod

After helm upgrade:

  1. Get the LB IP allocated by the cloud:
    kubectl -n openzro get svc <release>-openzro-management-grpc-proxy
    
  2. Create the DNS A record mgmt-grpc.example.com → <IP>.
  3. Wait for cert-manager to issue the certificate (Ready: True on the Certificate CRD).
  4. The chart auto-derives management.config.flow.uri to mgmt-grpc.example.com:443 and embeds it in management.json.
  5. Peers receive the new Flow.URI via their next Sync update and open the FlowService stream against the proxy.

Why peers don't need to be reconfigured: the management hands the Flow URL down to peers in every Sync response (FlowConfig.URL), and the peer's openzro client uses it for the flow gRPC dial. There's no per-peer config of --flow-url. Operators changing the Flow URI just redeploy the chart; peers pick it up automatically on their next Sync.

Why not enable management.tls directly on the binary?

The chart does support TLS on the management binary itself (management.tls.enabled: true, management.publicGrpcHostname), mirroring the relay/signal pattern even more closely. Most operators should still pick grpcProxy instead because:

  • Single shared Ingress. The bundled ingress.openzro Ingress serves /, /dex and (optionally) /api on the same hostname. ingress-nginx applies backend-protocol per-Ingress, not per-path — so flipping the management binary to TLS would require setting backend-protocol: HTTPS on the bundled Ingress, breaking dashboard and Dex (which don't terminate TLS at their backends).
  • Backwards compatibility. Existing peers stay registered with --management-url=https://example.com. They reach Login / Sync through the cluster ingress as today; only the flow data path moves out.
  • Simpler debugging. A standalone nginx pod with one location block is trivial to kubectl exec into and inspect. The same is harder when management is doing TLS termination + content negotiation + gRPC + HTTP API on the same listener.

management.tls is documented as the alternative for operators who run with split Ingresses (one Ingress per service) and don't care about the dashboard footgun. Most should not pick it.

Why this exists at all

The trade-off is deliberate. The cluster ingress-nginx serves many other workloads with their own buffering / streaming characteristics. Flipping proxy-buffering: off cluster-wide hurts every HTTP/1.1 webhook / REST API in the cluster that benefits from buffered responses. The dedicated proxy serves one route, so it can be tuned aggressively without collateral damage.

Alternatives the chart does not bundle

If your cluster already runs one of these, point peers at it instead and skip grpcProxy.enabled: true entirely:

AlternativeNotes
HAProxy ingress controllergRPC streaming works out of the box. Common pattern in shops that already use HAProxy.
Envoy GatewayModern Gateway API + GRPCRoute. Cleanest if you're greenfield.
Contour (Envoy-based)Same Envoy lineage, packaged for VMware Tanzu / Bitnami stacks.
GKE Gateway API + HTTPS LBNative to GCP. Works well for GKE-only deploys; ties you to Google.

The grpcProxy path is the lowest-friction option for operators already running ingress-nginx and not looking to introduce another controller for one route.

Operations — verifying HA

Cluster broker is up

Embedded NATS:

kubectl logs -n openzro openzro-management-0 \
  | grep "JetStream cluster new metadata leader"
# Expected: JetStream cluster new metadata leader: openzro-management-1/openzro

External NATS via subchart:

kubectl logs -n openzro <release>-nats-0 | tail

External Redis (mode 3c):

kubectl exec -n openzro openzro-management-0 -- env | grep OPENZRO_REDIS_URL
# Expected: OPENZRO_REDIS_URL=redis://...

Relay fabric is healthy

kubectl logs -n openzro -l app.kubernetes.io/name=openzro-relay \
  | grep "cluster discovery"
# Expected: cluster discovery: openzro-relay-internal reconciled — +N peers

Prometheus metrics on the cluster fabric (when metrics.serviceMonitor.enabled: true):

MetricTypeHealthy steady-state
relay_cluster_forwards_total{result}counterOK rate dominates
relay_cluster_lookup_duration_secondshistogramp99 < 100ms
relay_cluster_hello_rejects_total{reason}counter~0 (any non-zero is a security signal)
relay_cluster_streamsgaugematches peer count

A sustained hmac_mismatch rate across the fleet means a pod has the wrong secret — check the auth secret rotation.

Storage HA

Independent of the cluster broker, both management and Dex need a shared SQL backend for HA. Sqlite single-node only — see Storage Backends for the postgres / mysql / external DB patterns.