Flow Archive — Google Cloud Storage (native)

The S3-style flow archive sink already covers GCS via its Interoperability mode — you mint an HMAC key in the GCP Console, point the S3 endpoint at https://storage.googleapis.com, and it works. This page is about the native GCS sink: the one that uses Google's own SDK and authenticates the way GCP itself recommends.

When to pick native over S3-Interop

ConcernS3 InteropGCS Native
Auth mechanismsHMAC keys onlyService Account JSON, Workload Identity, ADC
Workload Identity (GKE)not supportedyes — IAM bound to the workload, no creds in the pod
Customer-Managed Encryption Keysno (header path the S3 SDK doesn't pass)yes
Resumable uploads with lifecyclepartialfull
Cost / latencyidenticalidentical (same backend)

Self-host outside GCP usually wants S3 Interop — it's one HMAC key to provision. Self-host inside GCP (on GKE, GCE, Cloud Run) should always pick native: drop the credential file from the pod, bind IAM to the workload identity, and let ADC do the rest.

Configuration

Via the dashboard

SettingsIntegrationsFlow ExportsAdd destination → type Google Cloud Storage native (cold archive).

FieldNotes
BucketRequired. The destination bucket name (no gs:// prefix).
PrefixOptional. Prepended to every object key.
Project IDOptional. Informational; the bucket pins the project on Google's side.
AuthenticationOne of: ADC (default), Service Account JSON file, inline JSON

For GKE / GCE / Cloud Run, leave the auth mode at Application Default Credentials and ensure the workload's service account has roles/storage.objectCreator on the bucket. Outside GCP, mount a Service Account JSON and pick file path or paste the JSON inline.

Via env vars (boot-time baseline)

For self-host operators who manage configuration in env files:

export OPENZRO_FLOW_ARCHIVE_GCS_BUCKET="openzro-flow-archive"
export OPENZRO_FLOW_ARCHIVE_GCS_PREFIX="prod"

# pick ONE of these:
# (a) ADC — recommended on GKE / GCE / Cloud Run
#     no env var needed, the SDK reads the runtime credentials.

# (b) Service account JSON file
export OPENZRO_FLOW_ARCHIVE_GCS_CREDENTIALS_FILE="/etc/openzro/sa.json"

# (c) Service account JSON inline (constrained runners)
export OPENZRO_FLOW_ARCHIVE_GCS_CREDENTIALS_JSON='{"type":"service_account",...}'

# optional knobs
export OPENZRO_FLOW_ARCHIVE_GCS_PROJECT_ID="my-gcp-project"
export OPENZRO_FLOW_ARCHIVE_GCS_FLUSH_INTERVAL="15m"
export OPENZRO_FLOW_ARCHIVE_GCS_MAX_EVENTS_PER_FILE="100000"
export OPENZRO_FLOW_ARCHIVE_GCS_BUFFER_SIZE="50000"

The management server logs

flow archive enabled: GCS bucket "openzro-flow-archive" (auth=adc)

at boot when configured.

Object layout

Identical to the S3 sink — that's deliberate. An operator can move between the two (or run both side-by-side) without changing downstream tooling.

/year=2026/month=04/day=26/account=/-.ndjson.gz

Each object is gzipped NDJSON: one JSON event per line, gzip- compressed. BigQuery, DuckDB, ClickHouse, and Athena all read this format natively. For BigQuery specifically:

-- External table over the archive (run once)
CREATE EXTERNAL TABLE openzro_flow.archive
WITH PARTITION COLUMNS (year INT64, month INT64, day INT64, account STRING)
OPTIONS (
  format = 'JSON',
  uris = ['gs://openzro-flow-archive/prod/year=*/month=*/day=*/account=*/*.ndjson.gz'],
  hive_partition_uri_prefix = 'gs://openzro-flow-archive/prod/'
);

-- Sample query
SELECT count(*)
FROM openzro_flow.archive
WHERE year = 2026 AND month = 4 AND account = 'acct-1' AND type = 'drop';

IAM minimum viable role

For the workload identity (GKE) or service account (file/JSON):

# Bucket-level binding, not project-level. The sink only writes;
# never reads or deletes.
roles:
  - roles/storage.objectCreator

Do not grant roles/storage.admin — the sink does not need it and the principle of least privilege says no.

How the data gets to the bucket

This is not a synchronous write per event. Understanding the buffering / flush sequence saves a lot of "the bucket is empty, something's broken" support tickets:

  1. Per-event ingest. A peer's openzro client streams a flow.FlowEvent to the management's FlowService.Events bidi-stream. The Events handler enqueues the event onto an in-memory channel (default capacity 10 000) and immediately acks the peer. There is no per-event blocking write to GCS.

  2. Worker tick. A background goroutine (runWorker) ticks every FLUSH_INTERVAL (default 5 seconds) and also flushes whenever the in-memory batch reaches BATCH_SIZE (default 500 events) — whichever comes first.

  3. Per-sink fan-out. When a flush fires, the management resolves peer identity for every event in the batch (cached, sub-millisecond) and hands the resolved batch to every configured sink in parallel. Sinks include the hot store (flow_events Postgres table), any streaming exporter (Datadog, Elastic, HTTP), and the cold archive (S3 or GCS).

  4. Cold archive object write. The GCS sink groups the batch by (account_id, day) and writes one object per group. Object keys follow the year=…/month=…/day=…/account=…/<flush-id>.ndjson.gz layout described above. Each object is gzipped NDJSON, streamed via the Google SDK with resumable uploads on by default.

The first object lands in the bucket within 1 batch worth of events or 5 seconds (FLUSH_INTERVAL) of the first event, whichever fires sooner.

Verifying the archive

# List the most recent objects (replace placeholders).
gcloud storage ls --recursive \
  --limit 10 \
  --project=$PROJECT_ID \
  gs://openzro-flow-archive/prod/

# Download one and inspect.
gcloud storage cp \
  gs://openzro-flow-archive/prod/year=2026/month=04/day=26/account=acct-1/<id>.ndjson.gz \
  /tmp/sample.ndjson.gz
zcat /tmp/sample.ndjson.gz | head -3 | jq

"The bucket is empty"

Walk the path top-down before assuming the sink is broken:

  1. Are events even reaching the management? Check /api/network-traffic-events?limit=10 returns rows. If empty, the problem is upstream of any sink — peers aren't delivering. Most common cause is the bidi-streaming bug with cluster ingress-nginx; peers see "flow receiver sent no headers" in their logs and never push events. The fix is management.grpcProxy.enabled on Kubernetes, or a non-buggy ingress controller.
  2. Is the GCS sink loaded? kubectl logs … | grep "flow sink GCS:" on the management pod should show one flow sink GCS: bucket=… prefix=… line at boot per replica. If not, the flow_exports row didn't land or the bucket field is empty — re-save the destination from the dashboard.
  3. Is IAM right? A wrong binding produces flow sink GCS: write failed: googleapi: Error 403 once per batch. Fix the binding (roles/storage.objectCreator on the bucket, not the project) and a future flush will succeed.
  4. Time window. If you only just enabled flow logging, give it a few minutes — at low event rates the first flush waits for the 5 s ticker to fire after the first event arrives.

Trade-offs

  • Best-effort. A failed flush logs loud and drops the batch. Pair the cold archive with a streaming destination (Datadog, Elastic) when durability of every record matters.
  • No CMEK by default. GCS Native supports it; the openZro config does not yet expose the key name. Open an issue if you need it — it is a one-line config addition.
  • No retention lifecycle managed by openZro. Configure bucket-level lifecycle policies (Coldline / Archive transitions, age-based delete) directly in GCS.