Flow Archive — Google Cloud Storage (native)
The S3-style flow archive sink already covers GCS via its
Interoperability mode
— you mint an HMAC key in the GCP Console, point the S3 endpoint
at https://storage.googleapis.com, and it works. This page is
about the native GCS sink: the one that uses Google's own
SDK and authenticates the way GCP itself recommends.
When to pick native over S3-Interop
| Concern | S3 Interop | GCS Native |
|---|---|---|
| Auth mechanisms | HMAC keys only | Service Account JSON, Workload Identity, ADC |
| Workload Identity (GKE) | not supported | yes — IAM bound to the workload, no creds in the pod |
| Customer-Managed Encryption Keys | no (header path the S3 SDK doesn't pass) | yes |
| Resumable uploads with lifecycle | partial | full |
| Cost / latency | identical | identical (same backend) |
Self-host outside GCP usually wants S3 Interop — it's one HMAC key to provision. Self-host inside GCP (on GKE, GCE, Cloud Run) should always pick native: drop the credential file from the pod, bind IAM to the workload identity, and let ADC do the rest.
Configuration
Via the dashboard
Settings → Integrations → Flow Exports → Add destination
→ type Google Cloud Storage native (cold archive).
| Field | Notes |
|---|---|
| Bucket | Required. The destination bucket name (no gs:// prefix). |
| Prefix | Optional. Prepended to every object key. |
| Project ID | Optional. Informational; the bucket pins the project on Google's side. |
| Authentication | One of: ADC (default), Service Account JSON file, inline JSON |
For GKE / GCE / Cloud Run, leave the auth mode at Application
Default Credentials and ensure the workload's service account
has roles/storage.objectCreator on the bucket. Outside GCP,
mount a Service Account JSON and pick file path or paste the
JSON inline.
Via env vars (boot-time baseline)
For self-host operators who manage configuration in env files:
export OPENZRO_FLOW_ARCHIVE_GCS_BUCKET="openzro-flow-archive"
export OPENZRO_FLOW_ARCHIVE_GCS_PREFIX="prod"
# pick ONE of these:
# (a) ADC — recommended on GKE / GCE / Cloud Run
# no env var needed, the SDK reads the runtime credentials.
# (b) Service account JSON file
export OPENZRO_FLOW_ARCHIVE_GCS_CREDENTIALS_FILE="/etc/openzro/sa.json"
# (c) Service account JSON inline (constrained runners)
export OPENZRO_FLOW_ARCHIVE_GCS_CREDENTIALS_JSON='{"type":"service_account",...}'
# optional knobs
export OPENZRO_FLOW_ARCHIVE_GCS_PROJECT_ID="my-gcp-project"
export OPENZRO_FLOW_ARCHIVE_GCS_FLUSH_INTERVAL="15m"
export OPENZRO_FLOW_ARCHIVE_GCS_MAX_EVENTS_PER_FILE="100000"
export OPENZRO_FLOW_ARCHIVE_GCS_BUFFER_SIZE="50000"
The management server logs
flow archive enabled: GCS bucket "openzro-flow-archive" (auth=adc)
at boot when configured.
Object layout
Identical to the S3 sink — that's deliberate. An operator can move between the two (or run both side-by-side) without changing downstream tooling.
/year=2026/month=04/day=26/account=/-.ndjson.gz
Each object is gzipped NDJSON: one JSON event per line, gzip-
compressed. BigQuery, DuckDB, ClickHouse, and Athena all read
this format natively. For BigQuery specifically:
-- External table over the archive (run once)
CREATE EXTERNAL TABLE openzro_flow.archive
WITH PARTITION COLUMNS (year INT64, month INT64, day INT64, account STRING)
OPTIONS (
format = 'JSON',
uris = ['gs://openzro-flow-archive/prod/year=*/month=*/day=*/account=*/*.ndjson.gz'],
hive_partition_uri_prefix = 'gs://openzro-flow-archive/prod/'
);
-- Sample query
SELECT count(*)
FROM openzro_flow.archive
WHERE year = 2026 AND month = 4 AND account = 'acct-1' AND type = 'drop';
IAM minimum viable role
For the workload identity (GKE) or service account (file/JSON):
# Bucket-level binding, not project-level. The sink only writes;
# never reads or deletes.
roles:
- roles/storage.objectCreator
Do not grant roles/storage.admin — the sink does not need it
and the principle of least privilege says no.
How the data gets to the bucket
This is not a synchronous write per event. Understanding the buffering / flush sequence saves a lot of "the bucket is empty, something's broken" support tickets:
-
Per-event ingest. A peer's openzro client streams a
flow.FlowEventto the management'sFlowService.Eventsbidi-stream. The Events handler enqueues the event onto an in-memory channel (default capacity 10 000) and immediately acks the peer. There is no per-event blocking write to GCS. -
Worker tick. A background goroutine (
runWorker) ticks everyFLUSH_INTERVAL(default 5 seconds) and also flushes whenever the in-memory batch reachesBATCH_SIZE(default 500 events) — whichever comes first. -
Per-sink fan-out. When a flush fires, the management resolves peer identity for every event in the batch (cached, sub-millisecond) and hands the resolved batch to every configured sink in parallel. Sinks include the hot store (
flow_eventsPostgres table), any streaming exporter (Datadog, Elastic, HTTP), and the cold archive (S3 or GCS). -
Cold archive object write. The GCS sink groups the batch by
(account_id, day)and writes one object per group. Object keys follow theyear=…/month=…/day=…/account=…/<flush-id>.ndjson.gzlayout described above. Each object is gzipped NDJSON, streamed via the Google SDK with resumable uploads on by default.
The first object lands in the bucket within 1 batch worth of
events or 5 seconds (FLUSH_INTERVAL) of the first event,
whichever fires sooner.
Verifying the archive
# List the most recent objects (replace placeholders).
gcloud storage ls --recursive \
--limit 10 \
--project=$PROJECT_ID \
gs://openzro-flow-archive/prod/
# Download one and inspect.
gcloud storage cp \
gs://openzro-flow-archive/prod/year=2026/month=04/day=26/account=acct-1/<id>.ndjson.gz \
/tmp/sample.ndjson.gz
zcat /tmp/sample.ndjson.gz | head -3 | jq
"The bucket is empty"
Walk the path top-down before assuming the sink is broken:
- Are events even reaching the management? Check
/api/network-traffic-events?limit=10returns rows. If empty, the problem is upstream of any sink — peers aren't delivering. Most common cause is the bidi-streaming bug with cluster ingress-nginx; peers see "flow receiver sent no headers" in their logs and never push events. The fix ismanagement.grpcProxy.enabledon Kubernetes, or a non-buggy ingress controller. - Is the GCS sink loaded?
kubectl logs … | grep "flow sink GCS:"on the management pod should show oneflow sink GCS: bucket=… prefix=…line at boot per replica. If not, theflow_exportsrow didn't land or the bucket field is empty — re-save the destination from the dashboard. - Is IAM right? A wrong binding produces
flow sink GCS: write failed: googleapi: Error 403once per batch. Fix the binding (roles/storage.objectCreatoron the bucket, not the project) and a future flush will succeed. - Time window. If you only just enabled flow logging, give it a few minutes — at low event rates the first flush waits for the 5 s ticker to fire after the first event arrives.
Trade-offs
- Best-effort. A failed flush logs loud and drops the batch. Pair the cold archive with a streaming destination (Datadog, Elastic) when durability of every record matters.
- No CMEK by default. GCS Native supports it; the openZro config does not yet expose the key name. Open an issue if you need it — it is a one-line config addition.
- No retention lifecycle managed by openZro. Configure bucket-level lifecycle policies (Coldline / Archive transitions, age-based delete) directly in GCS.