Active-active S3 ingest

Status

Draft proposal.

Summary

Content-addressed S3 objects can make raw batch storage idempotent across multiple collector processes, but they do not by themselves make the system active-active. The useful primitive is an S3 conditional create on an identity manifest. That manifest becomes the shared acceptance record for (agent_id, boot_id, seq_start, seq_end).

This gives the project active-active ingest without adding a database for the raw acceptance path, but only inside one S3 acceptance authority. All writers must conditionally create identity manifests in the same authoritative bucket and prefix. Multi-Region Access Points, cross-region replication, and independent regional buckets do not provide a global conditional-create lock.

The design does not give active-active live reads, rollup ownership, cache coherence, or complete query fanout for free.

Goals

Allow multiple collector processes behind a load balancer to accept agent uploads.
Keep raw batch acceptance replay-safe and idempotent across collectors.
Preserve the existing at-least-once agent contract and duplicate response shape.
Avoid introducing a database in the raw ingest acceptance path.
Keep S3 as the durable system of record.

Non-Goals

Exactly-once delivery.
Cross-collector in-memory live-state consistency.
Multi-writer mutable rollup files.
Public query serving from every collector with complete live data.
Accepting ingest while S3 is unavailable in active-active mode.
Cross-region active-active acceptance without a separate global coordinator.

Acceptance Authority

An acceptance authority is one S3 bucket plus one configured prefix namespace. Every collector participating in the same active-active deployment must write accepted/v1/ manifests through that same authority.

Do not treat replicated buckets as independent authorities. Cross-region replication is asynchronous, so two regional buckets can both accept the same identity before replication converges. Do not treat an S3 Multi-Region Access Point as a lock unless the deployment can prove every conditional manifest write for this prefix is routed to the same backing bucket. MRAP routing is an availability and latency feature, not object-key-based coordination.

If the deployment needs multi-region active-active ingest with regional write survivability, add a separate global coordinator for accepted identities. The S3 manifest protocol remains useful as the regional durable record, but it is not the global linearization point in that topology.

Required S3 Semantics

The design relies on these S3 behaviors:

Conditional writes with If-None-Match: * fail when the target key already has a current object.
S3 supports conditional writes for PutObject and CompleteMultipartUpload.
Bucket policies can require conditional headers for writes under selected prefixes.
S3 provides strong read-after-write consistency for object PUT, GET, HEAD, and LIST operations.

References:

If bucket versioning is enabled, conditional writes apply to the current version of the key. The acceptance prefixes should deny deletes, or use Object Lock if the deployment requires stronger protection, so a delete marker cannot reopen an accepted identity key.

The protocol should not use CopyObject into acceptance prefixes. AWS documents conditional writes for CopyObject, but enforced conditional-write bucket policies can reject copy operations even when conditional headers are present. The ingest path only needs direct object creation.

Object Layout

Use immutable content blobs, immutable identity manifests, quarantine manifests, and derived indexes.

raw-blobs/v1/sha256/<hh>/<hh>/<sha256>.lmbatch
raw-blobs-by-time/v1/date=YYYY-MM-DD/hour=HH/sha256=<sha256>.json
accepted/v1/agent=<agent-hex>/boot=<boot-hex>/<seq-start>-<seq-end>.json
accepted-by-time/v1/date=YYYY-MM-DD/hour=HH/<manifest-id>.json
accepted-by-blob/v1/sha256/<hh>/<hh>/<sha256>/agent=<agent-hex>/boot=<boot-hex>/<seq-start>-<seq-end>.json
quarantine/v1/identity-conflict/agent=<agent-hex>/boot=<boot-hex>/<seq-start>-<seq-end>/<sha256>.json
quarantine-by-blob/v1/sha256/<hh>/<hh>/<sha256>/agent=<agent-hex>/boot=<boot-hex>/<seq-start>-<seq-end>.json
manifests/v1/date=YYYY-MM-DD/hour=HH/<manifest-id>.json
rollup/v1/window=<window>/date=YYYY-MM-DD/hour=HH/metric=<metric-name>/<chunk-id>.lmrollup.zst
index/v1/date=YYYY-MM-DD/hour=HH/<chunk-id>.lmindex.zst

agent-hex and boot-hex should use the same hex component encoding as the local spool filenames. seq-start and seq-end should stay fixed-width decimal so lexicographic ordering matches sequence ordering.

The content hash is sha256 over the exact validated Lightmetrics frame bytes that will be stored. HTTP Content-Encoding stays forbidden; compression remains inside the frame flags. If a future protocol accepts multiple wire encodings for the same logical batch, this must be revisited and the hash should move to a canonical stored representation.

accepted/v1/ is identity-keyed because the ingest path must be able to construct the exact dedupe key before writing. accepted-by-time/v1/ and raw-blobs-by-time/v1/ are derived indexes for reconciliation and garbage collection; they are not acceptance records.

Acceptance Manifest

The identity manifest is the cluster-wide dedupe record:

{
  "schema": "lightmetrics.accepted_batch.v1",
  "agent_id": "agent-a",
  "boot_id": "boot-uuid",
  "seq_start": 10,
  "seq_end": 20,
  "frame_sha256": "hex",
  "frame_bytes": 12345,
  "raw_blob_key": "raw-blobs/v1/sha256/ab/cd/hex.lmbatch",
  "accepted_at_unix_ns": 1760000000000000000,
  "collector_id": "collector-1",
  "wire_version": 1,
  "frame_version": 1
}

Only the first collector to create this object has accepted the batch. Every later collector must treat the object as the source of truth.

Ingest Flow

Authenticate the request and verify that the authenticated principal may write the claimed agent_id.
Enforce body size limits before decode.
Decode and validate the frame and batch.
Derive the batch identity from the decoded batch: (agent_id, boot_id, seq_start, seq_end).
Compute frame_sha256 over the exact validated frame bytes.
Write the raw blob to raw-blobs/v1/.../<frame_sha256>.lmbatch with If-None-Match: *.
If the raw blob already exists, continue. The content address makes this case idempotent.
Create the identity manifest at accepted/v1/... with If-None-Match: *.
If manifest creation succeeds, the batch is accepted. The collector may opportunistically write reverse references, update local live state, and enqueue rollup/index work before or after returning duplicate=false; the reconciler is responsible for repairing any derived side effect that does not complete.
If manifest creation fails because the key exists, read the existing manifest. If frame_sha256 matches, return duplicate=true and do not update live state or enqueue work. If it differs, return an identity-conflict error and optionally quarantine the submitted frame.

The identity manifest creation is the only point where a batch becomes accepted. All side effects that must happen once per accepted batch must occur after that conditional create succeeds.

S3 Error Handling

Conditional-create handling must distinguish durable existence from retryable S3 races:

200 or 201 from a raw blob write means the blob was created.
412 Precondition Failed from a raw blob write means the content-addressed blob already exists; continue to manifest creation.
409 Conflict from a raw PutObject conditional write is retryable with backoff. It is not evidence that the object exists.
409 Conflict from CompleteMultipartUpload requires starting a new multipart upload before retrying.
412 Precondition Failed from the accepted-manifest write means the identity key already exists; read the existing manifest and compare frame_sha256.
409 Conflict, timeout, or unknown outcome from accepted-manifest creation must be resolved by retrying the conditional create or reading the manifest key. Do not classify it as duplicate unless the existing manifest is observed and its hash matches.

Accepted manifests should be small single-PutObject writes. They should not use multipart upload.

Response Semantics

New accepted batch: 200, duplicate=false.
Same identity and same hash already accepted: 200, duplicate=true.
Same identity with a different hash: 409 identity_conflict; the agent must not delete the queued batch based on this response.
S3 unavailable before manifest creation: 503; the agent must retry.
Blob upload succeeds but manifest creation does not complete: retry is safe. The raw blob may be orphaned until garbage collection.

For identity_conflict, the collector should write a quarantine manifest before returning when possible:

{
  "schema": "lightmetrics.identity_conflict.v1",
  "agent_id": "agent-a",
  "boot_id": "boot-uuid",
  "seq_start": 10,
  "seq_end": 20,
  "accepted_manifest_key": "accepted/v1/agent=.../boot=.../00000000000000000010-00000000000000000020.json",
  "accepted_raw_blob_key": "raw-blobs/v1/sha256/ab/cd/accepted-hex.lmbatch",
  "accepted_frame_sha256": "accepted-hex",
  "submitted_raw_blob_key": "raw-blobs/v1/sha256/de/ad/submitted-hex.lmbatch",
  "submitted_frame_sha256": "submitted-hex",
  "observed_at_unix_ns": 1760000000000000000,
  "collector_id": "collector-2"
}

The agent should park the conflicting batch outside the normal retry order instead of retrying it forever, keep the queue entry for operator inspection, continue later uploads when the protocol does not require contiguous acknowledgement, and emit a local alert. The query or admin API should expose conflicts with both hashes, object keys, agent identity, sequence range, and first-seen time. Operator resolution is explicit: either accept the existing manifest as authoritative and drop the parked agent queue entry, or quarantine the agent/boot identity for investigation.

Local Spool Role

In single-collector mode, the local disk spool can remain the durable acceptance point before asynchronous object-store upload. The collector exposes that as ingest.acceptance = "local_spool" plus a configured object store; accepted spool batches are uploaded on ingest.object_landing_interval_ms, while private query remains backed by the local accepted spool before landing.

In active-active mode, the local spool cannot be the acceptance point because each collector has a different spool. There are two defensible options:

Strict active-active: acknowledge only after the S3 identity manifest is created. Local disk is only a cache and crash-recovery aid.
Single-writer degradation: if S3 is down, route all ingest for an agent to one collector with sticky routing and local dedupe. This is more complex and should not be the default.

The recommended active-active MVP is strict active-active. If S3 is unavailable, return 503 instead of accepting locally.

Reconciliation

Active-active mode requires a reconciler. The accept path may update live state and enqueue rollup/index work after manifest creation, but correctness cannot depend on those in-process side effects. A collector can crash after creating the accepted manifest and before updating memory, local cache, reverse indexes, or rollup queues.

The reconciler must repeatedly discover accepted manifests and idempotently repair derived state:

verify the referenced raw blob exists and matches frame_sha256
backfill accepted-by-time/v1/ entries
backfill accepted-by-blob/v1/ reverse references
backfill raw-blobs-by-time/v1/ entries for referenced blobs
populate or refresh local cache and bounded live state
enqueue missing rollup and index work
mark corrupt or missing raw blobs for operator-visible repair

Discovery cannot use bounded time LIST directly over accepted/v1/, because that prefix is keyed by agent, boot, and sequence. The primary hot path should consume S3 object events for accepted/v1/ and write the accepted-by-time/v1/ derived index. The steady-state scanner can then LIST accepted-by-time/v1/date=.../hour=.../ from a durable checkpoint.

Because object events and derived indexes can be missing after a crash or misconfiguration, cold reconciliation must have a complete fallback. For MVP, that fallback is either an S3 Inventory report over accepted/v1/ or a full scan of all accepted/v1/ identity partitions. A deployment with an agent registry can narrow the full scan to known (agent_id, boot_id) prefixes, but the design must still treat a cold scan as potentially unbounded operational work.

Startup recovery must scan from the last durable checkpoint far enough back to cover event loss, clock skew, retry windows, and previous unclean shutdowns. Periodic audit scans using S3 Inventory or identity-prefix listing should compare accepted/v1/ against the derived indexes and repair missing entries.

Live State

The winning collector can update its local live state immediately after creating the identity manifest. Other collectors will not see that in-memory state unless they poll accepted manifests or receive a replication event.

For v1, keep the claim narrow:

active-active ingest is supported
complete live reads are served by a designated query collector, or query responses are allowed to be explicitly partial
each collector can rebuild recent live state by scanning accepted manifests and fetching raw blobs

If complete low-latency reads from every collector become a requirement, add a shared live-state layer, object-event fanout, or a query coordinator. Content addressing does not solve that problem.

Rollups and Indexes

Rollup and index generation must also be idempotent. The simplest v1 rule is to run rollups on a single designated collector. That keeps the raw ingest path active-active while avoiding distributed rollup ownership.

If rollups need to run on multiple collectors, use one of these patterns:

deterministic shard ownership from a configured collector set
idempotent deterministic rollup output keys with conditional creates
an external lease/claim system

Do not let multiple collectors mutate the same rollup object. Rollup files should remain immutable chunks plus manifests.

Bucket Policy Requirements

Use bucket policy to protect the acceptance protocol:

require If-None-Match on writes under accepted/v1/
deny DeleteObject and DeleteObjectVersion under accepted/v1/
configure no lifecycle expiration, noncurrent-version expiration, or delete-marker creation for accepted/v1/
forbid CopyObject into accepted/v1/
deny DeleteObject and DeleteObjectVersion under raw-blobs/v1/ for all roles except the dedicated GC role
configure no lifecycle expiration or noncurrent-version expiration for raw-blobs/v1/; lifecycle may only abort incomplete multipart uploads
require the GC role to use the GC procedure below before deleting from raw-blobs/v1/
require server-side encryption according to deployment policy
restrict collectors to the expected prefixes
optionally require If-None-Match under raw-blobs/v1/

For multipart uploads, policy needs to account for the fact that only CompleteMultipartUpload carries the final conditional create check.

If versioning is enabled, deleting the current accepted manifest can create a delete marker, and If-None-Match: * can then accept a new object at the same key. The acceptance prefix must be treated as append-only metadata, not as ordinary mutable object storage.

Raw blobs are not acceptance records, but accepted manifests depend on them. Manual deletes, broad lifecycle expiration, or broad GC under raw-blobs/v1/ can leave an accepted manifest pointing at missing data. Treat referenced raw blobs as retained data, and allow deletion only through a narrow GC role after reverse-reference proof.

Garbage Collection

Orphan raw blobs are expected:

a collector can upload the blob and crash before creating the identity manifest
a collector can lose the race to create the identity manifest
a request can time out after blob upload and before manifest creation

Garbage collection should delete raw blobs only when all of these are true:

the blob is older than the maximum retry horizon
no accepted manifest references it
no quarantine manifest references it
the reconciler or GC job has completed an S3 Inventory or full identity-prefix audit for the accepted and quarantine prefixes through the candidate blob’s age horizon

Acceptance manifests are durable records and should not be deleted by routine garbage collection.

Candidate discovery cannot use bounded time LIST directly over raw-blobs/v1/, because that prefix is hash-keyed. The preferred input is S3 Inventory over raw-blobs/v1/, filtered by object age. The hot-path optimization is the derived raw-blobs-by-time/v1/ index. For MVP, GC may also be opportunistic and full-scan only; in that mode it is acceptable for orphan blobs to remain longer than the retry horizon.

The GC path should not scan all accepted manifests for every candidate blob. Use the accepted-by-blob/v1/ and quarantine-by-blob/v1/ reverse-reference prefixes as the fast proof. Those reverse references are derived state, so GC may trust their absence only after one of these has happened:

the reconciler has completed an S3 Inventory or full identity-prefix audit for the accepted and quarantine prefixes through the candidate blob’s age horizon
the GC job performs that bounded audit itself before deletion

If neither condition is met, skip the delete. Retaining an orphan blob is cheaper than deleting data referenced by an accepted manifest.

Migration Path

Keep the existing local spool behavior for single-collector mode.
Add object-store writer support for raw blobs and accepted manifests.
Add a config flag such as ingest.acceptance = "local_spool" | "s3_manifest".
In s3_manifest mode, acknowledge only after the manifest conditional create succeeds.
Add conflict handling tests with two collector instances racing on the same identity.
Add S3 Inventory or equivalent full-scan fallback for accepted/v1/ audits.
Add the reconciler before enabling active-active mode in production.
Keep rollups single-owner until raw ingest behavior is stable.

Risks

S3 latency is now on the ingest acknowledgement path in active-active mode.
S3 request cost increases because every accepted batch needs at least one blob write and one manifest write.
Existing local-spool outage behavior changes; active-active mode cannot safely acknowledge while S3 is unavailable.
Live query behavior can become misleading if every collector serves private query requests without shared state or explicit partial markers.
Misconfigured bucket policies can silently weaken the protocol by allowing overwrites or deletes in the acceptance prefix.
A second authoritative bucket or MRAP-routed write path can split the acceptance record and admit duplicate identities.
Missing or stalled reconciliation can leave accepted batches out of live state, local cache, rollups, or indexes.
Broad raw-blob deletes or lifecycle expiration can turn durable acceptance records into unrecoverable missing data.
Identity-keyed acceptance and hash-keyed raw blobs make cold reconciliation and GC operationally expensive without derived indexes or S3 Inventory.

Decision

Adopt content-addressed raw blobs plus conditional identity manifests for active-active ingest. Treat this as active-active ingest only, not active-active query or rollup execution. The first implementation should use strict S3-backed acceptance and single-owner rollups.