Publication Boundary
This page is a partial design and reference note, not a finished multi-region operations guide.
Partial public document
S3-manifest ingest direction and active-active conditional-create constraints.
This page is a partial design and reference note, not a finished multi-region operations guide.
Implemented object-store behavior is constrained by the object-storage plan items and limitations page.
Draft proposal.
Content-addressed S3 objects can make raw batch storage idempotent across multiple
collector processes, but they do not by themselves make the system active-active.
The useful primitive is an S3 conditional create on an identity manifest. That
manifest becomes the shared acceptance record for (agent_id, boot_id, seq_start, seq_end).
This gives the project active-active ingest without adding a database for the raw acceptance path, but only inside one S3 acceptance authority. All writers must conditionally create identity manifests in the same authoritative bucket and prefix. Multi-Region Access Points, cross-region replication, and independent regional buckets do not provide a global conditional-create lock.
The design does not give active-active live reads, rollup ownership, cache coherence, or complete query fanout for free.
An acceptance authority is one S3 bucket plus one configured prefix namespace. Every
collector participating in the same active-active deployment must write
accepted/v1/ manifests through that same authority.
Do not treat replicated buckets as independent authorities. Cross-region replication is asynchronous, so two regional buckets can both accept the same identity before replication converges. Do not treat an S3 Multi-Region Access Point as a lock unless the deployment can prove every conditional manifest write for this prefix is routed to the same backing bucket. MRAP routing is an availability and latency feature, not object-key-based coordination.
If the deployment needs multi-region active-active ingest with regional write survivability, add a separate global coordinator for accepted identities. The S3 manifest protocol remains useful as the regional durable record, but it is not the global linearization point in that topology.
The design relies on these S3 behaviors:
If-None-Match: * fail when the target key already has a
current object.PutObject and CompleteMultipartUpload.References:
If bucket versioning is enabled, conditional writes apply to the current version of the key. The acceptance prefixes should deny deletes, or use Object Lock if the deployment requires stronger protection, so a delete marker cannot reopen an accepted identity key.
The protocol should not use CopyObject into acceptance prefixes. AWS documents
conditional writes for CopyObject, but enforced conditional-write bucket policies
can reject copy operations even when conditional headers are present. The ingest
path only needs direct object creation.
Use immutable content blobs, immutable identity manifests, quarantine manifests, and derived indexes.
raw-blobs/v1/sha256/<hh>/<hh>/<sha256>.lmbatch
raw-blobs-by-time/v1/date=YYYY-MM-DD/hour=HH/sha256=<sha256>.json
accepted/v1/agent=<agent-hex>/boot=<boot-hex>/<seq-start>-<seq-end>.json
accepted-by-time/v1/date=YYYY-MM-DD/hour=HH/<manifest-id>.json
accepted-by-blob/v1/sha256/<hh>/<hh>/<sha256>/agent=<agent-hex>/boot=<boot-hex>/<seq-start>-<seq-end>.json
quarantine/v1/identity-conflict/agent=<agent-hex>/boot=<boot-hex>/<seq-start>-<seq-end>/<sha256>.json
quarantine-by-blob/v1/sha256/<hh>/<hh>/<sha256>/agent=<agent-hex>/boot=<boot-hex>/<seq-start>-<seq-end>.json
manifests/v1/date=YYYY-MM-DD/hour=HH/<manifest-id>.json
rollup/v1/window=<window>/date=YYYY-MM-DD/hour=HH/metric=<metric-name>/<chunk-id>.lmrollup.zst
index/v1/date=YYYY-MM-DD/hour=HH/<chunk-id>.lmindex.zst
agent-hex and boot-hex should use the same hex component encoding as the local
spool filenames. seq-start and seq-end should stay fixed-width decimal so
lexicographic ordering matches sequence ordering.
The content hash is sha256 over the exact validated Lightmetrics frame bytes that
will be stored. HTTP Content-Encoding stays forbidden; compression remains inside
the frame flags. If a future protocol accepts multiple wire encodings for the same
logical batch, this must be revisited and the hash should move to a canonical stored
representation.
accepted/v1/ is identity-keyed because the ingest path must be able to construct
the exact dedupe key before writing. accepted-by-time/v1/ and
raw-blobs-by-time/v1/ are derived indexes for reconciliation and garbage
collection; they are not acceptance records.
The identity manifest is the cluster-wide dedupe record:
{
"schema": "lightmetrics.accepted_batch.v1",
"agent_id": "agent-a",
"boot_id": "boot-uuid",
"seq_start": 10,
"seq_end": 20,
"frame_sha256": "hex",
"frame_bytes": 12345,
"raw_blob_key": "raw-blobs/v1/sha256/ab/cd/hex.lmbatch",
"accepted_at_unix_ns": 1760000000000000000,
"collector_id": "collector-1",
"wire_version": 1,
"frame_version": 1
}
Only the first collector to create this object has accepted the batch. Every later collector must treat the object as the source of truth.
agent_id.(agent_id, boot_id, seq_start, seq_end).frame_sha256 over the exact validated frame bytes.raw-blobs/v1/.../<frame_sha256>.lmbatch with
If-None-Match: *.accepted/v1/... with If-None-Match: *.duplicate=false; the reconciler is
responsible for repairing any derived side effect that does not complete.frame_sha256 matches, return duplicate=true and do not update live state
or enqueue work. If it differs, return an identity-conflict error and optionally
quarantine the submitted frame.The identity manifest creation is the only point where a batch becomes accepted. All side effects that must happen once per accepted batch must occur after that conditional create succeeds.
Conditional-create handling must distinguish durable existence from retryable S3 races:
200 or 201 from a raw blob write means the blob was created.412 Precondition Failed from a raw blob write means the content-addressed blob
already exists; continue to manifest creation.409 Conflict from a raw PutObject conditional write is retryable with
backoff. It is not evidence that the object exists.409 Conflict from CompleteMultipartUpload requires starting a new multipart
upload before retrying.412 Precondition Failed from the accepted-manifest write means the identity key
already exists; read the existing manifest and compare frame_sha256.409 Conflict, timeout, or unknown outcome from accepted-manifest creation must
be resolved by retrying the conditional create or reading the manifest key. Do not
classify it as duplicate unless the existing manifest is observed and its hash
matches.Accepted manifests should be small single-PutObject writes. They should not use
multipart upload.
200, duplicate=false.200, duplicate=true.409 identity_conflict; the agent must not
delete the queued batch based on this response.503; the agent must retry.For identity_conflict, the collector should write a quarantine manifest before
returning when possible:
{
"schema": "lightmetrics.identity_conflict.v1",
"agent_id": "agent-a",
"boot_id": "boot-uuid",
"seq_start": 10,
"seq_end": 20,
"accepted_manifest_key": "accepted/v1/agent=.../boot=.../00000000000000000010-00000000000000000020.json",
"accepted_raw_blob_key": "raw-blobs/v1/sha256/ab/cd/accepted-hex.lmbatch",
"accepted_frame_sha256": "accepted-hex",
"submitted_raw_blob_key": "raw-blobs/v1/sha256/de/ad/submitted-hex.lmbatch",
"submitted_frame_sha256": "submitted-hex",
"observed_at_unix_ns": 1760000000000000000,
"collector_id": "collector-2"
}
The agent should park the conflicting batch outside the normal retry order instead of retrying it forever, keep the queue entry for operator inspection, continue later uploads when the protocol does not require contiguous acknowledgement, and emit a local alert. The query or admin API should expose conflicts with both hashes, object keys, agent identity, sequence range, and first-seen time. Operator resolution is explicit: either accept the existing manifest as authoritative and drop the parked agent queue entry, or quarantine the agent/boot identity for investigation.
In single-collector mode, the local disk spool can remain the durable acceptance
point before asynchronous object-store upload. The collector exposes that as
ingest.acceptance = "local_spool" plus a configured object store; accepted
spool batches are uploaded on ingest.object_landing_interval_ms, while private
query remains backed by the local accepted spool before landing.
In active-active mode, the local spool cannot be the acceptance point because each collector has a different spool. There are two defensible options:
The recommended active-active MVP is strict active-active. If S3 is unavailable,
return 503 instead of accepting locally.
Active-active mode requires a reconciler. The accept path may update live state and enqueue rollup/index work after manifest creation, but correctness cannot depend on those in-process side effects. A collector can crash after creating the accepted manifest and before updating memory, local cache, reverse indexes, or rollup queues.
The reconciler must repeatedly discover accepted manifests and idempotently repair derived state:
frame_sha256accepted-by-time/v1/ entriesaccepted-by-blob/v1/ reverse referencesraw-blobs-by-time/v1/ entries for referenced blobsDiscovery cannot use bounded time LIST directly over accepted/v1/, because that
prefix is keyed by agent, boot, and sequence. The primary hot path should consume S3
object events for accepted/v1/ and write the accepted-by-time/v1/ derived index.
The steady-state scanner can then LIST accepted-by-time/v1/date=.../hour=.../
from a durable checkpoint.
Because object events and derived indexes can be missing after a crash or
misconfiguration, cold reconciliation must have a complete fallback. For MVP, that
fallback is either an S3 Inventory report over accepted/v1/ or a full scan of all
accepted/v1/ identity partitions. A deployment with an agent registry can narrow
the full scan to known (agent_id, boot_id) prefixes, but the design must still
treat a cold scan as potentially unbounded operational work.
Startup recovery must scan from the last durable checkpoint far enough back to cover
event loss, clock skew, retry windows, and previous unclean shutdowns. Periodic
audit scans using S3 Inventory or identity-prefix listing should compare
accepted/v1/ against the derived indexes and repair missing entries.
The winning collector can update its local live state immediately after creating the identity manifest. Other collectors will not see that in-memory state unless they poll accepted manifests or receive a replication event.
For v1, keep the claim narrow:
If complete low-latency reads from every collector become a requirement, add a shared live-state layer, object-event fanout, or a query coordinator. Content addressing does not solve that problem.
Rollup and index generation must also be idempotent. The simplest v1 rule is to run rollups on a single designated collector. That keeps the raw ingest path active-active while avoiding distributed rollup ownership.
If rollups need to run on multiple collectors, use one of these patterns:
Do not let multiple collectors mutate the same rollup object. Rollup files should remain immutable chunks plus manifests.
Use bucket policy to protect the acceptance protocol:
If-None-Match on writes under accepted/v1/DeleteObject and DeleteObjectVersion under accepted/v1/accepted/v1/CopyObject into accepted/v1/DeleteObject and DeleteObjectVersion under raw-blobs/v1/ for all roles
except the dedicated GC roleraw-blobs/v1/; lifecycle may only abort incomplete multipart uploadsraw-blobs/v1/If-None-Match under raw-blobs/v1/For multipart uploads, policy needs to account for the fact that only
CompleteMultipartUpload carries the final conditional create check.
If versioning is enabled, deleting the current accepted manifest can create a delete
marker, and If-None-Match: * can then accept a new object at the same key. The
acceptance prefix must be treated as append-only metadata, not as ordinary mutable
object storage.
Raw blobs are not acceptance records, but accepted manifests depend on them. Manual
deletes, broad lifecycle expiration, or broad GC under raw-blobs/v1/ can leave an
accepted manifest pointing at missing data. Treat referenced raw blobs as retained
data, and allow deletion only through a narrow GC role after reverse-reference proof.
Orphan raw blobs are expected:
Garbage collection should delete raw blobs only when all of these are true:
Acceptance manifests are durable records and should not be deleted by routine garbage collection.
Candidate discovery cannot use bounded time LIST directly over raw-blobs/v1/,
because that prefix is hash-keyed. The preferred input is S3 Inventory over
raw-blobs/v1/, filtered by object age. The hot-path optimization is the derived
raw-blobs-by-time/v1/ index. For MVP, GC may also be opportunistic and full-scan
only; in that mode it is acceptable for orphan blobs to remain longer than the retry
horizon.
The GC path should not scan all accepted manifests for every candidate blob. Use the
accepted-by-blob/v1/ and quarantine-by-blob/v1/ reverse-reference prefixes as
the fast proof. Those reverse references are derived state, so GC may trust their
absence only after one of these has happened:
If neither condition is met, skip the delete. Retaining an orphan blob is cheaper than deleting data referenced by an accepted manifest.
ingest.acceptance = "local_spool" | "s3_manifest".s3_manifest mode, acknowledge only after the manifest conditional create
succeeds.accepted/v1/ audits.Adopt content-addressed raw blobs plus conditional identity manifests for active-active ingest. Treat this as active-active ingest only, not active-active query or rollup execution. The first implementation should use strict S3-backed acceptance and single-owner rollups.