Current limitations

This file is the current implementation limitations registry. It describes known boundaries that are enforced by code, required by configuration, or implied by unfinished roadmap work. Target architecture remains in docs/design.md; operational backlog status remains in docs/PLAN.md, and completed-slice history remains in docs/worklog/.

System Status

Lightmetrics is still pre-production. The repository contains working ingest, query, object-store, demo, Grafana-smoke, and console pieces, but several target components remain planned or partial.
Public ingest and private query/UI surfaces are separate. The ingest listener does not serve query, UI, admin, token, or dashboard-management endpoints.
Runtime configuration is TOML. CUE files are validation/generation sketches, not a runtime dependency or an alternate configuration source.
Secrets are referenced by file path. Runtime config should not embed bearer tokens, certificate private keys, or other secret values.
The local demo is a development harness. It is not a hardened deployment topology, and it writes state under /tmp/lightmetrics-demo by default.
Demo-generated token files, MinIO credentials, and TLS private keys are created with restrictive permissions (0600) independent of ambient umask. Restart preserves existing secrets without widening permissions.

Agent

lm-agent --daemon runs repeated heartbeat and configured log_inputs collection through the same durable queue and upload path as --once. Production service units, graceful signal handling, and dynamic config reload are not implemented yet.
Agent telemetry now includes lmagent_up, aggregate lightmetrics_host_cpu_seconds_total counters, capped per-core node_cpu_seconds_total counters for CPU IDs 0 through 7, node_memory_MemTotal_bytes, node_memory_MemAvailable_bytes, node_load1, node_load5, node_load15, lmagent_queue_filesystem_size_bytes, lmagent_queue_filesystem_avail_bytes, node_network_receive_bytes_total, node_network_transmit_bytes_total, lmagent_process_cpu_seconds_total, and lmagent_process_resident_memory_bytes. It also emits capped Linux process-enumeration metrics lightmetrics_host_process_cpu_seconds_total, lightmetrics_host_process_resident_memory_bytes, and lightmetrics_host_processes_omitted for up to eight /proc processes. Enumerated processes are sorted by resident memory bytes, then CPU ticks, then PID; metric identity labels are pid, start_time, and comm; command-line arguments are not read. Non-Linux host APIs and persisted CPU-rate delta calculation remain planned.
Cumulative CPU and network counters carry stable sample start times for reset detection: host CPU and network counters use Linux boot time, and process CPU counters use each process start time. If those Linux /proc boundaries cannot be parsed, the corresponding best-effort counters are omitted for that scrape.
The built-in Linux agent heartbeat currently emits up to 107 scalar metric series, CPU and network metric series use up to two labels, process-enumeration metric series use up to three labels, and generated log records use up to four attributes. Custom collector configs must keep ingest.max_series_per_batch >= 107 and ingest.max_labels_per_series >= 4, with ingest.max_label_key_bytes >= 14 and ingest.max_label_value_bytes >= 64 for built-in metric labels and log attributes; lower values are rejected by collector config validation. The sample collector config defaults of 2048 series, 32 labels, 128 key bytes, and 512 value bytes satisfy this floor.
Log input offsets are tracked by configured input name and path, current file device/inode identity, and a short fingerprint before the committed offset. Common uncompressed rename and copy-truncate rotations in the same directory are drained before the agent switches to the new active file. Rotated files that are compressed, deleted, moved outside the configured file’s directory, or renamed without the configured basename plus a . or - suffix may not be found before unread complete lines become unavailable. Trailing partial lines left in a rotated file are not queued.
Missing configured log files are nonfatal. A configured path that exists but is not a regular file is an error.
Only complete newline-terminated log lines are queued. A trailing partial line is held until a later run sees its newline.
Log tail reads are bounded per queued batch. A single --once run or daemon scrape can queue multiple ordered batches when more complete lines are available than one batch can carry.
Oversized log lines are not uploaded as full messages. The agent emits a bounded omission marker and advances past the line after durable queueing when the source line exceeds max_log_message_bytes, exceeds the batch budget after encoding overhead, or cannot be read within the bounded line window.
Invalid UTF-8 log bytes are converted with lossy UTF-8 replacement and marked with encoding=utf8_lossy.
Agent max_log_message_bytes must be configured no higher than the collector’s ingest.max_log_message_bytes. The agent cannot discover the collector limit dynamically.
Agent max_logs_per_batch must be configured no higher than the collector’s ingest.max_logs_per_batch. One configured log slot is used by the heartbeat log record, so file tailing collects at most max_logs_per_batch - 1 records per queued batch. If agent-side log limits diverge upward from collector limits, the collector can reject queued batches after source offsets have advanced.
Agent max_alerts_per_batch defaults to one local alert slot for queue-pressure drop reporting and must be configured no higher than the collector’s ingest.max_alerts_per_batch. Set it to 0 only when the collector cannot accept alert records; queue-pressure log drops can then occur without an uploaded local alert.
Offset commits have crash recovery for batches that are durably queued but not yet uploaded. This is not a general exactly-once guarantee; delivery remains at-least-once and collector dedupe is still required.
The local disk queue is FIFO by sequence. When the current scrape cannot enqueue a batch with logs because the queue is full, the agent retries the same scrape batch by dropping newest collected log records first, keeps the heartbeat and host metrics when they can still fit, and emits one agent.queue_dropped alert when max_alerts_per_batch and queue space allow it. If even the metrics-only batch cannot fit, collection fails without advancing log offsets.

Protocol And Upload

The wire protocol is versioned framed Cap’n Proto over HTTPS. Unknown frame versions, unknown flags, CRC failures, and malformed payloads are rejected.
HTTP Content-Encoding is not supported for batch uploads. Compression is reserved for frame flags.
The zstd frame flag is reserved but not implemented; frames with zstd enabled are currently rejected.
Ingest remains at-least-once. The agent deletes queue entries only after a collector success response that matches the queued batch identity.
The custom-CA HTTPS upload path uses a small rustls HTTP/1.1 client and builds the request head from a restricted collector URL form. It rejects control bytes, whitespace, userinfo, fragments, unsupported IPv6 authority forms, and network-path-style request targets before opening the TLS connection.
TLS 1.3 0-RTT is only acceptable for replay-safe ingest. Query, UI, admin, and token-style endpoints must reject early data.

Collector Ingest

local_spool and s3_manifest are the implemented ingest acceptance modes. local_spool acknowledges after durable local spool acceptance and is a single-collector durability mode, not active-active acceptance.
When local_spool has a configured object store, accepted batches are landed asynchronously on ingest.object_landing_interval_ms. Object-store landing failures are logged and retried by the next background pass, but they do not reject already accepted local-spool ingest and queries continue to read the accepted spool before landing.
s3_manifest acknowledges only after object-store accepted-manifest conditional create succeeds. Object-store outages in this mode should become 503 backpressure rather than local acceptance.
Duplicate accepted batches are replay-safe by (agent_id, boot_id, seq_start, seq_end). Same identity with different bytes is an identity conflict and is not acknowledged as accepted.
Identity-conflict quarantine is implemented, but there is no production admin HTTP API or operator resolution workflow for conflicts yet.
Public ingest limits are configured on the collector and enforced during frame and Cap’n Proto decode. The agent must be configured to stay within those limits for records it constructs locally, including log message length and log record count.

Object Storage

Filesystem and S3-compatible object-store backends are implemented. gcs is parsed by config but rejected at runtime.
Active-active ingest requires one authoritative bucket/prefix namespace with conditional accepted-manifest writes. Independent regional buckets, asynchronous replication, or MRAP-routed writes are not an acceptance lock.
Accepted manifests are the durable acceptance records. Derived indexes such as accepted-by-time, accepted-by-blob, and raw-blobs-by-time are repairable side effects.
Derived index write failures are logged and do not reverse a completed acceptance decision.
Startup reconciliation in s3_manifest mode and fixed-interval in-process background reconciliation for configured object stores run accepted-manifest repair passes for listing-capable backends. The private query listener exposes the latest reconciliation attempt at GET /api/v1/object-store/reconciliation. Configurable schedules, cross-process leader election, S3 Inventory fallback, full cold-scan scheduling, and rollup enqueueing remain planned. The query path does not maintain a separate persistent local query cache; derived object indexes are the rebuildable query-planning side effects, and local .object-landed markers remain the landed-spool tail boundary.
Object-store garbage collection is not implemented. Routine deletion of accepted manifests is not allowed, and raw blob deletion needs reverse-reference proof plus an audit horizon.

Query APIs

Queries must treat live memory as cache/fanout state only. Correctness comes from accepted local spool/landing data and object storage.
If object storage is not configured, or a configured backend cannot list accepted manifests, successful metric/log/alert query responses include explicit warnings and are limited to local accepted or landed spool data.
Closed time-bounded local-only query_range, logs, and alerts requests add an explicit warning when the requested range extends outside the loaded local sample, log, or alert horizon, or when no local horizon can be established.
Object-store read/list failures return an explicit object_store_unavailable query error instead of silently falling back to local data.
For closed time-bounded query_range, logs, and alerts requests, object-store loading can use accepted-by-time derived indexes to bound accepted-manifest discovery before reading authoritative accepted manifests and raw blobs. The index is keyed by acceptance time, not per-sample/log/alert event time; requests without a closed time range, empty index scans, index read failures, and very broad index ranges fall back to accepted-manifest listing. Indexed responses include explicit partial warnings because individual derived index entries can be missing until reconciliation repairs them.
Prometheus compatibility is intentionally narrow. Supported forms include direct selectors, exact/negative/regex label matchers, rate() over range selectors, sum(... ) by (...), histogram bucket virtual series, and histogram_quantile() in the supported shape.
Label values and series endpoints honor repeated Prometheus match[] filters for supported series selectors, including metric-name selectors and label-only selectors with exact, negative, regex, and negative-regex matchers. Malformed or unsupported match[] values return Prometheus-style bad_data.
Instant query?time=... evaluates supported direct selectors and rate-derived functions at the requested finite timestamp with the fixed five minute lookback for direct selectors. Without time, instant vector queries retain the existing latest-sample behavior.
Unsupported PromQL syntax returns Prometheus-style bad_data responses. Binary operators, joins, subqueries, offset, logical/set operators, arbitrary nested functions, recording-rule semantics, and broad Prometheus staleness behavior are not implemented.
query_range evaluates samples at start + n * step timestamps with a fixed five minute lookback. Configured object-store-backed long range queries may use immutable gauge/counter rollup chunks and compatible histogram virtual-series rollup chunks when a rollup window fits the query step and lookback. Exact object-store coverage metadata is not implemented.
Logs and alerts APIs are bounded JSON APIs. They support documented filters, ordering, limits, cursors, and warning metadata over accepted data. GET /api/v1/logs?contains=... decodes the query parameter using standard form query rules, including + as space and percent-decoded UTF-8, then applies a case-sensitive substring match to the log message field.
GET /api/v1/logs/tail is a private SSE endpoint that emits bounded log_tail_snapshot, log_tail_update, and log_tail_gap events loaded from the same accepted-log visibility source as GET /api/v1/logs. It supports the existing log time/agent/boot/severity/target filters, a bounded limit, and an opaque after cursor for reconnect or lag repair when the request is scoped to the matching agent_id and boot_id. It is not a durable event journal or unbounded log stream.
Full-text log search, regex matching, case-insensitive matching, indexed text search, and contains filtering on GET /api/v1/logs/tail are not implemented.
The server does not evaluate alert rules, send notifications, silence alerts, or expose an Alertmanager-compatible API.

Live Updates

GET /api/v1/events is a private SSE stream of accepted-batch notices, not a correctness source. Clients must repair missed ranges by querying persisted data.
Live update payloads are bounded summaries. They are not an unbounded metrics stream, unbounded log tail, or durable event journal.
The embedded console detects accepted-batch sequence gaps per agent/boot and backfills the focused metric trend with query_range over the affected timestamp interval. Broader multi-series dashboard backfill is not implemented yet, and the embedded console does not consume GET /api/v1/logs/tail yet.

Rollups

Configured rollup windows write immutable per-accepted-batch metric rollup chunks and rollup manifests during object landing, and accepted-manifest reconciliation backfills missing rollup objects. Gauge chunks include count/min/max/sum/last summaries, counter chunks include first/last and reset-aware delta/rate fields when enough same-start samples are present, and histogram chunks merge compatible bucket/count/sum deltas.
Query planning can choose gauge/counter rollups and compatible histogram virtual-series rollups for long object-store-backed query_range requests, and falls back to raw accepted data when a selected rollup read returns no chunks or fails validation. Instant queries, labels, series, logs, and alerts do not use rollups.
Rollup-backed histogram _bucket, _sum, and _count range results expose synthetic cumulative window-boundary points derived from merged window deltas so rate() and histogram_quantile() can compute bounded approximations.
Rollup objects are not a compact cross-batch query index yet. Exact coverage metadata and rollup retention enforcement remain planned work.
Cumulative histogram rollup conversion is limited to samples available inside the same accepted batch/window. Cross-batch baseline lookup remains planned.
Histogram quantiles are computed from available bucket data or compatible histogram rollup bucket approximations; rollup quantile accuracy is bounded by bucket width and rollup aggregation rules.
Multi-writer rollup ownership is not implemented. The active-active design assumes single-owner or explicitly coordinated rollup generation.

Private UI And Dashboards

The embedded console uses real API/SSE boundaries, and its topbar, dashboard tabs, tab overflow, and dashboard settings metadata follow the Claude Design v2 shell. Fleet, Live metrics, Logs and alerts, Ingest/storage, and Metrics query now expose the UI-10 v2 data-surface controls over existing APIs, but they remain bounded operational views rather than a complete observability product. The UI-05 parity review intentionally keeps dashboard definitions as normal top-level tabs rather than a standalone Dashboards product section, and rejects visible prototype-only wording or data.
Built-in dashboard definitions are typed TOML files loaded by the collector, and configured custom dashboard TOML files or directories can add dashboards or explicitly replace/hide built-ins. GET /api/v1/dashboards is a private query-token API that exposes the canonical dashboard definition list. Duplicate custom dashboard IDs and implicit built-in overrides are rejected at config check/startup. The console loads dashboard tabs/settings from that API after query-token connection, and configured Prometheus dashboard panels issue bounded /api/v1/query_range requests with loading, empty, partial-warning, stale, and error states. Dashboard panel rendering and querying are capped per dashboard in the browser to bound private query load. The deeper Fleet, Live metrics, Logs, Alerts, Ingest/storage, and Query utility views are not fully definition-backed. The Object horizon surface additionally fetches GET /api/v1/object-store/reconciliation and maps the returned object-store reconciliation state into disabled, partial, or failed UI, but exact object-store horizon metadata remains unavailable.
The embedded console keeps the query bearer token in browser session storage by default. The Remember control is the only path that persists the token in localStorage, and Disconnect clears both browser storage scopes. There is no server-side session management for the private console.
Fleet overview triage is derived from accepted-batch SSE summaries and the metric/log/alert values those summaries carry. Resource health is therefore limited to metrics present in live summaries, and missing CPU, memory, disk, network, process, spool, duplicate, conflict, retry, and S3-lag fields are shown as unavailable rather than fabricated. A bounded collector detail/history API remains planned.
Several console surfaces still depend on missing backend contracts: host detail remains shallow, collector spool maxima, duplicate/conflict/retry history, object-store lag, rollup owner, and exact object-store horizon metadata are shown as unavailable, and dashboard action paths remain future work. Logs/alerts filters, ingest/storage health panels, query/debug helpers, source-tier states, and 1m/5m rollup comparison now exist over the current bounded APIs. The Live metrics view has focused realtime-plus-query chart reconciliation, but it is not yet a multi-series dashboard surface. The UI-05 screenshot audit covers desktop and mobile utility surfaces plus configured dashboard tabs, but it does not make missing backend contracts available.
UI tests may use mock data only through mocked HTTP/SSE boundaries with the same shape as production contracts. Production UI must not hard-code fake telemetry.
Arrow/Perspective table endpoints are post-MVP and should remain feature-gated if added.

Container Deployment

The repository does not ship official Lightmetrics container images, Dockerfiles, production Compose files, or image publishing automation. docs/docker.md is an operator-owned container recipe, not a supported packaged artifact.
Containerized collectors still require persistent mounted state for the spool, cache, and optional filesystem object-store bucket. Removing those mounts can remove accepted data or query-visible object-store data.
Containerized agents still require persistent mounted queue state. Removing the queue can drop not-yet-acknowledged uploads, sequence state, and log-tail offsets.
An agent running inside a container observes the container’s process, filesystem, network, and cgroup view unless host paths or namespaces are mounted intentionally. Host log collection requires explicit read-only mounts and container-path log_inputs.
The private query/UI listener must remain private even when its container port binds to 0.0.0.0 inside the container. Restrict the host-side published port or private network exposure.
S3-compatible object-store containers still use the same runtime contract as host installs: credentials come from AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and optional AWS_SESSION_TOKEN; config points at the endpoint, bucket, prefix, and region.

Public Website

site/ is a static Astro scaffold for the public website. It uses npm, builds static output to site/dist, and does not use a Cloudflare adapter, SSR, runtime API calls, bearer tokens, or private agent procedure content.
The site prebuild step reads only selected public repository inputs: README.md, docs/README.md, docs/quickstart.md, docs/install.md, docs/docker.md, docs/limitations.md, docs/design.md, docs/active-active-s3-ingest.md, docs/design-artifacts.md, docs/claude-design-public-site-task.md, docs/PLAN.md, and docs/gantt-data.json. It writes ignored build metadata and generated Markdown route inputs under site/src/generated/; that generated data is not an authoritative replacement for the source files.
Public docs routes render only the allowlisted docs with status labels, source links, generated heading navigation, and explicit partial/planned/ unsupported callouts. Incomplete user, admin, configuration, API, development, troubleshooting, and release/upgrade guides are listed as planned rather than published as complete. AGENTS.md, REVIEW.md, docs/worklog/, and the local docs/gantt-data.js wrapper are excluded from public docs routes.
The current public pages are a maintainable WEB-02 scaffold with WEB-03 landing page content, WEB-04 public docs publishing, and WEB-05 public roadmap/Gantt integration. The roadmap page consumes the canonical docs/gantt-data.json payload through the site prebuild metadata step, renders filters, task details, and a static Gantt timeline, and excludes the local file:// docs/gantt-data.js wrapper from the public site boundary. Cloudflare Pages preview configuration is codified through the site/ package’s build settings, wrangler.toml, _headers, _redirects, and deploy contract check. Actual Cloudflare project/account operations are operator-dependent, and DNS/custom-domain launch remains gated under WEB-07.

Grafana

Grafana support means stock Grafana can query Lightmetrics through the Prometheus-compatible API. There is no custom Grafana data source plugin.
The current smoke target verifies datasource health, label, instant, range, rate(), grouped sum(rate(...)), and histogram_quantile() paths against the local demo. The rate() and histogram dashboard panels use smoke-only metric data accepted through the real ingest/query path; broader PromQL parity remains out of scope.
Dashboard JSON export is low priority and export-only when implemented. Grafana JSON import into Lightmetrics is unsupported.
Logs and alerts are not Grafana-native in v1. A Loki-compatible surface would be a later feature if Grafana log browsing becomes important.

Integration And CI

MinIO and Grafana smoke tests are available locally, but full integration gating in CI is still planned.
Integration tests depend on local service availability and isolated demo directories; they are not a replacement for production deployment validation.

Documentation Rule

When a change adds a new runtime boundary, unsupported behavior, partial response mode, durability tradeoff, or planned-but-not-implemented feature, update this file in the same slice. If the limitation is tied to a tracked backlog item, also update docs/PLAN.md. If the limitation is discovered or changed by a completed slice, record the completed-slice evidence in the matching docs/worklog/<TASK-ID>.jsonl record.

Publication Boundary

Unsupported behavior