Skip to content

Deprecated — Consolidated into OBS-0001 on 2026-05-02 per ADR-0047. This source file is retained as a reference; the canonical content is in OBS-0001.

ADR-0038 — IT Observability Data Plane

| Status | Accepted | | Date | 2026-04-22 | | Author | Ben Peries | | Phases | 2, 3 | | ADO WI | WI-257 | | Related | New Relic (external hosted service, secondary telemetry backend) |

Context

Governance gap

The Archon Platform runs its observability components as ambient tribal knowledge rather than a governed architecture. An audit of the ADR corpus on 2026-04-22 confirmed no ADR formally adopts Prometheus, Grafana, Loki, Alertmanager, or Uptime Kuma for IT-side monitoring. ADR-0016 pre-declares a monitoring namespace containing Prometheus, but that is a namespace design decision, not a stack adoption decision.

Uptime Kuma was adopted without an ADR and has been the de facto IT infrastructure monitor.

Triage finding: Uptime Kuma is a status-page tool, not a monitoring platform

On 2026-04-22 an investigation triggered by eight red monitors on Uptime Kuma v2.2.1 surfaced two independent failures:

  1. Host-level DNS breakage took out 80% of monitors. Tailscale overwrote /etc/resolv.conf on caneast-site1-node2 with a broken MagicDNS resolver. Docker's embedded DNS forwards to the host, so every Uptime Kuma monitor using a hostname returned ENOTFOUND. The only monitors that survived were those using direct IPs or the dns monitor type (which queries 1.1.1.1 directly, bypassing the system resolver). Uptime Kuma has no abstraction over the host network stack — a single OS-level DNS break silently degraded the entire platform's observability.

  2. A known Uptime Kuma v2 bug in the conditions SQLite column for monitors created via the Python API affects MQTT and related monitor types. Fixable, but emblematic: monitor configuration was drifting outside the platform's intended API surface.

Neither failure indicates Uptime Kuma itself is broken. Both indicate it is being asked to do work outside its design envelope. Uptime Kuma is a lightweight status-page tool: it answers "is the thing responding" for a small number of endpoints, ideally public ones. It is not an observability platform — it has no metric TSDB, no query language, no alert routing, no structured log ingestion, no integration surface for AI operations.

The Archon arc requires a real observability substrate

The Archon Platform's stated purpose is to demonstrate the transition from traditional IT operations to agentic AI-driven operations. Agents — whether advisory, diagnostic, or (eventually) autonomous — operate against machine-readable, queryable, structured telemetry. That substrate does not exist today. ADR-0039 establishes the first AI operations primitives on top of it; this ADR establishes the substrate itself.

The observability data plane is therefore not a retrofit of a broken Uptime Kuma setup. It is the deliberate foundation for everything that follows.

Decision

Four-component IT observability data plane

Adopt the following stack, deployed to caneast-site1-node4 in the archon-monitoring k3s namespace (per ADR-0016):

Component Role Data shape
Prometheus IT infrastructure metrics TSDB Host metrics, container state, service up/down, synthetic checks
Loki Log aggregation Structured and unstructured logs across the platform
Alertmanager Alert routing and grouping Notifications to Telegram, ADO, cmms_alarm_bridge
Grafana Unified visualization Dashboards over Prometheus, Loki, and existing InfluxDB

Supporting exporters and agents:

  • node_exporter on all IT nodes (caneast-site1-node1-5, CanEast AI Node)
  • cAdvisor on all container hosts
  • blackbox_exporter for synthetic HTTP/TCP/ICMP/DNS checks
  • mqtt_exporter scraping caneast-site1-mqtt1 for broker and message-rate metrics
  • Netdata as a per-host secondary observability agent with built-in ML anomaly detection

Grafana migration is in scope

The existing Grafana instance on caneast-site1-node2:[REDACTED] will be migrated to caneast-site1-node4, preserving:

  • All dashboard UIDs (for existing links and documentation references)
  • The dfhk1c9eh80zkc REDACTED-influxdb datasource UID (per ADR-OT-0006)
  • All alert rules and contact points
  • The cmms_alarm_bridge webhook integration path that feeds ALM-REDACTED through ALM-REDACTED into Atlas CMMS (verified end-to-end; ADR-OT-0009)

Data-plane separation is explicit

  • caneast-site1-node2 retains: InfluxDB for OT telemetry (per ADR-OT-0005, OT-0010), user-facing Docker services (Homepage, Atlas CMMS, NRP, etc.)
  • caneast-site1-node4 gains: Prometheus, Loki, Alertmanager, Grafana, k3s control plane (per ADR-0034)

InfluxDB and Prometheus are not redundant. They store different data with different shapes, retention profiles, and query patterns:

  • InfluxDB: OT telemetry (process data, sensor readings, alarm setpoints, calibration-sensitive values). High-resolution, long retention, domain-specific queries. Governed by ADR-OT-0005 and OT-0010.
  • Prometheus: IT infrastructure metrics (host health, container state, service reachability). Short retention, label-based queries, standard exporter ecosystem. Governed by this ADR.

Uptime Kuma scope narrows, not removed

Uptime Kuma retains a legitimate role: external status-page for public-facing endpoints. This includes internet reachability canaries, public site uptime (docs.peries.ca, peries.ca, future archon.peries.ca), and eventually a public status.peries.ca page.

Uptime Kuma is removed from:

  • Host-level infrastructure monitoring (moves to Prometheus + node_exporter)
  • Container state monitoring (moves to cAdvisor → Prometheus)
  • Internal service reachability (moves to blackbox_exporter)
  • MQTT broker health (moves to mqtt_exporter)
  • OT sensor liveness (moves to Prometheus absent-message rules)

MQTT monitoring moves to a proper primitive

The failing Uptime Kuma MQTT monitor (#18, topic caneast/ot-zone/snr01/status, targeting the live sump pit sensor) is deprecated. Replaced by mqtt_exporter scraping the caneast/ot-zone/snr01/status topic via caneast-site1-mqtt1, surfaced through Prometheus absent-metric alerting rules.

Dual-backend strategy: self-hosted primary, New Relic secondary

Observability telemetry is dual-shipped. The self-hosted stack on caneast-site1-node4 (Prometheus, Loki, Alertmanager, Grafana) is the primary, authoritative backend. A parallel copy of the same telemetry is shipped to New Relic (free tier, 100 GB/month ingest ceiling) as a hosted secondary backend.

Data paths: - Prometheus remote_write → New Relic metrics endpoint - Vector or Fluent Bit log forwarder splits logs: Loki (primary) and New Relic Logs (secondary) - New Relic Infrastructure Agent runs alongside node_exporter on each IT node for agent-native APM and host metrics

UI roles: - Grafana on caneast-site1-node4 remains the primary operator UI and dashboard authority - New Relic UI is available for cross-validation, evaluation, NRQL-based ad-hoc queries, and fallback visibility when on-premise observability is degraded or unavailable

Backend choice rationale: New Relic was selected over Datadog on free-tier fit. The platform fleet (7 hosts: caneast-site1-node1-5, caneast-site1-mqtt1, CanEast AI Node) exceeds Datadog's 5-host free limit, making Datadog uneconomical without paid uplift. New Relic's free tier counts data ingest (100 GB/month) rather than host count; fleet telemetry volume is projected at 30-60 GB/month, fitting within the ceiling with headroom.

Consequences

  • Prometheus, Loki, Alertmanager, and Grafana are adopted as the authoritative IT observability stack; deployment is scoped to the archon-monitoring namespace on caneast-site1-node4
  • node_exporter, cAdvisor, blackbox_exporter, mqtt_exporter, and Netdata are adopted as standard platform agents; onboarding new IT nodes requires deploying node_exporter and cAdvisor as part of the node baseline
  • Grafana migrates from caneast-site1-node2:[REDACTED] to caneast-site1-node4; all dashboard UIDs, the InfluxDB datasource UID (dfhk1c9eh80zkc), alert rules, and the cmms_alarm_bridge webhook path are preserved
  • Uptime Kuma retains a narrowed scope: public-facing endpoint status page only; all internal infrastructure monitors are deprecated
  • Uptime Kuma MQTT monitor #18 (caneast/ot-zone/snr01/status) is deprecated; snr01 liveness monitoring is owned by Prometheus absent-metric rules via mqtt_exporter
  • The IT/OT data plane boundary is formally documented: InfluxDB on caneast-site1-node2 for OT process telemetry; Prometheus on caneast-site1-node4 for IT infrastructure metrics
  • ADR-0039 (AI operations agent plane) depends on this substrate; this ADR must be accepted before ADR-0039 implementation begins
  • New Relic (free tier) is introduced as a hosted secondary backend; all platform metrics and logs are dual-shipped to New Relic alongside the self-hosted primary

Negative / risks

  • Dual-shipping configuration complexity. Each telemetry source (Prometheus scrape target, log source, node agent) requires configuration for both backends. Divergence between the two views is a debugging concern requiring a troubleshooting runbook.
  • New Relic free-tier ceiling is a hard budget limit. 100 GB/month ingest is a non-negotiable boundary; overage triggers per-GB charges. Ingest monitoring and budget alerting must be configured during deployment. A circuit breaker that pauses secondary-ship when approaching 90% of monthly quota is required.

Rationale

Why dual-shipped, not pure self-hosted

Self-hosted observability has a structural failure mode: when the observability node itself is impaired, operators lose visibility into the platform precisely when they most need it. A hosted secondary backend provides an independent signal path that survives on-premise failures. New Relic's free tier at 100 GB/month contains the platform fleet with headroom, making dual-shipping economically zero-cost.

The architecture also demonstrates familiarity with both self-hosted and cloud-native observability tooling — expected competencies at the director/CIO level. A pure self-hosted posture would signal depth at the cost of breadth; a pure cloud posture would signal breadth at the cost of sovereignty. Dual-shipped demonstrates both while keeping the sovereign data plane (ADR-0039's precondition) intact.

Out of Scope

  • Datadog evaluation is deferred. Fleet size (7 hosts) exceeds Datadog's free-tier host limit (5 hosts). Datadog reconsideration is a future decision contingent on either platform scope change or budget allocation for paid observability tooling.

Implementation Tracking

  1. Prometheus deployment to archon-monitoring namespace on caneast-site1-node4
  2. Loki deployment to archon-monitoring namespace on caneast-site1-node4
  3. Alertmanager deployment with Telegram notification path (ADR-0009)
  4. Grafana migration from caneast-site1-node2:[REDACTED] to caneast-site1-node4 (preserve all UIDs, datasources, alert rules, cmms_alarm_bridge webhook)
  5. node_exporter rollout to all IT nodes (caneast-site1-node1–5, CanEast AI Node) via AWX
  6. cAdvisor deployment on all container hosts (caneast-site1-node2, caneast-site1-node3)
  7. blackbox_exporter deployment for synthetic HTTP/TCP/ICMP/DNS checks
  8. mqtt_exporter deployment targeting caneast-site1-mqtt1
  9. Netdata deployment on all IT nodes as per-host secondary observability agent
  10. Prometheus absent-metric alerting rules for snr01 sensor liveness
  11. Uptime Kuma internal monitor deprecation (retain external/public scope only)
  12. End-to-end validation: ALM-REDACTED/002/003 through cmms_alarm_bridge to Atlas CMMS
  13. New Relic account provisioning and ingest license key in Infisical (archon-platform project)
  14. Prometheus remote_write configuration targeting New Relic metrics endpoint
  15. Log forwarder (Vector or Fluent Bit) deployment with dual-sink configuration: Loki primary, New Relic secondary
  16. New Relic Infrastructure Agent rollout to all IT nodes via AWX
  17. New Relic ingest budget monitoring and 90%-quota circuit breaker
  18. Dual-backend divergence troubleshooting runbook

Alternatives Considered

Keep Uptime Kuma as primary infrastructure monitor — rejected. The 2026-04-22 triage confirmed Uptime Kuma lacks a metric TSDB, query language, structured log ingestion, and any AI integration surface. It is not fixable toward a full observability platform without replacing it entirely.

VictoriaMetrics instead of Prometheus — deferred, not rejected. VictoriaMetrics is binary-compatible with Prometheus and offers better cardinality at scale. Not warranted at current fleet scale (five nodes); re-evaluate if label cardinality or ingestion rate becomes a constraint.

Grafana Cloud for Prometheus and Loki — rejected. The OT telemetry mandate (ADR-OT-0005) requires on-premises data storage; running a split observability plane (cloud + on-prem) adds operational complexity with no benefit at homelab scale. Full self-hosted is consistent with the Infisical and k3s posture.

Extend Uptime Kuma to cover all monitoring — rejected. The session triage confirmed the design limit: Uptime Kuma's monolithic dependency on the host network stack means a single DNS failure takes out the entire status page. A proper observability stack uses multiple independent signal paths (scrape, push, synthetic).

References

  • ADR-0016 — k3s namespace design (pre-declares archon-monitoring namespace)
  • ADR-0034 — k3s control plane migration to caneast-site1-node4
  • ADR-0009 — Telegram alerting (Alertmanager notification path)
  • ADR-OT-0005 — InfluxDB historian retention (OT data plane)
  • ADR-OT-0006 — ISA-18.2 alarm rationalization (Grafana datasource UID)
  • ADR-OT-0009 — Grafana dashboard taxonomy
  • ADR-OT-0010 — Historian retention architecture
  • ADR-0039 — AI operations agent plane (depends on this ADR; forward reference)
  • WI-257 — Epic E1: Platform Infrastructure & Ops

Addendum 2026-04-26: Helm approach and Grafana topology

Two implementation decisions were made before Sprint N WI creation. Both are recorded here rather than as separate ADRs because they refine ADR-0038 rather than overturn it.

Decision 1: kube-prometheus-stack Helm chart

The Prometheus substrate will be deployed using the prometheus-community kube-prometheus-stack Helm chart as a single bundled release. This brings up Prometheus, Alertmanager, Grafana, kube-state-metrics, and node-exporter together with the Prometheus Operator and the ServiceMonitor, PodMonitor, and PrometheusRule CRDs.

Rationale: the ServiceMonitor pattern is the dominant convention in the Kubernetes monitoring ecosystem. Choosing component-by-component deployment would isolate this platform from community Helm charts, k8sgpt integrations, Grafana community dashboards, and Loki ecosystem tooling, all of which assume the operator pattern. Component versions in the bundle are tested together.

The "tighter coupling" tradeoff is mitigated by per-component value overrides in the chart values file. Components can be disabled, replaced, or independently configured.

Loki is deployed separately via the official Loki Helm chart, not the kube-prometheus-stack bundle.

Decision 2: Two separate Grafana instances

The platform will run two independent Grafana instances:

  1. OT Grafana on caneast-site1-node2:3002. Stays in place, untouched. Datasource: InfluxDB. Continues to serve ALM-REDACTED/002/003 alert webhooks through cmms_alarm_bridge to Atlas CMMS. Owns OT and HMI dashboards. Production-stable.
  2. Platform Grafana in the archon-monitoring namespace, deployed as part of the kube-prometheus-stack bundle on caneast-site1-node4 k3s. Datasources: Prometheus and Loki. Owns Kubernetes, fleet health, and platform observability dashboards. Internal-only.

Rationale: the two domains have different audiences, change cadences, uptime requirements, and risk profiles. OT Grafana is production for ALM alerting; touching it for platform reasons creates unnecessary risk. Platform Grafana will iterate fast as the substrate matures, and that iteration must not affect OT dashboards. The separation also reflects the IT and OT convergence pattern this platform demonstrates: federation where appropriate, not unification.

Impact on ADR-0038 implementation plan

  • Item 4 (Grafana migration from caneast-site1-node2:[REDACTED] to caneast-site1-node4) is descoped from the active plan. The migration is not abandoned; it remains a future option if a high-availability requirement emerges or caneast-site1-node2 becomes a constraint.
  • Items 10, 11, 12 re-anchor against the existing OT Grafana on caneast-site1-node2 plus the new Platform Grafana, depending on the alert source. Specifics worked out during Sprint N+1 planning.
  • Sprint N+1 risk profile drops from High to Medium with the L-effort migration removed.

Future re-evaluation criteria

The two-Grafana topology should be revisited if any of the following becomes true:

  • High-availability requirement emerges for OT alerting (k3s gives HA primitives that a single host on caneast-site1-node2 does not).
  • caneast-site1-node2 is decommissioned or repurposed.
  • A unified single-pane-of-glass requirement comes from a stakeholder beyond the current operator (Ben).
  • OT and platform dashboard divergence creates duplication burden in practice.

Hardware constraint note

caneast-site1-node4 (OptiPlex 3070) is at its hardware RAM ceiling: 32 GB across two populated DIMM slots, no upgrade path without replacing the host. Projected peak substrate consumption is approximately 5.5 GB (Prometheus, Alertmanager, Grafana, kube-state-metrics, node-exporter, Loki, blackbox and mqtt exporters, log forwarder, New Relic agent). Headroom adequate at current fleet size. Re-evaluate if fleet grows beyond approximately 15 nodes or if Loki log volume materially increases.

Addendum -- 2026-04-27

Loki deployed as archon-loki (WI-328)

Implementation plan item 2 (Loki deployment) is complete.

Release: Helm release archon-loki installed in archon-monitoring namespace, chart grafana/loki, SingleBinary mode, filesystem backend, 50 Gi local-path PVC, 7-day retention (168 h), tsdb schema v13.

Datasource: Loki wired into Platform Grafana via kube-prometheus-stack additionalDataSources. Service URL: http://archon-loki:3100. Datasource confirmed present in Grafana Connections.

Log forwarder deferred: No Promtail, Alloy, Vector, or Fluent Bit deployed in this release. Loki is ready to receive logs; the forwarder decision is deferred to Sprint N+2 per the original implementation plan.

Operational notes: - kube-prometheus-stack 84.3.0 chart drops CAP_CHOWN in the Grafana init-chown-data init container. Fixed by grafana.initChownData.enabled: false. Root cause: the chart's busybox init container lost its runAsUser: 0 override; the default securityContext now prevents the chown call. PVC ownership was correct from the initial install so the step is safely skipped. - Grafana Deployment strategy set to Recreate to prevent RWO PVC deadlock during rolling updates. - Hardware headroom on caneast-site1-node4 remains adequate. Loki pod observed at steady state with minimal CPU, ~200 MB RSS.

Addendum — 2026-04-27: Flux Query Conventions for Grafana Panels

Originating incident: WI-346 — "Data is missing a time field" error on the MQTT Publish Rate panel (Panel 11) of ot-eng-snr01-diag. Root cause: group(columns: ["_time"]) |> sum() strips the _time column from result frames, leaving Grafana with a timeless table. A secondary anti-pattern (filtering by _measurement rather than the durable topic tag) was identified at the same time.

Mandatory rules for InfluxDB Flux queries on OT Grafana

a. aggregateWindow is required on all timeseries panels. Raw filter() |> count() or filter() |> sum() without aggregateWindow() emits a single scalar with no time axis. Use aggregateWindow(every: $__interval, fn: <fn>, createEmpty: true/false) to preserve time buckets.

b. Never call group(columns: ["_time"]). This transformation removes _time from every row's group key. The subsequent aggregate reduces each group to a row with no time column. Use group() only to ungroup tables completely, and avoid it entirely unless a multi-table join explicitly requires it.

c. Use keep(columns: [...]) before yield(). Flux result frames inherit all tag columns from upstream. Keeping only the columns the panel needs (_time, _value, and any legend dimension) prevents multi-cell stat panel bugs and reduces frame size.

d. Filter by the topic tag in preference to _measurement. The _measurement name is an artifact of the Telegraf name_override setting and changes when Telegraf is reconfigured. The topic tag carries the canonical MQTT address and is durable.

e. Anchor all OT topic filters to the cae/ prefix. Use the regex form r["topic"] =~ /^cae\// to prevent accidental cross-platform matches if a non-cae broker message reaches the same bucket.

f. Filter _field == "value" explicitly. Omitting the field filter can return multiple frames (one per field) when Telegraf writes additional metadata fields alongside the primary value, causing multi-series artefacts in stat panels.

Canonical publish-rate query (reference implementation)

from(bucket: "homelab")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["host"] == "<telegraf-host-id>")
  |> filter(fn: (r) => r["topic"] =~ /^cae\/ot-zone\/snr01\//)
  |> filter(fn: (r) => r["_field"] == "value")
  |> aggregateWindow(every: 1m, fn: count, createEmpty: true)
  |> keep(columns: ["_time", "_value", "topic"])
  |> yield(name: "publish_rate")

Substitute <telegraf-host-id> with the Telegraf container's MAC-derived hostname (e.g., 2cbe476e81d9). Panel 11 on ot-eng-snr01-diag uses this pattern as of version 6.

AP-1: group(columns: ["_time"]) |> sum() — strips time axis

// BROKEN — Grafana reports "Data is missing a time field"
from(bucket: "homelab")
  |> range(start: -5m)
  |> filter(fn: (r) => r._measurement =~ /^snr01_/)
  |> group(columns: ["_time"])
  |> sum()

group(columns: ["_time"]) removes _time from each row's group key. The subsequent sum() reduces each group to a single row with no time column. Grafana receives a table with no _time field.

Fix: Use aggregateWindow() to bucket values while preserving _time:

from(bucket: "homelab")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["topic"] =~ /^cae\/ot-zone\/snr01\//)
  |> filter(fn: (r) => r["_field"] == "value")
  |> aggregateWindow(every: 1m, fn: count, createEmpty: true)
  |> keep(columns: ["_time", "_value", "topic"])
  |> yield(name: "publish_rate")

AP-2: Bare count() without windowing — emits a scalar

// BROKEN — returns a single integer, no time series
from(bucket: "homelab")
  |> range(start: -1h)
  |> filter(fn: (r) => r._measurement == "snr01_flood")
  |> count()

count() collapses all rows per group into a single aggregate row. With the default group key (all tag columns) each measurement produces one row; the time column is set to the range start, not actual data points. The panel renders as a single flat line rather than a timeseries.

Fix:

from(bucket: "homelab")
  |> range(start: -1h)
  |> filter(fn: (r) => r["topic"] =~ /^cae\/ot-zone\/snr01\/flood/)
  |> filter(fn: (r) => r["_field"] == "value")
  |> aggregateWindow(every: 1m, fn: count, createEmpty: true)
  |> keep(columns: ["_time", "_value"])
  |> yield(name: "flood_rate")

Applicability

These rules apply to all InfluxDB Flux queries written for OT Grafana (caneast-site1-node2:[REDACTED]) panels. Stat panels displaying a single last() value are exempt from rules a–b but must still follow c–f.