Consolidated from ADR-0009, ADR-0038, and ADR-0045 on 2026-05-02 per ADR-0047. Source files retained with deprecation banners at
docs/adr/0009-telegram-over-whatsapp.md,docs/adr/0038-it-observability-data-plane.md, anddocs/adr/0045-headlamp-kubernetes-dashboard.md.
OBS-0001 — IT Observability Stack: Data Plane, Alerting, and Kubernetes UI¶
| Field | Value |
|---|---|
| Status | Accepted |
| Date | 2026-05-01 (latest source) |
| Author | Ben Peries |
| Sources | ADR-0009 (Telegram alerting), ADR-0038 (IT observability data plane), ADR-0045 (Headlamp Kubernetes dashboard) |
Context¶
The Archon Platform required a governed IT observability architecture across three separate decisions that were developed together and belong on the same data plane:
-
Alert delivery channel (ADR-0009): Push alerting requires a supported, cost-free channel with native integrations for Grafana, Uptime Kuma, Diun, and future SIEM.
-
IT observability data plane (ADR-0038): An audit on 2026-04-22 confirmed no ADR formally adopted Prometheus, Grafana, Loki, or Alertmanager for IT monitoring. Uptime Kuma had been the de facto monitor — but the 2026-04-22 triage confirmed it is a status-page tool, not a monitoring platform. A DNS breakage caused by Tailscale overwriting
/etc/resolv.conftook out 80% of Uptime Kuma monitors simultaneously, exposing the structural flaw: no metric TSDB, no query language, no AI integration surface. The observability data plane is not a retrofit — it is the deliberate foundation for AI operations (see LLMOPS-0002). -
Kubernetes-native UI layer (ADR-0045): The k3s cluster lacked a graphical dashboard. CRD visibility (cert-manager Certificates, IngressRoutes, ExternalSecrets) required
kubectlCLI. A lightweight web UI was needed without significant resource overhead.
Decision¶
Alert delivery channel — Telegram (from ADR-0009)¶
Telegram via @caneast-alertbot using the official Bot API is the primary platform alerting channel.
- Free, no business account, no per-message cost
- Bot creation via @BotFather; works behind NAT
- WhatsApp requires Meta Business verification + paid Cloud API or unofficial libraries that violate ToS
- Native integrations in Grafana, Uptime Kuma, Diun, Wazuh
- Bot token stored in Infisical at
caneast/prod/telegram/bot-token
Forward-looking: WhatsApp will be implemented as a family-facing alert channel for critical notifications (flood, power, security) once the audience extends beyond the platform operator. FreePBX/Asterisk SMS is planned as a carrier-level fallback.
Bot architecture — Alerting vs ChatOps (Supplement 2026-05-12)¶
Two Telegram bots serve distinct roles. These roles must not cross.
SentinelBot (@caneast-alertbot) — ALERTING only - Receives push notifications from Alertmanager via the Telegram receiver - One-way: platform pushes to the operator. No command handling. - Routing: Alertmanager sends to SentinelBot only. No other service sends to SentinelBot.
ArchonBot / archonagent — CHATOPS only - Receives natural language queries from the operator - Routes to OpenClaw on CanEast AI Node WSL at port 18789 - One-way: operator queries the platform. No alert delivery via this bot.
Routing rule: Alertmanager sends only to SentinelBot. archonagent queries only OpenClaw. No crossover between these roles.
Future bots: Any new Telegram bot introduced to the platform must declare its role (ALERTING or CHATOPS) in its ADR and must not perform functions belonging to the other role.
IT observability data plane — Four-component stack (from ADR-0038)¶
Adopted stack, deployed to caneast-site1-node4 in the archon-monitoring k3s namespace:
| Component | Role | Data shape |
|---|---|---|
| Prometheus | IT infrastructure metrics TSDB | Host metrics, container state, service up/down, synthetic checks |
| Loki | Log aggregation | Structured and unstructured logs across the platform |
| Alertmanager | Alert routing and grouping | Notifications to Telegram (SentinelBot), ADO, cmms_alarm_bridge |
| Grafana | metrics/logs UI layer | Dashboards over Prometheus, Loki, and existing InfluxDB |
Supporting exporters and agents: - node_exporter on all IT nodes (caneast-site1-node1–5, CanEast AI Node) - cAdvisor on all container hosts - blackbox_exporter for synthetic HTTP/TCP/ICMP/DNS checks - mqtt_exporter scraping caneast-site1-mqtt1 for broker and message-rate metrics - Netdata as a per-host secondary observability agent with built-in ML anomaly detection
Helm deployment: kube-prometheus-stack¶
The Prometheus substrate is deployed using the prometheus-community/kube-prometheus-stack
Helm chart (pinned: 84.3.0) as a single bundled release — Prometheus, Alertmanager, Grafana,
kube-state-metrics, and node-exporter together with the Prometheus Operator and ServiceMonitor,
PodMonitor, and PrometheusRule CRDs.
The ServiceMonitor pattern is the dominant convention in the Kubernetes monitoring ecosystem. Loki is deployed separately via the official Loki Helm chart, not the kube-prometheus-stack bundle.
Two Grafana instances (intentional separation)¶
The platform runs two independent Grafana instances:
- OT Grafana on caneast-site1-node2:[REDACTED] — production-stable; datasource: InfluxDB; owns OT and HMI dashboards; serves ALM-REDACTED/002/003 alert webhooks through cmss_alarm_bridge to Atlas CMMS. Untouched by this ADR.
- Platform Grafana in
archon-monitoringnamespace on caneast-site1-node4 k3s (via kube-prometheus-stack bundle) — datasources: Prometheus and Loki; owns Kubernetes, fleet health, and platform observability dashboards.
The two instances have different audiences, change cadences, uptime requirements, and risk profiles. OT Grafana is production for ALM alerting; platform Grafana iterates fast as the substrate matures. The separation reflects the IT/OT convergence pattern: federation where appropriate, not unification.
Data-plane separation¶
- caneast-site1-node2 retains: InfluxDB for OT telemetry (OT-0003); user-facing Docker services
- caneast-site1-node4 gains: Prometheus, Loki, Alertmanager, Platform Grafana, k3s control plane (PLAT-0003)
InfluxDB and Prometheus are not redundant. They store different data with different shapes, retention profiles, and query patterns: - InfluxDB: OT process data, sensor readings, alarm setpoints, calibration-sensitive values. High-resolution, long retention, domain-specific Flux queries. Governed by OT-0003. - Prometheus: IT infrastructure metrics (host health, container state, service reachability). Short retention, label-based queries, standard exporter ecosystem.
Uptime Kuma scope narrowed¶
Uptime Kuma retains a legitimate role: external status-page for public-facing endpoints (internet reachability canaries, public site uptime, eventually status.peries.ca).
Uptime Kuma is removed from: host-level infrastructure monitoring, container state monitoring,
internal service reachability, MQTT broker health, and OT sensor liveness. The failing Uptime
Kuma MQTT monitor (#18, topic caneast/ot-zone/snr01/status) is deprecated; replaced by mqtt_exporter
via Prometheus absent-metric alerting rules.
Dual-backend strategy: self-hosted primary, New Relic secondary¶
Observability telemetry is dual-shipped. Self-hosted stack on caneast-site1-node4 is the primary, authoritative backend. A parallel copy of the same telemetry is shipped to New Relic (free tier, 100 GB/month ingest ceiling) as a hosted secondary backend.
Data paths:
- Prometheus remote_write → New Relic metrics endpoint
- Vector or Fluent Bit log forwarder splits logs: Loki (primary) and New Relic Logs (secondary)
- New Relic Infrastructure Agent runs alongside node_exporter on each IT node
Free-tier ceiling: 100 GB/month is a non-negotiable boundary; overage triggers per-GB charges. A circuit breaker that pauses secondary-ship when approaching 90% of monthly quota is required. Platform fleet (7 hosts) projected at 30–60 GB/month.
Why New Relic over Datadog: Datadog's free tier is limited to 5 hosts; the fleet (7 hosts) exceeds it. New Relic counts data ingest, not host count.
Why dual-shipped: When the observability node itself is impaired, a hosted secondary backend provides an independent signal path that survives on-premise failures. Dual-shipping also demonstrates both self-hosted and cloud-native observability competency — expected at director/CIO level.
Kubernetes-native UI layer — Headlamp (from ADR-0045)¶
Deploy Headlamp (chart headlamp/headlamp v0.41.0) into the archon-infra namespace
as the Kubernetes-native UI layer.
Three options were evaluated:
| Option | Decision |
|---|---|
| Kubernetes Dashboard (official CNCF) | Rejected — heavy, separate auth proxy, less CRD-aware |
| Portainer (caneast-site1-node2) | Rejected — Docker-centric; k3s support limited, not cluster-native |
| Headlamp | Selected — CRD-aware, plugin store, lightweight (~80 MB), active development |
Configuration:
- inCluster: true with context name archon
- ClusterRoleBinding to cluster-admin (read-only use expected; cluster is single-tenant lab)
- Plugin directory at /headlamp/plugins for Plugin Store access
- Exposed at https://headlamp-platform.peries.ca via Traefik IngressRoute + cert-manager TLS
- Resource allocation: 50m/64Mi requests, 200m/128Mi limits
Helm release: archon-headlamp
Namespace: archon-infra
Chart: headlamp/headlamp 0.41.0
Repo: https://kubernetes-sigs.github.io/headlamp/
URL: https://headlamp-platform.peries.ca
Certificate: headlamp-platform-tls (letsencrypt-prod, DNS-01)
Manifests: k8s/ingress/headlamp/{certificate,middleware-redirect-https,ingressroute}.yaml
Values: kubernetes/archon-infra/headlamp/values.yaml
Post-install: install cert-manager plugin from Headlamp UI → Plugin Store → search "cert-manager".
Accepted tradeoff: cluster-admin RBAC is broad; acceptable for single-tenant homelab;
revisit if multi-tenant. Headlamp has no built-in Prometheus integration — Grafana dashboards
for Headlamp resource usage require a separate ServiceMonitor (deferred).
Addendum — 2026-04-27: Loki deployed as archon-loki¶
Implementation plan item 2 (Loki deployment to archon-monitoring) is complete.
- Release: Helm release
archon-loki, chartgrafana/loki, SingleBinary mode, filesystem backend, 50 Gi local-path PVC, 7-day retention (168 h), tsdb schema v13. - Datasource: Loki wired into Platform Grafana via
kube-prometheus-stackadditionalDataSources. Service URL:http://archon-loki:3100. Datasource confirmed present in Grafana Connections. - Log forwarder deferred: No Promtail, Alloy, Vector, or Fluent Bit deployed in this release. Loki is ready to receive logs; the forwarder decision is deferred (tracked WI-387).
Operational notes:
- kube-prometheus-stack 84.3.0 drops CAP_CHOWN in the Grafana init-chown-data init
container. Fixed by grafana.initChownData.enabled: false.
- Grafana Deployment strategy set to Recreate to prevent RWO PVC deadlock during rolling updates.
Addendum — 2026-04-27: Flux Query Conventions for OT Grafana Panels¶
Originating incident: WI-346 — "Data is missing a time field" error on MQTT Publish Rate
panel. Root cause: group(columns: ["_time"]) |> sum() strips the _time column.
Mandatory rules for InfluxDB Flux queries on OT Grafana¶
a. aggregateWindow is required on all timeseries panels. Raw filter() |> count() or
filter() |> sum() without aggregateWindow() emits a single scalar with no time axis.
Use aggregateWindow(every: $__interval, fn: <fn>, createEmpty: true/false).
b. Never call group(columns: ["_time"]). Removes _time from every row's group key;
the subsequent aggregate reduces each group to a row with no time column.
c. Use keep(columns: [...]) before yield(). Prevents multi-cell stat panel bugs and
reduces frame size.
d. Filter by the topic tag in preference to _measurement. The topic tag carries the
canonical MQTT address and is durable; _measurement changes when Telegraf is reconfigured.
e. Anchor all OT topic filters to the cae/ prefix. Use r["topic"] =~ /^cae\//.
f. Filter _field == "value" explicitly. Omitting the field filter can return multiple
frames when Telegraf writes additional metadata fields.
Canonical publish-rate query¶
from(bucket: "homelab")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["host"] == "<telegraf-host-id>")
|> filter(fn: (r) => r["topic"] =~ /^cae\/ot-zone\/snr01\//)
|> filter(fn: (r) => r["_field"] == "value")
|> aggregateWindow(every: 1m, fn: count, createEmpty: true)
|> keep(columns: ["_time", "_value", "topic"])
|> yield(name: "publish_rate")
Rules a–f apply to all InfluxDB Flux queries for OT Grafana panels. Stat panels displaying
a single last() value are exempt from rules a–b but must still follow c–f.
Consequences¶
- Prometheus, Loki, Alertmanager, and Platform Grafana (kube-prometheus-stack 84.3.0) are the
authoritative IT observability stack on caneast-site1-node4,
archon-monitoringnamespace - OT Grafana (caneast-site1-node2:[REDACTED], InfluxDB) is retained unchanged; cmms_alarm_bridge ALM webhooks are unaffected
- Headlamp (archon-infra namespace) is the Kubernetes-native UI layer at headlamp-platform.peries.ca
- node_exporter, cAdvisor, blackbox_exporter, mqtt_exporter, and Netdata are standard IT node agents
- Uptime Kuma retains public-facing endpoint scope only; all internal infrastructure monitors are deprecated
- Telegram SentinelBot = ALERTING only; ArchonBot/archonagent = CHATOPS only; roles must not cross
- New Relic free tier (100 GB/month) is the hosted secondary backend; 90% quota circuit breaker required
- Log forwarder (Promtail/Fluent Bit/Alloy) decision deferred to WI-387
- LLMOPS-0002 (AI operations agent plane) depends on this substrate
References¶
- PLAT-0003 — k3s control plane migration to caneast-site1-node4
- IAM-0001 — Infisical for secrets (Telegram bot token, New Relic ingest key)
- LLMOPS-0002 — AI operations agent plane (depends on this ADR; forward reference)
- OT-0003 — Historian retention (OT data plane)
- OT-0004 — Alarm rationalization and CMMS integration
- OT-0005 — Dashboard taxonomy
ansible/roles/common/tasks/— node_exporter deploymentkubernetes/archon-monitoring/— kube-prometheus-stack + archon-loki Helm valueskubernetes/archon-infra/headlamp/— Headlamp Helm values- WI-257 — Epic E1: Platform Infrastructure & Ops
- WI-321 — k8sgpt AI operations agent plane
- WI-387 — Log collector ADR (Promtail vs Fluent Bit vs Alloy)