Skip to content

Consolidated from ADR-0009, ADR-0038, and ADR-0045 on 2026-05-02 per ADR-0047. Source files retained with deprecation banners at docs/adr/0009-telegram-over-whatsapp.md, docs/adr/0038-it-observability-data-plane.md, and docs/adr/0045-headlamp-kubernetes-dashboard.md.

OBS-0001 — IT Observability Stack: Data Plane, Alerting, and Kubernetes UI

Field Value
Status Accepted
Date 2026-05-01 (latest source)
Author Ben Peries
Sources ADR-0009 (Telegram alerting), ADR-0038 (IT observability data plane), ADR-0045 (Headlamp Kubernetes dashboard)

Context

The Archon Platform required a governed IT observability architecture across three separate decisions that were developed together and belong on the same data plane:

  1. Alert delivery channel (ADR-0009): Push alerting requires a supported, cost-free channel with native integrations for Grafana, Uptime Kuma, Diun, and future SIEM.

  2. IT observability data plane (ADR-0038): An audit on 2026-04-22 confirmed no ADR formally adopted Prometheus, Grafana, Loki, or Alertmanager for IT monitoring. Uptime Kuma had been the de facto monitor — but the 2026-04-22 triage confirmed it is a status-page tool, not a monitoring platform. A DNS breakage caused by Tailscale overwriting /etc/resolv.conf took out 80% of Uptime Kuma monitors simultaneously, exposing the structural flaw: no metric TSDB, no query language, no AI integration surface. The observability data plane is not a retrofit — it is the deliberate foundation for AI operations (see LLMOPS-0002).

  3. Kubernetes-native UI layer (ADR-0045): The k3s cluster lacked a graphical dashboard. CRD visibility (cert-manager Certificates, IngressRoutes, ExternalSecrets) required kubectl CLI. A lightweight web UI was needed without significant resource overhead.

Decision

Alert delivery channel — Telegram (from ADR-0009)

Telegram via @caneast-alertbot using the official Bot API is the primary platform alerting channel.

  • Free, no business account, no per-message cost
  • Bot creation via @BotFather; works behind NAT
  • WhatsApp requires Meta Business verification + paid Cloud API or unofficial libraries that violate ToS
  • Native integrations in Grafana, Uptime Kuma, Diun, Wazuh
  • Bot token stored in Infisical at caneast/prod/telegram/bot-token

Forward-looking: WhatsApp will be implemented as a family-facing alert channel for critical notifications (flood, power, security) once the audience extends beyond the platform operator. FreePBX/Asterisk SMS is planned as a carrier-level fallback.

Bot architecture — Alerting vs ChatOps (Supplement 2026-05-12)

Two Telegram bots serve distinct roles. These roles must not cross.

SentinelBot (@caneast-alertbot) — ALERTING only - Receives push notifications from Alertmanager via the Telegram receiver - One-way: platform pushes to the operator. No command handling. - Routing: Alertmanager sends to SentinelBot only. No other service sends to SentinelBot.

ArchonBot / archonagent — CHATOPS only - Receives natural language queries from the operator - Routes to OpenClaw on CanEast AI Node WSL at port 18789 - One-way: operator queries the platform. No alert delivery via this bot.

Routing rule: Alertmanager sends only to SentinelBot. archonagent queries only OpenClaw. No crossover between these roles.

Future bots: Any new Telegram bot introduced to the platform must declare its role (ALERTING or CHATOPS) in its ADR and must not perform functions belonging to the other role.

IT observability data plane — Four-component stack (from ADR-0038)

Adopted stack, deployed to caneast-site1-node4 in the archon-monitoring k3s namespace:

Component Role Data shape
Prometheus IT infrastructure metrics TSDB Host metrics, container state, service up/down, synthetic checks
Loki Log aggregation Structured and unstructured logs across the platform
Alertmanager Alert routing and grouping Notifications to Telegram (SentinelBot), ADO, cmms_alarm_bridge
Grafana metrics/logs UI layer Dashboards over Prometheus, Loki, and existing InfluxDB

Supporting exporters and agents: - node_exporter on all IT nodes (caneast-site1-node1–5, CanEast AI Node) - cAdvisor on all container hosts - blackbox_exporter for synthetic HTTP/TCP/ICMP/DNS checks - mqtt_exporter scraping caneast-site1-mqtt1 for broker and message-rate metrics - Netdata as a per-host secondary observability agent with built-in ML anomaly detection

Helm deployment: kube-prometheus-stack

The Prometheus substrate is deployed using the prometheus-community/kube-prometheus-stack Helm chart (pinned: 84.3.0) as a single bundled release — Prometheus, Alertmanager, Grafana, kube-state-metrics, and node-exporter together with the Prometheus Operator and ServiceMonitor, PodMonitor, and PrometheusRule CRDs.

The ServiceMonitor pattern is the dominant convention in the Kubernetes monitoring ecosystem. Loki is deployed separately via the official Loki Helm chart, not the kube-prometheus-stack bundle.

Two Grafana instances (intentional separation)

The platform runs two independent Grafana instances:

  1. OT Grafana on caneast-site1-node2:[REDACTED] — production-stable; datasource: InfluxDB; owns OT and HMI dashboards; serves ALM-REDACTED/002/003 alert webhooks through cmss_alarm_bridge to Atlas CMMS. Untouched by this ADR.
  2. Platform Grafana in archon-monitoring namespace on caneast-site1-node4 k3s (via kube-prometheus-stack bundle) — datasources: Prometheus and Loki; owns Kubernetes, fleet health, and platform observability dashboards.

The two instances have different audiences, change cadences, uptime requirements, and risk profiles. OT Grafana is production for ALM alerting; platform Grafana iterates fast as the substrate matures. The separation reflects the IT/OT convergence pattern: federation where appropriate, not unification.

Data-plane separation

  • caneast-site1-node2 retains: InfluxDB for OT telemetry (OT-0003); user-facing Docker services
  • caneast-site1-node4 gains: Prometheus, Loki, Alertmanager, Platform Grafana, k3s control plane (PLAT-0003)

InfluxDB and Prometheus are not redundant. They store different data with different shapes, retention profiles, and query patterns: - InfluxDB: OT process data, sensor readings, alarm setpoints, calibration-sensitive values. High-resolution, long retention, domain-specific Flux queries. Governed by OT-0003. - Prometheus: IT infrastructure metrics (host health, container state, service reachability). Short retention, label-based queries, standard exporter ecosystem.

Uptime Kuma scope narrowed

Uptime Kuma retains a legitimate role: external status-page for public-facing endpoints (internet reachability canaries, public site uptime, eventually status.peries.ca).

Uptime Kuma is removed from: host-level infrastructure monitoring, container state monitoring, internal service reachability, MQTT broker health, and OT sensor liveness. The failing Uptime Kuma MQTT monitor (#18, topic caneast/ot-zone/snr01/status) is deprecated; replaced by mqtt_exporter via Prometheus absent-metric alerting rules.

Dual-backend strategy: self-hosted primary, New Relic secondary

Observability telemetry is dual-shipped. Self-hosted stack on caneast-site1-node4 is the primary, authoritative backend. A parallel copy of the same telemetry is shipped to New Relic (free tier, 100 GB/month ingest ceiling) as a hosted secondary backend.

Data paths: - Prometheus remote_write → New Relic metrics endpoint - Vector or Fluent Bit log forwarder splits logs: Loki (primary) and New Relic Logs (secondary) - New Relic Infrastructure Agent runs alongside node_exporter on each IT node

Free-tier ceiling: 100 GB/month is a non-negotiable boundary; overage triggers per-GB charges. A circuit breaker that pauses secondary-ship when approaching 90% of monthly quota is required. Platform fleet (7 hosts) projected at 30–60 GB/month.

Why New Relic over Datadog: Datadog's free tier is limited to 5 hosts; the fleet (7 hosts) exceeds it. New Relic counts data ingest, not host count.

Why dual-shipped: When the observability node itself is impaired, a hosted secondary backend provides an independent signal path that survives on-premise failures. Dual-shipping also demonstrates both self-hosted and cloud-native observability competency — expected at director/CIO level.

Kubernetes-native UI layer — Headlamp (from ADR-0045)

Deploy Headlamp (chart headlamp/headlamp v0.41.0) into the archon-infra namespace as the Kubernetes-native UI layer.

Three options were evaluated:

Option Decision
Kubernetes Dashboard (official CNCF) Rejected — heavy, separate auth proxy, less CRD-aware
Portainer (caneast-site1-node2) Rejected — Docker-centric; k3s support limited, not cluster-native
Headlamp Selected — CRD-aware, plugin store, lightweight (~80 MB), active development

Configuration: - inCluster: true with context name archon - ClusterRoleBinding to cluster-admin (read-only use expected; cluster is single-tenant lab) - Plugin directory at /headlamp/plugins for Plugin Store access - Exposed at https://headlamp-platform.peries.ca via Traefik IngressRoute + cert-manager TLS - Resource allocation: 50m/64Mi requests, 200m/128Mi limits

Helm release:  archon-headlamp
Namespace:     archon-infra
Chart:         headlamp/headlamp 0.41.0
Repo:          https://kubernetes-sigs.github.io/headlamp/
URL:           https://headlamp-platform.peries.ca
Certificate:   headlamp-platform-tls (letsencrypt-prod, DNS-01)
Manifests:     k8s/ingress/headlamp/{certificate,middleware-redirect-https,ingressroute}.yaml
Values:        kubernetes/archon-infra/headlamp/values.yaml

Post-install: install cert-manager plugin from Headlamp UI → Plugin Store → search "cert-manager".

Accepted tradeoff: cluster-admin RBAC is broad; acceptable for single-tenant homelab; revisit if multi-tenant. Headlamp has no built-in Prometheus integration — Grafana dashboards for Headlamp resource usage require a separate ServiceMonitor (deferred).

Addendum — 2026-04-27: Loki deployed as archon-loki

Implementation plan item 2 (Loki deployment to archon-monitoring) is complete.

  • Release: Helm release archon-loki, chart grafana/loki, SingleBinary mode, filesystem backend, 50 Gi local-path PVC, 7-day retention (168 h), tsdb schema v13.
  • Datasource: Loki wired into Platform Grafana via kube-prometheus-stack additionalDataSources. Service URL: http://archon-loki:3100. Datasource confirmed present in Grafana Connections.
  • Log forwarder deferred: No Promtail, Alloy, Vector, or Fluent Bit deployed in this release. Loki is ready to receive logs; the forwarder decision is deferred (tracked WI-387).

Operational notes: - kube-prometheus-stack 84.3.0 drops CAP_CHOWN in the Grafana init-chown-data init container. Fixed by grafana.initChownData.enabled: false. - Grafana Deployment strategy set to Recreate to prevent RWO PVC deadlock during rolling updates.

Addendum — 2026-04-27: Flux Query Conventions for OT Grafana Panels

Originating incident: WI-346 — "Data is missing a time field" error on MQTT Publish Rate panel. Root cause: group(columns: ["_time"]) |> sum() strips the _time column.

Mandatory rules for InfluxDB Flux queries on OT Grafana

a. aggregateWindow is required on all timeseries panels. Raw filter() |> count() or filter() |> sum() without aggregateWindow() emits a single scalar with no time axis. Use aggregateWindow(every: $__interval, fn: <fn>, createEmpty: true/false).

b. Never call group(columns: ["_time"]). Removes _time from every row's group key; the subsequent aggregate reduces each group to a row with no time column.

c. Use keep(columns: [...]) before yield(). Prevents multi-cell stat panel bugs and reduces frame size.

d. Filter by the topic tag in preference to _measurement. The topic tag carries the canonical MQTT address and is durable; _measurement changes when Telegraf is reconfigured.

e. Anchor all OT topic filters to the cae/ prefix. Use r["topic"] =~ /^cae\//.

f. Filter _field == "value" explicitly. Omitting the field filter can return multiple frames when Telegraf writes additional metadata fields.

Canonical publish-rate query

from(bucket: "homelab")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["host"] == "<telegraf-host-id>")
  |> filter(fn: (r) => r["topic"] =~ /^cae\/ot-zone\/snr01\//)
  |> filter(fn: (r) => r["_field"] == "value")
  |> aggregateWindow(every: 1m, fn: count, createEmpty: true)
  |> keep(columns: ["_time", "_value", "topic"])
  |> yield(name: "publish_rate")

Rules a–f apply to all InfluxDB Flux queries for OT Grafana panels. Stat panels displaying a single last() value are exempt from rules a–b but must still follow c–f.

Consequences

  • Prometheus, Loki, Alertmanager, and Platform Grafana (kube-prometheus-stack 84.3.0) are the authoritative IT observability stack on caneast-site1-node4, archon-monitoring namespace
  • OT Grafana (caneast-site1-node2:[REDACTED], InfluxDB) is retained unchanged; cmms_alarm_bridge ALM webhooks are unaffected
  • Headlamp (archon-infra namespace) is the Kubernetes-native UI layer at headlamp-platform.peries.ca
  • node_exporter, cAdvisor, blackbox_exporter, mqtt_exporter, and Netdata are standard IT node agents
  • Uptime Kuma retains public-facing endpoint scope only; all internal infrastructure monitors are deprecated
  • Telegram SentinelBot = ALERTING only; ArchonBot/archonagent = CHATOPS only; roles must not cross
  • New Relic free tier (100 GB/month) is the hosted secondary backend; 90% quota circuit breaker required
  • Log forwarder (Promtail/Fluent Bit/Alloy) decision deferred to WI-387
  • LLMOPS-0002 (AI operations agent plane) depends on this substrate

References

  • PLAT-0003 — k3s control plane migration to caneast-site1-node4
  • IAM-0001 — Infisical for secrets (Telegram bot token, New Relic ingest key)
  • LLMOPS-0002 — AI operations agent plane (depends on this ADR; forward reference)
  • OT-0003 — Historian retention (OT data plane)
  • OT-0004 — Alarm rationalization and CMMS integration
  • OT-0005 — Dashboard taxonomy
  • ansible/roles/common/tasks/ — node_exporter deployment
  • kubernetes/archon-monitoring/ — kube-prometheus-stack + archon-loki Helm values
  • kubernetes/archon-infra/headlamp/ — Headlamp Helm values
  • WI-257 — Epic E1: Platform Infrastructure & Ops
  • WI-321 — k8sgpt AI operations agent plane
  • WI-387 — Log collector ADR (Promtail vs Fluent Bit vs Alloy)