Observability Stack Architecture¶

Overview¶

Platform observability is built on the CNCF-native trio of Prometheus, Loki, and Grafana. Metrics, logs, and alerts flow through a unified stack deployed in the platform Kubernetes cluster. A separate OT Grafana instance serves OT dashboards, maintaining operational separation between IT and OT observability surfaces.

Metrics: Prometheus¶

Prometheus scrapes metrics from all platform nodes and Kubernetes workloads via kube-prometheus-stack. Node Exporter provides host-level metrics; cAdvisor provides container metrics; kube-state-metrics provides Kubernetes object state. Additional scrape targets can be added for out-of-cluster services.

The observability stack architecture, including the decision to use kube-prometheus-stack as the umbrella Helm chart, is documented in OBS-0001.

Logs: Loki¶

Loki is the platform log aggregation backend. It is deployed in the same cluster as Prometheus and Grafana, and is accessible via the Grafana Logs Drilldown plugin. A log collector (to ship node and container logs into Loki) is a tracked backlog item; until deployed, Loki receives no log streams. The log collector selection decision will be a separate ADR.

Dashboards and UI: Grafana¶

Platform Grafana is the primary dashboard and alerting interface. It is served behind Traefik with TLS and is accessible via the public DNS name.

Two-Grafana Topology¶

The platform intentionally runs two Grafana instances:

Platform Grafana (in-cluster): serves IT infrastructure dashboards, Prometheus metrics, Loki logs, and alert rules for the k3s cluster and nodes.
OT Grafana (Docker, separate host): serves OT sensor dashboards and OT alert rules. This instance has direct access to the InfluxDB historian used by OT sensor firmware.

The two-Grafana topology reflects IT/OT separation at the observability layer. OT operators do not need access to platform Grafana; platform engineers do not need access to OT Grafana. The separation is by design and is maintained across upgrades.

Kubernetes-Native UI: Headlamp¶

Headlamp provides a read-only Kubernetes dashboard for cluster state inspection. It is deployed as a Kubernetes workload and served via Traefik with TLS. Headlamp complements (not replaces) kubectl access -- it is intended for rapid visual inspection of pod state, resource health, and events.

Alerting: Alertmanager and Telegram¶

Alertmanager receives alerts from Prometheus rule evaluations and routes them to notification channels. The primary notification channel is Telegram, chosen for reliability and mobile accessibility. Alert rules follow ISA-18.2-inspired severity tiers: Warning, Critical. The Telegram bot token is stored in Infisical and injected at runtime -- never hardcoded in values files.

Key Properties¶

Prometheus + Loki + Grafana as the unified observability stack
Two-Grafana topology enforces IT/OT observability separation
Headlamp provides lightweight Kubernetes-native UI without replacing CLI access
Alertmanager routes to Telegram -- no pager dependency, single notification channel
Log collector not yet deployed -- tracked backlog item before Loki is useful for log queries