Skip to content

OT-0004 Alarm Rationalization and CMMS Integration

Field Value
ID OT-0004
Date 2026-05-02
Status Accepted
Deciders Ben Peries
Consolidates OT-0006 (2026-04-07), OT-0007 (2026-04-09)

Context

The OT zone 1 sump-pit sensor (caneast-site1-ot1-snr01) generates telemetry that must trigger alarms and create maintenance work orders when threshold conditions are met. Two earlier ADRs addressed these concerns separately: OT-0006 defined the ISA-18.2 alarm design, and OT-0007 defined the CMMS integration. This consolidation combines them as both decisions are operationally inseparable.


Section 1 — ISA-18.2 Alarm Rationalization

Source: OT-0006 (2026-04-07)

Decision

Three alarms are defined for caneast-site1-ot1-snr01 following ISA-18.2 alarm management principles:

Alarm ID Condition Delay Severity Action
CAE1-OT1-ALM-REDACTED level > 300 mm sustained 2 min 2 min Warning Telegram notification
CAE1-OT1-ALM-REDACTED level > 450 mm OR flood == 1 Immediate Critical Telegram + CMMS WO
CAE1-OT1-ALM-REDACTED vibration_stddev > threshold Warning Telegram notification

Implementation

Alarms are implemented in Grafana Unified Alerting with the following UIDs:

Alarm ID Grafana Alert UID
CAE1-OT1-ALM-REDACTED sumppit-level-warning
CAE1-OT1-ALM-REDACTED sumppit-level-critical
CAE1-OT1-ALM-REDACTED sumppit-vibration-warning

Notification channel: Grafana Contact Point → Telegram. TELEGRAM_BOT_TOKEN sourced from Infisical (IAM-0001); never hardcoded.

ISA-18.2 Alignment

  • Alarm rationalization: Only three alarms defined; no nuisance-alarm flood from minor level fluctuations.
  • Deadband: ALM-REDACTED uses a 2-minute sustain period to prevent chatter during normal pump cycles.
  • Priority: Warning (P3) for ALM-REDACTED/003; Critical (P1) for ALM-REDACTED.

Section 2 — CMMS Integration

Source: OT-0007 (2026-04-09), addendum 2026-04-13

Decision

Atlas CMMS (intelloop/grash Docker image) runs on caneast-site1-node2 and serves as the maintenance management system. An alarm-bridge Flask service (also on caneast-site1-node2) mediates between Grafana Unified Alerting and Atlas CMMS.

Alarm-Bridge Architecture

Grafana alert fires
    → POST /webhook (alarm-bridge Flask, caneast-site1-node2)
    → Atlas CMMS REST API → create Work Order
    → Grafana: annotate panel with fingerprint correlation ID
    → ADO: create Bug WI (if Critical)

Service Configuration

Parameter Value
Atlas CMMS image intelloop/grash
Atlas CMMS deps PostgreSQL, MinIO
CMMS_BASE_URL http://REDACTED:8080 (internal; NRP bypass)
alarm-bridge Flask service on caneast-site1-node2, /opt/cae/tools/alarm-bridge/
alarm-bridge image cae1-alarm-harness:latest

Addendum — 2026-04-13

CMMS_BASE_URL was updated from an NRP proxy hostname to the direct internal IP http://REDACTED:8080. The NRP proxy host for Atlas CMMS was not configured at that time; direct IP bypasses the proxy and avoids TLS overhead for internal service communication.

ADO Integration

For Critical alarms (CAE1-OT1-ALM-REDACTED), the alarm-bridge creates an ADO Bug work item in archon-platform\OT with the Grafana alert fingerprint as the correlation ID. This provides a full audit trail: Grafana alert → CMMS work order → ADO bug.


Rationale

Combining alarm rationalization and CMMS integration into a single ADR reflects operational reality: alarms without CMMS integration have no maintenance workflow, and CMMS integration without alarm rationalization has no trigger source. The ISA-18.2 three-alarm design prevents nuisance-alarm overload while ensuring critical conditions generate actionable maintenance records.

Consequences

  • TELEGRAM_BOT_TOKEN must be present in Infisical at runtime; alarm-bridge will fail silently if missing.
  • Atlas CMMS PostgreSQL and MinIO volumes must be maintained; data loss = lost WO history.
  • ADO Bug WIs created by alarm-bridge are in archon-platform\OT; manual triage required to assign to sprint.
  • Future alarm additions (e.g., for caneast-site1-ot1-snr02) follow the same three-alarm rationalization pattern.

Sources

Original ADR Date Content
OT-0006 ISA-18.2 Alarm Rationalization 2026-04-07 Alarm definitions, Grafana UIDs, ISA-18.2 alignment
OT-0007 CMMS Integration 2026-04-09 (addendum 2026-04-13) alarm-bridge, Atlas CMMS, ADO integration

References