OT-0004 Alarm Rationalization and CMMS Integration¶
| Field | Value |
|---|---|
| ID | OT-0004 |
| Date | 2026-05-02 |
| Status | Accepted |
| Deciders | Ben Peries |
| Consolidates | OT-0006 (2026-04-07), OT-0007 (2026-04-09) |
Context¶
The OT zone 1 sump-pit sensor (caneast-site1-ot1-snr01) generates telemetry that must trigger
alarms and create maintenance work orders when threshold conditions are met. Two earlier
ADRs addressed these concerns separately: OT-0006 defined the ISA-18.2 alarm design,
and OT-0007 defined the CMMS integration. This consolidation combines them as both
decisions are operationally inseparable.
Section 1 — ISA-18.2 Alarm Rationalization¶
Source: OT-0006 (2026-04-07)
Decision¶
Three alarms are defined for caneast-site1-ot1-snr01 following ISA-18.2 alarm management
principles:
| Alarm ID | Condition | Delay | Severity | Action |
|---|---|---|---|---|
CAE1-OT1-ALM-REDACTED |
level > 300 mm sustained 2 min |
2 min | Warning | Telegram notification |
CAE1-OT1-ALM-REDACTED |
level > 450 mm OR flood == 1 |
Immediate | Critical | Telegram + CMMS WO |
CAE1-OT1-ALM-REDACTED |
vibration_stddev > threshold |
— | Warning | Telegram notification |
Implementation¶
Alarms are implemented in Grafana Unified Alerting with the following UIDs:
| Alarm ID | Grafana Alert UID |
|---|---|
CAE1-OT1-ALM-REDACTED |
sumppit-level-warning |
CAE1-OT1-ALM-REDACTED |
sumppit-level-critical |
CAE1-OT1-ALM-REDACTED |
sumppit-vibration-warning |
Notification channel: Grafana Contact Point → Telegram.
TELEGRAM_BOT_TOKEN sourced from Infisical (IAM-0001); never hardcoded.
ISA-18.2 Alignment¶
- Alarm rationalization: Only three alarms defined; no nuisance-alarm flood from minor level fluctuations.
- Deadband: ALM-REDACTED uses a 2-minute sustain period to prevent chatter during normal pump cycles.
- Priority: Warning (P3) for ALM-REDACTED/003; Critical (P1) for ALM-REDACTED.
Section 2 — CMMS Integration¶
Source: OT-0007 (2026-04-09), addendum 2026-04-13
Decision¶
Atlas CMMS (intelloop/grash Docker image) runs on caneast-site1-node2 and serves as the
maintenance management system. An alarm-bridge Flask service (also on caneast-site1-node2)
mediates between Grafana Unified Alerting and Atlas CMMS.
Alarm-Bridge Architecture¶
Grafana alert fires
→ POST /webhook (alarm-bridge Flask, caneast-site1-node2)
→ Atlas CMMS REST API → create Work Order
→ Grafana: annotate panel with fingerprint correlation ID
→ ADO: create Bug WI (if Critical)
Service Configuration¶
| Parameter | Value |
|---|---|
| Atlas CMMS image | intelloop/grash |
| Atlas CMMS deps | PostgreSQL, MinIO |
| CMMS_BASE_URL | http://REDACTED:8080 (internal; NRP bypass) |
| alarm-bridge | Flask service on caneast-site1-node2, /opt/cae/tools/alarm-bridge/ |
| alarm-bridge image | cae1-alarm-harness:latest |
Addendum — 2026-04-13¶
CMMS_BASE_URL was updated from an NRP proxy hostname to the direct internal IP
http://REDACTED:8080. The NRP proxy host for Atlas CMMS was not configured at
that time; direct IP bypasses the proxy and avoids TLS overhead for internal service
communication.
ADO Integration¶
For Critical alarms (CAE1-OT1-ALM-REDACTED), the alarm-bridge creates an ADO Bug work
item in archon-platform\OT with the Grafana alert fingerprint as the correlation ID.
This provides a full audit trail: Grafana alert → CMMS work order → ADO bug.
Rationale¶
Combining alarm rationalization and CMMS integration into a single ADR reflects operational reality: alarms without CMMS integration have no maintenance workflow, and CMMS integration without alarm rationalization has no trigger source. The ISA-18.2 three-alarm design prevents nuisance-alarm overload while ensuring critical conditions generate actionable maintenance records.
Consequences¶
TELEGRAM_BOT_TOKENmust be present in Infisical at runtime; alarm-bridge will fail silently if missing.- Atlas CMMS PostgreSQL and MinIO volumes must be maintained; data loss = lost WO history.
- ADO Bug WIs created by alarm-bridge are in
archon-platform\OT; manual triage required to assign to sprint. - Future alarm additions (e.g., for
caneast-site1-ot1-snr02) follow the same three-alarm rationalization pattern.
Sources¶
| Original ADR | Date | Content |
|---|---|---|
| OT-0006 ISA-18.2 Alarm Rationalization | 2026-04-07 | Alarm definitions, Grafana UIDs, ISA-18.2 alignment |
| OT-0007 CMMS Integration | 2026-04-09 (addendum 2026-04-13) | alarm-bridge, Atlas CMMS, ADO integration |