Note: This ADR is Proposed and pending implementation. For the current operational two-tier baseline, see OT-0003.
OT-0010 - OT Historian Retention Architecture with Event Preservation¶
Status¶
Proposed
Context¶
ADR-OT-0005 established a single InfluxDB bucket (homelab, 30-day retention) for all OT sensor data. As the platform matures toward AI-driven predictive maintenance and capital planning, a single flat bucket is insufficient for three reasons:
-
Retention mismatch: Operators need 7 days of raw data for shift review. Lifecycle planning requires 7 years of trend data. A single bucket must either waste storage on fine-grained old data or lose the detail that makes real-time SCADA useful.
-
Naive downsampling loses critical events: Standard mean/max aggregation destroys alarm transitions, vibration spikes, and flood state changes at the moment they are downsampled. Predictive maintenance models require the original anomaly timestamps to train effectively.
-
Industry standards require tiered retention: ISA-18.2 alarm management, ISO 14224 equipment reliability, NIST SP 800-82 ICS security, and SOX/regulatory audit trails each have distinct retention requirements that cannot be satisfied by a single policy.
The sensor node CAE1OT1SNR01 currently publishes event-driven data at the edge (5 mm level threshold, flood state transitions, IMU deadband triggers). This edge intelligence is preserved by the event-preservation rules defined in this ADR.
Decision¶
Four-Bucket Hierarchy¶
| Bucket | Retention | Resolution | Purpose | Standard |
|---|---|---|---|---|
| homelab | 7 days | Native event-driven | Real-time SCADA, shift review | Operator HMI |
| homelab-1m | 90 days | 1 minute plus significant events | Monthly and quarterly analysis | ISA-18.2, ISO 14224 |
| homelab-1h | 2 years | 1 hour plus alarm transitions | Annual and YoY analysis | NIST SP 800-82 |
| homelab-archive | 7 years | 1 hour plus event preservation | Lifecycle, capital planning, audit | SOX, regulatory |
Note: homelab retention is 30 days at ADR adoption. It will be reduced to 7 days immediately following backfill completion into homelab-1m.
Event Preservation Rules¶
Every downsampled tier preserves the following at original timestamps in addition to the regular 1-minute or 1-hour aggregate:
Alarm transitions (all tiers): - All ISA-18.2 alarm state changes (HH, H, L, LL, RTN) preserved verbatim - Flood sensor digital state changes (0 to 1 and 1 to 0) preserved - Rain sensor state changes preserved
Deadband-triggered points (homelab-1m and homelab-1h): - Level: any point where |delta| > 10 mm from previous preserved point - RSSI: any point where |delta| > 3 dBm from previous preserved point - IMU accel: any axis reading where |delta| > 0.5 m/s^2 - IMU gyro: any axis reading where |delta| > 2 deg/s - IMU temperature: any reading where |delta| > 0.5 C
Gap markers (all tiers): - First and last point of any gap exceeding 1 hour preserved to maintain contiguous time series in trend dashboards
Downsampling Flow¶
homelab (7d, native)
|-- every 1 minute --> homelab-1m (90d, 1-min + events)
|-- every 1 hour --> homelab-1h (2y, 1-hour + events)
|-- every 1 hour --> homelab-archive (7y, 1-hour + events)
Flux tasks run on InfluxDB on caneast-site1-node2. Implementation pending.
Dashboard-to-Bucket Mapping¶
| Dashboard UID | Bucket | Default range | Max range |
|---|---|---|---|
| ot-ops-snr01-scada | homelab | 1 hour | 7 days |
| ot-eng-snr01-diag | homelab | 24 hours | 7 days |
| ot-inf-snr01-node | homelab | 24 hours | 7 days |
| ot-hist-snr01-quarterly | homelab-1m | 30 days | 90 days |
| ot-hist-snr01-annual | homelab-1h | 1 year | 2 years |
| ot-hist-snr01-lifecycle | homelab-archive | 2 years | 7 years |
Default range is the starting view on dashboard load. Max range equals full bucket retention. The time range picker is never capped per OT-0005.
Storage Projections¶
At steady state with all six dashboards and three downsampling tiers active, CAE1OT1SNR01 generates approximately:
- homelab: ~200k points/7d (native event-driven rate, ~1 point/3s average across 7 measurements)
- homelab-1m: ~90k points/90d (1-min aggregates plus ~5% event-preserved extras)
- homelab-1h: ~15k points/2y (1-hour aggregates plus alarm transitions)
- homelab-archive: ~8k points/7y (1-hour aggregates, compressed)
Total across all buckets per node at steady state: approximately 2 million points.
Consequences¶
- The implementation will deploy Flux downsampling tasks and backfill homelab-1m and homelab-1h from existing homelab data.
- After backfill completes, homelab retention is reduced from 30 days to 7 days.
- The live Flux task
downsample-sumppit-to-archive(ID: 108cb726d43c7000) references the deleted esp1_* measurements and the REDACTED org by name. It must be replaced as part of WI-309 (InfluxDB org rename) and the downsampling implementation. - Each additional sensor node adds one set of dashboards and contributes to the same bucket set. Buckets are shared across all OT nodes -- no per-node buckets.
- The homelab-archive bucket satisfies SOX 7-year retention and CISSP audit trail requirements for OT data without requiring a separate archival system at the current scale.
Alternatives Considered¶
Naive aggregation (mean/max only): Rejected. Destroys alarm transients and vibration anomalies at the moment of downsampling. Predictive maintenance models trained on this data cannot detect the signature patterns that precede equipment failure.
Single bucket with tags for tier: Rejected. No retention separation is possible within a single bucket. All data would share the most permissive retention policy, making storage costs unpredictable as the fleet grows.
Per-node buckets: Rejected. Adds operational overhead without benefit at the current scale (1-5 nodes). Each additional node would require its own Flux task set and Grafana datasource. Revisit if the fleet grows beyond 20 nodes.
References¶
- OT-0003: Historian Retention (two-tier baseline; superseded for retention tiers by OT-0010)
- OT-0005: OT Grafana Dashboard Taxonomy
- OT-0004: Alarm Rationalization and CMMS Integration
- ISA-18.2: Management of Alarm Systems for the Process Industries
- ISO 14224: Petroleum, petrochemical, and natural gas industries -- Collection and exchange of reliability and maintenance data for equipment
- NIST SP 800-82: Guide to Industrial Control Systems (ICS) Security
- WI-312: ADR-OT-0010 OT Historian Retention Architecture
- WI-313: Create InfluxDB tiered historian buckets