ADR-0018: Sanitization verification strategy — two-layer DLP for public docs¶

Status¶

Accepted — 2026-04-03

Context¶

Public documentation on peries.ca is derived from internal docs via sanitize.py (see ADR-0017). The sanitization engine applies replacement rules to strip hostnames, IPs, service account names, hardware details, and credentials before MkDocs builds the public site.

A single-layer approach (sanitize only) is insufficient — new internal strings added to docs may not be covered by existing rules, and rule regressions are invisible until the public site is already deployed.

Decision¶

Two-layer approach: sanitize then verify.

Layer 1 — sanitize.py (replacement engine)¶

Parses docs/_index.md inventory at runtime for IP and port mappings
Applies static string replacements (hardware, service accounts, personal names)
Applies regex replacements with word boundaries (prevents partial matches)
Applies node name pattern transforms (ADR-0008 CanEast naming)
Applies credential patterns (tokens, keys, passwords, emails)
RFC1918 IP catch-all as final safety net

Layer 2 — verify-sanitization.py (DLP scanner)¶

Scans all .md files in build/docs/ after sanitization
Checks for known internal string patterns:
RFC1918 IPs (192.168.x.x)
Internal node naming patterns
Service account names
Personal identifiers (usernames, emails, real names)
Hardware make/model identifiers
OS version specifics
Prints file path and line number for each leak
Exits with code 1 if any leak found — pipeline halts, nothing deploys
Prints PASS if clean

Pipeline integration¶

sanitize.py → verify-sanitization.py → mike deploy → wrangler deploy

If verify fails, mike and wrangler never run. The public site is not updated.

Rationale¶

Two independent layers reduce the risk of a single rule regression leaking internal data
verify-sanitization.py is pattern-based (what should NOT appear), independent of sanitize.py's replacement logic (what SHOULD be replaced) — different failure modes
Pipeline gate ensures zero-leak-tolerance — no silent deployment of unsanitized content
Simple Python scripts with no dependencies — no external tool installation required

Future — Layer 3 (planned)¶

truffleHog or gitleaks: credential scanning as a third verification layer
Would catch API keys, tokens, and secrets that neither sanitize.py nor verify-sanitization.py are designed to detect
Planned for Phase 3 when ADO pipelines are fully operational

Alternatives considered¶

Manual review before publish: does not scale, will be forgotten
Single sanitize-only approach: no feedback when rules are incomplete — leaks are silent
truffleHog only: credential-focused, does not catch hostname/IP/hardware leaks

Consequences¶

Every new internal string type requires both a sanitize rule AND a verify pattern
verify-sanitization.py must be updated when new categories of internal data are introduced
Pipeline runs are slightly longer (verify scan adds ~1 second)
False positives in verify are possible — must be triaged, not suppressed

References¶

ADR-0008: CanEast naming convention
ADR-0017: Public documentation pipeline