ADR-0018: Sanitization verification strategy — two-layer DLP for public docs¶
Status¶
Accepted — 2026-04-03
Context¶
Public documentation on peries.ca is derived from internal docs via sanitize.py (see ADR-0017). The sanitization engine applies replacement rules to strip hostnames, IPs, service account names, hardware details, and credentials before MkDocs builds the public site.
A single-layer approach (sanitize only) is insufficient — new internal strings added to docs may not be covered by existing rules, and rule regressions are invisible until the public site is already deployed.
Decision¶
Two-layer approach: sanitize then verify.
Layer 1 — sanitize.py (replacement engine)¶
- Parses
docs/_index.mdinventory at runtime for IP and port mappings - Applies static string replacements (hardware, service accounts, personal names)
- Applies regex replacements with word boundaries (prevents partial matches)
- Applies node name pattern transforms (ADR-0008 CanEast naming)
- Applies credential patterns (tokens, keys, passwords, emails)
- RFC1918 IP catch-all as final safety net
Layer 2 — verify-sanitization.py (DLP scanner)¶
- Scans all
.mdfiles inbuild/docs/after sanitization - Checks for known internal string patterns:
- RFC1918 IPs (192.168.x.x)
- Internal node naming patterns
- Service account names
- Personal identifiers (usernames, emails, real names)
- Hardware make/model identifiers
- OS version specifics
- Prints file path and line number for each leak
- Exits with code 1 if any leak found — pipeline halts, nothing deploys
- Prints PASS if clean
Pipeline integration¶
If verify fails, mike and wrangler never run. The public site is not updated.Rationale¶
- Two independent layers reduce the risk of a single rule regression leaking internal data
- verify-sanitization.py is pattern-based (what should NOT appear), independent of sanitize.py's replacement logic (what SHOULD be replaced) — different failure modes
- Pipeline gate ensures zero-leak-tolerance — no silent deployment of unsanitized content
- Simple Python scripts with no dependencies — no external tool installation required
Future — Layer 3 (planned)¶
- truffleHog or gitleaks: credential scanning as a third verification layer
- Would catch API keys, tokens, and secrets that neither sanitize.py nor verify-sanitization.py are designed to detect
- Planned for Phase 3 when ADO pipelines are fully operational
Alternatives considered¶
- Manual review before publish: does not scale, will be forgotten
- Single sanitize-only approach: no feedback when rules are incomplete — leaks are silent
- truffleHog only: credential-focused, does not catch hostname/IP/hardware leaks
Consequences¶
- Every new internal string type requires both a sanitize rule AND a verify pattern
- verify-sanitization.py must be updated when new categories of internal data are introduced
- Pipeline runs are slightly longer (verify scan adds ~1 second)
- False positives in verify are possible — must be triaged, not suppressed
References¶
- ADR-0008: CanEast naming convention
- ADR-0017: Public documentation pipeline