Skip to content

ADR-0018: Sanitization verification strategy — two-layer DLP for public docs

Status

Accepted — 2026-04-03

Context

Public documentation on peries.ca is derived from internal docs via sanitize.py (see ADR-0017). The sanitization engine applies replacement rules to strip hostnames, IPs, service account names, hardware details, and credentials before MkDocs builds the public site.

A single-layer approach (sanitize only) is insufficient — new internal strings added to docs may not be covered by existing rules, and rule regressions are invisible until the public site is already deployed.

Decision

Two-layer approach: sanitize then verify.

Layer 1 — sanitize.py (replacement engine)

  • Parses docs/_index.md inventory at runtime for IP and port mappings
  • Applies static string replacements (hardware, service accounts, personal names)
  • Applies regex replacements with word boundaries (prevents partial matches)
  • Applies node name pattern transforms (ADR-0008 CanEast naming)
  • Applies credential patterns (tokens, keys, passwords, emails)
  • RFC1918 IP catch-all as final safety net

Layer 2 — verify-sanitization.py (DLP scanner)

  • Scans all .md files in build/docs/ after sanitization
  • Checks for known internal string patterns:
  • RFC1918 IPs (192.168.x.x)
  • Internal node naming patterns
  • Service account names
  • Personal identifiers (usernames, emails, real names)
  • Hardware make/model identifiers
  • OS version specifics
  • Prints file path and line number for each leak
  • Exits with code 1 if any leak found — pipeline halts, nothing deploys
  • Prints PASS if clean

Pipeline integration

sanitize.py → verify-sanitization.py → mike deploy → wrangler deploy
If verify fails, mike and wrangler never run. The public site is not updated.

Rationale

  • Two independent layers reduce the risk of a single rule regression leaking internal data
  • verify-sanitization.py is pattern-based (what should NOT appear), independent of sanitize.py's replacement logic (what SHOULD be replaced) — different failure modes
  • Pipeline gate ensures zero-leak-tolerance — no silent deployment of unsanitized content
  • Simple Python scripts with no dependencies — no external tool installation required

Future — Layer 3 (planned)

  • truffleHog or gitleaks: credential scanning as a third verification layer
  • Would catch API keys, tokens, and secrets that neither sanitize.py nor verify-sanitization.py are designed to detect
  • Planned for Phase 3 when ADO pipelines are fully operational

Alternatives considered

  • Manual review before publish: does not scale, will be forgotten
  • Single sanitize-only approach: no feedback when rules are incomplete — leaks are silent
  • truffleHog only: credential-focused, does not catch hostname/IP/hardware leaks

Consequences

  • Every new internal string type requires both a sanitize rule AND a verify pattern
  • verify-sanitization.py must be updated when new categories of internal data are introduced
  • Pipeline runs are slightly longer (verify scan adds ~1 second)
  • False positives in verify are possible — must be triaged, not suppressed

References

  • ADR-0008: CanEast naming convention
  • ADR-0017: Public documentation pipeline