Skip to content

Migration note: Consolidated from ADR-0017, ADR-0018, and ADR-0033 on 2026-05-02 per ADR-0047. Original files retained at docs/adr/0017-public-docs-pipeline.md, docs/adr/0018-sanitization-verification-strategy.md, and docs/adr/0033-public-docs-security-controls.md with deprecation banners.

APPSEC-0002: Public Documentation DLP Controls

Sources

  • ADR-0017: Public documentation pipeline — CanEast sanitization via ADO CI/CD (2026-04-03)
  • ADR-0018: Sanitization verification strategy — two-layer DLP for public docs (2026-04-03)
  • ADR-0033: Public docs security controls — content classification, exclusion enforcement, and DLP hardening (2026-04-13)
Field Value
ID APPSEC-0002
Date 2026-04-13
Status Accepted
Author Ben Peries
Class security/APPSEC

Context

archon-docs contains internal documentation with real hostnames (caneast-site1-node2, caneast-site1-node3), IP addresses (192.168.2.x), ports, and credential paths. The platform needs a public-facing version on peries.ca for portfolio purposes without exposing internal infrastructure details.

GOV-0005 established the CanEast naming convention. This ADR defines how sanitized public docs are built, deployed automatically, and protected by multi-layer DLP controls.

UUID Hyphenated-Bypass Gap (ADR-0033 incident — 2026-04-12)

A 2026-04-12 security review identified that sanitize.py CREDENTIAL_PATTERNS contained a 32-char alphanumeric catch-all (\b[A-Za-z0-9+_=-]{32,}\b), but hyphenated UUIDs like Infisical project and machine identity IDs were bypassing it — hyphens prevent word-boundary matching in the existing pattern. UUIDs such as REDACTED (Infisical project IDs and machine identity IDs) were present in docs being copied to build/docs/ and were not matched by the existing pattern.

The same review identified that docs/_context.md (the AI portability file containing full node inventory, real IPs, UUIDs, SSH ports, and Infisical project credentials) was being copied to build/docs/ on every pipeline run — the highest-severity item in the review.

Decision

Stage 1 — Sanitize (ADR-0017)

sanitize.py reads all markdown files and applies CanEast substitutions before build:

Internal Public (CanEast)
caneast-site1-node2, caneast-site1-node3 compute-node-01, compute-node-02
caneast-site1-node1 vpn-node-01
caneast-site1-mqtt1 mqtt-broker-01
caneast-site1-ot1-snr01 ot-sensor-01
caneast-site1-fw1 firewall-01
caneast-site1-jmp1 jumpbox-01
192.168.2.x 10.x.x.0/24 (network/mask only)
Real port numbers Functional descriptions or omitted
caneast/prod/telegram/bot-token secrets/prod/alerting/bot-token
REDACTED REDACTED
dev.azure.com/caneast-platform dev.azure.com/caneast-platform
CAE prefix in naming CanEast prefix

The script operates on a copy in a _build/ staging directory — the source docs/ directory is never modified.

shutil.copytree excludes classified paths explicitly:

shutil.copytree(DOCS_SRC, BUILD_DIR, ignore=shutil.ignore_patterns("_index.md", "_context.md", "internal"))

UUID DLP: new UUID pattern added to CREDENTIAL_PATTERNS before the alphanumeric catch-all, using lookarounds to avoid partial matches:

(re.compile(r"(?<![a-zA-Z0-9_-])[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}(?![a-zA-Z0-9_-])", re.IGNORECASE), "REDACTED"),

Layer 1/2/3 DLP Strategy (ADR-0018)

Layer 1 — sanitize.py (replacement engine) - Parses docs/_index.md inventory at runtime for IP and port mappings - Applies static string replacements (hardware, service accounts, personal names) - Applies regex replacements with word boundaries (prevents partial matches) - Applies node name pattern transforms (GOV-0005 CanEast naming) - Applies credential patterns (tokens, keys, passwords, emails) - RFC1918 IP catch-all as final safety net

Layer 2 — verify-sanitization.py (DLP scanner) - Scans all .md files in build/docs/ after sanitization - Checks for known internal string patterns: RFC1918 IPs, internal node naming patterns, service account names, personal identifiers, hardware identifiers, OS version specifics - Prints file path and line number for each leak - Exits with code 1 if any leak found — pipeline halts, nothing deploys - UUID pattern added (matching the sanitize.py pattern above) to catch hyphenated UUIDs that slip through

Layer 3 — Planned (truffleHog or gitleaks) - Credential scanning as a third verification layer - Would catch API keys, tokens, and secrets that neither sanitize.py nor verify-sanitization.py are designed to detect - Planned for Phase 3 when ADO pipelines are fully operational

Pipeline integration

sanitize.py → assert-exclusions → verify-sanitization.py → mike deploy → wrangler deploy

If verify fails, mike and wrangler never run. The public site is not updated.

Stage 2 — Build

MkDocs Material builds the sanitized copy into static HTML via a separate mkdocs-public.yml: - site_url: https://peries.ca - Theme: Material default (no custom palette) - Plugins: search, mike - No internal-only pages (e.g., _index.md excluded from nav)

Stage 3 — Version

mike manages version snapshots on a gh-pages-public branch, mirroring the internal version history but with sanitized content.

Stage 4 — Deploy

Cloudflare Pages deployment via Wrangler CLI: - DNS managed by Terraform in archon-cloud (Cloudflare provider) - Deployment triggered by pipeline on successful build - CrowdSec bouncer active on Cloudflare

Content Classification: docs/internal/ (ADR-0033)

Files with machine-identity-level sensitivity are moved to docs/internal/, excluded from shutil.copytree:

File Reason
docs/reference/infisical-navigation.mddocs/internal/ Contains machine identity UUIDs (alienware-wsl, caneast-site1-node2-terraform-runner) and Infisical project IDs
docs/devices/alienware-wsl.mddocs/internal/ Contains real IP (REDACTED), SSH port REDACTED, hardware specs, MCP server UUID
docs/_context.md AI portability file — excluded by ignore_patterns, never reaches build/docs/

Pipeline Gate: Exclusion Assertion (ADR-0033)

Independent check added to azure-pipelines.yml after sanitize.py and before verify-sanitization.py:

- script: |
    FAIL=0
    if [ -f "build/docs/_context.md" ]; then FAIL=1; fi
    if [ -d "build/docs/internal" ]; then FAIL=1; fi
    if [ -f "build/docs/_index.md" ]; then FAIL=1; fi
    if [ $FAIL -eq 1 ]; then exit 1; fi
  displayName: "Assert excluded files absent from build"

This catches ignore_patterns misconfigurations before the DLP content scan runs.

Defence in Depth Summary (ADR-0033)

Four independent layers — any one catches a leak the others miss:

  1. Exclusionsanitize.py copytree ignore_patterns blocks classified files from entering build/docs/
  2. Content replacement — CREDENTIAL_PATTERNS (including UUID lookaround pattern) sanitizes any leaking content
  3. Pipeline assertion — independent step verifies classified files are absent before DLP scan
  4. DLP scanverify-sanitization.py scans for any remaining internal string patterns

Alternatives Considered

Manual sanitization — rejected; error-prone, does not scale, will be forgotten.

Separate public repo — rejected; content drift between internal and public copies.

Single sanitize-only approach — rejected; no feedback when rules are incomplete, leaks are silent.

Sanitize UUIDs with the 32-char catch-all — rejected; requires removing hyphens before matching — brittle and would break UUID format in contexts where sanitization is not appropriate.

Remove _context.md from the repo entirely — rejected; context file is required for AI agent portability across sessions; exclusion from build is the correct control, not removal.

Consequences

  • sanitize.py must be maintained as internal naming evolves — new hostnames/IPs require new substitution rules
  • Every new internal string type requires both a sanitize rule AND a verify pattern
  • Every new document containing machine-identity-level data must be placed in docs/internal/ before merging to main
  • infisical-navigation.md and alienware-wsl.md are no longer in the public docs nav
  • UUID pattern in CREDENTIAL_PATTERNS will redact legitimate CanEast-scoped UUIDs if they appear in docs — acceptable trade-off for a public docs layer
  • Pipeline requires Cloudflare API token stored in Infisical at caneast/prod/cloudflare/api-token
  • mkdocs-public.yml is a separate config from the internal mkdocs.yml

References

  • GOV-0005: CanEast public naming convention
  • APPSEC-0001: Supply chain scanning (Grype + Aikido)
  • 2026-04-12 security incident log: docs/internal/secret-rotation.md