Migration note: Consolidated from ADR-0017, ADR-0018, and ADR-0033 on 2026-05-02 per ADR-0047. Original files retained at
docs/adr/0017-public-docs-pipeline.md,docs/adr/0018-sanitization-verification-strategy.md, anddocs/adr/0033-public-docs-security-controls.mdwith deprecation banners.
APPSEC-0002: Public Documentation DLP Controls¶
Sources¶
- ADR-0017: Public documentation pipeline — CanEast sanitization via ADO CI/CD (2026-04-03)
- ADR-0018: Sanitization verification strategy — two-layer DLP for public docs (2026-04-03)
- ADR-0033: Public docs security controls — content classification, exclusion enforcement, and DLP hardening (2026-04-13)
| Field | Value |
|---|---|
| ID | APPSEC-0002 |
| Date | 2026-04-13 |
| Status | Accepted |
| Author | Ben Peries |
| Class | security/APPSEC |
Context¶
archon-docs contains internal documentation with real hostnames (caneast-site1-node2, caneast-site1-node3), IP addresses (192.168.2.x), ports, and credential paths. The platform needs a public-facing version on peries.ca for portfolio purposes without exposing internal infrastructure details.
GOV-0005 established the CanEast naming convention. This ADR defines how sanitized public docs are built, deployed automatically, and protected by multi-layer DLP controls.
UUID Hyphenated-Bypass Gap (ADR-0033 incident — 2026-04-12)¶
A 2026-04-12 security review identified that sanitize.py CREDENTIAL_PATTERNS contained a 32-char alphanumeric catch-all (\b[A-Za-z0-9+_=-]{32,}\b), but hyphenated UUIDs like Infisical project and machine identity IDs were bypassing it — hyphens prevent word-boundary matching in the existing pattern. UUIDs such as REDACTED (Infisical project IDs and machine identity IDs) were present in docs being copied to build/docs/ and were not matched by the existing pattern.
The same review identified that docs/_context.md (the AI portability file containing full node inventory, real IPs, UUIDs, SSH ports, and Infisical project credentials) was being copied to build/docs/ on every pipeline run — the highest-severity item in the review.
Decision¶
Stage 1 — Sanitize (ADR-0017)¶
sanitize.py reads all markdown files and applies CanEast substitutions before build:
| Internal | Public (CanEast) |
|---|---|
caneast-site1-node2, caneast-site1-node3 |
compute-node-01, compute-node-02 |
caneast-site1-node1 |
vpn-node-01 |
caneast-site1-mqtt1 |
mqtt-broker-01 |
caneast-site1-ot1-snr01 |
ot-sensor-01 |
caneast-site1-fw1 |
firewall-01 |
caneast-site1-jmp1 |
jumpbox-01 |
192.168.2.x |
10.x.x.0/24 (network/mask only) |
| Real port numbers | Functional descriptions or omitted |
caneast/prod/telegram/bot-token |
secrets/prod/alerting/bot-token |
REDACTED |
REDACTED |
dev.azure.com/caneast-platform |
dev.azure.com/caneast-platform |
CAE prefix in naming |
CanEast prefix |
The script operates on a copy in a _build/ staging directory — the source docs/ directory is never modified.
shutil.copytree excludes classified paths explicitly:
shutil.copytree(DOCS_SRC, BUILD_DIR, ignore=shutil.ignore_patterns("_index.md", "_context.md", "internal"))
UUID DLP: new UUID pattern added to CREDENTIAL_PATTERNS before the alphanumeric catch-all, using lookarounds to avoid partial matches:
(re.compile(r"(?<![a-zA-Z0-9_-])[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}(?![a-zA-Z0-9_-])", re.IGNORECASE), "REDACTED"),
Layer 1/2/3 DLP Strategy (ADR-0018)¶
Layer 1 — sanitize.py (replacement engine)
- Parses docs/_index.md inventory at runtime for IP and port mappings
- Applies static string replacements (hardware, service accounts, personal names)
- Applies regex replacements with word boundaries (prevents partial matches)
- Applies node name pattern transforms (GOV-0005 CanEast naming)
- Applies credential patterns (tokens, keys, passwords, emails)
- RFC1918 IP catch-all as final safety net
Layer 2 — verify-sanitization.py (DLP scanner)
- Scans all .md files in build/docs/ after sanitization
- Checks for known internal string patterns: RFC1918 IPs, internal node naming patterns, service account names, personal identifiers, hardware identifiers, OS version specifics
- Prints file path and line number for each leak
- Exits with code 1 if any leak found — pipeline halts, nothing deploys
- UUID pattern added (matching the sanitize.py pattern above) to catch hyphenated UUIDs that slip through
Layer 3 — Planned (truffleHog or gitleaks) - Credential scanning as a third verification layer - Would catch API keys, tokens, and secrets that neither sanitize.py nor verify-sanitization.py are designed to detect - Planned for Phase 3 when ADO pipelines are fully operational
Pipeline integration¶
If verify fails, mike and wrangler never run. The public site is not updated.
Stage 2 — Build¶
MkDocs Material builds the sanitized copy into static HTML via a separate mkdocs-public.yml:
- site_url: https://peries.ca
- Theme: Material default (no custom palette)
- Plugins: search, mike
- No internal-only pages (e.g., _index.md excluded from nav)
Stage 3 — Version¶
mike manages version snapshots on a gh-pages-public branch, mirroring the internal version history but with sanitized content.
Stage 4 — Deploy¶
Cloudflare Pages deployment via Wrangler CLI: - DNS managed by Terraform in archon-cloud (Cloudflare provider) - Deployment triggered by pipeline on successful build - CrowdSec bouncer active on Cloudflare
Content Classification: docs/internal/ (ADR-0033)¶
Files with machine-identity-level sensitivity are moved to docs/internal/, excluded from shutil.copytree:
| File | Reason |
|---|---|
docs/reference/infisical-navigation.md → docs/internal/ |
Contains machine identity UUIDs (alienware-wsl, caneast-site1-node2-terraform-runner) and Infisical project IDs |
docs/devices/alienware-wsl.md → docs/internal/ |
Contains real IP (REDACTED), SSH port REDACTED, hardware specs, MCP server UUID |
docs/_context.md |
AI portability file — excluded by ignore_patterns, never reaches build/docs/ |
Pipeline Gate: Exclusion Assertion (ADR-0033)¶
Independent check added to azure-pipelines.yml after sanitize.py and before verify-sanitization.py:
- script: |
FAIL=0
if [ -f "build/docs/_context.md" ]; then FAIL=1; fi
if [ -d "build/docs/internal" ]; then FAIL=1; fi
if [ -f "build/docs/_index.md" ]; then FAIL=1; fi
if [ $FAIL -eq 1 ]; then exit 1; fi
displayName: "Assert excluded files absent from build"
This catches ignore_patterns misconfigurations before the DLP content scan runs.
Defence in Depth Summary (ADR-0033)¶
Four independent layers — any one catches a leak the others miss:
- Exclusion —
sanitize.pycopytreeignore_patternsblocks classified files from enteringbuild/docs/ - Content replacement — CREDENTIAL_PATTERNS (including UUID lookaround pattern) sanitizes any leaking content
- Pipeline assertion — independent step verifies classified files are absent before DLP scan
- DLP scan —
verify-sanitization.pyscans for any remaining internal string patterns
Alternatives Considered¶
Manual sanitization — rejected; error-prone, does not scale, will be forgotten.
Separate public repo — rejected; content drift between internal and public copies.
Single sanitize-only approach — rejected; no feedback when rules are incomplete, leaks are silent.
Sanitize UUIDs with the 32-char catch-all — rejected; requires removing hyphens before matching — brittle and would break UUID format in contexts where sanitization is not appropriate.
Remove _context.md from the repo entirely — rejected; context file is required for AI agent portability across sessions; exclusion from build is the correct control, not removal.
Consequences¶
sanitize.pymust be maintained as internal naming evolves — new hostnames/IPs require new substitution rules- Every new internal string type requires both a sanitize rule AND a verify pattern
- Every new document containing machine-identity-level data must be placed in
docs/internal/before merging tomain infisical-navigation.mdandalienware-wsl.mdare no longer in the public docs nav- UUID pattern in CREDENTIAL_PATTERNS will redact legitimate CanEast-scoped UUIDs if they appear in docs — acceptable trade-off for a public docs layer
- Pipeline requires Cloudflare API token stored in Infisical at
caneast/prod/cloudflare/api-token mkdocs-public.ymlis a separate config from the internalmkdocs.yml
References¶
- GOV-0005: CanEast public naming convention
- APPSEC-0001: Supply chain scanning (Grype + Aikido)
- 2026-04-12 security incident log:
docs/internal/secret-rotation.md