Deprecated — Consolidated into LLMOPS-0002 on 2026-05-02 per ADR-0047. This source file is retained as a reference; the canonical content is in LLMOPS-0002.
ADR-0039 — AI Operations Agent Plane¶
| Status | Accepted | | Date | 2026-04-22 | | Author | Ben Peries | | Phases | 2, 3 | | ADO WI | WI-321 | | Supersedes | None | | Related | ADR-0038 (IT observability data plane), ADR-0034 (k3s CP migration to caneast-site1-node4), ADR-0016 (k3s namespace design), ADR-0031 (CanEast AI Node workstation-as-code) | | Epic | E3 — AI Apps / OpenClaw & OT-AI (WI-259) |
Context¶
The Archon Platform's stated arc is the transition from traditional IT operations to agentic AI-driven operations. ADR-0038 established the observability data plane — the machine-readable telemetry substrate required for AI agents to reason about platform state. This ADR establishes the first architectural primitive on top of that substrate: the agent plane.
The agent plane is the set of AI-driven components that consume platform telemetry, reason about platform state, and provide operational intelligence to human operators. At this stage, the agent plane is advisory only — agents explain and propose, they do not execute. Autonomous remediation is a future decision (a future ADR) that requires failure-mode analysis, blast-radius controls, and audit mechanisms not justified for a five-node platform in Phase 1.
The platform has access to multiple LLM backends already. Ollama runs on the CanEast AI Node workstation (per ADR-0031) for local inference. DashScope is configured for the public translation worker (DASHSCOPE_API_KEY in Infisical). Additional hosted backends (Groq, Anthropic) are available but not yet provisioned. A per-backend routing policy is required so each AI operations workload picks the right privacy, latency, and cost tradeoff.
Decision¶
Primary: adopt k8sgpt-operator as the first AI operations agent¶
Deploy k8sgpt-operator to the archon-monitoring namespace on
caneast-site1-node4's k3s cluster. k8sgpt performs continuous AI-driven
diagnostics of k3s cluster state — failing pods, misconfigured
resources, PVC issues, image pull failures, and related conditions.
Results are published as k8s custom resources and surfaced in
Grafana (per ADR-0038) and Alertmanager routing.
Secondary: multi-backend LLM strategy¶
The agent plane does not commit to a single LLM provider. Different AI operations workloads have different privacy, latency, and cost profiles. The platform adopts a per-use-case backend routing policy.
Available backends:
| Backend | Type | Authentication |
|---|---|---|
| Ollama (local) | Self-hosted, sovereign | None (network-local) |
| Alibaba DashScope (qwen-turbo, qwen-max) | Hosted | DASHSCOPE_API_KEY in Infisical |
| Groq (Llama 3.x, Mixtral) | Hosted, high-speed | API key in Infisical (to be provisioned) |
| Anthropic API (Claude) | Hosted | API key in Infisical (to be provisioned) |
Routing policy:
| Use case | Backend | Rationale |
|---|---|---|
| k8sgpt cluster diagnostics | Ollama | Cluster topology is sovereign; data does not leave environment |
| OT-AI agents (future) | Ollama (MANDATORY) | OT telemetry is regulated-equivalent data; hard rule |
| Archy Infra natural-language queries (future) | Ollama default, Groq optional for speed | IT metadata is sensitive; local default, hosted only for explicit low-sensitivity queries |
| Public-facing translation (peries.ca) | DashScope qwen-turbo | Already decided; public content; no sovereignty concern |
| Platform documentation drafting | Anthropic API or hosted | Public platform docs, not operational data |
Tertiary: sovereignty rule¶
Any AI operations workload that consumes OT telemetry, cluster internals, secrets metadata, or identifiable platform topology MUST use a local backend (Ollama). Hosted backends are permitted only for workloads where the input data is public or explicitly sanitized.
Quaternary: Ollama placement (interim)¶
k8sgpt's interim LLM backend is the Ollama instance on the CanEast AI Node workstation at REDACTED:11434. This is a known- degraded dependency:
- CanEast AI Node is not a 24/7 server. When suspended, k8sgpt analyses requiring LLM explanation will fail.
- GPU inference is fast when CanEast AI Node is awake, but unreliable by the standards of a platform service.
Pre-flight requirement: A 2026-04-22 review identified that Ollama on REDACTED:[REDACTED] is currently not reachable from caneast-site1-node2 (Uptime Kuma monitor #15 timing out). Before k8sgpt implementation begins, WI-318 (Ollama reachability audit from caneast-site1-node4, PRE-FLIGHT) must resolve:
- Confirm Ollama is listening on 0.0.0.0 (not 127.0.0.1 only)
- Confirm the WSL network path exposes the port to the LAN
- Confirm firewall permits caneast-site1-node4 → CanEast AI Node:[REDACTED]
If any of these fail, the interim placement is invalid and a follow-up Ollama placement ADR must be written before k8sgpt implementation proceeds.
Acceptable degraded state: when Ollama is unreachable (suspended CanEast AI Node or network break), k8sgpt falls back to non-LLM analysis mode — it still reports detected problems, just without natural- language explanation. The platform remains functional. This is documented in an operational runbook.
A dedicated Ollama deployment decision is deferred to a follow-up ADR. Candidate targets include:
- caneast-site1-node4 CPU inference — always-on, slow, uses small models (phi3:mini, qwen2.5:3b)
- Dedicated GPU node — hardware acquisition required
- Status quo with monitoring — treat Ollama unreachability as a non-critical degraded state
Quinary: agent plane boundaries¶
The agent plane is advisory only at this stage. Agents surface findings, explain state, and propose remediation. They do NOT execute changes. Autonomous remediation, self-healing, and closed-loop control require a future ADR with explicit scope, safety rails, and audit mechanisms.
Rationale¶
Why k8sgpt, not a custom build¶
k8sgpt is an established open-source project with native Ollama support, built-in k8s analyzers for common failure modes, PII filtering before LLM submission, and an operator pattern for continuous scanning. Building equivalent capability from scratch would consume Phase 2 entirely without differentiation. Adopting k8sgpt lets platform development focus on Archy Infra — the natural-language query layer over the broader observability stack — which is the genuinely novel piece.
Why multi-backend, not single-backend¶
Committing the platform to a single LLM provider creates migration-risk coupling. The multi-backend policy lets each use case pick the right tradeoff independently and treats LLMs as interchangeable infrastructure rather than a vendor choice. It also aligns with the platform's broader principle of AI portability.
Why advisory-only at this stage¶
Autonomous remediation in a platform with live OT telemetry and real infrastructure is a categorically different decision from diagnostic advice. It requires failure-mode analysis, blast-radius controls, and operator approval workflows that are not justified for a five-node fleet in Phase 1. The platform's credibility as a portfolio piece depends on demonstrating judgment about autonomy boundaries, not on maximizing autonomy.
Why interim Ollama on CanEast AI Node is acceptable, conditional¶
k8sgpt's fallback to non-LLM mode means the platform degrades gracefully rather than breaking when the LLM backend is unreachable. This makes the interim dependency acceptable in principle. It is acceptable in practice only if the pre-flight reachability audit (WI-318) confirms Ollama is actually reachable from caneast-site1-node4. If not, the interim plan is invalid and the follow-up placement ADR is required first.
Why the sovereignty rule is a hard rule¶
The Archon Platform includes OT telemetry that represents real operational data. The portfolio story depends on demonstrating that the platform can handle regulated-equivalent data with appropriate controls. A soft "prefer local" guideline would not be credible. A hard rule that sovereignty-sensitive workloads use local inference is both the right technical choice and the right portfolio signal.
Consequences¶
Positive¶
- First AI operations primitive deployed on the substrate ADR-0038 provides.
- Multi-backend policy documented; future AI work has clear routing guidance.
- Sovereignty rule established before any agent consumes OT data.
- Advisory-only boundary makes autonomy expansion a deliberate future decision, not a drift.
- Strong portfolio narrative: data plane (ADR-0038) + agent plane (this ADR) as a matched architectural pair demonstrating the Archon transition-to-agentic-AI arc.
Negative / risks¶
- Ollama-on-CanEast AI Node is a conditional dependency. Interim placement is only valid if WI-318 reachability audit passes. If it fails, k8sgpt implementation is blocked until a follow-up placement ADR exists.
- Multi-backend implementation complexity. Each backend requires credential management (Infisical), health monitoring, and a routing mechanism. Future Archy Infra must be built with this abstraction from day one.
- k8sgpt requires populated k3s. Until ADR-0034's k3s migration completes and workloads are deployed to k3s, k8sgpt has minimal surface area to analyze. Initial deployment may appear underwhelming; value increases as k3s workloads grow.
- LLM inference cost is non-zero, even local. Ollama GPU time on CanEast AI Node is effectively free but non-trivial. Hosted backends (Groq, DashScope, Anthropic) have real per-call costs. A cost monitoring runbook is required before hosted backends see production use.
Out of scope¶
- Autonomous remediation. Deferred to a future ADR.
- Archy Infra implementation. Separate work, tracked in E3 epic (WI-259). The LLM routing policy in this ADR applies when Archy Infra is built.
- OT-AI agents. The sovereignty rule applies when they are built; implementation is a future ADR.
- Dedicated Ollama hardware decision. Deferred to follow-up ADR if WI-318 invalidates the interim plan or if operational experience shows the CanEast AI Node dependency is too fragile.
Implementation tracking¶
Parent epic: E3 (WI-259).
Work items to be created when implementation is sequenced (NOT in this session):
- Ollama reachability audit from caneast-site1-node4 (WI-318, PRE-FLIGHT — blocks k8sgpt deployment)
- k8sgpt-operator Helm deployment to
archon-monitoringnamespace - k8sgpt configuration pointing at confirmed Ollama endpoint
- Groq API key provisioning in Infisical (archon-platform project)
- Anthropic API key provisioning in Infisical (archon-platform project)
- LLM backend routing policy documentation as a platform runbook
- Ollama-unavailable runbook: expected k8sgpt behavior in degraded state
- k8sgpt Results CR surfacing in Grafana (ADR-0038 Grafana instance)
- k8sgpt Results routing through Alertmanager for critical findings
- PII filter configuration verification for k8sgpt
- Cost monitoring runbook for hosted LLM backends