Skip to content

Deprecated — Consolidated into LLMOPS-0002 on 2026-05-02 per ADR-0047. This source file is retained as a reference; the canonical content is in LLMOPS-0002.

ADR-0039 — AI Operations Agent Plane

| Status | Accepted | | Date | 2026-04-22 | | Author | Ben Peries | | Phases | 2, 3 | | ADO WI | WI-321 | | Supersedes | None | | Related | ADR-0038 (IT observability data plane), ADR-0034 (k3s CP migration to caneast-site1-node4), ADR-0016 (k3s namespace design), ADR-0031 (CanEast AI Node workstation-as-code) | | Epic | E3 — AI Apps / OpenClaw & OT-AI (WI-259) |

Context

The Archon Platform's stated arc is the transition from traditional IT operations to agentic AI-driven operations. ADR-0038 established the observability data plane — the machine-readable telemetry substrate required for AI agents to reason about platform state. This ADR establishes the first architectural primitive on top of that substrate: the agent plane.

The agent plane is the set of AI-driven components that consume platform telemetry, reason about platform state, and provide operational intelligence to human operators. At this stage, the agent plane is advisory only — agents explain and propose, they do not execute. Autonomous remediation is a future decision (a future ADR) that requires failure-mode analysis, blast-radius controls, and audit mechanisms not justified for a five-node platform in Phase 1.

The platform has access to multiple LLM backends already. Ollama runs on the CanEast AI Node workstation (per ADR-0031) for local inference. DashScope is configured for the public translation worker (DASHSCOPE_API_KEY in Infisical). Additional hosted backends (Groq, Anthropic) are available but not yet provisioned. A per-backend routing policy is required so each AI operations workload picks the right privacy, latency, and cost tradeoff.

Decision

Primary: adopt k8sgpt-operator as the first AI operations agent

Deploy k8sgpt-operator to the archon-monitoring namespace on caneast-site1-node4's k3s cluster. k8sgpt performs continuous AI-driven diagnostics of k3s cluster state — failing pods, misconfigured resources, PVC issues, image pull failures, and related conditions. Results are published as k8s custom resources and surfaced in Grafana (per ADR-0038) and Alertmanager routing.

Secondary: multi-backend LLM strategy

The agent plane does not commit to a single LLM provider. Different AI operations workloads have different privacy, latency, and cost profiles. The platform adopts a per-use-case backend routing policy.

Available backends:

Backend Type Authentication
Ollama (local) Self-hosted, sovereign None (network-local)
Alibaba DashScope (qwen-turbo, qwen-max) Hosted DASHSCOPE_API_KEY in Infisical
Groq (Llama 3.x, Mixtral) Hosted, high-speed API key in Infisical (to be provisioned)
Anthropic API (Claude) Hosted API key in Infisical (to be provisioned)

Routing policy:

Use case Backend Rationale
k8sgpt cluster diagnostics Ollama Cluster topology is sovereign; data does not leave environment
OT-AI agents (future) Ollama (MANDATORY) OT telemetry is regulated-equivalent data; hard rule
Archy Infra natural-language queries (future) Ollama default, Groq optional for speed IT metadata is sensitive; local default, hosted only for explicit low-sensitivity queries
Public-facing translation (peries.ca) DashScope qwen-turbo Already decided; public content; no sovereignty concern
Platform documentation drafting Anthropic API or hosted Public platform docs, not operational data

Tertiary: sovereignty rule

Any AI operations workload that consumes OT telemetry, cluster internals, secrets metadata, or identifiable platform topology MUST use a local backend (Ollama). Hosted backends are permitted only for workloads where the input data is public or explicitly sanitized.

Quaternary: Ollama placement (interim)

k8sgpt's interim LLM backend is the Ollama instance on the CanEast AI Node workstation at REDACTED:11434. This is a known- degraded dependency:

  • CanEast AI Node is not a 24/7 server. When suspended, k8sgpt analyses requiring LLM explanation will fail.
  • GPU inference is fast when CanEast AI Node is awake, but unreliable by the standards of a platform service.

Pre-flight requirement: A 2026-04-22 review identified that Ollama on REDACTED:[REDACTED] is currently not reachable from caneast-site1-node2 (Uptime Kuma monitor #15 timing out). Before k8sgpt implementation begins, WI-318 (Ollama reachability audit from caneast-site1-node4, PRE-FLIGHT) must resolve:

  1. Confirm Ollama is listening on 0.0.0.0 (not 127.0.0.1 only)
  2. Confirm the WSL network path exposes the port to the LAN
  3. Confirm firewall permits caneast-site1-node4 → CanEast AI Node:[REDACTED]

If any of these fail, the interim placement is invalid and a follow-up Ollama placement ADR must be written before k8sgpt implementation proceeds.

Acceptable degraded state: when Ollama is unreachable (suspended CanEast AI Node or network break), k8sgpt falls back to non-LLM analysis mode — it still reports detected problems, just without natural- language explanation. The platform remains functional. This is documented in an operational runbook.

A dedicated Ollama deployment decision is deferred to a follow-up ADR. Candidate targets include:

  1. caneast-site1-node4 CPU inference — always-on, slow, uses small models (phi3:mini, qwen2.5:3b)
  2. Dedicated GPU node — hardware acquisition required
  3. Status quo with monitoring — treat Ollama unreachability as a non-critical degraded state

Quinary: agent plane boundaries

The agent plane is advisory only at this stage. Agents surface findings, explain state, and propose remediation. They do NOT execute changes. Autonomous remediation, self-healing, and closed-loop control require a future ADR with explicit scope, safety rails, and audit mechanisms.

Rationale

Why k8sgpt, not a custom build

k8sgpt is an established open-source project with native Ollama support, built-in k8s analyzers for common failure modes, PII filtering before LLM submission, and an operator pattern for continuous scanning. Building equivalent capability from scratch would consume Phase 2 entirely without differentiation. Adopting k8sgpt lets platform development focus on Archy Infra — the natural-language query layer over the broader observability stack — which is the genuinely novel piece.

Why multi-backend, not single-backend

Committing the platform to a single LLM provider creates migration-risk coupling. The multi-backend policy lets each use case pick the right tradeoff independently and treats LLMs as interchangeable infrastructure rather than a vendor choice. It also aligns with the platform's broader principle of AI portability.

Why advisory-only at this stage

Autonomous remediation in a platform with live OT telemetry and real infrastructure is a categorically different decision from diagnostic advice. It requires failure-mode analysis, blast-radius controls, and operator approval workflows that are not justified for a five-node fleet in Phase 1. The platform's credibility as a portfolio piece depends on demonstrating judgment about autonomy boundaries, not on maximizing autonomy.

Why interim Ollama on CanEast AI Node is acceptable, conditional

k8sgpt's fallback to non-LLM mode means the platform degrades gracefully rather than breaking when the LLM backend is unreachable. This makes the interim dependency acceptable in principle. It is acceptable in practice only if the pre-flight reachability audit (WI-318) confirms Ollama is actually reachable from caneast-site1-node4. If not, the interim plan is invalid and the follow-up placement ADR is required first.

Why the sovereignty rule is a hard rule

The Archon Platform includes OT telemetry that represents real operational data. The portfolio story depends on demonstrating that the platform can handle regulated-equivalent data with appropriate controls. A soft "prefer local" guideline would not be credible. A hard rule that sovereignty-sensitive workloads use local inference is both the right technical choice and the right portfolio signal.

Consequences

Positive

  • First AI operations primitive deployed on the substrate ADR-0038 provides.
  • Multi-backend policy documented; future AI work has clear routing guidance.
  • Sovereignty rule established before any agent consumes OT data.
  • Advisory-only boundary makes autonomy expansion a deliberate future decision, not a drift.
  • Strong portfolio narrative: data plane (ADR-0038) + agent plane (this ADR) as a matched architectural pair demonstrating the Archon transition-to-agentic-AI arc.

Negative / risks

  • Ollama-on-CanEast AI Node is a conditional dependency. Interim placement is only valid if WI-318 reachability audit passes. If it fails, k8sgpt implementation is blocked until a follow-up placement ADR exists.
  • Multi-backend implementation complexity. Each backend requires credential management (Infisical), health monitoring, and a routing mechanism. Future Archy Infra must be built with this abstraction from day one.
  • k8sgpt requires populated k3s. Until ADR-0034's k3s migration completes and workloads are deployed to k3s, k8sgpt has minimal surface area to analyze. Initial deployment may appear underwhelming; value increases as k3s workloads grow.
  • LLM inference cost is non-zero, even local. Ollama GPU time on CanEast AI Node is effectively free but non-trivial. Hosted backends (Groq, DashScope, Anthropic) have real per-call costs. A cost monitoring runbook is required before hosted backends see production use.

Out of scope

  • Autonomous remediation. Deferred to a future ADR.
  • Archy Infra implementation. Separate work, tracked in E3 epic (WI-259). The LLM routing policy in this ADR applies when Archy Infra is built.
  • OT-AI agents. The sovereignty rule applies when they are built; implementation is a future ADR.
  • Dedicated Ollama hardware decision. Deferred to follow-up ADR if WI-318 invalidates the interim plan or if operational experience shows the CanEast AI Node dependency is too fragile.

Implementation tracking

Parent epic: E3 (WI-259).

Work items to be created when implementation is sequenced (NOT in this session):

  1. Ollama reachability audit from caneast-site1-node4 (WI-318, PRE-FLIGHT — blocks k8sgpt deployment)
  2. k8sgpt-operator Helm deployment to archon-monitoring namespace
  3. k8sgpt configuration pointing at confirmed Ollama endpoint
  4. Groq API key provisioning in Infisical (archon-platform project)
  5. Anthropic API key provisioning in Infisical (archon-platform project)
  6. LLM backend routing policy documentation as a platform runbook
  7. Ollama-unavailable runbook: expected k8sgpt behavior in degraded state
  8. k8sgpt Results CR surfacing in Grafana (ADR-0038 Grafana instance)
  9. k8sgpt Results routing through Alertmanager for critical findings
  10. PII filter configuration verification for k8sgpt
  11. Cost monitoring runbook for hosted LLM backends