Skip to content

k8sgpt-operator Runbook

ADR: ADR-0039 — AI Operations Agent Plane

k8sgpt-operator performs continuous AI-driven diagnostics of the k3s cluster. Results are written as Result CRs in the archon-monitoring namespace. The operator is advisory-only — it does not execute remediation.

Component overview

Component Location
Operator archon-monitoring/release-k8sgpt-operator-controller-manager
Analysis pod archon-monitoring/k8sgpt-ollama-*
K8sGPT CR archon-monitoring/k8sgpt-ollama
Helm release archon-monitoring/release-k8sgpt-operator
Manifest archon-platform: k8s/k8sgpt/k8sgpt-ollama.yaml
LLM backend Ollama at REDACTED:11434, model qwen3:4b

Check operator health

# Operator and analysis pod status
kubectl -n archon-monitoring get pods -l app.kubernetes.io/name=k8sgpt-operator
kubectl -n archon-monitoring get pods -l app.kubernetes.io/name=k8sgpt

# K8sGPT CR reconcile status
kubectl -n archon-monitoring get k8sgpt k8sgpt-ollama -o yaml | grep -A5 'status:'

# Operator logs (recent activity)
kubectl -n archon-monitoring logs deploy/release-k8sgpt-operator-controller-manager \
  -c manager --tail=50

Expected healthy state: operator 2/2 Running, analysis pod 1/1 Running. Each reconcile cycle: InitStep → FinalizerStep → ConfigureStep → PreAnalysisStep → AnalysisStep → ResultStatusStep → RemediationStep (skipped).

Query results

# List all current results
kubectl -n archon-monitoring get results

# Count results
kubectl -n archon-monitoring get results --no-headers | wc -l

# Show results with AI explanations
kubectl -n archon-monitoring get results -o json | python3 -c "
import json, sys
for r in json.load(sys.stdin)['items']:
    d = r.get('spec', {}).get('details', '')
    if d:
        print(r['metadata']['name'], ':', d[:120])
"

# Full detail for a specific result
kubectl -n archon-monitoring get result <result-name> -o json | \
  python3 -c "import json,sys; s=json.load(sys.stdin)['spec']; print('Error:', s.get('error')); print('Details:', s.get('details'))"

Result CR fields

Field Description
spec.error[].text Raw k8s analyzer finding (always populated)
spec.details AI-generated explanation from Ollama (populated when LLM is reachable)
spec.backend Confirms which backend generated the explanation (localai = Ollama)
spec.kind Kubernetes resource kind analyzed
spec.name namespace/resource-name of the affected resource

details is empty when Ollama is unreachable or when the analysis cycle completed before the LLM responded. error is always populated — results without details are still actionable.

Normal analysis behavior

  • Cycle interval: ~30 seconds between reconcile attempts; AnalysisStep runs each cycle.
  • Analysis duration: varies with cluster size and LLM availability.
  • Without Ollama calls: 5–6 minutes for ~20 issues.
  • With Ollama calls (qwen3:4b, 52 issues): ~29 minutes.
  • First-run behavior: the first AnalysisStep after pod startup may produce empty details while the model warms up in Ollama VRAM. Subsequent runs populate details.
  • Result persistence: Results are updated in-place when issues persist; new Results are created for new issues.

Degraded state — Ollama unreachable (CanEast AI Node suspended)

When REDACTED:11434 is unreachable:

  • k8sgpt continues running and produces Results with error populated.
  • details field is empty (no LLM explanation generated).
  • The operator logs Finished Reconciling k8sGPT with RequeueTime: 30s rather than an error — this is expected degraded behavior per ADR-0039.
  • Cluster health detection is unaffected; only natural-language explanation is lost.

To confirm Ollama reachability from within the cluster:

kubectl run ollama-check --rm -it --restart=Never --image=curlimages/curl -- \
  curl -s http://REDACTED:[REDACTED]/api/tags | python3 -c "import json,sys; print([m['name'] for m in json.load(sys.stdin).get('models',[])])"

Update LLM model

Edit k8s/k8sgpt/k8sgpt-ollama.yaml in archon-platform, change spec.ai.model, then re-apply:

kubectl apply -f k8s/k8sgpt/k8sgpt-ollama.yaml

Ensure the target model is pulled on CanEast AI Node before applying:

# From CanEast AI Node WSL (or any host with Docker)
curl -X POST http://REDACTED:[REDACTED]/api/pull -d '{"name":"<model-name>"}'

Avoid glm-4.7-flash (29.9B) — too slow for k8sgpt analysis cycles. Recommended: qwen3:4b or llama3.2.

Rotate backend

To switch to a different backend (e.g., Groq), update spec.ai in the K8sGPT CR:

spec:
  ai:
    backend: openai          # or groq, amazonbedrock, etc.
    model: llama-3.1-8b-instant
    baseUrl: https://api.groq.com/openai/v1
    secret:
      REDACTED k8sgpt-groq-secret
      key: GROQ_API_KEY

Sovereignty rule (ADR-0039): cluster internals are sovereign data. Any backend that sends cluster topology off-premise violates the sovereignty rule. Hosted backends are only permitted if anonymized: true AND the input data is confirmed non-sensitive. Current configuration uses anonymized: true which redacts names/namespaces before LLM submission.

RBAC note

The operator's manager ClusterRole (release-k8sgpt-operator-controller-manager) grants apiGroups: ['*'], resources: ['*'] with write verbs (effectively cluster-admin scope). This is upstream default for k8sgpt-operator and is a known over-provisioning. Do not narrow this without testing — the operator requires broad read access for cluster analysis and write access to manage Result CRs and the k8sgpt-ollama Deployment. Track narrowing as a future hardening task if cluster grows beyond dev/lab use.

Helm upgrade

helm -n archon-monitoring upgrade release-k8sgpt-operator \
  k8sgpt-operator/k8sgpt-operator \
  --version <new-version>

Current deployed version: 0.2.27 (appVersion 0.2.25). The version: latest field in the K8sGPT CR controls the analysis pod image tag, not the operator version.

Clean up test namespaces

After testing image pull failures or other injected faults:

kubectl delete ns k8sgpt-test

The corresponding Results in archon-monitoring will be cleaned up on the next AnalysisStep.