k8sgpt-operator Runbook¶

ADR: ADR-0039 — AI Operations Agent Plane

k8sgpt-operator performs continuous AI-driven diagnostics of the k3s cluster. Results are written as Result CRs in the archon-monitoring namespace. The operator is advisory-only — it does not execute remediation.

Component overview¶

Component	Location
Operator	`archon-monitoring/release-k8sgpt-operator-controller-manager`
Analysis pod	`archon-monitoring/k8sgpt-ollama-*`
K8sGPT CR	`archon-monitoring/k8sgpt-ollama`
Helm release	`archon-monitoring/release-k8sgpt-operator`
Manifest	`archon-platform: k8s/k8sgpt/k8sgpt-ollama.yaml`
LLM backend	Ollama at `REDACTED:11434`, model `qwen3:4b`

Check operator health¶

# Operator and analysis pod status
kubectl -n archon-monitoring get pods -l app.kubernetes.io/name=k8sgpt-operator
kubectl -n archon-monitoring get pods -l app.kubernetes.io/name=k8sgpt

# K8sGPT CR reconcile status
kubectl -n archon-monitoring get k8sgpt k8sgpt-ollama -o yaml | grep -A5 'status:'

# Operator logs (recent activity)
kubectl -n archon-monitoring logs deploy/release-k8sgpt-operator-controller-manager \
  -c manager --tail=50

Expected healthy state: operator 2/2 Running, analysis pod 1/1 Running. Each reconcile cycle: InitStep → FinalizerStep → ConfigureStep → PreAnalysisStep → AnalysisStep → ResultStatusStep → RemediationStep (skipped).

Query results¶

# List all current results
kubectl -n archon-monitoring get results

# Count results
kubectl -n archon-monitoring get results --no-headers | wc -l

# Show results with AI explanations
kubectl -n archon-monitoring get results -o json | python3 -c "
import json, sys
for r in json.load(sys.stdin)['items']:
    d = r.get('spec', {}).get('details', '')
    if d:
        print(r['metadata']['name'], ':', d[:120])
"

# Full detail for a specific result
kubectl -n archon-monitoring get result <result-name> -o json | \
  python3 -c "import json,sys; s=json.load(sys.stdin)['spec']; print('Error:', s.get('error')); print('Details:', s.get('details'))"

Result CR fields¶

Field	Description
`spec.error[].text`	Raw k8s analyzer finding (always populated)
`spec.details`	AI-generated explanation from Ollama (populated when LLM is reachable)
`spec.backend`	Confirms which backend generated the explanation (`localai` = Ollama)
`spec.kind`	Kubernetes resource kind analyzed
`spec.name`	`namespace/resource-name` of the affected resource

details is empty when Ollama is unreachable or when the analysis cycle completed before the LLM responded. error is always populated — results without details are still actionable.

Normal analysis behavior¶

Cycle interval: ~30 seconds between reconcile attempts; AnalysisStep runs each cycle.
Analysis duration: varies with cluster size and LLM availability.
Without Ollama calls: 5–6 minutes for ~20 issues.
With Ollama calls (qwen3:4b, 52 issues): ~29 minutes.
First-run behavior: the first AnalysisStep after pod startup may produce empty details while the model warms up in Ollama VRAM. Subsequent runs populate details.
Result persistence: Results are updated in-place when issues persist; new Results are created for new issues.

Degraded state — Ollama unreachable (CanEast AI Node suspended)¶

When REDACTED:11434 is unreachable:

k8sgpt continues running and produces Results with error populated.
details field is empty (no LLM explanation generated).
The operator logs Finished Reconciling k8sGPT with RequeueTime: 30s rather than an error — this is expected degraded behavior per ADR-0039.
Cluster health detection is unaffected; only natural-language explanation is lost.

To confirm Ollama reachability from within the cluster:

kubectl run ollama-check --rm -it --restart=Never --image=curlimages/curl -- \
  curl -s http://REDACTED:[REDACTED]/api/tags | python3 -c "import json,sys; print([m['name'] for m in json.load(sys.stdin).get('models',[])])"

Update LLM model¶

Edit k8s/k8sgpt/k8sgpt-ollama.yaml in archon-platform, change spec.ai.model, then re-apply:

kubectl apply -f k8s/k8sgpt/k8sgpt-ollama.yaml

Ensure the target model is pulled on CanEast AI Node before applying:

# From CanEast AI Node WSL (or any host with Docker)
curl -X POST http://REDACTED:[REDACTED]/api/pull -d '{"name":"<model-name>"}'

Avoid glm-4.7-flash (29.9B) — too slow for k8sgpt analysis cycles. Recommended: qwen3:4b or llama3.2.

Rotate backend¶

To switch to a different backend (e.g., Groq), update spec.ai in the K8sGPT CR:

spec:
  ai:
    backend: openai          # or groq, amazonbedrock, etc.
    model: llama-3.1-8b-instant
    baseUrl: https://api.groq.com/openai/v1
    secret:
      REDACTED k8sgpt-groq-secret
      key: GROQ_API_KEY

Sovereignty rule (ADR-0039): cluster internals are sovereign data. Any backend that sends cluster topology off-premise violates the sovereignty rule. Hosted backends are only permitted if anonymized: true AND the input data is confirmed non-sensitive. Current configuration uses anonymized: true which redacts names/namespaces before LLM submission.

RBAC note¶

The operator's manager ClusterRole (release-k8sgpt-operator-controller-manager) grants apiGroups: ['*'], resources: ['*'] with write verbs (effectively cluster-admin scope). This is upstream default for k8sgpt-operator and is a known over-provisioning. Do not narrow this without testing — the operator requires broad read access for cluster analysis and write access to manage Result CRs and the k8sgpt-ollama Deployment. Track narrowing as a future hardening task if cluster grows beyond dev/lab use.

Helm upgrade¶

helm -n archon-monitoring upgrade release-k8sgpt-operator \
  k8sgpt-operator/k8sgpt-operator \
  --version <new-version>

Current deployed version: 0.2.27 (appVersion 0.2.25). The version: latest field in the K8sGPT CR controls the analysis pod image tag, not the operator version.

Clean up test namespaces¶

After testing image pull failures or other injected faults:

kubectl delete ns k8sgpt-test

The corresponding Results in archon-monitoring will be cleaned up on the next AnalysisStep.