k8sgpt-operator Runbook¶
ADR: ADR-0039 — AI Operations Agent Plane
k8sgpt-operator performs continuous AI-driven diagnostics of the k3s cluster.
Results are written as Result CRs in the archon-monitoring namespace.
The operator is advisory-only — it does not execute remediation.
Component overview¶
| Component | Location |
|---|---|
| Operator | archon-monitoring/release-k8sgpt-operator-controller-manager |
| Analysis pod | archon-monitoring/k8sgpt-ollama-* |
| K8sGPT CR | archon-monitoring/k8sgpt-ollama |
| Helm release | archon-monitoring/release-k8sgpt-operator |
| Manifest | archon-platform: k8s/k8sgpt/k8sgpt-ollama.yaml |
| LLM backend | Ollama at REDACTED:11434, model qwen3:4b |
Check operator health¶
# Operator and analysis pod status
kubectl -n archon-monitoring get pods -l app.kubernetes.io/name=k8sgpt-operator
kubectl -n archon-monitoring get pods -l app.kubernetes.io/name=k8sgpt
# K8sGPT CR reconcile status
kubectl -n archon-monitoring get k8sgpt k8sgpt-ollama -o yaml | grep -A5 'status:'
# Operator logs (recent activity)
kubectl -n archon-monitoring logs deploy/release-k8sgpt-operator-controller-manager \
-c manager --tail=50
Expected healthy state: operator 2/2 Running, analysis pod 1/1 Running.
Each reconcile cycle: InitStep → FinalizerStep → ConfigureStep → PreAnalysisStep → AnalysisStep → ResultStatusStep → RemediationStep (skipped).
Query results¶
# List all current results
kubectl -n archon-monitoring get results
# Count results
kubectl -n archon-monitoring get results --no-headers | wc -l
# Show results with AI explanations
kubectl -n archon-monitoring get results -o json | python3 -c "
import json, sys
for r in json.load(sys.stdin)['items']:
d = r.get('spec', {}).get('details', '')
if d:
print(r['metadata']['name'], ':', d[:120])
"
# Full detail for a specific result
kubectl -n archon-monitoring get result <result-name> -o json | \
python3 -c "import json,sys; s=json.load(sys.stdin)['spec']; print('Error:', s.get('error')); print('Details:', s.get('details'))"
Result CR fields¶
| Field | Description |
|---|---|
spec.error[].text |
Raw k8s analyzer finding (always populated) |
spec.details |
AI-generated explanation from Ollama (populated when LLM is reachable) |
spec.backend |
Confirms which backend generated the explanation (localai = Ollama) |
spec.kind |
Kubernetes resource kind analyzed |
spec.name |
namespace/resource-name of the affected resource |
details is empty when Ollama is unreachable or when the analysis cycle completed before the LLM responded.
error is always populated — results without details are still actionable.
Normal analysis behavior¶
- Cycle interval: ~30 seconds between reconcile attempts; AnalysisStep runs each cycle.
- Analysis duration: varies with cluster size and LLM availability.
- Without Ollama calls: 5–6 minutes for ~20 issues.
- With Ollama calls (qwen3:4b, 52 issues): ~29 minutes.
- First-run behavior: the first AnalysisStep after pod startup may produce empty
detailswhile the model warms up in Ollama VRAM. Subsequent runs populatedetails. - Result persistence: Results are updated in-place when issues persist; new Results are created for new issues.
Degraded state — Ollama unreachable (CanEast AI Node suspended)¶
When REDACTED:11434 is unreachable:
- k8sgpt continues running and produces Results with
errorpopulated. detailsfield is empty (no LLM explanation generated).- The operator logs
Finished Reconciling k8sGPTwithRequeueTime: 30srather than an error — this is expected degraded behavior per ADR-0039. - Cluster health detection is unaffected; only natural-language explanation is lost.
To confirm Ollama reachability from within the cluster:
kubectl run ollama-check --rm -it --restart=Never --image=curlimages/curl -- \
curl -s http://REDACTED:[REDACTED]/api/tags | python3 -c "import json,sys; print([m['name'] for m in json.load(sys.stdin).get('models',[])])"
Update LLM model¶
Edit k8s/k8sgpt/k8sgpt-ollama.yaml in archon-platform, change spec.ai.model, then re-apply:
Ensure the target model is pulled on CanEast AI Node before applying:
# From CanEast AI Node WSL (or any host with Docker)
curl -X POST http://REDACTED:[REDACTED]/api/pull -d '{"name":"<model-name>"}'
Avoid glm-4.7-flash (29.9B) — too slow for k8sgpt analysis cycles.
Recommended: qwen3:4b or llama3.2.
Rotate backend¶
To switch to a different backend (e.g., Groq), update spec.ai in the K8sGPT CR:
spec:
ai:
backend: openai # or groq, amazonbedrock, etc.
model: llama-3.1-8b-instant
baseUrl: https://api.groq.com/openai/v1
secret:
REDACTED k8sgpt-groq-secret
key: GROQ_API_KEY
Sovereignty rule (ADR-0039): cluster internals are sovereign data.
Any backend that sends cluster topology off-premise violates the sovereignty rule.
Hosted backends are only permitted if anonymized: true AND the input data is confirmed non-sensitive.
Current configuration uses anonymized: true which redacts names/namespaces before LLM submission.
RBAC note¶
The operator's manager ClusterRole (release-k8sgpt-operator-controller-manager) grants
apiGroups: ['*'], resources: ['*'] with write verbs (effectively cluster-admin scope).
This is upstream default for k8sgpt-operator and is a known over-provisioning.
Do not narrow this without testing — the operator requires broad read access for cluster analysis
and write access to manage Result CRs and the k8sgpt-ollama Deployment.
Track narrowing as a future hardening task if cluster grows beyond dev/lab use.
Helm upgrade¶
helm -n archon-monitoring upgrade release-k8sgpt-operator \
k8sgpt-operator/k8sgpt-operator \
--version <new-version>
Current deployed version: 0.2.27 (appVersion 0.2.25).
The version: latest field in the K8sGPT CR controls the analysis pod image tag,
not the operator version.
Clean up test namespaces¶
After testing image pull failures or other injected faults:
The corresponding Results in archon-monitoring will be cleaned up on the next AnalysisStep.