Operations Audit - 2026-02-22¶
Scope¶
Live environment audit and remediation executed on 2026-02-22 against:
- k3s cluster (kubectl from control node)
- k3s nodes via SSH (K8-Master, k3s-worker-01, k3s-worker-02, k3s-worker-03)
- TrueNAS relay host (Truenas)
- Docker edge host (192.168.30.117)
Findings¶
- Authentik instability from OOM kills
authentikandauthentik-workerhad repeated restarts withLast State: OOMKilled.-
Worker kernel logs confirmed memory cgroup kills for Authentik worker processes.
-
Manifest drift versus live cluster PVCs
kubectl applyfailed for several apps because immutable PVCstorageClassNamein manifests did not match live PVCs.- Affected PVCs:
authentik-postgres-datacode-server-workspace-
n8n-data -
Historical control-plane/runtime turbulence (resolved at audit time)
-
Previous logs showed transient kubelet/API timeouts and lease renewal failures that explain historical restarts in
metrics-serverandnfs-provisioner. -
Edge relay and proxy health
- TrueNAS relay config test passed (
nginx -t). - Docker host services were healthy.
- TrueNAS relay errors in logs were historical (mostly from 2026-02-21), with no new matching error lines during this audit window.
Fixes Applied¶
- Increased Authentik runtime resources
- Updated
manifests/apps/authentik/authentik.yml: authentik:- requests:
cpu: 500m,memory: 512Mi - limits:
cpu: 1000m,memory: 1Gi
- requests:
authentik-worker:- requests:
cpu: 250m,memory: 384Mi - limits:
cpu: 1000m,memory: 1Gi
- requests:
-
Rolled out deployments and verified readiness with zero restarts on new pods.
-
Reconciled manifest storage classes to live immutable PVCs
- Updated manifests:
manifests/apps/authentik/authentik.yml->local-pathmanifests/apps/code-server/code-server.yml->local-pathmanifests/apps/n8n/n8n.yml->local-path- Result: server-side dry-run and apply are now consistent with live cluster state.
Validation Snapshot (post-fix)¶
kubectl get nodes -o wide: all nodesReadykubectl get pods -A: all workloads inRunningstate at check timekubectl top pods -A: Authentik usage stabilized below new limits at audit timekubectl get events -A --field-selector type=Warning:- only residual warning from pre-fix Authentik worker pod restart history
Recommended Follow-up¶
- Plan a controlled storage migration for selected workloads from
local-pathtotruenas-nfsif cross-node failover is required. - Add periodic restart/event trend checks (daily) to catch repeated transient failures earlier.
- Keep a dated audit log in
docs/for every production incident/remediation window.