Skip to content

Operations Audit - 2026-02-22

Scope

Live environment audit and remediation executed on 2026-02-22 against: - k3s cluster (kubectl from control node) - k3s nodes via SSH (K8-Master, k3s-worker-01, k3s-worker-02, k3s-worker-03) - TrueNAS relay host (Truenas) - Docker edge host (192.168.30.117)

Findings

  1. Authentik instability from OOM kills
  2. authentik and authentik-worker had repeated restarts with Last State: OOMKilled.
  3. Worker kernel logs confirmed memory cgroup kills for Authentik worker processes.

  4. Manifest drift versus live cluster PVCs

  5. kubectl apply failed for several apps because immutable PVC storageClassName in manifests did not match live PVCs.
  6. Affected PVCs:
  7. authentik-postgres-data
  8. code-server-workspace
  9. n8n-data

  10. Historical control-plane/runtime turbulence (resolved at audit time)

  11. Previous logs showed transient kubelet/API timeouts and lease renewal failures that explain historical restarts in metrics-server and nfs-provisioner.

  12. Edge relay and proxy health

  13. TrueNAS relay config test passed (nginx -t).
  14. Docker host services were healthy.
  15. TrueNAS relay errors in logs were historical (mostly from 2026-02-21), with no new matching error lines during this audit window.

Fixes Applied

  1. Increased Authentik runtime resources
  2. Updated manifests/apps/authentik/authentik.yml:
  3. authentik:
    • requests: cpu: 500m, memory: 512Mi
    • limits: cpu: 1000m, memory: 1Gi
  4. authentik-worker:
    • requests: cpu: 250m, memory: 384Mi
    • limits: cpu: 1000m, memory: 1Gi
  5. Rolled out deployments and verified readiness with zero restarts on new pods.

  6. Reconciled manifest storage classes to live immutable PVCs

  7. Updated manifests:
  8. manifests/apps/authentik/authentik.yml -> local-path
  9. manifests/apps/code-server/code-server.yml -> local-path
  10. manifests/apps/n8n/n8n.yml -> local-path
  11. Result: server-side dry-run and apply are now consistent with live cluster state.

Validation Snapshot (post-fix)

  • kubectl get nodes -o wide: all nodes Ready
  • kubectl get pods -A: all workloads in Running state at check time
  • kubectl top pods -A: Authentik usage stabilized below new limits at audit time
  • kubectl get events -A --field-selector type=Warning:
  • only residual warning from pre-fix Authentik worker pod restart history
  1. Plan a controlled storage migration for selected workloads from local-path to truenas-nfs if cross-node failover is required.
  2. Add periodic restart/event trend checks (daily) to catch repeated transient failures earlier.
  3. Keep a dated audit log in docs/ for every production incident/remediation window.