Backup & Restore Runbook¶

Last updated: 2026-02-28

1. What to Back Up¶

Asset	Location	Method
App data (PVCs on NFS)	TrueNAS `/mnt/strange/NSF_Prox/k3s/`	ZFS snapshots + replication
App data (local-path PVCs)	Worker node local disk	Manual copy or node snapshot
k3s etcd	k3s-master-01 `/var/lib/rancher/k3s/server/db/`	`k3s etcd-snapshot`
Terraform state	`terraform/terraform.tfstate` (local, gitignored)	Copy to secure off-host location
Terraform vars	`terraform/terraform.tfvars` (local, gitignored)	Copy to secure off-host location
kubeconfig	`~/.kube/config` or `kubeconfig-raw.yml` (gitignored)	Regenerated on cluster rebuild
Runtime secrets	Kubernetes Secrets (in-cluster only)	Export or recreate from password manager
Ansible inventory	`ansible/group_vars/all.yml` (tracked in git)	Git repo
App manifests	`manifests/` (tracked in git)	Git repo

local-path PVCs (not on TrueNAS)¶

These PVCs use local-path and live on the worker node, not TrueNAS NFS:

PVC	Namespace
`authentik-postgres-data`	authentik
`code-server-workspace`	apps
`n8n-data`	n8n

Back these up separately -- they are not covered by TrueNAS ZFS snapshots.

# Find which node hosts a local-path PV
kubectl get pv -o custom-columns='PV:.metadata.name,NODE:.spec.nodeAffinity.required.nodeSelectorTerms[0].matchExpressions[0].values[0],PATH:.spec.hostPath.path'

# Copy data off-node (example for n8n)
ssh debian@192.168.100.111 "sudo tar czf /tmp/n8n-data.tar.gz -C /var/lib/rancher/k3s/storage/ ."
scp debian@192.168.100.111:/tmp/n8n-data.tar.gz ./backups/

2. TrueNAS NFS Snapshots (ZFS)¶

All truenas-nfs PVC data lives under the ZFS dataset backing /mnt/strange/NSF_Prox/k3s/.

Create a snapshot¶

ssh ray@192.168.13.69

# Snapshot the k3s NFS dataset
zfs snapshot strange/NSF_Prox/k3s@backup-$(date +%Y%m%d-%H%M%S)

# Or snapshot the parent dataset (includes all children)
zfs snapshot -r strange/NSF_Prox@backup-$(date +%Y%m%d-%H%M%S)

List snapshots¶

ssh ray@192.168.13.69 "zfs list -t snapshot -r strange/NSF_Prox/k3s"

Restore from snapshot¶

# Rollback to a snapshot (destroys changes after snapshot)
ssh ray@192.168.13.69 "zfs rollback strange/NSF_Prox/k3s@backup-20260228-120000"

# Or clone to a separate mount for selective recovery
ssh ray@192.168.13.69 "zfs clone strange/NSF_Prox/k3s@backup-20260228-120000 strange/k3s-recovery"
# Copy what you need from /mnt/strange/k3s-recovery/
ssh ray@192.168.13.69 "zfs destroy strange/k3s-recovery"

Replicate to another pool or host¶

# Send snapshot to a file (off-site backup)
ssh ray@192.168.13.69 "zfs send strange/NSF_Prox/k3s@backup-20260228-120000 > /mnt/backup-drive/k3s-snapshot.zfs"

# Send to remote TrueNAS
ssh ray@192.168.13.69 "zfs send strange/NSF_Prox/k3s@backup-20260228-120000 | ssh backup-host zfs recv tank/k3s-backup"

Automate via TrueNAS UI¶

TrueNAS web UI > Data Protection > Periodic Snapshot Tasks
Dataset: strange/NSF_Prox/k3s
Schedule: daily, retain 7 days
Optionally add a Replication Task to push snapshots off-box

3. k3s etcd Snapshots¶

k3s uses embedded etcd on the master node. Snapshots capture all cluster state (deployments, services, configmaps, etc.) but not PVC data.

Save a snapshot¶

ssh debian@192.168.100.110

# On-demand snapshot
sudo k3s etcd-snapshot save --name manual-backup

# Default location: /var/lib/rancher/k3s/server/db/snapshots/

List snapshots¶

ssh debian@192.168.100.110 "sudo k3s etcd-snapshot list"

Copy snapshots off-node¶

scp debian@192.168.100.110:/var/lib/rancher/k3s/server/db/snapshots/manual-backup* ./backups/

Restore from snapshot¶

ssh debian@192.168.100.110

# Stop k3s
sudo systemctl stop k3s

# Restore
sudo k3s server --cluster-reset --cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/manual-backup

# Start k3s
sudo systemctl start k3s

Built-in automatic snapshots¶

k3s saves etcd snapshots every 12 hours by default, retaining 5. Verify:

ssh debian@192.168.100.110 "sudo k3s etcd-snapshot list"

4. Terraform State Backup¶

Terraform state is local (not remote backend) and gitignored.

File	Purpose
`terraform/terraform.tfstate`	Current infrastructure state
`terraform/terraform.tfstate.backup`	Previous state
`terraform/terraform.tfvars`	Variables (contains Proxmox credentials)

Back up¶

cp terraform/terraform.tfstate backups/terraform.tfstate.$(date +%Y%m%d)
cp terraform/terraform.tfvars backups/terraform.tfvars.$(date +%Y%m%d)

Store copies in your password manager or encrypted off-site storage. These files contain infrastructure credentials.

Recover without state¶

If state is lost, the VMs still exist. Re-import:

cd terraform
terraform init
terraform import proxmox_vm_qemu.k3s_master proxmox-pve1/qemu/110
terraform import proxmox_vm_qemu.k3s_worker[0] proxmox-pve1/qemu/111
terraform import proxmox_vm_qemu.k3s_worker[1] proxmox-pve1/qemu/112
terraform import proxmox_vm_qemu.k3s_worker[2] proxmox-pve1/qemu/113

Exact resource names depend on your .tf definitions. Check terraform/main.tf.

5. Secret Recovery¶

Runtime secrets are never committed to git. Recreate them from your password manager using the commands in manifests/secrets/README.md.

Export existing secrets (before disaster)¶

# Dump all secrets to a local encrypted file
for ns in apps authentik n8n vaultwarden discourse; do
  kubectl get secrets -n $ns -o yaml >> /tmp/all-secrets.yml
done
# Encrypt and store securely -- never commit this file
gpg -c /tmp/all-secrets.yml
rm /tmp/all-secrets.yml

Recreate from scratch¶

Replace all CHANGE_ME values with real credentials from your password manager:

# apps namespace
kubectl create secret generic code-server-secret -n apps \
  --from-literal=PASSWORD='CHANGE_ME' \
  --dry-run=client -o yaml | kubectl apply -f -

kubectl create secret generic pangolin-secret -n apps \
  --from-literal=PANGOLIN_APP_SECRET='CHANGE_ME_LONG_RANDOM_SECRET' \
  --dry-run=client -o yaml | kubectl apply -f -

kubectl create secret generic notify-channel-secrets -n apps \
  --from-literal=TELEGRAM_BOT_TOKEN='CHANGE_ME' \
  --from-literal=TELEGRAM_CHAT_ID='CHANGE_ME' \
  --from-literal=TWILIO_ACCOUNT_SID='CHANGE_ME' \
  --from-literal=TWILIO_AUTH_TOKEN='CHANGE_ME' \
  --from-literal=TWILIO_WHATSAPP_FROM='whatsapp:+14155238886' \
  --from-literal=TWILIO_WHATSAPP_TO='whatsapp:+10000000000' \
  --from-literal=ALERT_SMTP_HOST='smtp.gmail.com' \
  --from-literal=ALERT_SMTP_PORT='587' \
  --from-literal=ALERT_EMAIL_USER='CHANGE_ME' \
  --from-literal=ALERT_EMAIL_PASSWORD='CHANGE_ME' \
  --from-literal=ALERT_FROM_EMAIL='notify@kwe2.org' \
  --from-literal=ALERT_TO_EMAIL_1='you@example.com' \
  --dry-run=client -o yaml | kubectl apply -f -

# authentik namespace
kubectl create secret generic authentik-secret -n authentik \
  --from-literal=AUTHENTIK_SECRET_KEY='CHANGE_ME_LONG_RANDOM_SECRET' \
  --from-literal=POSTGRES_DB='authentik' \
  --from-literal=POSTGRES_USER='authentik' \
  --from-literal=POSTGRES_PASSWORD='CHANGE_ME' \
  --dry-run=client -o yaml | kubectl apply -f -

# n8n namespace
kubectl create secret generic n8n-secret -n n8n \
  --from-literal=DB_TYPE='sqlite' \
  --from-literal=N8N_ENCRYPTION_KEY='CHANGE_ME_LONG_RANDOM_SECRET' \
  --from-literal=N8N_USER_MANAGEMENT_JWT_SECRET='CHANGE_ME_LONG_RANDOM_SECRET' \
  --from-literal=WEBHOOK_URL='https://n8n.smartmur.ca' \
  --dry-run=client -o yaml | kubectl apply -f -

# vaultwarden namespace
kubectl create secret generic vaultwarden-secret -n vaultwarden \
  --from-literal=ADMIN_TOKEN='CHANGE_ME_LONG_RANDOM_SECRET' \
  --from-literal=DOMAIN='https://vault.smartmur.ca' \
  --dry-run=client -o yaml | kubectl apply -f -

# discourse namespace
kubectl create secret generic discourse-secret -n discourse \
  --from-literal=POSTGRESQL_PASSWORD='CHANGE_ME' \
  --from-literal=DISCOURSE_EMAIL='admin@example.com' \
  --from-literal=DISCOURSE_PASSWORD='CHANGE_ME' \
  --dry-run=client -o yaml | kubectl apply -f -

6. Full Cluster Restore (Bare Metal)¶

This rebuilds the entire cluster from scratch using the repo's bootstrap scripts.

Prerequisites¶

Proxmox hosts online (PVE1 at 192.168.100.100)
TrueNAS online with NFS exports intact at 192.168.13.69
Control node has: bash, git, terraform, ansible, kubectl, helm
Repo cloned: git clone <repo-url> && cd k3s-cluster

Step-by-step¶

# 1. Install tooling on control node
bash scripts/00-install-tools.sh

# 2. Create Proxmox VM template (skip if template already exists)
bash scripts/00-create-proxmox-template.sh

# 3. Restore terraform.tfvars from backup
cp backups/terraform.tfvars terraform/terraform.tfvars
# Or recreate from example:
cp terraform/terraform.tfvars.example terraform/terraform.tfvars
# Edit with real Proxmox credentials

# 4. Provision VMs
bash scripts/01-provision.sh

# 5. Bootstrap k3s cluster
export K3S_TOKEN="your-token-min-20-chars"
bash scripts/02-cluster-setup.sh

# 6. Set up storage (NFS provisioner, MetalLB, Traefik)
bash scripts/03-storage-setup.sh

# 7. Recreate all runtime secrets (Section 5 above)
# ... run all kubectl create secret commands ...

# 8. Deploy all apps
bash scripts/04-deploy-apps.sh

# 9. (Optional) Restore etcd snapshot instead of fresh deploy
# If you have an etcd snapshot, use it after step 5 instead of steps 6-8:
#   sudo k3s server --cluster-reset --cluster-reset-restore-path=/path/to/snapshot
#   Then recreate any missing secrets.

# 10. Restore PVC data from TrueNAS ZFS snapshots
# If NFS data survived, PVCs auto-bind. If not:
ssh ray@192.168.13.69 "zfs rollback strange/NSF_Prox/k3s@<snapshot-name>"

# 11. Restore local-path PVC data
# Copy backed-up tarballs to the correct worker node paths

# 12. Sync Dockhand contexts
bash scripts/05-sync-dockhand-contexts.sh

Restoring kubeconfig¶

After step 5, 02-cluster-setup.sh generates a new kubeconfig. If you need it manually:

scp debian@192.168.100.110:~/.kube/config ./kubeconfig-raw.yml
# Update server address if using SSH tunnel
sed -i '' 's|https://127.0.0.1:6443|https://127.0.0.1:7443|' ./kubeconfig-raw.yml
export KUBECONFIG=$(pwd)/kubeconfig-raw.yml

7. Single App Restore¶

To redeploy one app without rebuilding the cluster.

Example: restore vaultwarden¶

# 1. Recreate the secret
kubectl create secret generic vaultwarden-secret -n vaultwarden \
  --from-literal=ADMIN_TOKEN='real-token-from-password-manager' \
  --from-literal=DOMAIN='https://vault.smartmur.ca' \
  --dry-run=client -o yaml | kubectl apply -f -

# 2. Apply the manifest
kubectl apply -f manifests/apps/vaultwarden/vaultwarden.yml

# 3. Verify
kubectl get pods -n vaultwarden -w

If PVC data is lost¶

# Restore from ZFS snapshot (NFS-backed PVCs)
ssh ray@192.168.13.69
zfs list -t snapshot -r strange/NSF_Prox/k3s | grep vaultwarden
zfs clone strange/NSF_Prox/k3s@<snapshot> strange/k3s-vw-recovery
cp -a /mnt/strange/k3s-vw-recovery/<vaultwarden-pvc-dir>/* /mnt/strange/NSF_Prox/k3s/<vaultwarden-pvc-dir>/
zfs destroy strange/k3s-vw-recovery

# Restart the pod to pick up restored data
kubectl rollout restart deployment/vaultwarden -n vaultwarden

General pattern for any app¶

Recreate the app's secret (see Section 5 or manifests/secrets/README.md).
Apply the manifest: kubectl apply -f manifests/apps/<app>/<app>.yml
Restore PVC data from ZFS snapshot if needed.
Verify pod is running and healthy.

8. Validation¶

Run after any restore to confirm the cluster is healthy.

Cluster health¶

kubectl get nodes
kubectl get pods -A
kubectl get pv
kubectl get pvc -A

Storage¶

# NFS provisioner running
kubectl get pods -n storage

# PVCs bound
kubectl get pvc -A | grep -v Bound
# (should return nothing)

# TrueNAS NFS accessible from nodes
ssh debian@192.168.100.111 "showmount -e 192.168.13.69"

Networking¶

# MetalLB and Traefik
kubectl get svc traefik -n traefik
# Should show EXTERNAL-IP 192.168.100.120

# IngressRoutes
kubectl get ingressroute -A

App-level¶

# Check each app responds
for url in vault.smartmur.ca n8n.smartmur.ca home.smartmur.ca auth.smartmur.ca code.smartmur.ca blog.smartmur.ca k8s.smartmur.ca; do
  echo -n "$url: "; curl -sk -o /dev/null -w "%{http_code}" https://$url; echo
done

Secrets present¶

kubectl get secret vaultwarden-secret -n vaultwarden
kubectl get secret n8n-secret -n n8n
kubectl get secret authentik-secret -n authentik
kubectl get secret code-server-secret -n apps
kubectl get secret discourse-secret -n discourse
kubectl get secret pangolin-secret -n apps
kubectl get secret notify-channel-secrets -n apps

etcd health¶

ssh debian@192.168.100.110 "sudo k3s etcd-snapshot list"