Backup & Restore Runbook¶
Last updated: 2026-02-28
1. What to Back Up¶
| Asset | Location | Method |
|---|---|---|
| App data (PVCs on NFS) | TrueNAS /mnt/strange/NSF_Prox/k3s/ |
ZFS snapshots + replication |
| App data (local-path PVCs) | Worker node local disk | Manual copy or node snapshot |
| k3s etcd | k3s-master-01 /var/lib/rancher/k3s/server/db/ |
k3s etcd-snapshot |
| Terraform state | terraform/terraform.tfstate (local, gitignored) |
Copy to secure off-host location |
| Terraform vars | terraform/terraform.tfvars (local, gitignored) |
Copy to secure off-host location |
| kubeconfig | ~/.kube/config or kubeconfig-raw.yml (gitignored) |
Regenerated on cluster rebuild |
| Runtime secrets | Kubernetes Secrets (in-cluster only) | Export or recreate from password manager |
| Ansible inventory | ansible/group_vars/all.yml (tracked in git) |
Git repo |
| App manifests | manifests/ (tracked in git) |
Git repo |
local-path PVCs (not on TrueNAS)¶
These PVCs use local-path and live on the worker node, not TrueNAS NFS:
| PVC | Namespace |
|---|---|
authentik-postgres-data |
authentik |
code-server-workspace |
apps |
n8n-data |
n8n |
Back these up separately -- they are not covered by TrueNAS ZFS snapshots.
# Find which node hosts a local-path PV
kubectl get pv -o custom-columns='PV:.metadata.name,NODE:.spec.nodeAffinity.required.nodeSelectorTerms[0].matchExpressions[0].values[0],PATH:.spec.hostPath.path'
# Copy data off-node (example for n8n)
ssh debian@192.168.100.111 "sudo tar czf /tmp/n8n-data.tar.gz -C /var/lib/rancher/k3s/storage/ ."
scp debian@192.168.100.111:/tmp/n8n-data.tar.gz ./backups/
2. TrueNAS NFS Snapshots (ZFS)¶
All truenas-nfs PVC data lives under the ZFS dataset backing /mnt/strange/NSF_Prox/k3s/.
Create a snapshot¶
ssh ray@192.168.13.69
# Snapshot the k3s NFS dataset
zfs snapshot strange/NSF_Prox/k3s@backup-$(date +%Y%m%d-%H%M%S)
# Or snapshot the parent dataset (includes all children)
zfs snapshot -r strange/NSF_Prox@backup-$(date +%Y%m%d-%H%M%S)
List snapshots¶
Restore from snapshot¶
# Rollback to a snapshot (destroys changes after snapshot)
ssh ray@192.168.13.69 "zfs rollback strange/NSF_Prox/k3s@backup-20260228-120000"
# Or clone to a separate mount for selective recovery
ssh ray@192.168.13.69 "zfs clone strange/NSF_Prox/k3s@backup-20260228-120000 strange/k3s-recovery"
# Copy what you need from /mnt/strange/k3s-recovery/
ssh ray@192.168.13.69 "zfs destroy strange/k3s-recovery"
Replicate to another pool or host¶
# Send snapshot to a file (off-site backup)
ssh ray@192.168.13.69 "zfs send strange/NSF_Prox/k3s@backup-20260228-120000 > /mnt/backup-drive/k3s-snapshot.zfs"
# Send to remote TrueNAS
ssh ray@192.168.13.69 "zfs send strange/NSF_Prox/k3s@backup-20260228-120000 | ssh backup-host zfs recv tank/k3s-backup"
Automate via TrueNAS UI¶
- TrueNAS web UI > Data Protection > Periodic Snapshot Tasks
- Dataset:
strange/NSF_Prox/k3s - Schedule: daily, retain 7 days
- Optionally add a Replication Task to push snapshots off-box
3. k3s etcd Snapshots¶
k3s uses embedded etcd on the master node. Snapshots capture all cluster state (deployments, services, configmaps, etc.) but not PVC data.
Save a snapshot¶
ssh debian@192.168.100.110
# On-demand snapshot
sudo k3s etcd-snapshot save --name manual-backup
# Default location: /var/lib/rancher/k3s/server/db/snapshots/
List snapshots¶
Copy snapshots off-node¶
Restore from snapshot¶
ssh debian@192.168.100.110
# Stop k3s
sudo systemctl stop k3s
# Restore
sudo k3s server --cluster-reset --cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/manual-backup
# Start k3s
sudo systemctl start k3s
Built-in automatic snapshots¶
k3s saves etcd snapshots every 12 hours by default, retaining 5. Verify:
4. Terraform State Backup¶
Terraform state is local (not remote backend) and gitignored.
| File | Purpose |
|---|---|
terraform/terraform.tfstate |
Current infrastructure state |
terraform/terraform.tfstate.backup |
Previous state |
terraform/terraform.tfvars |
Variables (contains Proxmox credentials) |
Back up¶
cp terraform/terraform.tfstate backups/terraform.tfstate.$(date +%Y%m%d)
cp terraform/terraform.tfvars backups/terraform.tfvars.$(date +%Y%m%d)
Store copies in your password manager or encrypted off-site storage. These files contain infrastructure credentials.
Recover without state¶
If state is lost, the VMs still exist. Re-import:
cd terraform
terraform init
terraform import proxmox_vm_qemu.k3s_master proxmox-pve1/qemu/110
terraform import proxmox_vm_qemu.k3s_worker[0] proxmox-pve1/qemu/111
terraform import proxmox_vm_qemu.k3s_worker[1] proxmox-pve1/qemu/112
terraform import proxmox_vm_qemu.k3s_worker[2] proxmox-pve1/qemu/113
Exact resource names depend on your .tf definitions. Check terraform/main.tf.
5. Secret Recovery¶
Runtime secrets are never committed to git. Recreate them from your password manager using the commands in manifests/secrets/README.md.
Export existing secrets (before disaster)¶
# Dump all secrets to a local encrypted file
for ns in apps authentik n8n vaultwarden discourse; do
kubectl get secrets -n $ns -o yaml >> /tmp/all-secrets.yml
done
# Encrypt and store securely -- never commit this file
gpg -c /tmp/all-secrets.yml
rm /tmp/all-secrets.yml
Recreate from scratch¶
Replace all CHANGE_ME values with real credentials from your password manager:
# apps namespace
kubectl create secret generic code-server-secret -n apps \
--from-literal=PASSWORD='CHANGE_ME' \
--dry-run=client -o yaml | kubectl apply -f -
kubectl create secret generic pangolin-secret -n apps \
--from-literal=PANGOLIN_APP_SECRET='CHANGE_ME_LONG_RANDOM_SECRET' \
--dry-run=client -o yaml | kubectl apply -f -
kubectl create secret generic notify-channel-secrets -n apps \
--from-literal=TELEGRAM_BOT_TOKEN='CHANGE_ME' \
--from-literal=TELEGRAM_CHAT_ID='CHANGE_ME' \
--from-literal=TWILIO_ACCOUNT_SID='CHANGE_ME' \
--from-literal=TWILIO_AUTH_TOKEN='CHANGE_ME' \
--from-literal=TWILIO_WHATSAPP_FROM='whatsapp:+14155238886' \
--from-literal=TWILIO_WHATSAPP_TO='whatsapp:+10000000000' \
--from-literal=ALERT_SMTP_HOST='smtp.gmail.com' \
--from-literal=ALERT_SMTP_PORT='587' \
--from-literal=ALERT_EMAIL_USER='CHANGE_ME' \
--from-literal=ALERT_EMAIL_PASSWORD='CHANGE_ME' \
--from-literal=ALERT_FROM_EMAIL='notify@kwe2.org' \
--from-literal=ALERT_TO_EMAIL_1='you@example.com' \
--dry-run=client -o yaml | kubectl apply -f -
# authentik namespace
kubectl create secret generic authentik-secret -n authentik \
--from-literal=AUTHENTIK_SECRET_KEY='CHANGE_ME_LONG_RANDOM_SECRET' \
--from-literal=POSTGRES_DB='authentik' \
--from-literal=POSTGRES_USER='authentik' \
--from-literal=POSTGRES_PASSWORD='CHANGE_ME' \
--dry-run=client -o yaml | kubectl apply -f -
# n8n namespace
kubectl create secret generic n8n-secret -n n8n \
--from-literal=DB_TYPE='sqlite' \
--from-literal=N8N_ENCRYPTION_KEY='CHANGE_ME_LONG_RANDOM_SECRET' \
--from-literal=N8N_USER_MANAGEMENT_JWT_SECRET='CHANGE_ME_LONG_RANDOM_SECRET' \
--from-literal=WEBHOOK_URL='https://n8n.smartmur.ca' \
--dry-run=client -o yaml | kubectl apply -f -
# vaultwarden namespace
kubectl create secret generic vaultwarden-secret -n vaultwarden \
--from-literal=ADMIN_TOKEN='CHANGE_ME_LONG_RANDOM_SECRET' \
--from-literal=DOMAIN='https://vault.smartmur.ca' \
--dry-run=client -o yaml | kubectl apply -f -
# discourse namespace
kubectl create secret generic discourse-secret -n discourse \
--from-literal=POSTGRESQL_PASSWORD='CHANGE_ME' \
--from-literal=DISCOURSE_EMAIL='admin@example.com' \
--from-literal=DISCOURSE_PASSWORD='CHANGE_ME' \
--dry-run=client -o yaml | kubectl apply -f -
6. Full Cluster Restore (Bare Metal)¶
This rebuilds the entire cluster from scratch using the repo's bootstrap scripts.
Prerequisites¶
- Proxmox hosts online (PVE1 at 192.168.100.100)
- TrueNAS online with NFS exports intact at 192.168.13.69
- Control node has:
bash,git,terraform,ansible,kubectl,helm - Repo cloned:
git clone <repo-url> && cd k3s-cluster
Step-by-step¶
# 1. Install tooling on control node
bash scripts/00-install-tools.sh
# 2. Create Proxmox VM template (skip if template already exists)
bash scripts/00-create-proxmox-template.sh
# 3. Restore terraform.tfvars from backup
cp backups/terraform.tfvars terraform/terraform.tfvars
# Or recreate from example:
cp terraform/terraform.tfvars.example terraform/terraform.tfvars
# Edit with real Proxmox credentials
# 4. Provision VMs
bash scripts/01-provision.sh
# 5. Bootstrap k3s cluster
export K3S_TOKEN="your-token-min-20-chars"
bash scripts/02-cluster-setup.sh
# 6. Set up storage (NFS provisioner, MetalLB, Traefik)
bash scripts/03-storage-setup.sh
# 7. Recreate all runtime secrets (Section 5 above)
# ... run all kubectl create secret commands ...
# 8. Deploy all apps
bash scripts/04-deploy-apps.sh
# 9. (Optional) Restore etcd snapshot instead of fresh deploy
# If you have an etcd snapshot, use it after step 5 instead of steps 6-8:
# sudo k3s server --cluster-reset --cluster-reset-restore-path=/path/to/snapshot
# Then recreate any missing secrets.
# 10. Restore PVC data from TrueNAS ZFS snapshots
# If NFS data survived, PVCs auto-bind. If not:
ssh ray@192.168.13.69 "zfs rollback strange/NSF_Prox/k3s@<snapshot-name>"
# 11. Restore local-path PVC data
# Copy backed-up tarballs to the correct worker node paths
# 12. Sync Dockhand contexts
bash scripts/05-sync-dockhand-contexts.sh
Restoring kubeconfig¶
After step 5, 02-cluster-setup.sh generates a new kubeconfig. If you need it manually:
scp debian@192.168.100.110:~/.kube/config ./kubeconfig-raw.yml
# Update server address if using SSH tunnel
sed -i '' 's|https://127.0.0.1:6443|https://127.0.0.1:7443|' ./kubeconfig-raw.yml
export KUBECONFIG=$(pwd)/kubeconfig-raw.yml
7. Single App Restore¶
To redeploy one app without rebuilding the cluster.
Example: restore vaultwarden¶
# 1. Recreate the secret
kubectl create secret generic vaultwarden-secret -n vaultwarden \
--from-literal=ADMIN_TOKEN='real-token-from-password-manager' \
--from-literal=DOMAIN='https://vault.smartmur.ca' \
--dry-run=client -o yaml | kubectl apply -f -
# 2. Apply the manifest
kubectl apply -f manifests/apps/vaultwarden/vaultwarden.yml
# 3. Verify
kubectl get pods -n vaultwarden -w
If PVC data is lost¶
# Restore from ZFS snapshot (NFS-backed PVCs)
ssh ray@192.168.13.69
zfs list -t snapshot -r strange/NSF_Prox/k3s | grep vaultwarden
zfs clone strange/NSF_Prox/k3s@<snapshot> strange/k3s-vw-recovery
cp -a /mnt/strange/k3s-vw-recovery/<vaultwarden-pvc-dir>/* /mnt/strange/NSF_Prox/k3s/<vaultwarden-pvc-dir>/
zfs destroy strange/k3s-vw-recovery
# Restart the pod to pick up restored data
kubectl rollout restart deployment/vaultwarden -n vaultwarden
General pattern for any app¶
- Recreate the app's secret (see Section 5 or
manifests/secrets/README.md). - Apply the manifest:
kubectl apply -f manifests/apps/<app>/<app>.yml - Restore PVC data from ZFS snapshot if needed.
- Verify pod is running and healthy.
8. Validation¶
Run after any restore to confirm the cluster is healthy.
Cluster health¶
Storage¶
# NFS provisioner running
kubectl get pods -n storage
# PVCs bound
kubectl get pvc -A | grep -v Bound
# (should return nothing)
# TrueNAS NFS accessible from nodes
ssh debian@192.168.100.111 "showmount -e 192.168.13.69"
Networking¶
# MetalLB and Traefik
kubectl get svc traefik -n traefik
# Should show EXTERNAL-IP 192.168.100.120
# IngressRoutes
kubectl get ingressroute -A
App-level¶
# Check each app responds
for url in vault.smartmur.ca n8n.smartmur.ca home.smartmur.ca auth.smartmur.ca code.smartmur.ca blog.smartmur.ca k8s.smartmur.ca; do
echo -n "$url: "; curl -sk -o /dev/null -w "%{http_code}" https://$url; echo
done
Secrets present¶
kubectl get secret vaultwarden-secret -n vaultwarden
kubectl get secret n8n-secret -n n8n
kubectl get secret authentik-secret -n authentik
kubectl get secret code-server-secret -n apps
kubectl get secret discourse-secret -n discourse
kubectl get secret pangolin-secret -n apps
kubectl get secret notify-channel-secrets -n apps