icon

This Document is all about my backing up strategy.

Overview

Noah's Ark is the homelab's overarching backup and data-protection strategy. The name reflects the core principle: before any disaster strikes, everything critical must already be safely aboard. This document defines storage mediums, backup services, tier assignments, schedules, and retention policies.


Storage Mediums

Three physical storage mediums are used across the backup infrastructure. Each has a defined role and is not interchangeable.

NVMe (Ephemeral / Fast Scratch)

NVMe drives are used exclusively for operating system volumes, VM boot disks, and short-lived cache. They are explicitly not a backup target. TrueNAS and other storage VMs may have NVMe visibility for performance purposes, but no long-term backup data is written here.

  • ✅ OS & boot volumes
  • ✅ Cache and write-intent logs
  • ❌ Long-term backup data
  • ❌ Replicated datasets

HDD (Primary Backup Storage)

Spinning hard drives are the primary medium for all backup workloads. The current deployment is a 2×4 TB external HDD enclosure attached to the Proxmox Backup Server VM.

  • ✅ PBS datastores
  • ✅ TrueNAS bulk datasets
  • ✅ Cold archive copies
  • Higher capacity-to-cost ratio makes this the default for retention-heavy tiers

S3 (Offsite Object Storage)

S3 is the offsite layer of the backup strategy. It provides geographical separation from the physical homelab. Used for encrypted, versioned remote copies synced from local backup services.

  • ✅ Offsite DR copy (Gold and Platinum tiers)
  • ✅ Immutable/versioned bucket policies for ransomware resistance
  • ✅ Lifecycle rules to tier cold data automatically
  • ❌ Not a primary backup destination — always a sync target from local

Storage Services

Proxmox Backup Server (PBS)

PBS runs as a VM on Proxmox, managed with HA between pve-1 and pve-2 via Ceph. The VM has:

  • 32 GB NVMe (OS/config disk — Ceph-backed for HA migration)
  • 2×4 TB HDD via external enclosure (backup datastore — direct-attached to pve-2)

⚠️ Known HA Limitation — Datastore Availability After Failover

Problem: The HDD enclosure is physically connected to pve-2. If pve-2 fails and the PBS VM migrates to pve-1, the VM starts successfully but the /datastore mount is unavailable because the drives are no longer accessible. PBS enters a degraded state.

Assessed options:

Option Feasibility Notes
Accept manual recovery ✅ Practical Drive reconnection to pve-1 or manual pve-2 restart; suitable if RTO > 1 hour
Expose drives via iSCSI from pve-2 ✅ Recommended pve-2 runs a lightweight iSCSI target (e.g. tgt or scst); PBS VM mounts over the network — works regardless of which hypervisor runs the VM
Move datastore to Ceph RBD ⚠️ With caveats Solves availability but writes backup data into the same Ceph cluster being backed up — not ideal, defeats the independence principle
Dedicated physical backup node ✅ Long-term ideal Removes the HA dependency entirely; PBS runs bare-metal on a dedicated machine with local drives

Current recommendation: Implement the iSCSI approach as a medium-term fix. Target the HDD enclosure from pve-2 using tgt, connect the PBS VM via an iSCSI initiator, and mount as a block device. This makes the datastore network-addressable and removes the physical attachment constraint.

TODO — Validate iSCSI failover

  • Set up tgt on pve-2 exposing /dev/sdX as iSCSI LUN
  • Configure PBS VM with open-iscsi initiator
  • Test PBS VM live migration to pve-1 with iSCSI session persistence
  • Confirm datastore mounts cleanly post-migration
  • Document reconnection runbook for the manual-recovery fallback

PBS Datastore Layout (Planned)

/mnt/datastore/
├── primary/        # Main backup target for all tiers
└── archive/        # Cold copies, infrequently pruned

TrueNAS (VM on Proxmox)

TrueNAS runs as a VM and provides ZFS-backed network storage for the homelab. Its role in the backup strategy is:

  • SMB/NFS share target for file-level backups from workstations and containers
  • ZFS snapshot source — periodic snapshots replicated to PBS or S3
  • Staging area for large datasets before S3 upload

TrueNAS is itself a backup source as well as a storage service — its ZFS datasets must be included in backup tier assignments.

Note

TrueNAS VMs access NVMe storage for ARC/L2ARC caching only. Pool vdevs use HDD.


Backup Tier System

The tier system assigns a protection level to each service, VM, or dataset. Higher tiers mean more frequent backups, longer retention, more redundancy, and verified restores. Assign tiers based on criticality and acceptable data-loss window.


🥉 Bronze — Local Snapshot, Best-Effort

Use for: Non-critical VMs, dev/test environments, disposable workloads.
RPO (Recovery Point Objective): Up to 7 days
RTO (Recovery Time Objective): Best-effort, no SLA
Storage: PBS local datastore only
Offsite: None

Setting Value
Schedule Weekly (Sunday 02:00)
Retention — Keep Last 2
Retention — Keep Weekly 4
Retention — Keep Monthly 1
Encryption No
Verify job No
Restore tested No

Suitable for:

  • Scratch VMs and containers
  • Build/CI agents that can be reprovisioned from code
  • Dev databases with no production data

🥈 Silver — Daily Local Backup, Short Retention

Use for: Standard homelab services — important but recoverable within a day.
RPO: 24 hours
RTO: < 4 hours
Storage: PBS local datastore + TrueNAS ZFS snapshots
Offsite: None

Setting Value
Schedule Daily (03:00)
Retention — Keep Last 7
Retention — Keep Weekly 4
Retention — Keep Monthly 2
Encryption Optional
Verify job Weekly
Restore tested Quarterly (manual spot check)

Suitable for:

  • Standard self-hosted services (dashboards, monitoring stacks)
  • Personal media servers
  • Non-production databases
  • TrueNAS datasets containing media/documents

🥇 Gold — Daily Local + S3 Offsite, Extended Retention

Use for: Important services with irreplaceable or hard-to-recreate data.
RPO: 24 hours
RTO: < 2 hours (local); < 8 hours (from S3)
Storage: PBS local datastore + S3 sync
Offsite: ✅ S3 (encrypted, versioned)

Setting Value
Schedule Daily (03:30)
Retention — Keep Last 14
Retention — Keep Weekly 8
Retention — Keep Monthly 6
Retention — Keep Yearly 1
Encryption Required (PBS client-side encryption)
S3 sync Daily after backup job completes
S3 bucket policy Versioning enabled, 30-day object lock
Verify job Weekly
Restore tested Monthly (automated or manual)

S3 Sync Method:

# Using proxmox-backup-client or rclone for datastore sync
rclone sync /mnt/datastore/primary s3:homelab-backups/primary \
  --s3-server-side-encryption AES256 \
  --transfers 4 \
  --log-file /var/log/rclone-sync.log

Suitable for:

  • PBS VM itself (meta-backup of the backup service)
  • TrueNAS configuration and critical datasets
  • Identity/auth services (e.g. Authentik, LLDAP)
  • Home automation state (Home Assistant)
  • Network config (MikroTik export backups)

💎 Platinum — Frequent Local + Offsite + Verified, Maximum Retention

Use for: Critical services where data loss or extended downtime is unacceptable.
RPO: 4 hours
RTO: < 1 hour (local)
Storage: PBS local datastore + TrueNAS ZFS replication + S3 offsite
Offsite: ✅ S3 (encrypted, immutable object lock)

Setting Value
Schedule Every 6 hours (00:00 / 06:00 / 12:00 / 18:00)
Retention — Keep Last 28
Retention — Keep Weekly 12
Retention — Keep Monthly 12
Retention — Keep Yearly 3
Encryption Required (PBS client-side, AES-256)
S3 sync After every backup job
S3 bucket policy Versioning + Object Lock (Compliance mode, 90 days)
Verify job After every sync
Restore tested Monthly automated restore drill

Restore Drill Procedure:

  1. Spin up an isolated VLAN or test namespace on Proxmox
  2. Restore latest backup from PBS to test VM
  3. Validate service health (HTTP check, DB query, or equivalent)
  4. Log result in homelab runbook with timestamp
  5. Destroy test VM and clean up

Suitable for:

  • Password manager (Vaultwarden)
  • Certificate authority and PKI data
  • Any VM storing financial records or irreplaceable personal documents
  • DNS/DHCP config that underpins the entire network
  • Wedding and personal photo archives

Tier Assignment Register

Work in progress

Populate this table as services are onboarded. Each service must have an explicit tier — unassigned means unprotected.

Service / Dataset Type Tier Notes
Vaultwarden VM 💎 Platinum Password manager — zero tolerance for loss
Home Assistant VM 🥇 Gold State + automations irreplaceable
Authentik VM 🥇 Gold Auth provider for all services
TrueNAS (config) VM 🥇 Gold Pool config and dataset structure
TrueNAS (media) Dataset 🥈 Silver Re-downloadable, low urgency
MikroTik exports File 🥇 Gold Network would be unrecoverable without
DNS (AdGuard/etc.) VM 🥈 Silver Quickly reconfigurable
PBS VM itself VM 🥇 Gold Back up the backup server
Dev/scratch VMs VM 🥉 Bronze Disposable
Monitoring stack VM 🥈 Silver Recoverable from config-as-code

3-2-1 Compliance by Tier

A useful sanity check — the classic 3-2-1 rule states: 3 copies of data, on 2 different media types, with 1 offsite.

Tier Copies Media Types Offsite 3-2-1 Compliant
🥉 Bronze 1 HDD
🥈 Silver 2 (PBS + ZFS snap) HDD Partial
🥇 Gold 3 (PBS + ZFS + S3) HDD + Object
💎 Platinum 3+ (PBS + ZFS + S3) HDD + Object

Note on Bronze

Bronze is intentionally non-compliant. It is only appropriate for truly disposable workloads. If a Bronze service becomes important, it must be re-tiered.


Key Outstanding TODOs

  • Validate iSCSI datastore approach for PBS HA failover (see PBS section)
  • Assign tiers to all running VMs and containers
  • Configure S3 bucket with versioning + object lock for Gold/Platinum
  • Implement rclone sync job and schedule via systemd timer or cron
  • Write automated restore drill script for Platinum tier
  • Document manual recovery runbook for PBS datastore unavailability
  • Review PBS encryption key backup — keys must be stored independently of the PBS datastore (e.g. printed, in Vaultwarden, and in S3)