Noah's Ark [(new, WIP)]

This Document is all about my backing up strategy.

Overview

Noah's Ark is the homelab's overarching backup and data-protection strategy. The name reflects the core principle: before any disaster strikes, everything critical must already be safely aboard. This document defines storage mediums, backup services, tier assignments, schedules, and retention policies.

Storage Mediums

Three physical storage mediums are used across the backup infrastructure. Each has a defined role and is not interchangeable.

NVMe (Ephemeral / Fast Scratch)

NVMe drives are used exclusively for operating system volumes, VM boot disks, and short-lived cache. They are explicitly not a backup target. TrueNAS and other storage VMs may have NVMe visibility for performance purposes, but no long-term backup data is written here.

✅ OS & boot volumes
✅ Cache and write-intent logs
❌ Long-term backup data
❌ Replicated datasets

HDD (Primary Backup Storage)

Spinning hard drives are the primary medium for all backup workloads. The current deployment is a 2×4 TB external HDD enclosure attached to the Proxmox Backup Server VM.

✅ PBS datastores
✅ TrueNAS bulk datasets
✅ Cold archive copies
Higher capacity-to-cost ratio makes this the default for retention-heavy tiers

S3 (Offsite Object Storage)

S3 is the offsite layer of the backup strategy. It provides geographical separation from the physical homelab. Used for encrypted, versioned remote copies synced from local backup services.

✅ Offsite DR copy (Gold and Platinum tiers)
✅ Immutable/versioned bucket policies for ransomware resistance
✅ Lifecycle rules to tier cold data automatically
❌ Not a primary backup destination — always a sync target from local

Storage Services

Proxmox Backup Server (PBS)

PBS runs as a VM on Proxmox, managed with HA between pve-1 and pve-2 via Ceph. The VM has:

32 GB NVMe (OS/config disk — Ceph-backed for HA migration)
2×4 TB HDD via external enclosure (backup datastore — direct-attached to pve-2)

⚠️ Known HA Limitation — Datastore Availability After Failover

Problem: The HDD enclosure is physically connected to pve-2. If pve-2 fails and the PBS VM migrates to pve-1, the VM starts successfully but the /datastore mount is unavailable because the drives are no longer accessible. PBS enters a degraded state.

Assessed options:

Option	Feasibility	Notes
Accept manual recovery	✅ Practical	Drive reconnection to `pve-1` or manual `pve-2` restart; suitable if RTO > 1 hour
Expose drives via iSCSI from `pve-2`	✅ Recommended	`pve-2` runs a lightweight iSCSI target (e.g. `tgt` or `scst`); PBS VM mounts over the network — works regardless of which hypervisor runs the VM
Move datastore to Ceph RBD	⚠️ With caveats	Solves availability but writes backup data into the same Ceph cluster being backed up — not ideal, defeats the independence principle
Dedicated physical backup node	✅ Long-term ideal	Removes the HA dependency entirely; PBS runs bare-metal on a dedicated machine with local drives

Current recommendation: Implement the iSCSI approach as a medium-term fix. Target the HDD enclosure from pve-2 using tgt, connect the PBS VM via an iSCSI initiator, and mount as a block device. This makes the datastore network-addressable and removes the physical attachment constraint.

TODO — Validate iSCSI failover

Set up tgt on pve-2 exposing /dev/sdX as iSCSI LUN
Configure PBS VM with open-iscsi initiator
Test PBS VM live migration to pve-1 with iSCSI session persistence
Confirm datastore mounts cleanly post-migration
Document reconnection runbook for the manual-recovery fallback

PBS Datastore Layout (Planned)

/mnt/datastore/
├── primary/        # Main backup target for all tiers
└── archive/        # Cold copies, infrequently pruned

TrueNAS (VM on Proxmox)

TrueNAS runs as a VM and provides ZFS-backed network storage for the homelab. Its role in the backup strategy is:

SMB/NFS share target for file-level backups from workstations and containers
ZFS snapshot source — periodic snapshots replicated to PBS or S3
Staging area for large datasets before S3 upload

TrueNAS is itself a backup source as well as a storage service — its ZFS datasets must be included in backup tier assignments.

Note

TrueNAS VMs access NVMe storage for ARC/L2ARC caching only. Pool vdevs use HDD.

Backup Tier System

The tier system assigns a protection level to each service, VM, or dataset. Higher tiers mean more frequent backups, longer retention, more redundancy, and verified restores. Assign tiers based on criticality and acceptable data-loss window.

🥉 Bronze — Local Snapshot, Best-Effort

Use for: Non-critical VMs, dev/test environments, disposable workloads.
RPO (Recovery Point Objective): Up to 7 days
RTO (Recovery Time Objective): Best-effort, no SLA
Storage: PBS local datastore only
Offsite: None

Setting	Value
Schedule	Weekly (Sunday 02:00)
Retention — Keep Last	2
Retention — Keep Weekly	4
Retention — Keep Monthly	1
Encryption	No
Verify job	No
Restore tested	No

Suitable for:

Scratch VMs and containers
Build/CI agents that can be reprovisioned from code
Dev databases with no production data

🥈 Silver — Daily Local Backup, Short Retention

Use for: Standard homelab services — important but recoverable within a day.
RPO: 24 hours
RTO: < 4 hours
Storage: PBS local datastore + TrueNAS ZFS snapshots
Offsite: None

Setting	Value
Schedule	Daily (03:00)
Retention — Keep Last	7
Retention — Keep Weekly	4
Retention — Keep Monthly	2
Encryption	Optional
Verify job	Weekly
Restore tested	Quarterly (manual spot check)

Suitable for:

Standard self-hosted services (dashboards, monitoring stacks)
Personal media servers
Non-production databases
TrueNAS datasets containing media/documents

🥇 Gold — Daily Local + S3 Offsite, Extended Retention

Use for: Important services with irreplaceable or hard-to-recreate data.
RPO: 24 hours
RTO: < 2 hours (local); < 8 hours (from S3)
Storage: PBS local datastore + S3 sync
Offsite: ✅ S3 (encrypted, versioned)

Setting	Value
Schedule	Daily (03:30)
Retention — Keep Last	14
Retention — Keep Weekly	8
Retention — Keep Monthly	6
Retention — Keep Yearly	1
Encryption	Required (PBS client-side encryption)
S3 sync	Daily after backup job completes
S3 bucket policy	Versioning enabled, 30-day object lock
Verify job	Weekly
Restore tested	Monthly (automated or manual)

S3 Sync Method:

# Using proxmox-backup-client or rclone for datastore sync
rclone sync /mnt/datastore/primary s3:homelab-backups/primary \
  --s3-server-side-encryption AES256 \
  --transfers 4 \
  --log-file /var/log/rclone-sync.log

Suitable for:

PBS VM itself (meta-backup of the backup service)
TrueNAS configuration and critical datasets
Identity/auth services (e.g. Authentik, LLDAP)
Home automation state (Home Assistant)
Network config (MikroTik export backups)

💎 Platinum — Frequent Local + Offsite + Verified, Maximum Retention

Use for: Critical services where data loss or extended downtime is unacceptable.
RPO: 4 hours
RTO: < 1 hour (local)
Storage: PBS local datastore + TrueNAS ZFS replication + S3 offsite
Offsite: ✅ S3 (encrypted, immutable object lock)

Setting	Value
Schedule	Every 6 hours (00:00 / 06:00 / 12:00 / 18:00)
Retention — Keep Last	28
Retention — Keep Weekly	12
Retention — Keep Monthly	12
Retention — Keep Yearly	3
Encryption	Required (PBS client-side, AES-256)
S3 sync	After every backup job
S3 bucket policy	Versioning + Object Lock (Compliance mode, 90 days)
Verify job	After every sync
Restore tested	Monthly automated restore drill

Restore Drill Procedure:

Spin up an isolated VLAN or test namespace on Proxmox
Restore latest backup from PBS to test VM
Validate service health (HTTP check, DB query, or equivalent)
Log result in homelab runbook with timestamp
Destroy test VM and clean up

Suitable for:

Password manager (Vaultwarden)
Certificate authority and PKI data
Any VM storing financial records or irreplaceable personal documents
DNS/DHCP config that underpins the entire network
Wedding and personal photo archives

Tier Assignment Register

Work in progress

Populate this table as services are onboarded. Each service must have an explicit tier — unassigned means unprotected.

Service / Dataset	Type	Tier	Notes
Vaultwarden	VM	💎 Platinum	Password manager — zero tolerance for loss
Home Assistant	VM	🥇 Gold	State + automations irreplaceable
Authentik	VM	🥇 Gold	Auth provider for all services
TrueNAS (config)	VM	🥇 Gold	Pool config and dataset structure
TrueNAS (media)	Dataset	🥈 Silver	Re-downloadable, low urgency
MikroTik exports	File	🥇 Gold	Network would be unrecoverable without
DNS (AdGuard/etc.)	VM	🥈 Silver	Quickly reconfigurable
PBS VM itself	VM	🥇 Gold	Back up the backup server
Dev/scratch VMs	VM	🥉 Bronze	Disposable
Monitoring stack	VM	🥈 Silver	Recoverable from config-as-code

3-2-1 Compliance by Tier

A useful sanity check — the classic 3-2-1 rule states: 3 copies of data, on 2 different media types, with 1 offsite.

Tier	Copies	Media Types	Offsite	3-2-1 Compliant
🥉 Bronze	1	HDD	❌	❌
🥈 Silver	2 (PBS + ZFS snap)	HDD	❌	Partial
🥇 Gold	3 (PBS + ZFS + S3)	HDD + Object	✅	✅
💎 Platinum	3+ (PBS + ZFS + S3)	HDD + Object	✅	✅

Note on Bronze

Bronze is intentionally non-compliant. It is only appropriate for truly disposable workloads. If a Bronze service becomes important, it must be re-tiered.

Key Outstanding TODOs

Validate iSCSI datastore approach for PBS HA failover (see PBS section)
Assign tiers to all running VMs and containers
Configure S3 bucket with versioning + object lock for Gold/Platinum
Implement rclone sync job and schedule via systemd timer or cron
Write automated restore drill script for Platinum tier
Document manual recovery runbook for PBS datastore unavailability
Review PBS encryption key backup — keys must be stored independently of the PBS datastore (e.g. printed, in Vaultwarden, and in S3)