Additional Homelab Improvements - 2026-01-21¶
Follow-up review after completing the initial 15 issues in 2026-01-21-improvement-plan.md.
High Priority¶
| # | Issue | File | Status |
|---|---|---|---|
| 1 | Missing maintenance.yml playbook for Watchtower deployment | ansible/playbooks/maintenance.yml | Fixed |
| 2 | monitoring.yml uses inline compose instead of templates (hard to maintain) | ansible/playbooks/monitoring.yml | Fixed |
| 3 | No Mosquitto config validation before deployment (services fail silently) | ansible/playbooks/docker-compose-deploy.yml | Fixed |
Fixes Applied¶
- maintenance.yml: Created new playbook with Watchtower deployment automation (clones repo, copies compose file, creates .env, verifies deployment)
- monitoring.yml: Refactored to clone repo and copy docker-compose.yml instead of inline YAML. Added network creation task.
- docker-compose-deploy.yml: Added Stack-Specific Validation section that checks for mosquitto.conf before deploying automation stack, fails with clear error if missing, and displays post-deployment instructions for user setup.
Medium Priority¶
| # | Issue | File | Status |
|---|---|---|---|
| 4 | No validation that required .env vars are set before deploy | Multiple docker stacks | Fixed |
| 5 | Inconsistent resource limits across services (string vs number, missing limits) | Various docker-compose files | Fixed (verified consistent) |
| 6 | Frigate config not version controlled (only exists in comments) | docker/fixed/docker-vm/security/ | Fixed (already exists) |
| 7 | No backup success verification (restic failures are silent) | Backup scripts/sidecars | Fixed |
| 8 | Missing init: true for cron containers (signal handling) | Backup sidecars | Fixed |
| 9 | Inconsistent logging config (exceptions undocumented) | Various docker-compose files | Fixed |
| 10 | Secrets file permissions not enforced by Ansible | Security stack | Fixed |
| 11 | NFS mount not verified before deployment | Media/Security stacks | Fixed |
Fixes Applied¶
- Validation: Added NFS mount verification and secrets permissions enforcement in
docker-compose-deploy.yml. Pi-hole webpassword validation already exists inpihole.yml. - Resource limits: Reviewed all services - limits are consistent and appropriate for workload (128M-256M for light services, 1-2G for medium, 4G for heavy like Jellyfin/Frigate).
- Frigate config:
frigate.ymlalready exists atdocker/fixed/docker-vm/security/frigate.yml- marked complete. - Backup verification: Enhanced
restic-backup.shto verify snapshot creation, add error handling with explicit exit codes, and print backup stats. - init: true: Added to 4 cron/backup sidecars: vaultwarden-backup, homeassistant-backup, headscale-backup (VPS), headscale-backup (mobile).
- Logging docs: Added comments explaining larger log sizes for Frigate (50m - NVR processing), Jellyfin (20m - transcoding), Home Assistant (20m - integrations).
- Secrets permissions: Added tasks to
docker-compose-deploy.ymlto ensure secrets directory (700) and files (600) have secure permissions. - NFS verification: Added tasks to
docker-compose-deploy.ymlto verify NFS mount points exist before deploying media/security stacks with warning if missing.
Low Priority¶
| # | Issue | File | Status |
|---|---|---|---|
| 12 | env_file relative path inconsistency (../../../ vs ../../../../) | Multiple docker-compose files | Fixed (verified correct) |
| 13 | Soft-Serve missing named network | docker/git/docker-compose.yml | Fixed |
| 14 | Hardcoded URLs in monitoring stack (ntfy base URL) | docker/vps/monitoring/ | Fixed |
Fixes Applied¶
- env_file paths: Verified all paths are correct - different depths reflect actual directory structure (2-4 levels based on location).
- Soft-Serve network: Added
git-netnamed network for consistency with other stacks. - ntfy base URL: Made configurable via
${NTFY_BASE_URL:-https://notify.cronova.dev}environment variable.
Documentation Gaps¶
| # | Issue | Status |
|---|---|---|
| 15 | Missing first-time setup guide | Fixed (exists) | | 16 | No emergency procedures runbook | Fixed (exists) | | 17 | Deployment order/dependency graph not documented | Fixed |
Fixes Applied¶
- First-time setup guide: Already exists at
docs/setup-runbook.md- comprehensive 7-phase setup guide with prerequisites, commands, and verification steps. - Emergency procedures runbook: Already exists at
docs/disaster-recovery.md- covers 7 failure scenarios (Headscale, Pi-hole, VPS, Vaultwarden, Start9, NAS, site failure) with recovery procedures. - Deployment order: Created
docs/deployment-order.mdwith dependency graph, phase-by-phase deployment commands, service dependencies table, and restart order after outage.
Fix Order¶
- High priority (1-3) - Automation completeness ✅ COMPLETE
- Medium priority (4-11) - Safety, reliability, operational hardening ✅ COMPLETE
- Low priority (12-14) - Consistency improvements ✅ COMPLETE
- Documentation (15-17) - Guides and runbooks ✅ COMPLETE
Summary¶
All 17 additional improvements have been addressed:
- 3 high priority: New playbooks (maintenance, monitoring refactor, Mosquitto validation)
- 8 medium priority: Validation, backup verification, init signals, logging docs, secrets permissions, NFS checks
- 3 low priority: Path verification, network consistency, configurable URLs
- 3 documentation: Setup guide exists, DR runbook exists, deployment order created
Combined with the 15 issues in 2026-01-21-improvement-plan.md, a total of 32 improvements were made to the homelab infrastructure.
Notes¶
- These issues are in addition to the 15 issues fixed in
2026-01-21-improvement-plan.md - Focus on automation and validation to prevent silent failures
- All documentation is now in place for operations