Improvement Opportunities - 2026-01-16¶
Post-security audit improvement opportunities identified after completing 24/24 documentation items and 21/22 security fixes.
Summary¶
| Priority | Count | Effort |
|---|---|---|
| Quick Wins | 6 | < 1 hour each | | Medium | 8 | 1-2 hours each | | Lower | 5 | 2+ hours each |
Quick Wins (High Impact, Low Effort)¶
1. Add Health Checks to Missing Services¶
Effort:30 min |Impact: High
9 compose files lack healthchecks for automatic restart on silent failures.
| File | Services |
|---|---|
| docker/fixed/nas/storage/docker-compose.yml | Samba, Syncthing |
| docker/fixed/nas/backup/docker-compose.yml | Restic REST |
| docker/vps/backup/docker-compose.yml | Restic REST |
| docker/vps/scraping/docker-compose.yml | changedetection |
| docker/vps/networking/pihole/docker-compose.yml | Pi-hole |
| docker/fixed/docker-vm/networking/pihole/docker-compose.yml | Pi-hole |
| docker/mobile/rpi5/networking/pihole/docker-compose.yml | Pi-hole |
| docker/vps/networking/derp/docker-compose.yml | DERP relay |
| docker/fixed/docker-vm/networking/caddy/docker-compose.yml | Caddy |
| docker/vps/networking/caddy/docker-compose.yml | Caddy |
Template¶
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:PORT/health"]
interval: 30s
timeout: 10s
retries: 3
2. Add Service Dependencies (depends_on)¶
Effort:20 min |Impact: High
Services may start out-of-order, causing initialization failures.
| Service | Depends On |
|---|---|
| Frigate | Mosquitto | | Home Assistant | Mosquitto | | Jellyfin | NFS mount | | Sonarr | Prowlarr | | Radarr | Prowlarr |
Template¶
3. Add Logging Limits¶
Effort:15 min |Impact: High
Unbounded container log growth will fill disk. No compose files have logging configuration.
Template (add to all services)¶
4. Add Container Labels¶
Effort:30 min |Impact: Medium
No labels in any compose file. Labels enable filtering, organization, and automation.
Template¶
labels:
- "com.cronova.environment=vps" # vps|fixed|mobile
- "com.cronova.category=networking" # networking|media|security|automation|backup
- "com.cronova.critical=true" # for critical services
- "com.cronova.backup=true" # services to backup
5. Create Per-Service READMEs¶
Effort:45 min |Impact: Medium
8 docker directories lack setup documentation.
| Directory | Missing README |
|---|---|
| docker/fixed/docker-vm/media/ | *arr stack integration, setup |
| docker/fixed/docker-vm/automation/ | MQTT setup, HA config |
| docker/fixed/docker-vm/security/ | Frigate setup, camera config |
| docker/fixed/nas/storage/ | Samba/Syncthing setup |
| docker/fixed/nas/backup/ | Restic REST setup |
| docker/vps/scraping/ | changedetection setup |
| docker/vps/backup/ | Restic REST setup |
| docker/vps/networking/ | Headscale/DERP/Caddy setup |
6. Document Network Topology¶
Effort:45 min |Impact: Medium
docker/README.md discusses network strategy but lacks:
- Visualization of how networks connect between compose files
- Explicit list of which services talk to which networks
- Documented ports/interfaces for inter-service communication
Create: docs/network-topology.md with diagram
Medium Priority¶
7. Missing Ansible Playbooks¶
Effort:1-2 hours |Impact: High
Current playbooks: common.yml, docker.yml, tailscale.yml
Missing¶
| Playbook | Purpose |
|---|---|
| backup.yml | Deploy restic, configure cronjobs |
| monitoring.yml | Deploy Uptime Kuma, ntfy, configure monitors |
| update.yml | System and container update automation |
| pihole.yml | Deploy Pi-hole to all hosts |
| caddy.yml | Deploy Caddy, manage certificates |
| nfs-server.yml | Configure NFS exports on NAS |
| docker-compose-deploy.yml | Deploy compose files to hosts |
8. Service Startup Checklist¶
Effort:1 hour |Impact: High
No end-to-end workflow documented for starting services across all 3 environments.
Missing¶
- Service startup order across environments
- Health check commands for each service
- Verification steps (curl endpoints, port checks)
- Rollback procedures if something fails
Create: docs/service-startup-guide.md
9. Backup Configuration Gaps¶
Effort:1 hour |Impact: Medium
docs/disaster-recovery.md exists but missing:
- Backup verification cronjob
- Automated backup scheduling
- Backup encryption key storage procedure
- Offsite backup verification automation
10. Image Pull Policies¶
Effort:30 min |Impact: Medium
No compose files define pull_policy. Stale images persist on restart.
Recommendation: Add pull_policy: always to critical services (Headscale, Vaultwarden, Pi-hole).
11. Restic Backup Sidecar for Backup Infrastructure¶
Effort:1 hour |Impact: Medium
Headscale has backup sidecar (good model). Restic REST instances have no automated backup.
12. Service Lifecycle Documentation¶
Effort:1 hour |Impact: Medium
Missing:
- Blue-green deployment strategy
- Update verification procedure
- Zero-downtime update guidance
- Rollback procedures
13. Compose File Version Strategy¶
Effort:15 min |Impact: Low
Versions range implicitly. Explicitly define version: "3.9" for depends_on: condition: support.
14. Incomplete Docker Network Documentation¶
Effort:1 hour |Impact: Medium
Missing:
- Network isolation strategy (external networks?)
- Inter-service communication matrix
Lower Priority¶
15. Prometheus/Grafana Monitoring Stack¶
Effort:2+ hours |Impact: Medium
Currently only Uptime Kuma (endpoint monitoring). Missing metrics collection (CPU, memory, disk, network).
Benefits¶
- Historical performance metrics
- Capacity planning data
- Alert thresholds on resource utilization
16. Disaster Recovery Testing Automation¶
Effort:2+ hours |Impact: Medium
Missing:
- Automated monthly restore test
- ntfy notification of test results
- Test environment documentation
17. Service Upgrade Strategy¶
Effort:2 hours |Impact: Low
Document for each service:
- Release notes source
- Update frequency
- Test procedure
- Rollback procedure
Create: docs/service-upgrade-strategy.md
18. Capacity Planning¶
Effort:2 hours |Impact: Low
Missing:
- CPU/memory requirements per service
- Expected disk usage (Frigate recordings, media)
- Network bandwidth estimates
- Growth projections
- Upgrade triggers
Create: docs/capacity-planning.md
19. Remaining Security Fix (#20)¶
Status: Acceptable risk
Issue: Containers running as root
Note: Many containers require root. Would need per-service analysis and testing.
Recommended Order¶
- Health checks + depends_on - Critical for reliability
- Logging limits - Prevent disk issues
- Labels - Organization
- Per-service READMEs - Knowledge transfer
- Service startup checklist - Deployment confidence
- Ansible playbooks - Automation
- Network diagram - System understanding
Files to Create¶
| File | Purpose |
|---|---|
| docs/network-topology.md | Network visualization |
| docs/service-startup-guide.md | Deployment workflow |
| docs/service-upgrade-strategy.md | Update procedures |
| docs/capacity-planning.md | Resource planning |
| ansible/playbooks/backup.yml | Backup automation |
| ansible/playbooks/monitoring.yml | Monitoring deployment |
| docker/*/README.md (8 files) | Per-service documentation |
Related Documents¶
docs/sessions/2026-01-16.md- Session summarydocs/sessions/improvements-2026-01-16.md- Completed improvements (24/24)docs/sessions/security-fixes-2026-01-16.md- Security audit (21/22 complete)docs/security-hardening.md- Security best practices