Skip to content

Monitoring Strategy

How the homelab is monitored, alerted, and observed.

Monitoring Stack Overview

Component Location Purpose
Uptime Kuma VPS External health checks (HTTP, TCP, ping)
ntfy VPS Push notifications (alerts, backup status)
VictoriaMetrics (Papa) Docker VM Time-series metrics database
vmagent Docker VM Prometheus-compatible metrics scraper
Grafana (Papa) Docker VM Dashboards and visualization
Dozzle (Ysyry) Docker VM Real-time container log viewer
Glances NAS System resource monitor (HA integration)
Watchtower Docker VM Auto-update monitoring (Sunday 4 AM)

Metrics Pipeline

node_exporter (Docker VM :9100) ──┐
node_exporter (NAS :9100) ────────┤
VictoriaMetrics (:8428) ──────────┤
vmagent (:8429) ──────────────────┤
Grafana (:3000) ──────────────────┼──► vmagent ──► VictoriaMetrics ──► Grafana
vmalert (:8880) ──────────────────┤    (30s scrape)    (90d retention)    (papa.cronova.dev)
Alertmanager (:9093) ─────────────┤                         │
cAdvisor (:8080) ─────────────────┤                    vmalert ──► Alertmanager ──► ntfy
Home Assistant (/api/prometheus) ──┘                   (30s eval)    (group/dedup)    (push)

Config: docker/fixed/docker-vm/monitoring/prometheus.yml

Scrape Targets

Job Target Labels
node-docker-vm host.docker.internal:9100 instance: docker-vm
node-nas 100.82.77.97:9100 instance: nas
victoriametrics victoriametrics:8428 instance: victoriametrics
vmagent vmagent:8429 instance: vmagent
grafana grafana:3000 instance: grafana
cadvisor cadvisor:8080 instance: docker-vm
vmalert vmalert:8880 instance: vmalert
alertmanager alertmanager:9093 instance: alertmanager
home-assistant host.docker.internal:8123/api/prometheus instance: home-assistant

VictoriaMetrics

  • Image: victoriametrics/victoria-metrics:latest
  • Port: 8428 (localhost only)
  • Retention: 90 days
  • Memory limit: 1GB
  • Data volume: vm-data

Grafana

  • Image: grafana/grafana:latest
  • Port: 3000 (localhost only, behind Caddy + Authelia)
  • URL: https://papa.cronova.dev
  • Plugin: victoriametrics-metrics-datasource
  • Dashboards provisioned via grafana/provisioning/dashboards/json/

Grafana Dashboards

Dashboard Grafana ID Source What it shows
Node Exporter Full 1860 node-docker-vm, node-nas Host CPU, RAM, disk, network
VictoriaMetrics Single 10229 victoriametrics TSDB health, ingestion rate, storage
cAdvisor Docker 19792 cadvisor Per-container CPU, memory, network
vmagent 12683 vmagent Scrape stats, target health, remote write
Grafana Internals 3590 grafana API response times, sessions, memory
Homelab Overview — (custom) all targets Host health, containers, network, monitoring health

Uptime Kuma Monitors

Uptime Kuma runs on the VPS and monitors all services via Tailscale mesh. 35 monitors managed via scripts/setup-uptime-kuma.py (single source of truth). Alerts route to ntfy topics by priority tier.

Critical (60s interval, ntfy urgent)

Monitor Type Target
Headscale HTTP https://hs.cronova.dev/health
Vaultwarden HTTP https://vault.cronova.dev/alive
Pi-hole DNS TCP 100.68.63.168:53
Caddy (Docker VM) HTTP https://cronova.dev
OPNsense Gateway Ping 192.168.0.1
Uptime Kuma HTTP https://status.cronova.dev
ntfy HTTP https://notify.cronova.dev
Caddy (VPS) TCP 100.77.172.46:443
VPS Pi-hole TCP 127.0.0.1:53
cronova.dev HTTP https://cronova.dev (900s interval)

Warning (60-300s interval, ntfy high)

Monitor Type Target
Home Assistant (Jara) HTTP https://jara.cronova.dev (300s)
Frigate (Taguato) HTTP https://taguato.cronova.dev/api/version (60s)
Forgejo HTTP http://100.82.77.97:3000 (60s)
NAS Samba TCP 100.82.77.97:445 (300s)
Restic REST Keyword http://100.82.77.97:8000 (keyword: "Unauthorized", expect 401)
Coolify (Tajy) HTTP https://tajy.cronova.dev (300s)
Authelia (Okẽ) HTTP https://auth.cronova.dev (300s)
Javya HTTP https://javya.cronova.dev (60s)
Javya API HTTP https://javya-api.cronova.dev/health (60s)
NAS Ping 100.82.77.97 (60s)
Docker VM Ping 100.68.63.168 (300s)
Watchtower Ping 100.68.63.168 (60s)

Info (300-900s interval, ntfy default)

Monitor Type Target
Jellyfin (Yrasema) HTTP https://yrasema.cronova.dev/health
Grafana (Papa) HTTP https://papa.cronova.dev
Immich (Vera) HTTP https://vera.cronova.dev
Syncthing HTTP http://100.82.77.97:8384/rest/noauth/health
Glances Keyword http://100.82.77.97:61208/api/4/cpu (keyword: "total")
Pi-hole Fixed TCP 100.68.63.168:53 (300s)
DNS - cronova.dev DNS cronova.dev via 1.1.1.1
Beryl AX Ping 100.102.244.131 (120s, may be offline)
Beryl AX - Admin TCP 100.102.244.131:80 (120s)
hermosilla.me HTTP https://hermosilla.me/ (900s)
HTTP ` (900s)

ntfy Notification Architecture

URL: https://notify.cronova.dev (VPS, Caddy reverse proxy)

Topics

Topic Purpose Priority
cronova-critical Service down, data loss risk Urgent (wakes phone)
cronova-warning Degraded performance High
cronova-info Backups completed, maintenance Default (silent)
cronova-test Testing notifications Low

Auth

  • Anonymous access: deny-all
  • Service tokens for automation (backup sidecars, scripts)
  • User augusto has full read/write on all topics

Integration Points

Source Topic Trigger
Uptime Kuma cronova-critical / cronova-warning Service down/degraded
Backup sidecars cronova-critical / cronova-info Backup failure / success
scripts/backup-notify.sh Per-service routing Backup event notifications

Subscribe on Phone

Android/iOS ntfy app → Subscribe to:
  https://notify.cronova.dev/cronova-critical
  https://notify.cronova.dev/cronova-warning

Container Log Monitoring — Dozzle (Ysyry)

  • URL: https://ysyry.cronova.dev (Caddy + Authelia)
  • Real-time Docker log viewer for all containers on Docker VM
  • No persistent storage — live view only
  • Useful for debugging container startup issues, watching Frigate detections, checking backup logs

Auto-Update Monitoring — Watchtower

  • Schedule: Sunday 4:00 AM (label-enabled, opt-in via com.centurylinklabs.watchtower.enable=true)
  • Image: nicholas-fedor/watchtower:1.14.2 (maintained fork — official containrrr is abandoned/Docker 29+ incompatible)
  • Behavior: Rolling restarts, old image cleanup
  • Excluded from auto-update (manual only): vaultwarden, frigate, headscale

Integration Source What It Monitors
System Monitor Docker VM CPU, RAM, disk usage
Glances NAS (100.82.77.97:61208) NAS system metrics
Proxmox VE (HACS) Oga (100.78.12.241) Host and VM status
Frigate MQTT (mqtt-net) Camera events, detection counts

Monitoring Checklist

Weekly

  • [ ] Check Uptime Kuma dashboard — all monitors green
  • [ ] Review ntfy alert history — any unexpected alerts
  • [ ] Spot-check Dozzle for container error logs

Monthly (1st Sunday)

  • [ ] Spot-check NAS restic repos for snapshot freshness (until the alerting plan — docs/plans/backup-success-alerting-2026-04-22.md — lands and this becomes automatic)
  • [ ] Check Grafana dashboards — disk usage trends, RAM pressure
  • [ ] Verify vmagent scrape targets are all up (/targets endpoint)
  • [ ] Review Watchtower update logs
  • [ ] Check NAS Purple 2TB usage (97% — monitor closely)

Quarterly

  • [ ] Manual backup restore drill per docs/guides/backup-test-procedure.md (pending Task #18 to automate)
  • [ ] Review and update Uptime Kuma monitors for new/removed services
  • [ ] Test ntfy notification delivery (all priority levels)
  • [ ] Review VictoriaMetrics retention and disk usage

References