Files
homelab-ai-agent/docs/gpu-monitoring-setup-2026-03-13.md
2026-03-13 15:27:27 -07:00

1.4 KiB

GPU Monitoring Setup — 2026-03-13

Components

Component Host Port Notes
nvidia_gpu_exporter Grace VM (192.168.20.142) 9835 Reads nvidia-smi, exposes Prometheus metrics
Prometheus LXC VMID 119 (192.168.20.119) 9090 Scrapes GPU exporter, 5m interval
Grafana LXC VMID 120 (192.168.20.120) 3000 Visualizes GPU metrics

Why exporter is on Grace VM, not Proxmox host

The GPUs are bound to vfio-pci for passthrough to VM 108 (Grace). The Proxmox host has no NVIDIA driver loaded — VFIO takes exclusive control of the device. nvidia-smi only works inside the VM where the driver is loaded. The exporter must therefore run inside the Grace VM.

Sensor limitations: Fan speed and power draw show ERR! in the VM because the GPU management interface (i2c/SMBus) is not forwarded through VFIO passthrough. Temperature and utilization readings are accurate. This is a known VFIO limitation for consumer GPUs.

nvidia_gpu_exporter service

File: /etc/systemd/system/nvidia-gpu-exporter.service Binary: /usr/local/bin/nvidia_gpu_exporter v1.4.1

Known issue: nvidia-smi on Pascal GPUs with active CUDA MPS takes ~3 minutes per query. Prometheus scrape timeout set to 4m, interval 5m to accommodate.

Grafana setup

Import dashboard ID 14574 from Grafana marketplace for NVIDIA GPU panels. Datasource: Prometheus at http://192.168.20.119:9090 (configured as default).