Files

Grace d997a3df8b docs: add GPU monitoring setup and two-container architecture plan

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-13 15:27:27 -07:00

1.4 KiB

Raw Blame History

GPU Monitoring Setup — 2026-03-13

Components

Component	Host	Port	Notes
nvidia_gpu_exporter	Grace VM (192.168.20.142)	9835	Reads nvidia-smi, exposes Prometheus metrics
Prometheus	LXC VMID 119 (192.168.20.119)	9090	Scrapes GPU exporter, 5m interval
Grafana	LXC VMID 120 (192.168.20.120)	3000	Visualizes GPU metrics

Why exporter is on Grace VM, not Proxmox host

The GPUs are bound to vfio-pci for passthrough to VM 108 (Grace). The Proxmox host has no NVIDIA driver loaded — VFIO takes exclusive control of the device. nvidia-smi only works inside the VM where the driver is loaded. The exporter must therefore run inside the Grace VM.

Sensor limitations: Fan speed and power draw show ERR! in the VM because the GPU management interface (i2c/SMBus) is not forwarded through VFIO passthrough. Temperature and utilization readings are accurate. This is a known VFIO limitation for consumer GPUs.

nvidia_gpu_exporter service

File: /etc/systemd/system/nvidia-gpu-exporter.service Binary: /usr/local/bin/nvidia_gpu_exporter v1.4.1

Known issue: nvidia-smi on Pascal GPUs with active CUDA MPS takes ~3 minutes per query. Prometheus scrape timeout set to 4m, interval 5m to accommodate.

Grafana setup

Import dashboard ID 14574 from Grafana marketplace for NVIDIA GPU panels. Datasource: Prometheus at http://192.168.20.119:9090 (configured as default).

1.4 KiB Raw Blame History