From d997a3df8bc31221fba5ca8f1828c0c65e6dd0e5 Mon Sep 17 00:00:00 2001 From: Grace Date: Fri, 13 Mar 2026 15:27:27 -0700 Subject: [PATCH] docs: add GPU monitoring setup and two-container architecture plan Co-Authored-By: Claude Sonnet 4.6 --- docs/gpu-monitoring-setup-2026-03-13.md | 28 +++++++++++++++++++++++++ 1 file changed, 28 insertions(+) create mode 100644 docs/gpu-monitoring-setup-2026-03-13.md diff --git a/docs/gpu-monitoring-setup-2026-03-13.md b/docs/gpu-monitoring-setup-2026-03-13.md new file mode 100644 index 0000000..7098831 --- /dev/null +++ b/docs/gpu-monitoring-setup-2026-03-13.md @@ -0,0 +1,28 @@ +# GPU Monitoring Setup — 2026-03-13 + +## Components + +| Component | Host | Port | Notes | +|---|---|---|---| +| nvidia_gpu_exporter | Grace VM (192.168.20.142) | 9835 | Reads nvidia-smi, exposes Prometheus metrics | +| Prometheus | LXC VMID 119 (192.168.20.119) | 9090 | Scrapes GPU exporter, 5m interval | +| Grafana | LXC VMID 120 (192.168.20.120) | 3000 | Visualizes GPU metrics | + +## Why exporter is on Grace VM, not Proxmox host + +The GPUs are bound to vfio-pci for passthrough to VM 108 (Grace). The Proxmox host has no NVIDIA driver loaded — VFIO takes exclusive control of the device. nvidia-smi only works inside the VM where the driver is loaded. The exporter must therefore run inside the Grace VM. + +**Sensor limitations:** Fan speed and power draw show ERR! in the VM because the GPU management interface (i2c/SMBus) is not forwarded through VFIO passthrough. Temperature and utilization readings are accurate. This is a known VFIO limitation for consumer GPUs. + +## nvidia_gpu_exporter service + +File: /etc/systemd/system/nvidia-gpu-exporter.service +Binary: /usr/local/bin/nvidia_gpu_exporter v1.4.1 + +Known issue: nvidia-smi on Pascal GPUs with active CUDA MPS takes ~3 minutes per query. +Prometheus scrape timeout set to 4m, interval 5m to accommodate. + +## Grafana setup + +Import dashboard ID 14574 from Grafana marketplace for NVIDIA GPU panels. +Datasource: Prometheus at http://192.168.20.119:9090 (configured as default).