docs: add GPU monitoring setup and two-container architecture plan
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
28
docs/gpu-monitoring-setup-2026-03-13.md
Normal file
28
docs/gpu-monitoring-setup-2026-03-13.md
Normal file
@@ -0,0 +1,28 @@
|
||||
# GPU Monitoring Setup — 2026-03-13
|
||||
|
||||
## Components
|
||||
|
||||
| Component | Host | Port | Notes |
|
||||
|---|---|---|---|
|
||||
| nvidia_gpu_exporter | Grace VM (192.168.20.142) | 9835 | Reads nvidia-smi, exposes Prometheus metrics |
|
||||
| Prometheus | LXC VMID 119 (192.168.20.119) | 9090 | Scrapes GPU exporter, 5m interval |
|
||||
| Grafana | LXC VMID 120 (192.168.20.120) | 3000 | Visualizes GPU metrics |
|
||||
|
||||
## Why exporter is on Grace VM, not Proxmox host
|
||||
|
||||
The GPUs are bound to vfio-pci for passthrough to VM 108 (Grace). The Proxmox host has no NVIDIA driver loaded — VFIO takes exclusive control of the device. nvidia-smi only works inside the VM where the driver is loaded. The exporter must therefore run inside the Grace VM.
|
||||
|
||||
**Sensor limitations:** Fan speed and power draw show ERR! in the VM because the GPU management interface (i2c/SMBus) is not forwarded through VFIO passthrough. Temperature and utilization readings are accurate. This is a known VFIO limitation for consumer GPUs.
|
||||
|
||||
## nvidia_gpu_exporter service
|
||||
|
||||
File: /etc/systemd/system/nvidia-gpu-exporter.service
|
||||
Binary: /usr/local/bin/nvidia_gpu_exporter v1.4.1
|
||||
|
||||
Known issue: nvidia-smi on Pascal GPUs with active CUDA MPS takes ~3 minutes per query.
|
||||
Prometheus scrape timeout set to 4m, interval 5m to accommodate.
|
||||
|
||||
## Grafana setup
|
||||
|
||||
Import dashboard ID 14574 from Grafana marketplace for NVIDIA GPU panels.
|
||||
Datasource: Prometheus at http://192.168.20.119:9090 (configured as default).
|
||||
Reference in New Issue
Block a user