feat: two-container GPU-isolated AI architecture #1

Open
opened 2026-03-13 15:27:53 -07:00 by Grace · 0 comments
Owner

Approved Architecture Plan

Overview

Two Docker containers on Grace VM (192.168.20.142) replacing current native processes.

Container 1 — Assistant LLM (GPU 0, GTX 1080 Ti)

  • Image: ghcr.io/ggerganov/llama.cpp:server-cuda
  • Model: Qwen3-8B-Q4_K_M.gguf (bind mount from /home/grace/models/)
  • Port: 8000 (no OpenClaw config change required)
  • Flags: --n-gpu-layers 99 --ctx-size 32768 --flash-attn --reasoning-format deepseek
  • Replaces: native llama-server process

Container 2 — Memory Engine (GPU 1, GTX 1080)

  • Image: ollama/ollama
  • Models: phi3:mini (fact extraction), nomic-embed-text (embeddings)
  • Port: 11434 (replaces native Ollama service which is currently CPU-only)
  • CUDA_VISIBLE_DEVICES=1

Vector Database

  • Qdrant already running on 192.168.20.82:6333
  • Collections present: grace_memories, memories
  • Data at /mnt/ai-storage/qdrant/storage
  • No deployment needed

Smart Deduplication Pipeline (Phase 2)

Implemented in openclaw-mem0 plugin (Node.js/TypeScript):

  • New fact → embed → Qdrant similarity search (threshold 0.85)
  • If similar found: call phi3:mini LLM to merge/deduplicate
  • If no similar: store directly (no LLM call)
  • Expected: 80-90% reduction in LLM calls

Prerequisites

  • nvidia-container-toolkit installed on Grace VM
  • docker-compose.yml created
  • phi3:mini pulled into Container 2
  • open-webui re-pointed to Container 1 (port 8000)
  • openclaw-mem0 plugin oss.llm.config.model changed from qwen3:1.7b to phi3:mini

GPU Monitoring

  • nvidia_gpu_exporter running on Grace VM port 9835
  • Prometheus LXC: VMID 119, 192.168.20.119:9090
  • Grafana LXC: VMID 120, 192.168.20.120:3000
  • Dashboard ID to import: 14574 (NVIDIA GPU Metrics)

Notes

  • LiteLLM removed from architecture — OpenClaw talks direct to llama-server, confirmed working
  • mem0_server.py (Python) is superseded by openclaw-mem0 Node.js plugin
  • Native Ollama service to be STOPPED after Container 2 is running
## Approved Architecture Plan ### Overview Two Docker containers on Grace VM (192.168.20.142) replacing current native processes. ### Container 1 — Assistant LLM (GPU 0, GTX 1080 Ti) - Image: ghcr.io/ggerganov/llama.cpp:server-cuda - Model: Qwen3-8B-Q4_K_M.gguf (bind mount from /home/grace/models/) - Port: 8000 (no OpenClaw config change required) - Flags: --n-gpu-layers 99 --ctx-size 32768 --flash-attn --reasoning-format deepseek - Replaces: native llama-server process ### Container 2 — Memory Engine (GPU 1, GTX 1080) - Image: ollama/ollama - Models: phi3:mini (fact extraction), nomic-embed-text (embeddings) - Port: 11434 (replaces native Ollama service which is currently CPU-only) - CUDA_VISIBLE_DEVICES=1 ### Vector Database - Qdrant already running on 192.168.20.82:6333 - Collections present: grace_memories, memories - Data at /mnt/ai-storage/qdrant/storage - No deployment needed ### Smart Deduplication Pipeline (Phase 2) Implemented in openclaw-mem0 plugin (Node.js/TypeScript): - New fact → embed → Qdrant similarity search (threshold 0.85) - If similar found: call phi3:mini LLM to merge/deduplicate - If no similar: store directly (no LLM call) - Expected: 80-90% reduction in LLM calls ### Prerequisites - [ ] nvidia-container-toolkit installed on Grace VM - [ ] docker-compose.yml created - [ ] phi3:mini pulled into Container 2 - [ ] open-webui re-pointed to Container 1 (port 8000) - [ ] openclaw-mem0 plugin oss.llm.config.model changed from qwen3:1.7b to phi3:mini ### GPU Monitoring - nvidia_gpu_exporter running on Grace VM port 9835 - Prometheus LXC: VMID 119, 192.168.20.119:9090 - Grafana LXC: VMID 120, 192.168.20.120:3000 - Dashboard ID to import: 14574 (NVIDIA GPU Metrics) ### Notes - LiteLLM removed from architecture — OpenClaw talks direct to llama-server, confirmed working - mem0_server.py (Python) is superseded by openclaw-mem0 Node.js plugin - Native Ollama service to be STOPPED after Container 2 is running
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Grace/homelab-ai-agent#1