feat: two-container GPU-isolated AI architecture #1

New Issue

Grace · 2026-03-13T15:27:53-07:00

Grace commented

2026-03-13 15:27:53 -07:00

Approved Architecture Plan

Overview

Two Docker containers on Grace VM (192.168.20.142) replacing current native processes.

Container 1 — Assistant LLM (GPU 0, GTX 1080 Ti)

Image: ghcr.io/ggerganov/llama.cpp:server-cuda
Model: Qwen3-8B-Q4_K_M.gguf (bind mount from /home/grace/models/)
Port: 8000 (no OpenClaw config change required)
Flags: --n-gpu-layers 99 --ctx-size 32768 --flash-attn --reasoning-format deepseek
Replaces: native llama-server process

Container 2 — Memory Engine (GPU 1, GTX 1080)

Image: ollama/ollama
Models: phi3:mini (fact extraction), nomic-embed-text (embeddings)
Port: 11434 (replaces native Ollama service which is currently CPU-only)
CUDA_VISIBLE_DEVICES=1

Vector Database

Qdrant already running on 192.168.20.82:6333
Collections present: grace_memories, memories
Data at /mnt/ai-storage/qdrant/storage
No deployment needed

Smart Deduplication Pipeline (Phase 2)

Implemented in openclaw-mem0 plugin (Node.js/TypeScript):

New fact → embed → Qdrant similarity search (threshold 0.85)
If similar found: call phi3:mini LLM to merge/deduplicate
If no similar: store directly (no LLM call)
Expected: 80-90% reduction in LLM calls

Prerequisites

nvidia-container-toolkit installed on Grace VM
docker-compose.yml created
phi3:mini pulled into Container 2
open-webui re-pointed to Container 1 (port 8000)
openclaw-mem0 plugin oss.llm.config.model changed from qwen3:1.7b to phi3:mini

GPU Monitoring

nvidia_gpu_exporter running on Grace VM port 9835
Prometheus LXC: VMID 119, 192.168.20.119:9090
Grafana LXC: VMID 120, 192.168.20.120:3000
Dashboard ID to import: 14574 (NVIDIA GPU Metrics)

Notes

LiteLLM removed from architecture — OpenClaw talks direct to llama-server, confirmed working
mem0_server.py (Python) is superseded by openclaw-mem0 Node.js plugin
Native Ollama service to be STOPPED after Container 2 is running

## Approved Architecture Plan ### Overview Two Docker containers on Grace VM (192.168.20.142) replacing current native processes. ### Container 1 — Assistant LLM (GPU 0, GTX 1080 Ti) - Image: ghcr.io/ggerganov/llama.cpp:server-cuda - Model: Qwen3-8B-Q4_K_M.gguf (bind mount from /home/grace/models/) - Port: 8000 (no OpenClaw config change required) - Flags: --n-gpu-layers 99 --ctx-size 32768 --flash-attn --reasoning-format deepseek - Replaces: native llama-server process ### Container 2 — Memory Engine (GPU 1, GTX 1080) - Image: ollama/ollama - Models: phi3:mini (fact extraction), nomic-embed-text (embeddings) - Port: 11434 (replaces native Ollama service which is currently CPU-only) - CUDA_VISIBLE_DEVICES=1 ### Vector Database - Qdrant already running on 192.168.20.82:6333 - Collections present: grace_memories, memories - Data at /mnt/ai-storage/qdrant/storage - No deployment needed ### Smart Deduplication Pipeline (Phase 2) Implemented in openclaw-mem0 plugin (Node.js/TypeScript): - New fact → embed → Qdrant similarity search (threshold 0.85) - If similar found: call phi3:mini LLM to merge/deduplicate - If no similar: store directly (no LLM call) - Expected: 80-90% reduction in LLM calls ### Prerequisites - [ ] nvidia-container-toolkit installed on Grace VM - [ ] docker-compose.yml created - [ ] phi3:mini pulled into Container 2 - [ ] open-webui re-pointed to Container 1 (port 8000) - [ ] openclaw-mem0 plugin oss.llm.config.model changed from qwen3:1.7b to phi3:mini ### GPU Monitoring - nvidia_gpu_exporter running on Grace VM port 9835 - Prometheus LXC: VMID 119, 192.168.20.119:9090 - Grafana LXC: VMID 120, 192.168.20.120:3000 - Dashboard ID to import: 14574 (NVIDIA GPU Metrics) ### Notes - LiteLLM removed from architecture — OpenClaw talks direct to llama-server, confirmed working - mem0_server.py (Python) is superseded by openclaw-mem0 Node.js plugin - Native Ollama service to be STOPPED after Container 2 is running

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Grace/homelab-ai-agent#1