58 KiB
IGNY8 Phase 0: Self-Hosted AI Infrastructure (00F)
Status: Ready for Implementation
Version: 1.1
Priority: High (cost savings critical for unit economics)
Duration: 5-7 days
Dependencies: 00B (VPS provisioning) must be complete first
Source of Truth: Codebase at /data/app/igny8/
Cost: ~$200/month GPU rental + $0 software (open source)
1. Current State
Existing AI Integration
- External providers (verified from
IntegrationProvidermodel): OpenAI (GPT-4, GPT-3.5), Anthropic (Claude), Runware (image gen) - Storage: API keys stored in
IntegrationProvidermodel (table:igny8_integration_providers) with per-account overrides inIntegrationSettings(table:igny8_integration_settings). Global defaults inGlobalIntegrationSettings. - Provider types in codebase:
ai,payment,email,storage(fromPROVIDER_TYPE_CHOICES) - Existing provider_ids:
openai,runware,stripe,paypal,resend - Architecture: Multi-provider AI engine with model selection capability
- Current AI functions:
auto_cluster,generate_ideas,generate_content,generate_images,generate_image_prompts,optimize_content,generate_site_structure - Async handling: Celery workers process long-running AI tasks
- Cost impact: External APIs constitute 15-30% of monthly operational costs
Problem
- External API costs scale linearly with subscriber growth
- No cost leverage at scale (pay-as-you-go model)
- API rate limits require careful orchestration
- Privacy concerns with offloading content generation
2. What to Build
Infrastructure Stack
┌─────────────────────────────────────────────────────────────┐
│ IGNY8 Backend (on VPS) │
│ - Send requests to LiteLLM proxy (local localhost:8000) │
│ - Fallback to OpenAI/Anthropic if self-hosted unavailable │
└──────────────┬──────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ LiteLLM Proxy (on VPS, port 8000) │
│ - OpenAI-compatible API gateway │
│ - Routes requests to local Ollama and ComfyUI (via tunnel) │
│ - Load balancing & model selection │
│ - Fallback configuration for external APIs │
└──────────────┬───────────────────────────────────────────────┘
│
┌────────┴──────────────────┐
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────────┐
│ SSH Tunnel │ │ ComfyUI Tunnel │
│ (autossh) │ │ (autossh) │
│ Port 11434-11435 │ │ Port 8188 │
│ │ │ (image generation) │
└────────┬─────────┘ └──────────┬───────────┘
│ │
▼ ▼
┌────────────────────────────────────────────────────────┐
│ Vast.ai GPU Server (2x RTX 3090, 48GB VRAM) │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Ollama Container │ │
│ │ - Qwen3-32B (reasoning) │ │
│ │ - Qwen3-30B-A3B (multimodal) │ │
│ │ - Qwen3-14B (general purpose) │ │
│ │ - Qwen3-8B (fast inference) │ │
│ │ Listening on 0.0.0.0:11434 │ │
│ └──────────────────────────────────────────────────┘ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ ComfyUI Container │ │
│ │ - FLUX.1 (image gen) │ │
│ │ - Stable Diffusion 3.5 (image gen) │ │
│ │ - SDXL-Lightning (fast generation) │ │
│ │ Listening on 0.0.0.0:8188 │ │
│ └──────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────┘
Components to Deploy
-
Vast.ai GPU Rental
- Machine: 2x NVIDIA RTX 3090 (48GB total VRAM)
- Estimated cost: $180-220/month
- Auto-bid setup for cost optimization
- Persistence: Restore from snapshot between rentals
-
Ollama (Text LLM Server)
- Container-based deployment on GPU
- Models: Qwen3 series (32B, 30B-A3B, 14B, 8B)
- API: OpenAI-compatible
/v1/chat/completions - Port: 11434 (tunneled via SSH)
-
ComfyUI (Image Generation)
- Container-based deployment on GPU
- Models: FLUX.1, Stable Diffusion 3.5, SDXL-Lightning
- API: REST endpoints for image generation
- Port: 8188 (tunneled via SSH)
-
SSH Tunnel (autossh)
- Persistent connection from VPS to GPU server
- Systemd service with auto-restart
- Ports: 11434/11435 (Ollama), 8188 (ComfyUI)
- Handles network interruptions automatically
-
LiteLLM Proxy
- Runs on IGNY8 VPS
- Acts as OpenAI-compatible API gateway
- Configurable routing based on model/task type
- Fallback to OpenAI/Anthropic if self-hosted unavailable
- Port: 8000 (local access only)
-
IGNY8 Backend Integration
- Add self-hosted LiteLLM as new
IntegrationProvider - Update AI request logic to check availability
- Implement fallback chain: self-hosted → OpenAI → Anthropic
- Cost tracking per provider
- Add self-hosted LiteLLM as new
3. Data Models / APIs
Database Models (Minimal Schema Changes)
Use existing IntegrationProvider model — add a new row with provider_type='ai':
# New IntegrationProvider row (NO new provider_type needed)
# provider_type='ai' already exists in PROVIDER_TYPE_CHOICES
# Create via admin or migration:
IntegrationProvider.objects.create(
provider_id='self_hosted_ai',
display_name='Self-Hosted AI (LiteLLM)',
provider_type='ai',
api_key='', # LiteLLM doesn't require auth (internal)
api_endpoint='http://localhost:8000',
is_active=True,
is_sandbox=False,
config={
"priority": 10, # Try self-hosted first
"models": {
"text_generation": "qwen3:32b",
"text_generation_fast": "qwen3:8b",
"image_generation": "flux.1-dev",
"image_generation_fast": "sdxl-lightning"
},
"timeout": 300, # 5 minute timeout for slow models
"fallback_to": "openai" # Fallback provider if self-hosted fails
}
)
LiteLLM API Endpoints
Text Generation (Compatible with OpenAI API)
# Request
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ollama/qwen3:32b",
"messages": [{"role": "user", "content": "Write an article about..."}],
"temperature": 0.7,
"max_tokens": 2000
}'
# Response (identical to OpenAI)
{
"id": "chatcmpl-...",
"object": "chat.completion",
"created": 1234567890,
"model": "ollama/qwen3:32b",
"choices": [{
"index": 0,
"message": {"role": "assistant", "content": "Article text..."},
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 50,
"completion_tokens": 500,
"total_tokens": 550
}
}
Image Generation (ComfyUI via LiteLLM)
# Request
curl -X POST http://localhost:8000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "comfyui/flux.1-dev",
"prompt": "A professional product photo of...",
"size": "1024x1024",
"n": 1,
"quality": "hd"
}'
# Response
{
"created": 1234567890,
"data": [{
"url": "data:image/png;base64,...",
"revised_prompt": "A professional product photo of..."
}]
}
Model Routing Configuration
LiteLLM Config (see section 4.2)
- Routes
gpt-4requests →ollama/qwen3:32b - Routes
gpt-3.5-turborequests →ollama/qwen3:8b - Routes DALL-E requests →
comfyui/flux.1-dev - Includes fallback to OpenAI for unavailable models
- Respects timeout and retry limits
4. Implementation Steps
Phase 1: GPU Infrastructure Setup (Days 1-2)
4.1 Vast.ai Account & GPU Rental
Step 1: Create Vast.ai Account
# Navigate to https://www.vast.ai
# Sign up with email
# Verify account via email
# Add payment method (credit card or crypto)
Step 2: Rent GPU Instance
Requirements:
- 2x NVIDIA RTX 3090 (or 1x RTX 4090) = 48GB+ VRAM
- Ubuntu 24.04 LTS base image (preferred) or later
- Minimum bandwidth: 100 Mbps
- SSH port 22 open
Setup via Vast.ai dashboard:
- Go to "Browse" → Filter by:
- GPU: 2x RTX 3090 or RTX 4090
- Min VRAM: 48GB
- OS: Ubuntu 24.04 LTS (or later)
- Price: Sort by lowest $/hr
- Click "Rent" on selected instance
- Choose:
- Disk size: 500GB (includes models)
- Secure Cloud: No (to access port 22)
- Wait for machine to start (2-5 minutes)
- Record SSH credentials from dashboard
Step 3: Test SSH Access
# From your local machine
ssh root@<vast_ai_ip> -i ~/.ssh/vast_key
# Update system
apt update && apt upgrade -y
Step 4: Set Up Snapshot for Persistence
# After first-time setup, create snapshot in Vast.ai dashboard
# Future rentals: select snapshot to restore previous state
4.2 Vast.ai: Docker & Base Containers
Step 1: Install Docker
# SSH into Vast.ai machine
ssh root@<vast_ai_ip>
# Install Docker
curl https://get.docker.com -sSfL | sh
systemctl enable docker
systemctl start docker
# Verify
docker --version
Step 2: Set Up Storage
# Create persistent directory for models
mkdir -p /mnt/models
mkdir -p /mnt/ollama-cache
mkdir -p /mnt/comfyui-models
chmod 777 /mnt/*
# Create docker network for inter-container communication
docker network create ai-network
Step 3: Deploy Ollama Container
docker run -d \
--name ollama \
--network ai-network \
--gpus all \
-e OLLAMA_MODELS=/mnt/ollama-cache \
-v /mnt/ollama-cache:/root/.ollama \
-p 0.0.0.0:11434:11434 \
ollama/ollama:latest
Step 4: Pull Qwen3 Models
# Wait for ollama to be ready
sleep 10
# Pull models (will take 30-60 minutes depending on speed)
# Order by priority (largest first)
docker exec ollama ollama pull qwen3:32b # ~20GB
docker exec ollama ollama pull qwen3:30b-a3b # ~18GB
docker exec ollama ollama pull qwen3:14b # ~9GB
docker exec ollama ollama pull qwen3:8b # ~5GB
# Verify models are loaded
docker exec ollama ollama list
# Output should show all models with their sizes
Step 5: Deploy ComfyUI Container
# Clone ComfyUI repository
cd /opt
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
# Use Docker image with CUDA support
docker run -d \
--name comfyui \
--network ai-network \
--gpus all \
-e CUDA_VISIBLE_DEVICES=0,1 \
-v /mnt/comfyui-models:/ComfyUI/models \
-v /mnt/comfyui-output:/ComfyUI/output \
-p 0.0.0.0:8188:8188 \
comfyui-docker:latest
# Alternative: Run from source
docker run -d \
--name comfyui \
--network ai-network \
--gpus all \
-v /opt/ComfyUI:/ComfyUI \
-v /mnt/comfyui-models:/ComfyUI/models \
-v /mnt/comfyui-output:/ComfyUI/output \
-p 0.0.0.0:8188:8188 \
-w /ComfyUI \
nvidia/cuda:11.8.0-runtime-ubuntu22.04 \
bash -c "pip install -r requirements.txt && python -m http.server 8188"
Step 6: Download Image Generation Models
# Download models to ComfyUI
# FLUX.1 (recommended for quality)
cd /mnt/comfyui-models/checkpoints
wget -O flux1-dev-Q8.safetensors \
"https://huggingface.co/black-forest-labs/FLUX.1-dev/resolve/main/flux1-dev-Q8_0.safetensors"
# Stable Diffusion 3.5 (alternative)
wget -O sd3.5-large.safetensors \
"https://huggingface.co/stabilityai/stable-diffusion-3.5-large/resolve/main/sd_xl_base_1.0.safetensors"
# SDXL-Lightning (fast, lower quality but acceptable)
wget -O sdxl-lightning.safetensors \
"https://huggingface.co/latent-consistency/lcm-sdxl/resolve/main/pytorch_lora_weights.safetensors"
# VAE (for all models)
cd /mnt/comfyui-models/vae
wget -O ae.safetensors \
"https://huggingface.co/black-forest-labs/FLUX.1-dev/resolve/main/ae.safetensors"
Step 7: Verify Services
# Check Ollama API
curl http://localhost:11434/api/tags
# Should return: {"models": [{"name": "qwen3:32b", "size": ...}, ...]}
# Check ComfyUI
curl http://localhost:8188/system_stats
# Should return GPU/memory stats
Phase 2: VPS Tunnel & LiteLLM Setup (Days 2-3)
4.3 IGNY8 VPS: SSH Tunnel Configuration
Prerequisites: VPS must be provisioned (see 00B)
VPS Environment:
- Ubuntu 24.04 LTS
- Docker 29.x
- GPU server is deployed separately on Vast.ai (not on the VPS)
- The VPS maintains SSH tunnels to the Vast.ai GPU server for accessing Ollama and ComfyUI
DNS Note: During initial setup, before DNS flip, the IGNY8 backend connects to the LiteLLM proxy via localhost:8000 within the VPS environment. This uses the internal Docker network and local port forwarding, so external DNS configuration does not affect this connection. DNS considerations only apply to external client connections to the IGNY8 API.
Step 1: Generate SSH Key Pair
# On VPS
ssh-keygen -t rsa -b 4096 -f /root/.ssh/vast_ai -N ""
# On local machine, copy public key to Vast.ai machine
ssh-copy-id -i /root/.ssh/vast_ai.pub root@<vast_ai_ip>
Step 2: Install & Configure autossh
# On VPS
apt install autossh -y
# Create dedicated user for tunnel
useradd -m -s /bin/bash tunnel-user
mkdir -p /home/tunnel-user/.ssh
cp /root/.ssh/vast_ai* /home/tunnel-user/.ssh/
chown -R tunnel-user:tunnel-user /home/tunnel-user/.ssh
chmod 600 /home/tunnel-user/.ssh/vast_ai
Step 3: Create autossh Systemd Service
File: /etc/systemd/system/tunnel-vast-ai.service
[Unit]
Description=SSH Tunnel to Vast.ai GPU Server
After=network.target
Wants=network-online.target
[Service]
Type=simple
User=tunnel-user
ExecStart=/usr/bin/autossh \
-M 20000 \
-N \
-o "ServerAliveInterval=30" \
-o "ServerAliveCountMax=3" \
-o "ExitOnForwardFailure=no" \
-o "ConnectTimeout=10" \
-o "StrictHostKeyChecking=accept-new" \
-i /home/tunnel-user/.ssh/vast_ai \
-L 11434:localhost:11434 \
-L 11435:localhost:11435 \
-L 8188:localhost:8188 \
root@<vast_ai_ip>
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target
Step 4: Start Tunnel Service
# Reload systemd
systemctl daemon-reload
# Start service
systemctl start tunnel-vast-ai
# Enable on boot
systemctl enable tunnel-vast-ai
# Verify tunnel is up
systemctl status tunnel-vast-ai
# Check logs
journalctl -u tunnel-vast-ai -f
Step 5: Test Tunnel Connectivity
# On VPS, verify ports are open
netstat -tlnp | grep -E '(11434|8188)'
# Should show: 127.0.0.1:11434 LISTEN
# 127.0.0.1:8188 LISTEN
# Test Ollama through tunnel
curl http://localhost:11434/api/tags
# Should return model list from remote Vast.ai machine
# Test ComfyUI through tunnel
curl http://localhost:8188/system_stats
# Should return GPU stats
4.4 LiteLLM Installation & Configuration
Step 1: Install LiteLLM
# On VPS
pip install litellm fastapi uvicorn python-dotenv requests
# Verify installation
python -c "import litellm; print(litellm.__version__)"
Step 2: Create LiteLLM Configuration
File: /opt/litellm/config.yaml
# LiteLLM Configuration for IGNY8
model_list:
# Text Generation Models (Ollama via SSH tunnel)
- model_name: gpt-4
litellm_params:
model: ollama/qwen3:32b
api_base: http://localhost:11434
timeout: 300
max_tokens: 8000
- model_name: gpt-4-turbo
litellm_params:
model: ollama/qwen3:30b-a3b
api_base: http://localhost:11434
timeout: 300
max_tokens: 8000
- model_name: gpt-3.5-turbo
litellm_params:
model: ollama/qwen3:14b
api_base: http://localhost:11434
timeout: 180
max_tokens: 4000
- model_name: gpt-3.5-turbo-fast
litellm_params:
model: ollama/qwen3:8b
api_base: http://localhost:11434
timeout: 120
max_tokens: 2048
# Fallback to OpenAI for redundancy
- model_name: gpt-4-fallback
litellm_params:
model: gpt-4
api_key: ${OPENAI_API_KEY}
timeout: 60
- model_name: gpt-3.5-turbo-fallback
litellm_params:
model: gpt-3.5-turbo
api_key: ${OPENAI_API_KEY}
timeout: 60
# Image Generation (ComfyUI via tunnel)
- model_name: dall-e-3
litellm_params:
model: comfyui/flux.1-dev
api_base: http://localhost:8188
timeout: 120
- model_name: dall-e-2
litellm_params:
model: comfyui/sdxl-lightning
api_base: http://localhost:8188
timeout: 60
# Router configuration for model selection
router_settings:
routing_strategy: "simple-shuffle" # Load balancing
allowed_model_region: null
# Logging configuration
litellm_settings:
verbose: true
log_level: "INFO"
cache_responses: true
cache_params: ["model", "messages", "temperature"]
# Fallback behavior
fallback_models:
- model_name: gpt-4
fallback_to: ["gpt-4-fallback", "gpt-3.5-turbo"]
- model_name: gpt-3.5-turbo
fallback_to: ["gpt-3.5-turbo-fallback"]
- model_name: dall-e-3
fallback_to: ["dall-e-2"]
Step 3: Create Environment File
File: /opt/litellm/.env
# OpenAI (for fallback)
OPENAI_API_KEY=sk-your-key-here
# Anthropic (optional fallback)
ANTHROPIC_API_KEY=sk-ant-your-key-here
# LiteLLM settings
LITELLM_LOG_LEVEL=INFO
LITELLM_CACHE=true
# Service settings
PORT=8000
HOST=127.0.0.1
Step 4: Create LiteLLM Startup Script
File: /opt/litellm/start.sh
#!/bin/bash
set -e
cd /opt/litellm
# Load environment variables
source .env
# Start LiteLLM server
python -m litellm.server \
--config config.yaml \
--host 127.0.0.1 \
--port 8000 \
--num_workers 4 \
--worker_timeout 600
chmod +x /opt/litellm/start.sh
Step 5: Create Systemd Service for LiteLLM
File: /etc/systemd/system/litellm.service
[Unit]
Description=LiteLLM AI Proxy Gateway
After=network.target tunnel-vast-ai.service
Wants=tunnel-vast-ai.service
[Service]
Type=simple
User=root
WorkingDirectory=/opt/litellm
ExecStart=/opt/litellm/start.sh
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/local/python/bin"
[Install]
WantedBy=multi-user.target
Step 6: Start LiteLLM Service
systemctl daemon-reload
systemctl start litellm
systemctl enable litellm
# Verify service
systemctl status litellm
# Check logs
journalctl -u litellm -f
Step 7: Test LiteLLM API
# Test text generation with self-hosted model
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer test-key" \
-d '{
"model": "gpt-4",
"messages": [
{"role": "user", "content": "Say hello in one sentence"}
],
"temperature": 0.7,
"max_tokens": 100
}'
# Should return a response from Qwen3:32B
# Test fallback (disconnect tunnel first to test fallback logic)
# curl should eventually fall back to OpenAI after timeout
Phase 3: IGNY8 Backend Integration (Days 3-4)
4.5 Add Self-Hosted Provider to IGNY8
Step 1: Update GlobalIntegrationSettings Model
File: backend/models/integration.py
# Add to IntegrationProvider enum
class IntegrationProvider(models.TextChoices):
OPENAI = "openai", "OpenAI"
ANTHROPIC = "anthropic", "Anthropic"
RUNWARE = "runware", "Runware"
BRIA = "bria", "Bria"
SELF_HOSTED = "self_hosted_ai", "Self-Hosted AI (LiteLLM)" # NEW
# Example settings structure
SELF_HOSTED_SETTINGS = {
"provider": "self_hosted_ai",
"name": "Self-Hosted AI (LiteLLM)",
"base_url": "http://localhost:8000",
"api_key": "not_required",
"enabled": True,
"priority": 10, # Try first
"models": {
"text_generation": "gpt-4", # Maps to qwen3:32b
"text_generation_fast": "gpt-3.5-turbo", # Maps to qwen3:8b
"image_generation": "dall-e-3", # Maps to flux.1-dev
"image_generation_fast": "dall-e-2" # Maps to sdxl-lightning
},
"timeout": 300,
"fallback_to": "openai"
}
Step 2: Add Self-Hosted Settings to Database
File: backend/management/commands/init_integrations.py
from backend.models.integration import GlobalIntegrationSettings, IntegrationProvider
def add_self_hosted_integration():
"""Initialize self-hosted AI integration"""
self_hosted_config = {
"provider": IntegrationProvider.SELF_HOSTED,
"name": "Self-Hosted AI (LiteLLM)",
"base_url": "http://localhost:8000",
"api_key": "", # Not required for local proxy
"enabled": True,
"priority": 10, # Higher priority = try first
"models": {
"text_generation": "gpt-4",
"text_generation_fast": "gpt-3.5-turbo",
"image_generation": "dall-e-3",
"image_generation_fast": "dall-e-2"
},
"timeout": 300,
"max_retries": 2,
"fallback_provider": IntegrationProvider.OPENAI
}
integration, created = GlobalIntegrationSettings.objects.update_or_create(
provider=IntegrationProvider.SELF_HOSTED,
defaults=self_hosted_config
)
if created:
print(f"✓ Created {IntegrationProvider.SELF_HOSTED} integration")
else:
print(f"✓ Updated {IntegrationProvider.SELF_HOSTED} integration")
# Run in management command initialization
Step 3: Update AI Request Router
File: backend/services/ai_engine.py
import requests
import logging
from typing import Optional, List, Dict, Any
from backend.models.integration import GlobalIntegrationSettings, IntegrationProvider
logger = logging.getLogger(__name__)
class AIEngineRouter:
"""Routes AI requests to appropriate provider with fallback chain"""
PROVIDER_PRIORITY = {
IntegrationProvider.SELF_HOSTED: 10, # Try first
IntegrationProvider.OPENAI: 5,
IntegrationProvider.ANTHROPIC: 4,
}
def __init__(self):
self.providers = self._load_providers()
def _load_providers(self) -> List[Dict[str, Any]]:
"""Load enabled providers from database"""
configs = GlobalIntegrationSettings.objects.filter(
enabled=True
).values()
# Sort by priority (highest first)
sorted_configs = sorted(
configs,
key=lambda x: self.PROVIDER_PRIORITY.get(x['provider'], 0),
reverse=True
)
return sorted_configs
def generate_text(
self,
prompt: str,
model: str = "gpt-4",
max_tokens: int = 2000,
temperature: float = 0.7,
timeout: Optional[int] = None
) -> Dict[str, Any]:
"""Generate text using available provider with fallback"""
for provider_config in self.providers:
try:
result = self._call_provider(
provider_config,
"text",
prompt=prompt,
model=model,
max_tokens=max_tokens,
temperature=temperature,
timeout=timeout or provider_config.get('timeout', 300)
)
return {
"success": True,
"provider": provider_config['provider'],
"text": result['content'],
"usage": result.get('usage', {}),
"model": result.get('model', model)
}
except Exception as e:
logger.warning(
f"Provider {provider_config['provider']} failed: {str(e)}"
)
continue
# All providers failed
raise Exception("All AI providers exhausted. No response available.")
def generate_image(
self,
prompt: str,
model: str = "dall-e-3",
size: str = "1024x1024",
quality: str = "hd",
timeout: Optional[int] = None
) -> Dict[str, Any]:
"""Generate image using available provider with fallback"""
for provider_config in self.providers:
try:
result = self._call_provider(
provider_config,
"image",
prompt=prompt,
model=model,
size=size,
quality=quality,
timeout=timeout or provider_config.get('timeout', 120)
)
return {
"success": True,
"provider": provider_config['provider'],
"image_url": result['url'],
"revised_prompt": result.get('revised_prompt', prompt),
"model": result.get('model', model)
}
except Exception as e:
logger.warning(
f"Provider {provider_config['provider']} failed: {str(e)}"
)
continue
# All providers failed
raise Exception("All image generation providers exhausted.")
def _call_provider(
self,
provider_config: Dict[str, Any],
task_type: str, # "text" or "image"
**kwargs
) -> Dict[str, Any]:
"""Call specific provider based on type"""
provider = provider_config['provider']
if provider == IntegrationProvider.SELF_HOSTED:
return self._call_litellm(provider_config, task_type, **kwargs)
elif provider == IntegrationProvider.OPENAI:
return self._call_openai(provider_config, task_type, **kwargs)
elif provider == IntegrationProvider.ANTHROPIC:
return self._call_anthropic(provider_config, task_type, **kwargs)
else:
raise ValueError(f"Unknown provider: {provider}")
def _call_litellm(
self,
provider_config: Dict[str, Any],
task_type: str,
**kwargs
) -> Dict[str, Any]:
"""Call LiteLLM proxy on localhost"""
base_url = provider_config['base_url']
timeout = kwargs.pop('timeout', 300)
if task_type == "text":
# Chat completion endpoint
endpoint = f"{base_url}/v1/chat/completions"
payload = {
"model": kwargs.get('model', 'gpt-4'),
"messages": [
{"role": "user", "content": kwargs['prompt']}
],
"temperature": kwargs.get('temperature', 0.7),
"max_tokens": kwargs.get('max_tokens', 2000)
}
elif task_type == "image":
# Image generation endpoint
endpoint = f"{base_url}/v1/images/generations"
payload = {
"model": kwargs.get('model', 'dall-e-3'),
"prompt": kwargs['prompt'],
"size": kwargs.get('size', '1024x1024'),
"n": 1,
"quality": kwargs.get('quality', 'hd')
}
else:
raise ValueError(f"Unknown task type: {task_type}")
try:
response = requests.post(
endpoint,
json=payload,
timeout=timeout,
headers={"Authorization": "Bearer test"}
)
response.raise_for_status()
data = response.json()
if task_type == "text":
return {
"content": data['choices'][0]['message']['content'],
"usage": data.get('usage', {}),
"model": data.get('model', kwargs.get('model'))
}
else: # image
return {
"url": data['data'][0]['url'],
"revised_prompt": data['data'][0].get('revised_prompt'),
"model": kwargs.get('model')
}
except requests.exceptions.Timeout:
logger.error(f"LiteLLM timeout after {timeout}s")
raise
except requests.exceptions.ConnectionError:
logger.error("Cannot connect to LiteLLM proxy - tunnel may be down")
raise
except Exception as e:
logger.error(f"LiteLLM request failed: {str(e)}")
raise
def _call_openai(self, provider_config, task_type, **kwargs):
"""Existing OpenAI implementation"""
# Use existing OpenAI integration code
pass
def _call_anthropic(self, provider_config, task_type, **kwargs):
"""Existing Anthropic implementation"""
# Use existing Anthropic integration code
pass
# Initialize global instance
ai_router = AIEngineRouter()
Step 4: Update Content Generation Celery Tasks
File: backend/tasks/content_generation.py
from celery import shared_task
from backend.services.ai_engine import ai_router
import logging
logger = logging.getLogger(__name__)
@shared_task
def generate_article_content(user_id: int, article_id: int):
"""Generate article content using AI router (tries self-hosted first)"""
try:
# Get article from database
article = Article.objects.get(id=article_id, user_id=user_id)
# Generate content
result = ai_router.generate_text(
prompt=f"Write a detailed article about: {article.topic}",
model="gpt-4",
max_tokens=3000,
temperature=0.7
)
# Save result
article.content = result['text']
article.ai_provider = result['provider']
article.save()
logger.info(
f"Generated article {article_id} using {result['provider']}"
)
return {
"success": True,
"article_id": article_id,
"provider": result['provider']
}
except Exception as e:
logger.error(f"Article generation failed: {str(e)}")
raise
@shared_task
def generate_product_images(user_id: int, product_id: int):
"""Generate product images using AI router"""
try:
product = Product.objects.get(id=product_id, user_id=user_id)
# Try to generate with self-hosted first (faster)
result = ai_router.generate_image(
prompt=f"Professional product photo of: {product.description}",
model="dall-e-3",
size="1024x1024",
quality="hd"
)
product.image_url = result['image_url']
product.ai_provider = result['provider']
product.save()
logger.info(f"Generated image for product {product_id} using {result['provider']}")
return {
"success": True,
"product_id": product_id,
"provider": result['provider'],
"image_url": result['image_url']
}
except Exception as e:
logger.error(f"Image generation failed: {str(e)}")
raise
Step 5: Add AI Provider Tracking
File: backend/models/content.py
from django.db import models
from backend.models.integration import IntegrationProvider
class Article(models.Model):
# ... existing fields ...
# Track which AI provider generated content
ai_provider = models.CharField(
max_length=50,
choices=IntegrationProvider.choices,
default=IntegrationProvider.OPENAI,
help_text="Which AI provider generated this content"
)
ai_cost = models.DecimalField(
max_digits=10,
decimal_places=6,
default=0,
help_text="Cost to generate via AI provider"
)
ai_generation_time = models.DurationField(
null=True,
blank=True,
help_text="Time taken to generate content"
)
class Product(models.Model):
# ... existing fields ...
ai_provider = models.CharField(
max_length=50,
choices=IntegrationProvider.choices,
default=IntegrationProvider.OPENAI,
help_text="Which AI provider generated the image"
)
ai_image_cost = models.DecimalField(
max_digits=10,
decimal_places=6,
default=0,
help_text="Cost to generate image"
)
Phase 4: Monitoring & Fallback (Days 4-5)
4.6 Health Check & Failover System
Step 1: Create Health Check Service
File: backend/services/ai_health_check.py
import requests
import time
import logging
from typing import Dict, Any, Tuple
from datetime import datetime, timedelta
logger = logging.getLogger(__name__)
class AIHealthMonitor:
"""Monitor health of self-hosted AI infrastructure"""
OLLAMA_ENDPOINT = "http://localhost:11434/api/tags"
COMFYUI_ENDPOINT = "http://localhost:8188/system_stats"
LITELLM_ENDPOINT = "http://localhost:8000/health"
HEALTH_CHECK_INTERVAL = 60 # seconds
FAILURE_THRESHOLD = 3 # Mark unhealthy after 3 failures
def __init__(self):
self.last_check = None
self.failure_count = {
'ollama': 0,
'comfyui': 0,
'litellm': 0
}
self.is_healthy = {
'ollama': True,
'comfyui': True,
'litellm': True
}
def check_all(self) -> Dict[str, Any]:
"""Run all health checks"""
results = {
'timestamp': datetime.now().isoformat(),
'overall_healthy': True,
'services': {}
}
# Check Ollama
ollama_healthy = self._check_ollama()
results['services']['ollama'] = {
'healthy': ollama_healthy,
'endpoint': self.OLLAMA_ENDPOINT
}
if not ollama_healthy:
results['overall_healthy'] = False
# Check ComfyUI
comfyui_healthy = self._check_comfyui()
results['services']['comfyui'] = {
'healthy': comfyui_healthy,
'endpoint': self.COMFYUI_ENDPOINT
}
if not comfyui_healthy:
results['overall_healthy'] = False
# Check LiteLLM
litellm_healthy = self._check_litellm()
results['services']['litellm'] = {
'healthy': litellm_healthy,
'endpoint': self.LITELLM_ENDPOINT
}
if not litellm_healthy:
results['overall_healthy'] = False
self.last_check = results
# Log status change if needed
if self.is_healthy['ollama'] != ollama_healthy:
level = logging.WARNING if not ollama_healthy else logging.INFO
logger.log(level, f"Ollama service {'down' if not ollama_healthy else 'recovered'}")
if self.is_healthy['comfyui'] != comfyui_healthy:
level = logging.WARNING if not comfyui_healthy else logging.INFO
logger.log(level, f"ComfyUI service {'down' if not comfyui_healthy else 'recovered'}")
if self.is_healthy['litellm'] != litellm_healthy:
level = logging.WARNING if not litellm_healthy else logging.INFO
logger.log(level, f"LiteLLM service {'down' if not litellm_healthy else 'recovered'}")
# Update internal state
self.is_healthy['ollama'] = ollama_healthy
self.is_healthy['comfyui'] = comfyui_healthy
self.is_healthy['litellm'] = litellm_healthy
return results
def _check_ollama(self) -> bool:
"""Check if Ollama is responding"""
try:
response = requests.get(self.OLLAMA_ENDPOINT, timeout=5)
if response.status_code == 200:
self.failure_count['ollama'] = 0
return True
except Exception as e:
logger.debug(f"Ollama health check failed: {str(e)}")
self.failure_count['ollama'] += 1
return self.failure_count['ollama'] < self.FAILURE_THRESHOLD
def _check_comfyui(self) -> bool:
"""Check if ComfyUI is responding"""
try:
response = requests.get(self.COMFYUI_ENDPOINT, timeout=5)
if response.status_code == 200:
self.failure_count['comfyui'] = 0
return True
except Exception as e:
logger.debug(f"ComfyUI health check failed: {str(e)}")
self.failure_count['comfyui'] += 1
return self.failure_count['comfyui'] < self.FAILURE_THRESHOLD
def _check_litellm(self) -> bool:
"""Check if LiteLLM is responding"""
try:
response = requests.get(self.LITELLM_ENDPOINT, timeout=5)
if response.status_code == 200:
self.failure_count['litellm'] = 0
return True
except Exception as e:
logger.debug(f"LiteLLM health check failed: {str(e)}")
self.failure_count['litellm'] += 1
return self.failure_count['litellm'] < self.FAILURE_THRESHOLD
def is_self_hosted_available(self) -> bool:
"""Check if self-hosted AI is fully available"""
return all([
self.is_healthy['ollama'],
self.is_healthy['comfyui'],
self.is_healthy['litellm']
])
# Create global instance
health_monitor = AIHealthMonitor()
Step 2: Create Health Check Celery Task
File: backend/tasks/health_checks.py
from celery import shared_task
from celery.schedules import schedule
from backend.services.ai_health_check import health_monitor
from backend.models.monitoring import ServiceHealthLog
import logging
logger = logging.getLogger(__name__)
@shared_task
def check_ai_health():
"""Run AI infrastructure health checks every minute"""
results = health_monitor.check_all()
# Log to database
ServiceHealthLog.objects.create(
service='self_hosted_ai',
is_healthy=results['overall_healthy'],
details=results
)
# Alert if services are down
if not results['overall_healthy']:
down_services = [
service for service, status in results['services'].items()
if not status['healthy']
]
logger.error(
f"AI services down: {', '.join(down_services)}. "
f"Falling back to external APIs."
)
return results
# Add to celery beat schedule
CELERY_BEAT_SCHEDULE = {
'check-ai-health': {
'task': 'backend.tasks.health_checks.check_ai_health',
'schedule': 60.0, # Every 60 seconds
},
}
Step 3: Create Monitoring Model
File: backend/models/monitoring.py
from django.db import models
from django.utils import timezone
class ServiceHealthLog(models.Model):
"""Log of service health checks"""
SERVICE_CHOICES = [
('self_hosted_ai', 'Self-Hosted AI'),
('tunnel', 'SSH Tunnel'),
('litellm', 'LiteLLM Proxy'),
]
service = models.CharField(max_length=50, choices=SERVICE_CHOICES)
is_healthy = models.BooleanField()
details = models.JSONField(default=dict)
checked_at = models.DateTimeField(auto_now_add=True)
class Meta:
ordering = ['-checked_at']
indexes = [
models.Index(fields=['-checked_at']),
models.Index(fields=['service', '-checked_at']),
]
def __str__(self):
status = "✓ Healthy" if self.is_healthy else "✗ Down"
return f"{self.service} {status} @ {self.checked_at}"
class AIUsageLog(models.Model):
"""Track AI provider usage and costs"""
PROVIDER_CHOICES = [
('self_hosted_ai', 'Self-Hosted AI'),
('openai', 'OpenAI'),
('anthropic', 'Anthropic'),
]
TASK_TYPE_CHOICES = [
('text_generation', 'Text Generation'),
('image_generation', 'Image Generation'),
('keyword_research', 'Keyword Research'),
]
user = models.ForeignKey('User', on_delete=models.CASCADE)
provider = models.CharField(max_length=50, choices=PROVIDER_CHOICES)
task_type = models.CharField(max_length=50, choices=TASK_TYPE_CHOICES)
model_used = models.CharField(max_length=100)
input_tokens = models.IntegerField(default=0)
output_tokens = models.IntegerField(default=0)
cost = models.DecimalField(max_digits=10, decimal_places=6, default=0)
duration_ms = models.IntegerField() # Milliseconds
success = models.BooleanField(default=True)
error_message = models.TextField(blank=True)
created_at = models.DateTimeField(auto_now_add=True)
class Meta:
ordering = ['-created_at']
indexes = [
models.Index(fields=['user', '-created_at']),
models.Index(fields=['provider', '-created_at']),
]
def __str__(self):
return f"{self.provider} - {self.task_type} - ${self.cost:.4f}"
Phase 5: Cost Tracking & Optimization (Days 5-6)
4.7 Cost Calculation & Dashboard
Step 1: Create Cost Calculator
File: backend/services/cost_calculator.py
from decimal import Decimal
from typing import Dict, Any
class AICostCalculator:
"""Calculate AI generation costs by provider"""
# Self-hosted cost (Vast.ai GPU rental amortized)
# $200/month ÷ 30 days ÷ 24 hours = $0.278/hour
# Assuming 70% utilization = $0.1945/hour
SELF_HOSTED_COST_PER_HOUR = Decimal('0.20') # Conservative estimate
# OpenAI pricing (as of 2026)
OPENAI_PRICING = {
'gpt-4': {
'input': Decimal('0.00003'), # per token
'output': Decimal('0.00006'),
},
'gpt-3.5-turbo': {
'input': Decimal('0.0005'),
'output': Decimal('0.0015'),
},
'dall-e-3': Decimal('0.04'), # per image
}
# Anthropic pricing
ANTHROPIC_PRICING = {
'claude-3-opus': {
'input': Decimal('0.000015'),
'output': Decimal('0.000075'),
},
'claude-3-sonnet': {
'input': Decimal('0.000003'),
'output': Decimal('0.000015'),
},
}
@classmethod
def calculate_text_generation_cost(
cls,
provider: str,
model: str,
input_tokens: int,
output_tokens: int,
duration_ms: int = 0
) -> Decimal:
"""Calculate cost for text generation"""
if provider == 'self_hosted_ai':
# Cost based on compute time (rough estimate)
duration_hours = duration_ms / (1000 * 3600)
return cls.SELF_HOSTED_COST_PER_HOUR * Decimal(duration_hours)
elif provider == 'openai':
pricing = cls.OPENAI_PRICING.get(model, {})
input_cost = Decimal(input_tokens) * pricing.get('input', Decimal(0))
output_cost = Decimal(output_tokens) * pricing.get('output', Decimal(0))
return input_cost + output_cost
elif provider == 'anthropic':
pricing = cls.ANTHROPIC_PRICING.get(model, {})
input_cost = Decimal(input_tokens) * pricing.get('input', Decimal(0))
output_cost = Decimal(output_tokens) * pricing.get('output', Decimal(0))
return input_cost + output_cost
return Decimal(0)
@classmethod
def calculate_image_generation_cost(
cls,
provider: str,
model: str,
duration_ms: int = 0
) -> Decimal:
"""Calculate cost for image generation"""
if provider == 'self_hosted_ai':
# Cost based on compute time
duration_hours = duration_ms / (1000 * 3600)
return cls.SELF_HOSTED_COST_PER_HOUR * Decimal(duration_hours)
elif provider == 'openai':
if 'dall-e' in model:
return cls.OPENAI_PRICING.get('dall-e-3', Decimal('0.04'))
return Decimal(0)
@classmethod
def monthly_cost_analysis(cls) -> Dict[str, Any]:
"""Analyze projected monthly costs"""
from backend.models.monitoring import AIUsageLog
from django.utils import timezone
from datetime import timedelta
# Get last 30 days of usage
thirty_days_ago = timezone.now() - timedelta(days=30)
usage_logs = AIUsageLog.objects.filter(
created_at__gte=thirty_days_ago
)
cost_by_provider = {}
total_cost = Decimal(0)
for log in usage_logs:
if log.provider not in cost_by_provider:
cost_by_provider[log.provider] = {
'count': 0,
'total_cost': Decimal(0),
'saved_vs_openai': Decimal(0)
}
cost_by_provider[log.provider]['count'] += 1
cost_by_provider[log.provider]['total_cost'] += log.cost
total_cost += log.cost
# Calculate savings
self_hosted_usage = usage_logs.filter(provider='self_hosted_ai')
openai_equivalent_cost = Decimal(0)
for log in self_hosted_usage:
# Calculate what OpenAI would have charged
openai_cost = cls.calculate_text_generation_cost(
'openai',
'gpt-4',
log.input_tokens,
log.output_tokens
) if log.task_type == 'text_generation' else cls.calculate_image_generation_cost(
'openai',
'dall-e-3'
)
openai_equivalent_cost += openai_cost
return {
'cost_by_provider': cost_by_provider,
'total_cost': total_cost,
'savings_vs_openai': openai_equivalent_cost - cost_by_provider.get('self_hosted_ai', {}).get('total_cost', Decimal(0)),
'roi_vs_gpu_cost': openai_equivalent_cost - Decimal(200), # $200 = 1 month GPU
}
5. Acceptance Criteria
Infrastructure Ready
- Vast.ai GPU instance rented and running (2x RTX 3090 or better)
- SSH access confirmed from IGNY8 VPS
- Ollama container running with all Qwen3 models downloaded
- ComfyUI container running with FLUX.1 and Stable Diffusion 3.5 models
- Models tested via direct API calls (curl tests all pass)
Network Tunnel Operational
- autossh service running on IGNY8 VPS
- SSH tunnel persists through network interruptions
- Ports 11434, 11435, 8188 accessible on localhost from VPS
- Tunnel auto-reconnects within 60 seconds of disconnect
- Systemd service enables on boot
LiteLLM Proxy Functional
- LiteLLM service running on VPS port 8000
- OpenAI-compatible API endpoints working
- Text generation requests route to Ollama
- Image generation requests route to ComfyUI
- Fallback to OpenAI works when self-hosted unavailable
- Config includes all model variants
- Timeout values appropriate for each model
IGNY8 Backend Integration Complete
- Self-hosted provider added to GlobalIntegrationSettings
- AIEngineRouter tries self-hosted before external APIs
- Celery tasks log which provider was used
- Content includes ai_provider tracking field
- Fallback chain works (self-hosted → OpenAI → Anthropic)
- Unit tests pass for all provider calls
Health Check System Operational
- Health check task runs every 60 seconds
- ServiceHealthLog table populated
- Alerts generated when services down
- System continues working with degraded services
- Dashboard shows service status
Cost Tracking Implemented
- AIUsageLog records all AI requests
- Cost calculation accurate per provider
- Monthly cost analysis working
- Cost comparison shows self-hosted savings
- Dashboard displays cost breakdown
Documentation & Runbooks
- This build document complete and accurate
- Troubleshooting guide for common issues
- Runbook for GPU rental renewal
- Cost monitoring dashboard updated
- Team trained on fallback procedures
6. Claude Code Instructions
Prerequisites
# Ensure VPS provisioned (see 00B)
# Have Vast.ai account created
# Have IGNY8 codebase cloned locally
Build Execution
Step 1: GPU Infrastructure (Operator)
# Manual: Set up Vast.ai account, rent GPU, note IP
# This requires manual interaction with Vast.ai dashboard
# Once IP obtained, proceed to step 2
Step 2: Vast.ai Setup (Automated)
# Run on Vast.ai GPU server
VAST_AI_IP="<your-gpu-ip>"
ssh -i ~/.ssh/vast_key root@$VAST_AI_IP << 'EOF'
# Update system
apt update && apt upgrade -y
# Install Docker
curl https://get.docker.com -sSfL | sh
systemctl enable docker && systemctl start docker
# Create storage directories
mkdir -p /mnt/{models,ollama-cache,comfyui-models,comfyui-output}
chmod 777 /mnt/*
# Create docker network
docker network create ai-network
# Deploy Ollama
docker run -d \
--name ollama \
--network ai-network \
--gpus all \
-e OLLAMA_MODELS=/mnt/ollama-cache \
-v /mnt/ollama-cache:/root/.ollama \
-p 0.0.0.0:11434:11434 \
ollama/ollama:latest
sleep 30
# Pull models (takes 1-2 hours)
docker exec ollama ollama pull qwen3:32b
docker exec ollama ollama pull qwen3:30b-a3b
docker exec ollama ollama pull qwen3:14b
docker exec ollama ollama pull qwen3:8b
# Deploy ComfyUI
docker run -d \
--name comfyui \
--network ai-network \
--gpus all \
-v /mnt/comfyui-models:/ComfyUI/models \
-v /mnt/comfyui-output:/ComfyUI/output \
-p 0.0.0.0:8188:8188 \
comfyui-docker:latest
# Download image models
mkdir -p /mnt/comfyui-models/checkpoints
cd /mnt/comfyui-models/checkpoints
wget https://huggingface.co/black-forest-labs/FLUX.1-dev/resolve/main/flux1-dev-Q8_0.safetensors -O flux1-dev.safetensors
wget https://huggingface.co/stabilityai/stable-diffusion-3.5-large/resolve/main/sd_xl_base_1.0.safetensors -O sd3.5-large.safetensors
echo "✓ Vast.ai setup complete"
EOF
Step 3: VPS Tunnel Setup (Automated)
# Run on IGNY8 VPS
VAST_AI_IP="<your-gpu-ip>"
# Install autossh
apt install autossh -y
# Create tunnel user
useradd -m -s /bin/bash tunnel-user
mkdir -p /home/tunnel-user/.ssh
# Copy SSH key (paste private key content)
cat > /home/tunnel-user/.ssh/vast_ai << 'KEY'
-----BEGIN RSA PRIVATE KEY-----
<paste-private-key-here>
-----END RSA PRIVATE KEY-----
KEY
chmod 600 /home/tunnel-user/.ssh/vast_ai
chown -R tunnel-user:tunnel-user /home/tunnel-user/.ssh
# Create systemd service
cat > /etc/systemd/system/tunnel-vast-ai.service << 'SERVICE'
[Unit]
Description=SSH Tunnel to Vast.ai GPU Server
After=network.target
Wants=network-online.target
[Service]
Type=simple
User=tunnel-user
ExecStart=/usr/bin/autossh \
-M 20000 \
-N \
-o "ServerAliveInterval=30" \
-o "ServerAliveCountMax=3" \
-o "ExitOnForwardFailure=no" \
-o "StrictHostKeyChecking=accept-new" \
-i /home/tunnel-user/.ssh/vast_ai \
-L 11434:localhost:11434 \
-L 11435:localhost:11435 \
-L 8188:localhost:8188 \
root@VAST_AI_IP
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
SERVICE
# Update IP in service file
sed -i "s/VAST_AI_IP/$VAST_AI_IP/g" /etc/systemd/system/tunnel-vast-ai.service
# Start tunnel
systemctl daemon-reload
systemctl start tunnel-vast-ai
systemctl enable tunnel-vast-ai
# Wait and verify
sleep 5
netstat -tlnp | grep -E '(11434|8188)'
echo "✓ SSH tunnel operational"
Step 4: LiteLLM Installation (Automated)
# Run on IGNY8 VPS
# Install LiteLLM
pip install litellm fastapi uvicorn python-dotenv requests
# Create directories
mkdir -p /opt/litellm
# Create config file
cat > /opt/litellm/config.yaml << 'CONFIG'
model_list:
- model_name: gpt-4
litellm_params:
model: ollama/qwen3:32b
api_base: http://localhost:11434
timeout: 300
max_tokens: 8000
- model_name: gpt-3.5-turbo
litellm_params:
model: ollama/qwen3:8b
api_base: http://localhost:11434
timeout: 120
max_tokens: 2048
- model_name: dall-e-3
litellm_params:
model: comfyui/flux.1-dev
api_base: http://localhost:8188
timeout: 120
litellm_settings:
verbose: true
log_level: INFO
cache_responses: true
CONFIG
# Create .env file
cat > /opt/litellm/.env << 'ENV'
OPENAI_API_KEY=your-openai-key
PORT=8000
HOST=127.0.0.1
ENV
# Create start script
cat > /opt/litellm/start.sh << 'SCRIPT'
#!/bin/bash
cd /opt/litellm
source .env
python -m litellm.server --config config.yaml --host 127.0.0.1 --port 8000 --num_workers 4
SCRIPT
chmod +x /opt/litellm/start.sh
# Create systemd service
cat > /etc/systemd/system/litellm.service << 'SERVICE'
[Unit]
Description=LiteLLM AI Proxy Gateway
After=network.target tunnel-vast-ai.service
Wants=tunnel-vast-ai.service
[Service]
Type=simple
User=root
WorkingDirectory=/opt/litellm
ExecStart=/opt/litellm/start.sh
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
SERVICE
# Start LiteLLM
systemctl daemon-reload
systemctl start litellm
systemctl enable litellm
# Verify
sleep 5
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer test" \
-d '{"model": "gpt-4", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 100}'
echo "✓ LiteLLM operational"
Step 5: IGNY8 Backend Integration (Developer)
# In IGNY8 codebase
# 1. Add to IntegrationProvider enum (backend/models/integration.py)
# 2. Update management command to initialize self-hosted settings
# 3. Implement AIEngineRouter with fallback logic
# 4. Update Celery tasks to use router
# 5. Add database fields for provider tracking
# 6. Run migrations
# 7. Create health check monitoring
python manage.py makemigrations
python manage.py migrate
# Initialize self-hosted integration
python manage.py init_integrations
Step 6: Verification (Automated)
# Test full chain
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4",
"messages": [{"role": "user", "content": "Write a 100-word article about clouds"}],
"max_tokens": 200
}'
# Expected response: Article from Qwen3:32B model
# Test fallback by stopping tunnel
systemctl stop tunnel-vast-ai
# Wait 10 seconds
# Retry request - should now use OpenAI instead
Timeline & Resource Allocation
| Phase | Days | Task | Owner | Status |
|---|---|---|---|---|
| 1.1 | 1 | Vast.ai account & GPU rental | Operator | Ready |
| 1.2 | 1 | Docker & Ollama setup | DevOps | Ready |
| 1.3 | 1 | Model pulling & ComfyUI | DevOps | Ready |
| 2.1 | 0.5 | VPS tunnel infrastructure | DevOps | Ready |
| 2.2 | 0.5 | autossh systemd service | DevOps | Ready |
| 2.3 | 1 | LiteLLM installation & config | DevOps | Ready |
| 3.1 | 1 | Backend integration scaffolding | Developer | Ready |
| 3.2 | 1 | AI router & fallback logic | Developer | Ready |
| 3.3 | 1 | Celery task updates | Developer | Ready |
| 4.1 | 1 | Health check system | DevOps | Ready |
| 5.1 | 1 | Cost tracking & dashboard | Developer | Ready |
| Total | 7 |
Cost Analysis
Monthly GPU Rental
- Vast.ai 2x RTX 3090: $180-220/month (auto-bid recommended)
- Fixed cost: $200/month (conservative)
Monthly API Costs (Current)
Estimated current external API costs (before optimization):
- OpenAI (GPT-4/3.5): $800-1,200/month
- Anthropic (Claude): $200-400/month
- Image generation (Runware/Bria): $300-500/month
- Total: $1,300-2,100/month
Monthly API Costs (After)
With self-hosted supplementing external:
- Self-hosted cost: $200/month (amortized GPU)
- External APIs (fallback only): $200-300/month
- Total: $400-500/month
Savings & ROI
- Monthly savings: $800-1,700
- Break-even: 12-24 days (1 GPU rental cost)
- Annual savings: $9,600-20,400
Cost Per Subscriber
- Before: $26-42/subscriber/month (on $49/month tier)
- After: $8-10/subscriber/month
- Improvement: 65-76% cost reduction
Troubleshooting Guide
SSH Tunnel Not Connecting
# Check service status
systemctl status tunnel-vast-ai
# View detailed logs
journalctl -u tunnel-vast-ai -n 100 -f
# Test SSH manually
ssh -v -i /home/tunnel-user/.ssh/vast_ai root@<vast_ai_ip>
# Ensure Vast.ai machine still running and has bandwidth
Ollama Not Responding
# Check container
docker ps | grep ollama
# View logs
docker logs -f ollama
# Test directly
docker exec ollama curl http://localhost:11434/api/tags
# Restart if needed
docker restart ollama
ComfyUI Port Not Accessible
# Check container
docker ps | grep comfyui
# Test through tunnel
curl http://localhost:8188/system_stats
# Restart if needed
docker restart comfyui
LiteLLM Timeouts
# Check LiteLLM logs
journalctl -u litellm -n 100
# Increase timeout in config.yaml
# Test simple request
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "Hi"}], "max_tokens": 10}'
Fallback to External APIs Not Working
# Verify OpenAI API key in /opt/litellm/.env
# Test OpenAI directly (disable tunnel)
systemctl stop tunnel-vast-ai
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "gpt-3.5-turbo-fallback", "messages": [{"role": "user", "content": "Hi"}]}'
Cross-References
Dependency: 00B VPS Provisioning & Infrastructure Related: 00A Project Planning Related: 00C Database & Schema Related: 00D Authentication & Security
Document Version
| Version | Date | Changes |
|---|---|---|
| 1.0 | 2026-03-23 | Initial comprehensive build document |
Status: Ready for implementation Last Updated: 2026-03-23 Next Step: Execute Phase 1 GPU infrastructure setup after 00B completion