Files
igny8/v2/V2-Execution-Docs/00F-self-hosted-ai-infra.md
IGNY8 VPS (Salman) e78a41f11c v2-exece-docs
2026-03-23 10:30:51 +00:00

58 KiB

IGNY8 Phase 0: Self-Hosted AI Infrastructure (00F)

Status: Ready for Implementation Version: 1.1 Priority: High (cost savings critical for unit economics) Duration: 5-7 days Dependencies: 00B (VPS provisioning) must be complete first Source of Truth: Codebase at /data/app/igny8/ Cost: ~$200/month GPU rental + $0 software (open source)


1. Current State

Existing AI Integration

  • External providers (verified from IntegrationProvider model): OpenAI (GPT-4, GPT-3.5), Anthropic (Claude), Runware (image gen)
  • Storage: API keys stored in IntegrationProvider model (table: igny8_integration_providers) with per-account overrides in IntegrationSettings (table: igny8_integration_settings). Global defaults in GlobalIntegrationSettings.
  • Provider types in codebase: ai, payment, email, storage (from PROVIDER_TYPE_CHOICES)
  • Existing provider_ids: openai, runware, stripe, paypal, resend
  • Architecture: Multi-provider AI engine with model selection capability
  • Current AI functions: auto_cluster, generate_ideas, generate_content, generate_images, generate_image_prompts, optimize_content, generate_site_structure
  • Async handling: Celery workers process long-running AI tasks
  • Cost impact: External APIs constitute 15-30% of monthly operational costs

Problem

  • External API costs scale linearly with subscriber growth
  • No cost leverage at scale (pay-as-you-go model)
  • API rate limits require careful orchestration
  • Privacy concerns with offloading content generation

2. What to Build

Infrastructure Stack

┌─────────────────────────────────────────────────────────────┐
│ IGNY8 Backend (on VPS)                                      │
│ - Send requests to LiteLLM proxy (local localhost:8000)     │
│ - Fallback to OpenAI/Anthropic if self-hosted unavailable   │
└──────────────┬──────────────────────────────────────────────┘
               │
               ▼
┌──────────────────────────────────────────────────────────────┐
│ LiteLLM Proxy (on VPS, port 8000)                            │
│ - OpenAI-compatible API gateway                              │
│ - Routes requests to local Ollama and ComfyUI (via tunnel)   │
│ - Load balancing & model selection                           │
│ - Fallback configuration for external APIs                   │
└──────────────┬───────────────────────────────────────────────┘
               │
      ┌────────┴──────────────────┐
      │                           │
      ▼                           ▼
┌──────────────────┐    ┌──────────────────────┐
│ SSH Tunnel       │    │ ComfyUI Tunnel       │
│ (autossh)        │    │ (autossh)            │
│ Port 11434-11435 │    │ Port 8188            │
│                  │    │ (image generation)   │
└────────┬─────────┘    └──────────┬───────────┘
         │                         │
         ▼                         ▼
┌────────────────────────────────────────────────────────┐
│ Vast.ai GPU Server (2x RTX 3090, 48GB VRAM)            │
│ ┌──────────────────────────────────────────────────┐   │
│ │ Ollama Container                                 │   │
│ │ - Qwen3-32B (reasoning)                          │   │
│ │ - Qwen3-30B-A3B (multimodal)                    │   │
│ │ - Qwen3-14B (general purpose)                    │   │
│ │ - Qwen3-8B (fast inference)                      │   │
│ │ Listening on 0.0.0.0:11434                       │   │
│ └──────────────────────────────────────────────────┘   │
│ ┌──────────────────────────────────────────────────┐   │
│ │ ComfyUI Container                                │   │
│ │ - FLUX.1 (image gen)                             │   │
│ │ - Stable Diffusion 3.5 (image gen)               │   │
│ │ - SDXL-Lightning (fast generation)               │   │
│ │ Listening on 0.0.0.0:8188                        │   │
│ └──────────────────────────────────────────────────┘   │
└────────────────────────────────────────────────────────┘

Components to Deploy

  1. Vast.ai GPU Rental

    • Machine: 2x NVIDIA RTX 3090 (48GB total VRAM)
    • Estimated cost: $180-220/month
    • Auto-bid setup for cost optimization
    • Persistence: Restore from snapshot between rentals
  2. Ollama (Text LLM Server)

    • Container-based deployment on GPU
    • Models: Qwen3 series (32B, 30B-A3B, 14B, 8B)
    • API: OpenAI-compatible /v1/chat/completions
    • Port: 11434 (tunneled via SSH)
  3. ComfyUI (Image Generation)

    • Container-based deployment on GPU
    • Models: FLUX.1, Stable Diffusion 3.5, SDXL-Lightning
    • API: REST endpoints for image generation
    • Port: 8188 (tunneled via SSH)
  4. SSH Tunnel (autossh)

    • Persistent connection from VPS to GPU server
    • Systemd service with auto-restart
    • Ports: 11434/11435 (Ollama), 8188 (ComfyUI)
    • Handles network interruptions automatically
  5. LiteLLM Proxy

    • Runs on IGNY8 VPS
    • Acts as OpenAI-compatible API gateway
    • Configurable routing based on model/task type
    • Fallback to OpenAI/Anthropic if self-hosted unavailable
    • Port: 8000 (local access only)
  6. IGNY8 Backend Integration

    • Add self-hosted LiteLLM as new IntegrationProvider
    • Update AI request logic to check availability
    • Implement fallback chain: self-hosted → OpenAI → Anthropic
    • Cost tracking per provider

3. Data Models / APIs

Database Models (Minimal Schema Changes)

Use existing IntegrationProvider model — add a new row with provider_type='ai':

# New IntegrationProvider row (NO new provider_type needed)
# provider_type='ai' already exists in PROVIDER_TYPE_CHOICES

# Create via admin or migration:
IntegrationProvider.objects.create(
    provider_id='self_hosted_ai',
    display_name='Self-Hosted AI (LiteLLM)',
    provider_type='ai',
    api_key='',  # LiteLLM doesn't require auth (internal)
    api_endpoint='http://localhost:8000',
    is_active=True,
    is_sandbox=False,
    config={
        "priority": 10,  # Try self-hosted first
        "models": {
            "text_generation": "qwen3:32b",
            "text_generation_fast": "qwen3:8b",
            "image_generation": "flux.1-dev",
            "image_generation_fast": "sdxl-lightning"
        },
        "timeout": 300,  # 5 minute timeout for slow models
        "fallback_to": "openai"  # Fallback provider if self-hosted fails
    }
)

LiteLLM API Endpoints

Text Generation (Compatible with OpenAI API)

# Request
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ollama/qwen3:32b",
    "messages": [{"role": "user", "content": "Write an article about..."}],
    "temperature": 0.7,
    "max_tokens": 2000
  }'

# Response (identical to OpenAI)
{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "ollama/qwen3:32b",
  "choices": [{
    "index": 0,
    "message": {"role": "assistant", "content": "Article text..."},
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 50,
    "completion_tokens": 500,
    "total_tokens": 550
  }
}

Image Generation (ComfyUI via LiteLLM)

# Request
curl -X POST http://localhost:8000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "comfyui/flux.1-dev",
    "prompt": "A professional product photo of...",
    "size": "1024x1024",
    "n": 1,
    "quality": "hd"
  }'

# Response
{
  "created": 1234567890,
  "data": [{
    "url": "data:image/png;base64,...",
    "revised_prompt": "A professional product photo of..."
  }]
}

Model Routing Configuration

LiteLLM Config (see section 4.2)

  • Routes gpt-4 requests → ollama/qwen3:32b
  • Routes gpt-3.5-turbo requests → ollama/qwen3:8b
  • Routes DALL-E requests → comfyui/flux.1-dev
  • Includes fallback to OpenAI for unavailable models
  • Respects timeout and retry limits

4. Implementation Steps

Phase 1: GPU Infrastructure Setup (Days 1-2)

4.1 Vast.ai Account & GPU Rental

Step 1: Create Vast.ai Account

# Navigate to https://www.vast.ai
# Sign up with email
# Verify account via email
# Add payment method (credit card or crypto)

Step 2: Rent GPU Instance

Requirements:

  • 2x NVIDIA RTX 3090 (or 1x RTX 4090) = 48GB+ VRAM
  • Ubuntu 24.04 LTS base image (preferred) or later
  • Minimum bandwidth: 100 Mbps
  • SSH port 22 open

Setup via Vast.ai dashboard:

  1. Go to "Browse" → Filter by:
    • GPU: 2x RTX 3090 or RTX 4090
    • Min VRAM: 48GB
    • OS: Ubuntu 24.04 LTS (or later)
    • Price: Sort by lowest $/hr
  2. Click "Rent" on selected instance
  3. Choose:
    • Disk size: 500GB (includes models)
    • Secure Cloud: No (to access port 22)
  4. Wait for machine to start (2-5 minutes)
  5. Record SSH credentials from dashboard

Step 3: Test SSH Access

# From your local machine
ssh root@<vast_ai_ip> -i ~/.ssh/vast_key
# Update system
apt update && apt upgrade -y

Step 4: Set Up Snapshot for Persistence

# After first-time setup, create snapshot in Vast.ai dashboard
# Future rentals: select snapshot to restore previous state

4.2 Vast.ai: Docker & Base Containers

Step 1: Install Docker

# SSH into Vast.ai machine
ssh root@<vast_ai_ip>

# Install Docker
curl https://get.docker.com -sSfL | sh
systemctl enable docker
systemctl start docker

# Verify
docker --version

Step 2: Set Up Storage

# Create persistent directory for models
mkdir -p /mnt/models
mkdir -p /mnt/ollama-cache
mkdir -p /mnt/comfyui-models
chmod 777 /mnt/*

# Create docker network for inter-container communication
docker network create ai-network

Step 3: Deploy Ollama Container

docker run -d \
  --name ollama \
  --network ai-network \
  --gpus all \
  -e OLLAMA_MODELS=/mnt/ollama-cache \
  -v /mnt/ollama-cache:/root/.ollama \
  -p 0.0.0.0:11434:11434 \
  ollama/ollama:latest

Step 4: Pull Qwen3 Models

# Wait for ollama to be ready
sleep 10

# Pull models (will take 30-60 minutes depending on speed)
# Order by priority (largest first)
docker exec ollama ollama pull qwen3:32b       # ~20GB
docker exec ollama ollama pull qwen3:30b-a3b   # ~18GB
docker exec ollama ollama pull qwen3:14b       # ~9GB
docker exec ollama ollama pull qwen3:8b        # ~5GB

# Verify models are loaded
docker exec ollama ollama list
# Output should show all models with their sizes

Step 5: Deploy ComfyUI Container

# Clone ComfyUI repository
cd /opt
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI

# Use Docker image with CUDA support
docker run -d \
  --name comfyui \
  --network ai-network \
  --gpus all \
  -e CUDA_VISIBLE_DEVICES=0,1 \
  -v /mnt/comfyui-models:/ComfyUI/models \
  -v /mnt/comfyui-output:/ComfyUI/output \
  -p 0.0.0.0:8188:8188 \
  comfyui-docker:latest

# Alternative: Run from source
docker run -d \
  --name comfyui \
  --network ai-network \
  --gpus all \
  -v /opt/ComfyUI:/ComfyUI \
  -v /mnt/comfyui-models:/ComfyUI/models \
  -v /mnt/comfyui-output:/ComfyUI/output \
  -p 0.0.0.0:8188:8188 \
  -w /ComfyUI \
  nvidia/cuda:11.8.0-runtime-ubuntu22.04 \
  bash -c "pip install -r requirements.txt && python -m http.server 8188"

Step 6: Download Image Generation Models

# Download models to ComfyUI
# FLUX.1 (recommended for quality)
cd /mnt/comfyui-models/checkpoints
wget -O flux1-dev-Q8.safetensors \
  "https://huggingface.co/black-forest-labs/FLUX.1-dev/resolve/main/flux1-dev-Q8_0.safetensors"

# Stable Diffusion 3.5 (alternative)
wget -O sd3.5-large.safetensors \
  "https://huggingface.co/stabilityai/stable-diffusion-3.5-large/resolve/main/sd_xl_base_1.0.safetensors"

# SDXL-Lightning (fast, lower quality but acceptable)
wget -O sdxl-lightning.safetensors \
  "https://huggingface.co/latent-consistency/lcm-sdxl/resolve/main/pytorch_lora_weights.safetensors"

# VAE (for all models)
cd /mnt/comfyui-models/vae
wget -O ae.safetensors \
  "https://huggingface.co/black-forest-labs/FLUX.1-dev/resolve/main/ae.safetensors"

Step 7: Verify Services

# Check Ollama API
curl http://localhost:11434/api/tags
# Should return: {"models": [{"name": "qwen3:32b", "size": ...}, ...]}

# Check ComfyUI
curl http://localhost:8188/system_stats
# Should return GPU/memory stats

Phase 2: VPS Tunnel & LiteLLM Setup (Days 2-3)

4.3 IGNY8 VPS: SSH Tunnel Configuration

Prerequisites: VPS must be provisioned (see 00B)

VPS Environment:

  • Ubuntu 24.04 LTS
  • Docker 29.x
  • GPU server is deployed separately on Vast.ai (not on the VPS)
  • The VPS maintains SSH tunnels to the Vast.ai GPU server for accessing Ollama and ComfyUI

DNS Note: During initial setup, before DNS flip, the IGNY8 backend connects to the LiteLLM proxy via localhost:8000 within the VPS environment. This uses the internal Docker network and local port forwarding, so external DNS configuration does not affect this connection. DNS considerations only apply to external client connections to the IGNY8 API.

Step 1: Generate SSH Key Pair

# On VPS
ssh-keygen -t rsa -b 4096 -f /root/.ssh/vast_ai -N ""
# On local machine, copy public key to Vast.ai machine
ssh-copy-id -i /root/.ssh/vast_ai.pub root@<vast_ai_ip>

Step 2: Install & Configure autossh

# On VPS
apt install autossh -y

# Create dedicated user for tunnel
useradd -m -s /bin/bash tunnel-user
mkdir -p /home/tunnel-user/.ssh
cp /root/.ssh/vast_ai* /home/tunnel-user/.ssh/
chown -R tunnel-user:tunnel-user /home/tunnel-user/.ssh
chmod 600 /home/tunnel-user/.ssh/vast_ai

Step 3: Create autossh Systemd Service

File: /etc/systemd/system/tunnel-vast-ai.service

[Unit]
Description=SSH Tunnel to Vast.ai GPU Server
After=network.target
Wants=network-online.target

[Service]
Type=simple
User=tunnel-user
ExecStart=/usr/bin/autossh \
  -M 20000 \
  -N \
  -o "ServerAliveInterval=30" \
  -o "ServerAliveCountMax=3" \
  -o "ExitOnForwardFailure=no" \
  -o "ConnectTimeout=10" \
  -o "StrictHostKeyChecking=accept-new" \
  -i /home/tunnel-user/.ssh/vast_ai \
  -L 11434:localhost:11434 \
  -L 11435:localhost:11435 \
  -L 8188:localhost:8188 \
  root@<vast_ai_ip>

Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Step 4: Start Tunnel Service

# Reload systemd
systemctl daemon-reload

# Start service
systemctl start tunnel-vast-ai

# Enable on boot
systemctl enable tunnel-vast-ai

# Verify tunnel is up
systemctl status tunnel-vast-ai

# Check logs
journalctl -u tunnel-vast-ai -f

Step 5: Test Tunnel Connectivity

# On VPS, verify ports are open
netstat -tlnp | grep -E '(11434|8188)'
# Should show: 127.0.0.1:11434 LISTEN
#              127.0.0.1:8188 LISTEN

# Test Ollama through tunnel
curl http://localhost:11434/api/tags
# Should return model list from remote Vast.ai machine

# Test ComfyUI through tunnel
curl http://localhost:8188/system_stats
# Should return GPU stats

4.4 LiteLLM Installation & Configuration

Step 1: Install LiteLLM

# On VPS
pip install litellm fastapi uvicorn python-dotenv requests

# Verify installation
python -c "import litellm; print(litellm.__version__)"

Step 2: Create LiteLLM Configuration

File: /opt/litellm/config.yaml

# LiteLLM Configuration for IGNY8
model_list:
  # Text Generation Models (Ollama via SSH tunnel)
  - model_name: gpt-4
    litellm_params:
      model: ollama/qwen3:32b
      api_base: http://localhost:11434
      timeout: 300
      max_tokens: 8000

  - model_name: gpt-4-turbo
    litellm_params:
      model: ollama/qwen3:30b-a3b
      api_base: http://localhost:11434
      timeout: 300
      max_tokens: 8000

  - model_name: gpt-3.5-turbo
    litellm_params:
      model: ollama/qwen3:14b
      api_base: http://localhost:11434
      timeout: 180
      max_tokens: 4000

  - model_name: gpt-3.5-turbo-fast
    litellm_params:
      model: ollama/qwen3:8b
      api_base: http://localhost:11434
      timeout: 120
      max_tokens: 2048

  # Fallback to OpenAI for redundancy
  - model_name: gpt-4-fallback
    litellm_params:
      model: gpt-4
      api_key: ${OPENAI_API_KEY}
      timeout: 60

  - model_name: gpt-3.5-turbo-fallback
    litellm_params:
      model: gpt-3.5-turbo
      api_key: ${OPENAI_API_KEY}
      timeout: 60

  # Image Generation (ComfyUI via tunnel)
  - model_name: dall-e-3
    litellm_params:
      model: comfyui/flux.1-dev
      api_base: http://localhost:8188
      timeout: 120

  - model_name: dall-e-2
    litellm_params:
      model: comfyui/sdxl-lightning
      api_base: http://localhost:8188
      timeout: 60

# Router configuration for model selection
router_settings:
  routing_strategy: "simple-shuffle"  # Load balancing
  allowed_model_region: null

# Logging configuration
litellm_settings:
  verbose: true
  log_level: "INFO"
  cache_responses: true
  cache_params: ["model", "messages", "temperature"]

# Fallback behavior
fallback_models:
  - model_name: gpt-4
    fallback_to: ["gpt-4-fallback", "gpt-3.5-turbo"]
  - model_name: gpt-3.5-turbo
    fallback_to: ["gpt-3.5-turbo-fallback"]
  - model_name: dall-e-3
    fallback_to: ["dall-e-2"]

Step 3: Create Environment File

File: /opt/litellm/.env

# OpenAI (for fallback)
OPENAI_API_KEY=sk-your-key-here

# Anthropic (optional fallback)
ANTHROPIC_API_KEY=sk-ant-your-key-here

# LiteLLM settings
LITELLM_LOG_LEVEL=INFO
LITELLM_CACHE=true

# Service settings
PORT=8000
HOST=127.0.0.1

Step 4: Create LiteLLM Startup Script

File: /opt/litellm/start.sh

#!/bin/bash
set -e

cd /opt/litellm

# Load environment variables
source .env

# Start LiteLLM server
python -m litellm.server \
  --config config.yaml \
  --host 127.0.0.1 \
  --port 8000 \
  --num_workers 4 \
  --worker_timeout 600
chmod +x /opt/litellm/start.sh

Step 5: Create Systemd Service for LiteLLM

File: /etc/systemd/system/litellm.service

[Unit]
Description=LiteLLM AI Proxy Gateway
After=network.target tunnel-vast-ai.service
Wants=tunnel-vast-ai.service

[Service]
Type=simple
User=root
WorkingDirectory=/opt/litellm
ExecStart=/opt/litellm/start.sh
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/local/python/bin"

[Install]
WantedBy=multi-user.target

Step 6: Start LiteLLM Service

systemctl daemon-reload
systemctl start litellm
systemctl enable litellm

# Verify service
systemctl status litellm

# Check logs
journalctl -u litellm -f

Step 7: Test LiteLLM API

# Test text generation with self-hosted model
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer test-key" \
  -d '{
    "model": "gpt-4",
    "messages": [
      {"role": "user", "content": "Say hello in one sentence"}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

# Should return a response from Qwen3:32B

# Test fallback (disconnect tunnel first to test fallback logic)
# curl should eventually fall back to OpenAI after timeout

Phase 3: IGNY8 Backend Integration (Days 3-4)

4.5 Add Self-Hosted Provider to IGNY8

Step 1: Update GlobalIntegrationSettings Model

File: backend/models/integration.py

# Add to IntegrationProvider enum
class IntegrationProvider(models.TextChoices):
    OPENAI = "openai", "OpenAI"
    ANTHROPIC = "anthropic", "Anthropic"
    RUNWARE = "runware", "Runware"
    BRIA = "bria", "Bria"
    SELF_HOSTED = "self_hosted_ai", "Self-Hosted AI (LiteLLM)"  # NEW

# Example settings structure
SELF_HOSTED_SETTINGS = {
    "provider": "self_hosted_ai",
    "name": "Self-Hosted AI (LiteLLM)",
    "base_url": "http://localhost:8000",
    "api_key": "not_required",
    "enabled": True,
    "priority": 10,  # Try first
    "models": {
        "text_generation": "gpt-4",  # Maps to qwen3:32b
        "text_generation_fast": "gpt-3.5-turbo",  # Maps to qwen3:8b
        "image_generation": "dall-e-3",  # Maps to flux.1-dev
        "image_generation_fast": "dall-e-2"  # Maps to sdxl-lightning
    },
    "timeout": 300,
    "fallback_to": "openai"
}

Step 2: Add Self-Hosted Settings to Database

File: backend/management/commands/init_integrations.py

from backend.models.integration import GlobalIntegrationSettings, IntegrationProvider

def add_self_hosted_integration():
    """Initialize self-hosted AI integration"""
    self_hosted_config = {
        "provider": IntegrationProvider.SELF_HOSTED,
        "name": "Self-Hosted AI (LiteLLM)",
        "base_url": "http://localhost:8000",
        "api_key": "",  # Not required for local proxy
        "enabled": True,
        "priority": 10,  # Higher priority = try first
        "models": {
            "text_generation": "gpt-4",
            "text_generation_fast": "gpt-3.5-turbo",
            "image_generation": "dall-e-3",
            "image_generation_fast": "dall-e-2"
        },
        "timeout": 300,
        "max_retries": 2,
        "fallback_provider": IntegrationProvider.OPENAI
    }

    integration, created = GlobalIntegrationSettings.objects.update_or_create(
        provider=IntegrationProvider.SELF_HOSTED,
        defaults=self_hosted_config
    )

    if created:
        print(f"✓ Created {IntegrationProvider.SELF_HOSTED} integration")
    else:
        print(f"✓ Updated {IntegrationProvider.SELF_HOSTED} integration")

# Run in management command initialization

Step 3: Update AI Request Router

File: backend/services/ai_engine.py

import requests
import logging
from typing import Optional, List, Dict, Any
from backend.models.integration import GlobalIntegrationSettings, IntegrationProvider

logger = logging.getLogger(__name__)

class AIEngineRouter:
    """Routes AI requests to appropriate provider with fallback chain"""

    PROVIDER_PRIORITY = {
        IntegrationProvider.SELF_HOSTED: 10,  # Try first
        IntegrationProvider.OPENAI: 5,
        IntegrationProvider.ANTHROPIC: 4,
    }

    def __init__(self):
        self.providers = self._load_providers()

    def _load_providers(self) -> List[Dict[str, Any]]:
        """Load enabled providers from database"""
        configs = GlobalIntegrationSettings.objects.filter(
            enabled=True
        ).values()

        # Sort by priority (highest first)
        sorted_configs = sorted(
            configs,
            key=lambda x: self.PROVIDER_PRIORITY.get(x['provider'], 0),
            reverse=True
        )

        return sorted_configs

    def generate_text(
        self,
        prompt: str,
        model: str = "gpt-4",
        max_tokens: int = 2000,
        temperature: float = 0.7,
        timeout: Optional[int] = None
    ) -> Dict[str, Any]:
        """Generate text using available provider with fallback"""

        for provider_config in self.providers:
            try:
                result = self._call_provider(
                    provider_config,
                    "text",
                    prompt=prompt,
                    model=model,
                    max_tokens=max_tokens,
                    temperature=temperature,
                    timeout=timeout or provider_config.get('timeout', 300)
                )
                return {
                    "success": True,
                    "provider": provider_config['provider'],
                    "text": result['content'],
                    "usage": result.get('usage', {}),
                    "model": result.get('model', model)
                }
            except Exception as e:
                logger.warning(
                    f"Provider {provider_config['provider']} failed: {str(e)}"
                )
                continue

        # All providers failed
        raise Exception("All AI providers exhausted. No response available.")

    def generate_image(
        self,
        prompt: str,
        model: str = "dall-e-3",
        size: str = "1024x1024",
        quality: str = "hd",
        timeout: Optional[int] = None
    ) -> Dict[str, Any]:
        """Generate image using available provider with fallback"""

        for provider_config in self.providers:
            try:
                result = self._call_provider(
                    provider_config,
                    "image",
                    prompt=prompt,
                    model=model,
                    size=size,
                    quality=quality,
                    timeout=timeout or provider_config.get('timeout', 120)
                )
                return {
                    "success": True,
                    "provider": provider_config['provider'],
                    "image_url": result['url'],
                    "revised_prompt": result.get('revised_prompt', prompt),
                    "model": result.get('model', model)
                }
            except Exception as e:
                logger.warning(
                    f"Provider {provider_config['provider']} failed: {str(e)}"
                )
                continue

        # All providers failed
        raise Exception("All image generation providers exhausted.")

    def _call_provider(
        self,
        provider_config: Dict[str, Any],
        task_type: str,  # "text" or "image"
        **kwargs
    ) -> Dict[str, Any]:
        """Call specific provider based on type"""

        provider = provider_config['provider']

        if provider == IntegrationProvider.SELF_HOSTED:
            return self._call_litellm(provider_config, task_type, **kwargs)
        elif provider == IntegrationProvider.OPENAI:
            return self._call_openai(provider_config, task_type, **kwargs)
        elif provider == IntegrationProvider.ANTHROPIC:
            return self._call_anthropic(provider_config, task_type, **kwargs)
        else:
            raise ValueError(f"Unknown provider: {provider}")

    def _call_litellm(
        self,
        provider_config: Dict[str, Any],
        task_type: str,
        **kwargs
    ) -> Dict[str, Any]:
        """Call LiteLLM proxy on localhost"""

        base_url = provider_config['base_url']
        timeout = kwargs.pop('timeout', 300)

        if task_type == "text":
            # Chat completion endpoint
            endpoint = f"{base_url}/v1/chat/completions"
            payload = {
                "model": kwargs.get('model', 'gpt-4'),
                "messages": [
                    {"role": "user", "content": kwargs['prompt']}
                ],
                "temperature": kwargs.get('temperature', 0.7),
                "max_tokens": kwargs.get('max_tokens', 2000)
            }
        elif task_type == "image":
            # Image generation endpoint
            endpoint = f"{base_url}/v1/images/generations"
            payload = {
                "model": kwargs.get('model', 'dall-e-3'),
                "prompt": kwargs['prompt'],
                "size": kwargs.get('size', '1024x1024'),
                "n": 1,
                "quality": kwargs.get('quality', 'hd')
            }
        else:
            raise ValueError(f"Unknown task type: {task_type}")

        try:
            response = requests.post(
                endpoint,
                json=payload,
                timeout=timeout,
                headers={"Authorization": "Bearer test"}
            )
            response.raise_for_status()

            data = response.json()

            if task_type == "text":
                return {
                    "content": data['choices'][0]['message']['content'],
                    "usage": data.get('usage', {}),
                    "model": data.get('model', kwargs.get('model'))
                }
            else:  # image
                return {
                    "url": data['data'][0]['url'],
                    "revised_prompt": data['data'][0].get('revised_prompt'),
                    "model": kwargs.get('model')
                }

        except requests.exceptions.Timeout:
            logger.error(f"LiteLLM timeout after {timeout}s")
            raise
        except requests.exceptions.ConnectionError:
            logger.error("Cannot connect to LiteLLM proxy - tunnel may be down")
            raise
        except Exception as e:
            logger.error(f"LiteLLM request failed: {str(e)}")
            raise

    def _call_openai(self, provider_config, task_type, **kwargs):
        """Existing OpenAI implementation"""
        # Use existing OpenAI integration code
        pass

    def _call_anthropic(self, provider_config, task_type, **kwargs):
        """Existing Anthropic implementation"""
        # Use existing Anthropic integration code
        pass


# Initialize global instance
ai_router = AIEngineRouter()

Step 4: Update Content Generation Celery Tasks

File: backend/tasks/content_generation.py

from celery import shared_task
from backend.services.ai_engine import ai_router
import logging

logger = logging.getLogger(__name__)

@shared_task
def generate_article_content(user_id: int, article_id: int):
    """Generate article content using AI router (tries self-hosted first)"""
    try:
        # Get article from database
        article = Article.objects.get(id=article_id, user_id=user_id)

        # Generate content
        result = ai_router.generate_text(
            prompt=f"Write a detailed article about: {article.topic}",
            model="gpt-4",
            max_tokens=3000,
            temperature=0.7
        )

        # Save result
        article.content = result['text']
        article.ai_provider = result['provider']
        article.save()

        logger.info(
            f"Generated article {article_id} using {result['provider']}"
        )

        return {
            "success": True,
            "article_id": article_id,
            "provider": result['provider']
        }

    except Exception as e:
        logger.error(f"Article generation failed: {str(e)}")
        raise

@shared_task
def generate_product_images(user_id: int, product_id: int):
    """Generate product images using AI router"""
    try:
        product = Product.objects.get(id=product_id, user_id=user_id)

        # Try to generate with self-hosted first (faster)
        result = ai_router.generate_image(
            prompt=f"Professional product photo of: {product.description}",
            model="dall-e-3",
            size="1024x1024",
            quality="hd"
        )

        product.image_url = result['image_url']
        product.ai_provider = result['provider']
        product.save()

        logger.info(f"Generated image for product {product_id} using {result['provider']}")

        return {
            "success": True,
            "product_id": product_id,
            "provider": result['provider'],
            "image_url": result['image_url']
        }

    except Exception as e:
        logger.error(f"Image generation failed: {str(e)}")
        raise

Step 5: Add AI Provider Tracking

File: backend/models/content.py

from django.db import models
from backend.models.integration import IntegrationProvider

class Article(models.Model):
    # ... existing fields ...

    # Track which AI provider generated content
    ai_provider = models.CharField(
        max_length=50,
        choices=IntegrationProvider.choices,
        default=IntegrationProvider.OPENAI,
        help_text="Which AI provider generated this content"
    )
    ai_cost = models.DecimalField(
        max_digits=10,
        decimal_places=6,
        default=0,
        help_text="Cost to generate via AI provider"
    )
    ai_generation_time = models.DurationField(
        null=True,
        blank=True,
        help_text="Time taken to generate content"
    )

class Product(models.Model):
    # ... existing fields ...

    ai_provider = models.CharField(
        max_length=50,
        choices=IntegrationProvider.choices,
        default=IntegrationProvider.OPENAI,
        help_text="Which AI provider generated the image"
    )
    ai_image_cost = models.DecimalField(
        max_digits=10,
        decimal_places=6,
        default=0,
        help_text="Cost to generate image"
    )

Phase 4: Monitoring & Fallback (Days 4-5)

4.6 Health Check & Failover System

Step 1: Create Health Check Service

File: backend/services/ai_health_check.py

import requests
import time
import logging
from typing import Dict, Any, Tuple
from datetime import datetime, timedelta

logger = logging.getLogger(__name__)

class AIHealthMonitor:
    """Monitor health of self-hosted AI infrastructure"""

    OLLAMA_ENDPOINT = "http://localhost:11434/api/tags"
    COMFYUI_ENDPOINT = "http://localhost:8188/system_stats"
    LITELLM_ENDPOINT = "http://localhost:8000/health"

    HEALTH_CHECK_INTERVAL = 60  # seconds
    FAILURE_THRESHOLD = 3  # Mark unhealthy after 3 failures

    def __init__(self):
        self.last_check = None
        self.failure_count = {
            'ollama': 0,
            'comfyui': 0,
            'litellm': 0
        }
        self.is_healthy = {
            'ollama': True,
            'comfyui': True,
            'litellm': True
        }

    def check_all(self) -> Dict[str, Any]:
        """Run all health checks"""

        results = {
            'timestamp': datetime.now().isoformat(),
            'overall_healthy': True,
            'services': {}
        }

        # Check Ollama
        ollama_healthy = self._check_ollama()
        results['services']['ollama'] = {
            'healthy': ollama_healthy,
            'endpoint': self.OLLAMA_ENDPOINT
        }
        if not ollama_healthy:
            results['overall_healthy'] = False

        # Check ComfyUI
        comfyui_healthy = self._check_comfyui()
        results['services']['comfyui'] = {
            'healthy': comfyui_healthy,
            'endpoint': self.COMFYUI_ENDPOINT
        }
        if not comfyui_healthy:
            results['overall_healthy'] = False

        # Check LiteLLM
        litellm_healthy = self._check_litellm()
        results['services']['litellm'] = {
            'healthy': litellm_healthy,
            'endpoint': self.LITELLM_ENDPOINT
        }
        if not litellm_healthy:
            results['overall_healthy'] = False

        self.last_check = results

        # Log status change if needed
        if self.is_healthy['ollama'] != ollama_healthy:
            level = logging.WARNING if not ollama_healthy else logging.INFO
            logger.log(level, f"Ollama service {'down' if not ollama_healthy else 'recovered'}")

        if self.is_healthy['comfyui'] != comfyui_healthy:
            level = logging.WARNING if not comfyui_healthy else logging.INFO
            logger.log(level, f"ComfyUI service {'down' if not comfyui_healthy else 'recovered'}")

        if self.is_healthy['litellm'] != litellm_healthy:
            level = logging.WARNING if not litellm_healthy else logging.INFO
            logger.log(level, f"LiteLLM service {'down' if not litellm_healthy else 'recovered'}")

        # Update internal state
        self.is_healthy['ollama'] = ollama_healthy
        self.is_healthy['comfyui'] = comfyui_healthy
        self.is_healthy['litellm'] = litellm_healthy

        return results

    def _check_ollama(self) -> bool:
        """Check if Ollama is responding"""
        try:
            response = requests.get(self.OLLAMA_ENDPOINT, timeout=5)
            if response.status_code == 200:
                self.failure_count['ollama'] = 0
                return True
        except Exception as e:
            logger.debug(f"Ollama health check failed: {str(e)}")

        self.failure_count['ollama'] += 1
        return self.failure_count['ollama'] < self.FAILURE_THRESHOLD

    def _check_comfyui(self) -> bool:
        """Check if ComfyUI is responding"""
        try:
            response = requests.get(self.COMFYUI_ENDPOINT, timeout=5)
            if response.status_code == 200:
                self.failure_count['comfyui'] = 0
                return True
        except Exception as e:
            logger.debug(f"ComfyUI health check failed: {str(e)}")

        self.failure_count['comfyui'] += 1
        return self.failure_count['comfyui'] < self.FAILURE_THRESHOLD

    def _check_litellm(self) -> bool:
        """Check if LiteLLM is responding"""
        try:
            response = requests.get(self.LITELLM_ENDPOINT, timeout=5)
            if response.status_code == 200:
                self.failure_count['litellm'] = 0
                return True
        except Exception as e:
            logger.debug(f"LiteLLM health check failed: {str(e)}")

        self.failure_count['litellm'] += 1
        return self.failure_count['litellm'] < self.FAILURE_THRESHOLD

    def is_self_hosted_available(self) -> bool:
        """Check if self-hosted AI is fully available"""
        return all([
            self.is_healthy['ollama'],
            self.is_healthy['comfyui'],
            self.is_healthy['litellm']
        ])


# Create global instance
health_monitor = AIHealthMonitor()

Step 2: Create Health Check Celery Task

File: backend/tasks/health_checks.py

from celery import shared_task
from celery.schedules import schedule
from backend.services.ai_health_check import health_monitor
from backend.models.monitoring import ServiceHealthLog
import logging

logger = logging.getLogger(__name__)

@shared_task
def check_ai_health():
    """Run AI infrastructure health checks every minute"""

    results = health_monitor.check_all()

    # Log to database
    ServiceHealthLog.objects.create(
        service='self_hosted_ai',
        is_healthy=results['overall_healthy'],
        details=results
    )

    # Alert if services are down
    if not results['overall_healthy']:
        down_services = [
            service for service, status in results['services'].items()
            if not status['healthy']
        ]

        logger.error(
            f"AI services down: {', '.join(down_services)}. "
            f"Falling back to external APIs."
        )

    return results


# Add to celery beat schedule
CELERY_BEAT_SCHEDULE = {
    'check-ai-health': {
        'task': 'backend.tasks.health_checks.check_ai_health',
        'schedule': 60.0,  # Every 60 seconds
    },
}

Step 3: Create Monitoring Model

File: backend/models/monitoring.py

from django.db import models
from django.utils import timezone

class ServiceHealthLog(models.Model):
    """Log of service health checks"""

    SERVICE_CHOICES = [
        ('self_hosted_ai', 'Self-Hosted AI'),
        ('tunnel', 'SSH Tunnel'),
        ('litellm', 'LiteLLM Proxy'),
    ]

    service = models.CharField(max_length=50, choices=SERVICE_CHOICES)
    is_healthy = models.BooleanField()
    details = models.JSONField(default=dict)
    checked_at = models.DateTimeField(auto_now_add=True)

    class Meta:
        ordering = ['-checked_at']
        indexes = [
            models.Index(fields=['-checked_at']),
            models.Index(fields=['service', '-checked_at']),
        ]

    def __str__(self):
        status = "✓ Healthy" if self.is_healthy else "✗ Down"
        return f"{self.service} {status} @ {self.checked_at}"


class AIUsageLog(models.Model):
    """Track AI provider usage and costs"""

    PROVIDER_CHOICES = [
        ('self_hosted_ai', 'Self-Hosted AI'),
        ('openai', 'OpenAI'),
        ('anthropic', 'Anthropic'),
    ]

    TASK_TYPE_CHOICES = [
        ('text_generation', 'Text Generation'),
        ('image_generation', 'Image Generation'),
        ('keyword_research', 'Keyword Research'),
    ]

    user = models.ForeignKey('User', on_delete=models.CASCADE)
    provider = models.CharField(max_length=50, choices=PROVIDER_CHOICES)
    task_type = models.CharField(max_length=50, choices=TASK_TYPE_CHOICES)
    model_used = models.CharField(max_length=100)

    input_tokens = models.IntegerField(default=0)
    output_tokens = models.IntegerField(default=0)

    cost = models.DecimalField(max_digits=10, decimal_places=6, default=0)
    duration_ms = models.IntegerField()  # Milliseconds

    success = models.BooleanField(default=True)
    error_message = models.TextField(blank=True)

    created_at = models.DateTimeField(auto_now_add=True)

    class Meta:
        ordering = ['-created_at']
        indexes = [
            models.Index(fields=['user', '-created_at']),
            models.Index(fields=['provider', '-created_at']),
        ]

    def __str__(self):
        return f"{self.provider} - {self.task_type} - ${self.cost:.4f}"

Phase 5: Cost Tracking & Optimization (Days 5-6)

4.7 Cost Calculation & Dashboard

Step 1: Create Cost Calculator

File: backend/services/cost_calculator.py

from decimal import Decimal
from typing import Dict, Any

class AICostCalculator:
    """Calculate AI generation costs by provider"""

    # Self-hosted cost (Vast.ai GPU rental amortized)
    # $200/month ÷ 30 days ÷ 24 hours = $0.278/hour
    # Assuming 70% utilization = $0.1945/hour
    SELF_HOSTED_COST_PER_HOUR = Decimal('0.20')  # Conservative estimate

    # OpenAI pricing (as of 2026)
    OPENAI_PRICING = {
        'gpt-4': {
            'input': Decimal('0.00003'),    # per token
            'output': Decimal('0.00006'),
        },
        'gpt-3.5-turbo': {
            'input': Decimal('0.0005'),
            'output': Decimal('0.0015'),
        },
        'dall-e-3': Decimal('0.04'),  # per image
    }

    # Anthropic pricing
    ANTHROPIC_PRICING = {
        'claude-3-opus': {
            'input': Decimal('0.000015'),
            'output': Decimal('0.000075'),
        },
        'claude-3-sonnet': {
            'input': Decimal('0.000003'),
            'output': Decimal('0.000015'),
        },
    }

    @classmethod
    def calculate_text_generation_cost(
        cls,
        provider: str,
        model: str,
        input_tokens: int,
        output_tokens: int,
        duration_ms: int = 0
    ) -> Decimal:
        """Calculate cost for text generation"""

        if provider == 'self_hosted_ai':
            # Cost based on compute time (rough estimate)
            duration_hours = duration_ms / (1000 * 3600)
            return cls.SELF_HOSTED_COST_PER_HOUR * Decimal(duration_hours)

        elif provider == 'openai':
            pricing = cls.OPENAI_PRICING.get(model, {})
            input_cost = Decimal(input_tokens) * pricing.get('input', Decimal(0))
            output_cost = Decimal(output_tokens) * pricing.get('output', Decimal(0))
            return input_cost + output_cost

        elif provider == 'anthropic':
            pricing = cls.ANTHROPIC_PRICING.get(model, {})
            input_cost = Decimal(input_tokens) * pricing.get('input', Decimal(0))
            output_cost = Decimal(output_tokens) * pricing.get('output', Decimal(0))
            return input_cost + output_cost

        return Decimal(0)

    @classmethod
    def calculate_image_generation_cost(
        cls,
        provider: str,
        model: str,
        duration_ms: int = 0
    ) -> Decimal:
        """Calculate cost for image generation"""

        if provider == 'self_hosted_ai':
            # Cost based on compute time
            duration_hours = duration_ms / (1000 * 3600)
            return cls.SELF_HOSTED_COST_PER_HOUR * Decimal(duration_hours)

        elif provider == 'openai':
            if 'dall-e' in model:
                return cls.OPENAI_PRICING.get('dall-e-3', Decimal('0.04'))

        return Decimal(0)

    @classmethod
    def monthly_cost_analysis(cls) -> Dict[str, Any]:
        """Analyze projected monthly costs"""

        from backend.models.monitoring import AIUsageLog
        from django.utils import timezone
        from datetime import timedelta

        # Get last 30 days of usage
        thirty_days_ago = timezone.now() - timedelta(days=30)
        usage_logs = AIUsageLog.objects.filter(
            created_at__gte=thirty_days_ago
        )

        cost_by_provider = {}
        total_cost = Decimal(0)

        for log in usage_logs:
            if log.provider not in cost_by_provider:
                cost_by_provider[log.provider] = {
                    'count': 0,
                    'total_cost': Decimal(0),
                    'saved_vs_openai': Decimal(0)
                }

            cost_by_provider[log.provider]['count'] += 1
            cost_by_provider[log.provider]['total_cost'] += log.cost
            total_cost += log.cost

        # Calculate savings
        self_hosted_usage = usage_logs.filter(provider='self_hosted_ai')
        openai_equivalent_cost = Decimal(0)

        for log in self_hosted_usage:
            # Calculate what OpenAI would have charged
            openai_cost = cls.calculate_text_generation_cost(
                'openai',
                'gpt-4',
                log.input_tokens,
                log.output_tokens
            ) if log.task_type == 'text_generation' else cls.calculate_image_generation_cost(
                'openai',
                'dall-e-3'
            )
            openai_equivalent_cost += openai_cost

        return {
            'cost_by_provider': cost_by_provider,
            'total_cost': total_cost,
            'savings_vs_openai': openai_equivalent_cost - cost_by_provider.get('self_hosted_ai', {}).get('total_cost', Decimal(0)),
            'roi_vs_gpu_cost': openai_equivalent_cost - Decimal(200),  # $200 = 1 month GPU
        }

5. Acceptance Criteria

Infrastructure Ready

  • Vast.ai GPU instance rented and running (2x RTX 3090 or better)
  • SSH access confirmed from IGNY8 VPS
  • Ollama container running with all Qwen3 models downloaded
  • ComfyUI container running with FLUX.1 and Stable Diffusion 3.5 models
  • Models tested via direct API calls (curl tests all pass)

Network Tunnel Operational

  • autossh service running on IGNY8 VPS
  • SSH tunnel persists through network interruptions
  • Ports 11434, 11435, 8188 accessible on localhost from VPS
  • Tunnel auto-reconnects within 60 seconds of disconnect
  • Systemd service enables on boot

LiteLLM Proxy Functional

  • LiteLLM service running on VPS port 8000
  • OpenAI-compatible API endpoints working
  • Text generation requests route to Ollama
  • Image generation requests route to ComfyUI
  • Fallback to OpenAI works when self-hosted unavailable
  • Config includes all model variants
  • Timeout values appropriate for each model

IGNY8 Backend Integration Complete

  • Self-hosted provider added to GlobalIntegrationSettings
  • AIEngineRouter tries self-hosted before external APIs
  • Celery tasks log which provider was used
  • Content includes ai_provider tracking field
  • Fallback chain works (self-hosted → OpenAI → Anthropic)
  • Unit tests pass for all provider calls

Health Check System Operational

  • Health check task runs every 60 seconds
  • ServiceHealthLog table populated
  • Alerts generated when services down
  • System continues working with degraded services
  • Dashboard shows service status

Cost Tracking Implemented

  • AIUsageLog records all AI requests
  • Cost calculation accurate per provider
  • Monthly cost analysis working
  • Cost comparison shows self-hosted savings
  • Dashboard displays cost breakdown

Documentation & Runbooks

  • This build document complete and accurate
  • Troubleshooting guide for common issues
  • Runbook for GPU rental renewal
  • Cost monitoring dashboard updated
  • Team trained on fallback procedures

6. Claude Code Instructions

Prerequisites

# Ensure VPS provisioned (see 00B)
# Have Vast.ai account created
# Have IGNY8 codebase cloned locally

Build Execution

Step 1: GPU Infrastructure (Operator)

# Manual: Set up Vast.ai account, rent GPU, note IP
# This requires manual interaction with Vast.ai dashboard
# Once IP obtained, proceed to step 2

Step 2: Vast.ai Setup (Automated)

# Run on Vast.ai GPU server
VAST_AI_IP="<your-gpu-ip>"

ssh -i ~/.ssh/vast_key root@$VAST_AI_IP << 'EOF'

# Update system
apt update && apt upgrade -y

# Install Docker
curl https://get.docker.com -sSfL | sh
systemctl enable docker && systemctl start docker

# Create storage directories
mkdir -p /mnt/{models,ollama-cache,comfyui-models,comfyui-output}
chmod 777 /mnt/*

# Create docker network
docker network create ai-network

# Deploy Ollama
docker run -d \
  --name ollama \
  --network ai-network \
  --gpus all \
  -e OLLAMA_MODELS=/mnt/ollama-cache \
  -v /mnt/ollama-cache:/root/.ollama \
  -p 0.0.0.0:11434:11434 \
  ollama/ollama:latest

sleep 30

# Pull models (takes 1-2 hours)
docker exec ollama ollama pull qwen3:32b
docker exec ollama ollama pull qwen3:30b-a3b
docker exec ollama ollama pull qwen3:14b
docker exec ollama ollama pull qwen3:8b

# Deploy ComfyUI
docker run -d \
  --name comfyui \
  --network ai-network \
  --gpus all \
  -v /mnt/comfyui-models:/ComfyUI/models \
  -v /mnt/comfyui-output:/ComfyUI/output \
  -p 0.0.0.0:8188:8188 \
  comfyui-docker:latest

# Download image models
mkdir -p /mnt/comfyui-models/checkpoints
cd /mnt/comfyui-models/checkpoints
wget https://huggingface.co/black-forest-labs/FLUX.1-dev/resolve/main/flux1-dev-Q8_0.safetensors -O flux1-dev.safetensors
wget https://huggingface.co/stabilityai/stable-diffusion-3.5-large/resolve/main/sd_xl_base_1.0.safetensors -O sd3.5-large.safetensors

echo "✓ Vast.ai setup complete"
EOF

Step 3: VPS Tunnel Setup (Automated)

# Run on IGNY8 VPS
VAST_AI_IP="<your-gpu-ip>"

# Install autossh
apt install autossh -y

# Create tunnel user
useradd -m -s /bin/bash tunnel-user
mkdir -p /home/tunnel-user/.ssh

# Copy SSH key (paste private key content)
cat > /home/tunnel-user/.ssh/vast_ai << 'KEY'
-----BEGIN RSA PRIVATE KEY-----
<paste-private-key-here>
-----END RSA PRIVATE KEY-----
KEY

chmod 600 /home/tunnel-user/.ssh/vast_ai
chown -R tunnel-user:tunnel-user /home/tunnel-user/.ssh

# Create systemd service
cat > /etc/systemd/system/tunnel-vast-ai.service << 'SERVICE'
[Unit]
Description=SSH Tunnel to Vast.ai GPU Server
After=network.target
Wants=network-online.target

[Service]
Type=simple
User=tunnel-user
ExecStart=/usr/bin/autossh \
  -M 20000 \
  -N \
  -o "ServerAliveInterval=30" \
  -o "ServerAliveCountMax=3" \
  -o "ExitOnForwardFailure=no" \
  -o "StrictHostKeyChecking=accept-new" \
  -i /home/tunnel-user/.ssh/vast_ai \
  -L 11434:localhost:11434 \
  -L 11435:localhost:11435 \
  -L 8188:localhost:8188 \
  root@VAST_AI_IP

Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
SERVICE

# Update IP in service file
sed -i "s/VAST_AI_IP/$VAST_AI_IP/g" /etc/systemd/system/tunnel-vast-ai.service

# Start tunnel
systemctl daemon-reload
systemctl start tunnel-vast-ai
systemctl enable tunnel-vast-ai

# Wait and verify
sleep 5
netstat -tlnp | grep -E '(11434|8188)'

echo "✓ SSH tunnel operational"

Step 4: LiteLLM Installation (Automated)

# Run on IGNY8 VPS

# Install LiteLLM
pip install litellm fastapi uvicorn python-dotenv requests

# Create directories
mkdir -p /opt/litellm

# Create config file
cat > /opt/litellm/config.yaml << 'CONFIG'
model_list:
  - model_name: gpt-4
    litellm_params:
      model: ollama/qwen3:32b
      api_base: http://localhost:11434
      timeout: 300
      max_tokens: 8000

  - model_name: gpt-3.5-turbo
    litellm_params:
      model: ollama/qwen3:8b
      api_base: http://localhost:11434
      timeout: 120
      max_tokens: 2048

  - model_name: dall-e-3
    litellm_params:
      model: comfyui/flux.1-dev
      api_base: http://localhost:8188
      timeout: 120

litellm_settings:
  verbose: true
  log_level: INFO
  cache_responses: true
CONFIG

# Create .env file
cat > /opt/litellm/.env << 'ENV'
OPENAI_API_KEY=your-openai-key
PORT=8000
HOST=127.0.0.1
ENV

# Create start script
cat > /opt/litellm/start.sh << 'SCRIPT'
#!/bin/bash
cd /opt/litellm
source .env
python -m litellm.server --config config.yaml --host 127.0.0.1 --port 8000 --num_workers 4
SCRIPT

chmod +x /opt/litellm/start.sh

# Create systemd service
cat > /etc/systemd/system/litellm.service << 'SERVICE'
[Unit]
Description=LiteLLM AI Proxy Gateway
After=network.target tunnel-vast-ai.service
Wants=tunnel-vast-ai.service

[Service]
Type=simple
User=root
WorkingDirectory=/opt/litellm
ExecStart=/opt/litellm/start.sh
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
SERVICE

# Start LiteLLM
systemctl daemon-reload
systemctl start litellm
systemctl enable litellm

# Verify
sleep 5
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer test" \
  -d '{"model": "gpt-4", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 100}'

echo "✓ LiteLLM operational"

Step 5: IGNY8 Backend Integration (Developer)

# In IGNY8 codebase

# 1. Add to IntegrationProvider enum (backend/models/integration.py)
# 2. Update management command to initialize self-hosted settings
# 3. Implement AIEngineRouter with fallback logic
# 4. Update Celery tasks to use router
# 5. Add database fields for provider tracking
# 6. Run migrations
# 7. Create health check monitoring

python manage.py makemigrations
python manage.py migrate

# Initialize self-hosted integration
python manage.py init_integrations

Step 6: Verification (Automated)

# Test full chain
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "Write a 100-word article about clouds"}],
    "max_tokens": 200
  }'

# Expected response: Article from Qwen3:32B model

# Test fallback by stopping tunnel
systemctl stop tunnel-vast-ai
# Wait 10 seconds
# Retry request - should now use OpenAI instead

Timeline & Resource Allocation

Phase Days Task Owner Status
1.1 1 Vast.ai account & GPU rental Operator Ready
1.2 1 Docker & Ollama setup DevOps Ready
1.3 1 Model pulling & ComfyUI DevOps Ready
2.1 0.5 VPS tunnel infrastructure DevOps Ready
2.2 0.5 autossh systemd service DevOps Ready
2.3 1 LiteLLM installation & config DevOps Ready
3.1 1 Backend integration scaffolding Developer Ready
3.2 1 AI router & fallback logic Developer Ready
3.3 1 Celery task updates Developer Ready
4.1 1 Health check system DevOps Ready
5.1 1 Cost tracking & dashboard Developer Ready
Total 7

Cost Analysis

Monthly GPU Rental

  • Vast.ai 2x RTX 3090: $180-220/month (auto-bid recommended)
  • Fixed cost: $200/month (conservative)

Monthly API Costs (Current)

Estimated current external API costs (before optimization):

  • OpenAI (GPT-4/3.5): $800-1,200/month
  • Anthropic (Claude): $200-400/month
  • Image generation (Runware/Bria): $300-500/month
  • Total: $1,300-2,100/month

Monthly API Costs (After)

With self-hosted supplementing external:

  • Self-hosted cost: $200/month (amortized GPU)
  • External APIs (fallback only): $200-300/month
  • Total: $400-500/month

Savings & ROI

  • Monthly savings: $800-1,700
  • Break-even: 12-24 days (1 GPU rental cost)
  • Annual savings: $9,600-20,400

Cost Per Subscriber

  • Before: $26-42/subscriber/month (on $49/month tier)
  • After: $8-10/subscriber/month
  • Improvement: 65-76% cost reduction

Troubleshooting Guide

SSH Tunnel Not Connecting

# Check service status
systemctl status tunnel-vast-ai

# View detailed logs
journalctl -u tunnel-vast-ai -n 100 -f

# Test SSH manually
ssh -v -i /home/tunnel-user/.ssh/vast_ai root@<vast_ai_ip>

# Ensure Vast.ai machine still running and has bandwidth

Ollama Not Responding

# Check container
docker ps | grep ollama

# View logs
docker logs -f ollama

# Test directly
docker exec ollama curl http://localhost:11434/api/tags

# Restart if needed
docker restart ollama

ComfyUI Port Not Accessible

# Check container
docker ps | grep comfyui

# Test through tunnel
curl http://localhost:8188/system_stats

# Restart if needed
docker restart comfyui

LiteLLM Timeouts

# Check LiteLLM logs
journalctl -u litellm -n 100

# Increase timeout in config.yaml
# Test simple request
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "Hi"}], "max_tokens": 10}'

Fallback to External APIs Not Working

# Verify OpenAI API key in /opt/litellm/.env
# Test OpenAI directly (disable tunnel)
systemctl stop tunnel-vast-ai
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-3.5-turbo-fallback", "messages": [{"role": "user", "content": "Hi"}]}'

Cross-References

Dependency: 00B VPS Provisioning & Infrastructure Related: 00A Project Planning Related: 00C Database & Schema Related: 00D Authentication & Security


Document Version

Version Date Changes
1.0 2026-03-23 Initial comprehensive build document

Status: Ready for implementation Last Updated: 2026-03-23 Next Step: Execute Phase 1 GPU infrastructure setup after 00B completion