igny8/v2/V2-Execution-Docs/00F-self-hosted-ai-infra.md

# IGNY8 Phase 0: Self-Hosted AI Infrastructure (00F)

**Status:** Ready for Implementation
**Version:** 1.1
**Priority:** High (cost savings critical for unit economics)
**Duration:** 5-7 days
**Dependencies:** 00B (VPS provisioning) must be complete first
**Source of Truth:** Codebase at `/data/app/igny8/`
**Cost:** ~$200/month GPU rental + $0 software (open source)

---

## 1. Current State

### Existing AI Integration
- **External providers (verified from `IntegrationProvider` model):** OpenAI (GPT-4, GPT-3.5), Anthropic (Claude), Runware (image gen)
- **Storage:** API keys stored in `IntegrationProvider` model (table: `igny8_integration_providers`) with per-account overrides in `IntegrationSettings` (table: `igny8_integration_settings`). Global defaults in `GlobalIntegrationSettings`.
- **Provider types in codebase:** `ai`, `payment`, `email`, `storage` (from `PROVIDER_TYPE_CHOICES`)
- **Existing provider_ids:** `openai`, `runware`, `stripe`, `paypal`, `resend`
- **Architecture:** Multi-provider AI engine with model selection capability
- **Current AI functions:** `auto_cluster`, `generate_ideas`, `generate_content`, `generate_images`, `generate_image_prompts`, `optimize_content`, `generate_site_structure`
- **Async handling:** Celery workers process long-running AI tasks
- **Cost impact:** External APIs constitute 15-30% of monthly operational costs

### Problem
- External API costs scale linearly with subscriber growth
- No cost leverage at scale (pay-as-you-go model)
- API rate limits require careful orchestration
- Privacy concerns with offloading content generation

---

## 2. What to Build

### Infrastructure Stack

```
┌─────────────────────────────────────────────────────────────┐
│ IGNY8 Backend (on VPS)                                      │
│ - Send requests to LiteLLM proxy (local localhost:8000)     │
│ - Fallback to OpenAI/Anthropic if self-hosted unavailable   │
└──────────────┬──────────────────────────────────────────────┘
               │
               ▼
┌──────────────────────────────────────────────────────────────┐
│ LiteLLM Proxy (on VPS, port 8000)                            │
│ - OpenAI-compatible API gateway                              │
│ - Routes requests to local Ollama and ComfyUI (via tunnel)   │
│ - Load balancing & model selection                           │
│ - Fallback configuration for external APIs                   │
└──────────────┬───────────────────────────────────────────────┘
               │
      ┌────────┴──────────────────┐
      │                           │
      ▼                           ▼
┌──────────────────┐    ┌──────────────────────┐
│ SSH Tunnel       │    │ ComfyUI Tunnel       │
│ (autossh)        │    │ (autossh)            │
│ Port 11434-11435 │    │ Port 8188            │
│                  │    │ (image generation)   │
└────────┬─────────┘    └──────────┬───────────┘
         │                         │
         ▼                         ▼
┌────────────────────────────────────────────────────────┐
│ Vast.ai GPU Server (2x RTX 3090, 48GB VRAM)            │
│ ┌──────────────────────────────────────────────────┐   │
│ │ Ollama Container                                 │   │
│ │ - Qwen3-32B (reasoning)                          │   │
│ │ - Qwen3-30B-A3B (multimodal)                    │   │
│ │ - Qwen3-14B (general purpose)                    │   │
│ │ - Qwen3-8B (fast inference)                      │   │
│ │ Listening on 0.0.0.0:11434                       │   │
│ └──────────────────────────────────────────────────┘   │
│ ┌──────────────────────────────────────────────────┐   │
│ │ ComfyUI Container                                │   │
│ │ - FLUX.1 (image gen)                             │   │
│ │ - Stable Diffusion 3.5 (image gen)               │   │
│ │ - SDXL-Lightning (fast generation)               │   │
│ │ Listening on 0.0.0.0:8188                        │   │
│ └──────────────────────────────────────────────────┘   │
└────────────────────────────────────────────────────────┘
```

### Components to Deploy

1. **Vast.ai GPU Rental**
   - Machine: 2x NVIDIA RTX 3090 (48GB total VRAM)
   - Estimated cost: $180-220/month
   - Auto-bid setup for cost optimization
   - Persistence: Restore from snapshot between rentals

2. **Ollama (Text LLM Server)**
   - Container-based deployment on GPU
   - Models: Qwen3 series (32B, 30B-A3B, 14B, 8B)
   - API: OpenAI-compatible `/v1/chat/completions`
   - Port: 11434 (tunneled via SSH)

3. **ComfyUI (Image Generation)**
   - Container-based deployment on GPU
   - Models: FLUX.1, Stable Diffusion 3.5, SDXL-Lightning
   - API: REST endpoints for image generation
   - Port: 8188 (tunneled via SSH)

4. **SSH Tunnel (autossh)**
   - Persistent connection from VPS to GPU server
   - Systemd service with auto-restart
   - Ports: 11434/11435 (Ollama), 8188 (ComfyUI)
   - Handles network interruptions automatically

5. **LiteLLM Proxy**
   - Runs on IGNY8 VPS
   - Acts as OpenAI-compatible API gateway
   - Configurable routing based on model/task type
   - Fallback to OpenAI/Anthropic if self-hosted unavailable
   - Port: 8000 (local access only)

6. **IGNY8 Backend Integration**
   - Add self-hosted LiteLLM as new `IntegrationProvider`
   - Update AI request logic to check availability
   - Implement fallback chain: self-hosted → OpenAI → Anthropic
   - Cost tracking per provider

---

## 3. Data Models / APIs

### Database Models (Minimal Schema Changes)

Use existing `IntegrationProvider` model — add a new row with `provider_type='ai'`:

```python
# New IntegrationProvider row (NO new provider_type needed)
# provider_type='ai' already exists in PROVIDER_TYPE_CHOICES

# Create via admin or migration:
IntegrationProvider.objects.create(
    provider_id='self_hosted_ai',
    display_name='Self-Hosted AI (LiteLLM)',
    provider_type='ai',
    api_key='',  # LiteLLM doesn't require auth (internal)
    api_endpoint='http://localhost:8000',
    is_active=True,
    is_sandbox=False,
    config={
        "priority": 10,  # Try self-hosted first
        "models": {
            "text_generation": "qwen3:32b",
            "text_generation_fast": "qwen3:8b",
            "image_generation": "flux.1-dev",
            "image_generation_fast": "sdxl-lightning"
        },
        "timeout": 300,  # 5 minute timeout for slow models
        "fallback_to": "openai"  # Fallback provider if self-hosted fails
    }
)
```

### LiteLLM API Endpoints

**Text Generation (Compatible with OpenAI API)**

```bash
# Request
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ollama/qwen3:32b",
    "messages": [{"role": "user", "content": "Write an article about..."}],
    "temperature": 0.7,
    "max_tokens": 2000
  }'

# Response (identical to OpenAI)
{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "ollama/qwen3:32b",
  "choices": [{
    "index": 0,
    "message": {"role": "assistant", "content": "Article text..."},
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 50,
    "completion_tokens": 500,
    "total_tokens": 550
  }
}
```

**Image Generation (ComfyUI via LiteLLM)**

```bash
# Request
curl -X POST http://localhost:8000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "comfyui/flux.1-dev",
    "prompt": "A professional product photo of...",
    "size": "1024x1024",
    "n": 1,
    "quality": "hd"
  }'

# Response
{
  "created": 1234567890,
  "data": [{
    "url": "data:image/png;base64,...",
    "revised_prompt": "A professional product photo of..."
  }]
}
```

### Model Routing Configuration

**LiteLLM Config (see section 4.2)**

- Routes `gpt-4` requests → `ollama/qwen3:32b`
- Routes `gpt-3.5-turbo` requests → `ollama/qwen3:8b`
- Routes DALL-E requests → `comfyui/flux.1-dev`
- Includes fallback to OpenAI for unavailable models
- Respects timeout and retry limits

---

## 4. Implementation Steps

### Phase 1: GPU Infrastructure Setup (Days 1-2)

#### 4.1 Vast.ai Account & GPU Rental

**Step 1: Create Vast.ai Account**
```bash
# Navigate to https://www.vast.ai
# Sign up with email
# Verify account via email
# Add payment method (credit card or crypto)
```

**Step 2: Rent GPU Instance**

Requirements:
- 2x NVIDIA RTX 3090 (or 1x RTX 4090) = 48GB+ VRAM
- Ubuntu 24.04 LTS base image (preferred) or later
- Minimum bandwidth: 100 Mbps
- SSH port 22 open

Setup via Vast.ai dashboard:
1. Go to "Browse" → Filter by:
   - GPU: 2x RTX 3090 or RTX 4090
   - Min VRAM: 48GB
   - OS: Ubuntu 24.04 LTS (or later)
   - Price: Sort by lowest $/hr
2. Click "Rent" on selected instance
3. Choose:
   - Disk size: 500GB (includes models)
   - Secure Cloud: No (to access port 22)
4. Wait for machine to start (2-5 minutes)
5. Record SSH credentials from dashboard

**Step 3: Test SSH Access**
```bash
# From your local machine
ssh root@<vast_ai_ip> -i ~/.ssh/vast_key
# Update system
apt update && apt upgrade -y
```

**Step 4: Set Up Snapshot for Persistence**
```bash
# After first-time setup, create snapshot in Vast.ai dashboard
# Future rentals: select snapshot to restore previous state
```

---

#### 4.2 Vast.ai: Docker & Base Containers

**Step 1: Install Docker**
```bash
# SSH into Vast.ai machine
ssh root@<vast_ai_ip>

# Install Docker
curl https://get.docker.com -sSfL | sh
systemctl enable docker
systemctl start docker

# Verify
docker --version
```

**Step 2: Set Up Storage**
```bash
# Create persistent directory for models
mkdir -p /mnt/models
mkdir -p /mnt/ollama-cache
mkdir -p /mnt/comfyui-models
chmod 777 /mnt/*

# Create docker network for inter-container communication
docker network create ai-network
```

**Step 3: Deploy Ollama Container**
```bash
docker run -d \
  --name ollama \
  --network ai-network \
  --gpus all \
  -e OLLAMA_MODELS=/mnt/ollama-cache \
  -v /mnt/ollama-cache:/root/.ollama \
  -p 0.0.0.0:11434:11434 \
  ollama/ollama:latest
```

**Step 4: Pull Qwen3 Models**
```bash
# Wait for ollama to be ready
sleep 10

# Pull models (will take 30-60 minutes depending on speed)
# Order by priority (largest first)
docker exec ollama ollama pull qwen3:32b       # ~20GB
docker exec ollama ollama pull qwen3:30b-a3b   # ~18GB
docker exec ollama ollama pull qwen3:14b       # ~9GB
docker exec ollama ollama pull qwen3:8b        # ~5GB

# Verify models are loaded
docker exec ollama ollama list
# Output should show all models with their sizes
```

**Step 5: Deploy ComfyUI Container**
```bash
# Clone ComfyUI repository
cd /opt
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI

# Use Docker image with CUDA support
docker run -d \
  --name comfyui \
  --network ai-network \
  --gpus all \
  -e CUDA_VISIBLE_DEVICES=0,1 \
  -v /mnt/comfyui-models:/ComfyUI/models \
  -v /mnt/comfyui-output:/ComfyUI/output \
  -p 0.0.0.0:8188:8188 \
  comfyui-docker:latest

# Alternative: Run from source
docker run -d \
  --name comfyui \
  --network ai-network \
  --gpus all \
  -v /opt/ComfyUI:/ComfyUI \
  -v /mnt/comfyui-models:/ComfyUI/models \
  -v /mnt/comfyui-output:/ComfyUI/output \
  -p 0.0.0.0:8188:8188 \
  -w /ComfyUI \
  nvidia/cuda:11.8.0-runtime-ubuntu22.04 \
  bash -c "pip install -r requirements.txt && python -m http.server 8188"
```

**Step 6: Download Image Generation Models**
```bash
# Download models to ComfyUI
# FLUX.1 (recommended for quality)
cd /mnt/comfyui-models/checkpoints
wget -O flux1-dev-Q8.safetensors \
  "https://huggingface.co/black-forest-labs/FLUX.1-dev/resolve/main/flux1-dev-Q8_0.safetensors"

# Stable Diffusion 3.5 (alternative)
wget -O sd3.5-large.safetensors \
  "https://huggingface.co/stabilityai/stable-diffusion-3.5-large/resolve/main/sd_xl_base_1.0.safetensors"

# SDXL-Lightning (fast, lower quality but acceptable)
wget -O sdxl-lightning.safetensors \
  "https://huggingface.co/latent-consistency/lcm-sdxl/resolve/main/pytorch_lora_weights.safetensors"

# VAE (for all models)
cd /mnt/comfyui-models/vae
wget -O ae.safetensors \
  "https://huggingface.co/black-forest-labs/FLUX.1-dev/resolve/main/ae.safetensors"
```

**Step 7: Verify Services**
```bash
# Check Ollama API
curl http://localhost:11434/api/tags
# Should return: {"models": [{"name": "qwen3:32b", "size": ...}, ...]}

# Check ComfyUI
curl http://localhost:8188/system_stats
# Should return GPU/memory stats
```

---

### Phase 2: VPS Tunnel & LiteLLM Setup (Days 2-3)

#### 4.3 IGNY8 VPS: SSH Tunnel Configuration

**Prerequisites:** VPS must be provisioned (see 00B)

**VPS Environment:**
- Ubuntu 24.04 LTS
- Docker 29.x
- GPU server is deployed separately on Vast.ai (not on the VPS)
- The VPS maintains SSH tunnels to the Vast.ai GPU server for accessing Ollama and ComfyUI

**DNS Note:** During initial setup, before DNS flip, the IGNY8 backend connects to the LiteLLM proxy via `localhost:8000` within the VPS environment. This uses the internal Docker network and local port forwarding, so external DNS configuration does not affect this connection. DNS considerations only apply to external client connections to the IGNY8 API.

**Step 1: Generate SSH Key Pair**
```bash
# On VPS
ssh-keygen -t rsa -b 4096 -f /root/.ssh/vast_ai -N ""
# On local machine, copy public key to Vast.ai machine
ssh-copy-id -i /root/.ssh/vast_ai.pub root@<vast_ai_ip>
```

**Step 2: Install & Configure autossh**
```bash
# On VPS
apt install autossh -y

# Create dedicated user for tunnel
useradd -m -s /bin/bash tunnel-user
mkdir -p /home/tunnel-user/.ssh
cp /root/.ssh/vast_ai* /home/tunnel-user/.ssh/
chown -R tunnel-user:tunnel-user /home/tunnel-user/.ssh
chmod 600 /home/tunnel-user/.ssh/vast_ai
```

**Step 3: Create autossh Systemd Service**

File: `/etc/systemd/system/tunnel-vast-ai.service`

```ini
[Unit]
Description=SSH Tunnel to Vast.ai GPU Server
After=network.target
Wants=network-online.target

[Service]
Type=simple
User=tunnel-user
ExecStart=/usr/bin/autossh \
  -M 20000 \
  -N \
  -o "ServerAliveInterval=30" \
  -o "ServerAliveCountMax=3" \
  -o "ExitOnForwardFailure=no" \
  -o "ConnectTimeout=10" \
  -o "StrictHostKeyChecking=accept-new" \
  -i /home/tunnel-user/.ssh/vast_ai \
  -L 11434:localhost:11434 \
  -L 11435:localhost:11435 \
  -L 8188:localhost:8188 \
  root@<vast_ai_ip>

Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
```

**Step 4: Start Tunnel Service**
```bash
# Reload systemd
systemctl daemon-reload

# Start service
systemctl start tunnel-vast-ai

# Enable on boot
systemctl enable tunnel-vast-ai

# Verify tunnel is up
systemctl status tunnel-vast-ai

# Check logs
journalctl -u tunnel-vast-ai -f
```

**Step 5: Test Tunnel Connectivity**
```bash
# On VPS, verify ports are open
netstat -tlnp | grep -E '(11434|8188)'
# Should show: 127.0.0.1:11434 LISTEN
#              127.0.0.1:8188 LISTEN

# Test Ollama through tunnel
curl http://localhost:11434/api/tags
# Should return model list from remote Vast.ai machine

# Test ComfyUI through tunnel
curl http://localhost:8188/system_stats
# Should return GPU stats
```

---

#### 4.4 LiteLLM Installation & Configuration

**Step 1: Install LiteLLM**
```bash
# On VPS
pip install litellm fastapi uvicorn python-dotenv requests

# Verify installation
python -c "import litellm; print(litellm.__version__)"
```

**Step 2: Create LiteLLM Configuration**

File: `/opt/litellm/config.yaml`

```yaml
# LiteLLM Configuration for IGNY8
model_list:
  # Text Generation Models (Ollama via SSH tunnel)
  - model_name: gpt-4
    litellm_params:
      model: ollama/qwen3:32b
      api_base: http://localhost:11434
      timeout: 300
      max_tokens: 8000

  - model_name: gpt-4-turbo
    litellm_params:
      model: ollama/qwen3:30b-a3b
      api_base: http://localhost:11434
      timeout: 300
      max_tokens: 8000

  - model_name: gpt-3.5-turbo
    litellm_params:
      model: ollama/qwen3:14b
      api_base: http://localhost:11434
      timeout: 180
      max_tokens: 4000

  - model_name: gpt-3.5-turbo-fast
    litellm_params:
      model: ollama/qwen3:8b
      api_base: http://localhost:11434
      timeout: 120
      max_tokens: 2048

  # Fallback to OpenAI for redundancy
  - model_name: gpt-4-fallback
    litellm_params:
      model: gpt-4
      api_key: ${OPENAI_API_KEY}
      timeout: 60

  - model_name: gpt-3.5-turbo-fallback
    litellm_params:
      model: gpt-3.5-turbo
      api_key: ${OPENAI_API_KEY}
      timeout: 60

  # Image Generation (ComfyUI via tunnel)
  - model_name: dall-e-3
    litellm_params:
      model: comfyui/flux.1-dev
      api_base: http://localhost:8188
      timeout: 120

  - model_name: dall-e-2
    litellm_params:
      model: comfyui/sdxl-lightning
      api_base: http://localhost:8188
      timeout: 60

# Router configuration for model selection
router_settings:
  routing_strategy: "simple-shuffle"  # Load balancing
  allowed_model_region: null

# Logging configuration
litellm_settings:
  verbose: true
  log_level: "INFO"
  cache_responses: true
  cache_params: ["model", "messages", "temperature"]

# Fallback behavior
fallback_models:
  - model_name: gpt-4
    fallback_to: ["gpt-4-fallback", "gpt-3.5-turbo"]
  - model_name: gpt-3.5-turbo
    fallback_to: ["gpt-3.5-turbo-fallback"]
  - model_name: dall-e-3
    fallback_to: ["dall-e-2"]
```

**Step 3: Create Environment File**

File: `/opt/litellm/.env`

```bash
# OpenAI (for fallback)
OPENAI_API_KEY=sk-your-key-here

# Anthropic (optional fallback)
ANTHROPIC_API_KEY=sk-ant-your-key-here

# LiteLLM settings
LITELLM_LOG_LEVEL=INFO
LITELLM_CACHE=true

# Service settings
PORT=8000
HOST=127.0.0.1
```

**Step 4: Create LiteLLM Startup Script**

File: `/opt/litellm/start.sh`

```bash
#!/bin/bash
set -e

cd /opt/litellm

# Load environment variables
source .env

# Start LiteLLM server
python -m litellm.server \
  --config config.yaml \
  --host 127.0.0.1 \
  --port 8000 \
  --num_workers 4 \
  --worker_timeout 600
```

```bash
chmod +x /opt/litellm/start.sh
```

**Step 5: Create Systemd Service for LiteLLM**

File: `/etc/systemd/system/litellm.service`

```ini
[Unit]
Description=LiteLLM AI Proxy Gateway
After=network.target tunnel-vast-ai.service
Wants=tunnel-vast-ai.service

[Service]
Type=simple
User=root
WorkingDirectory=/opt/litellm
ExecStart=/opt/litellm/start.sh
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/local/python/bin"

[Install]
WantedBy=multi-user.target
```

**Step 6: Start LiteLLM Service**
```bash
systemctl daemon-reload
systemctl start litellm
systemctl enable litellm

# Verify service
systemctl status litellm

# Check logs
journalctl -u litellm -f
```

**Step 7: Test LiteLLM API**
```bash
# Test text generation with self-hosted model
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer test-key" \
  -d '{
    "model": "gpt-4",
    "messages": [
      {"role": "user", "content": "Say hello in one sentence"}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

# Should return a response from Qwen3:32B

# Test fallback (disconnect tunnel first to test fallback logic)
# curl should eventually fall back to OpenAI after timeout
```

---

### Phase 3: IGNY8 Backend Integration (Days 3-4)

#### 4.5 Add Self-Hosted Provider to IGNY8

**Step 1: Update GlobalIntegrationSettings Model**

File: `backend/models/integration.py`

```python
# Add to IntegrationProvider enum
class IntegrationProvider(models.TextChoices):
    OPENAI = "openai", "OpenAI"
    ANTHROPIC = "anthropic", "Anthropic"
    RUNWARE = "runware", "Runware"
    BRIA = "bria", "Bria"
    SELF_HOSTED = "self_hosted_ai", "Self-Hosted AI (LiteLLM)"  # NEW

# Example settings structure
SELF_HOSTED_SETTINGS = {
    "provider": "self_hosted_ai",
    "name": "Self-Hosted AI (LiteLLM)",
    "base_url": "http://localhost:8000",
    "api_key": "not_required",
    "enabled": True,
    "priority": 10,  # Try first
    "models": {
        "text_generation": "gpt-4",  # Maps to qwen3:32b
        "text_generation_fast": "gpt-3.5-turbo",  # Maps to qwen3:8b
        "image_generation": "dall-e-3",  # Maps to flux.1-dev
        "image_generation_fast": "dall-e-2"  # Maps to sdxl-lightning
    },
    "timeout": 300,
    "fallback_to": "openai"
}
```

**Step 2: Add Self-Hosted Settings to Database**

File: `backend/management/commands/init_integrations.py`

```python
from backend.models.integration import GlobalIntegrationSettings, IntegrationProvider

def add_self_hosted_integration():
    """Initialize self-hosted AI integration"""
    self_hosted_config = {
        "provider": IntegrationProvider.SELF_HOSTED,
        "name": "Self-Hosted AI (LiteLLM)",
        "base_url": "http://localhost:8000",
        "api_key": "",  # Not required for local proxy
        "enabled": True,
        "priority": 10,  # Higher priority = try first
        "models": {
            "text_generation": "gpt-4",
            "text_generation_fast": "gpt-3.5-turbo",
            "image_generation": "dall-e-3",
            "image_generation_fast": "dall-e-2"
        },
        "timeout": 300,
        "max_retries": 2,
        "fallback_provider": IntegrationProvider.OPENAI
    }

    integration, created = GlobalIntegrationSettings.objects.update_or_create(
        provider=IntegrationProvider.SELF_HOSTED,
        defaults=self_hosted_config
    )

    if created:
        print(f"✓ Created {IntegrationProvider.SELF_HOSTED} integration")
    else:
        print(f"✓ Updated {IntegrationProvider.SELF_HOSTED} integration")

# Run in management command initialization
```

**Step 3: Update AI Request Router**

File: `backend/services/ai_engine.py`

```python
import requests
import logging
from typing import Optional, List, Dict, Any
from backend.models.integration import GlobalIntegrationSettings, IntegrationProvider

logger = logging.getLogger(__name__)

class AIEngineRouter:
    """Routes AI requests to appropriate provider with fallback chain"""

    PROVIDER_PRIORITY = {
        IntegrationProvider.SELF_HOSTED: 10,  # Try first
        IntegrationProvider.OPENAI: 5,
        IntegrationProvider.ANTHROPIC: 4,
    }

    def __init__(self):
        self.providers = self._load_providers()

    def _load_providers(self) -> List[Dict[str, Any]]:
        """Load enabled providers from database"""
        configs = GlobalIntegrationSettings.objects.filter(
            enabled=True
        ).values()

        # Sort by priority (highest first)
        sorted_configs = sorted(
            configs,
            key=lambda x: self.PROVIDER_PRIORITY.get(x['provider'], 0),
            reverse=True
        )

        return sorted_configs

    def generate_text(
        self,
        prompt: str,
        model: str = "gpt-4",
        max_tokens: int = 2000,
        temperature: float = 0.7,
        timeout: Optional[int] = None
    ) -> Dict[str, Any]:
        """Generate text using available provider with fallback"""

        for provider_config in self.providers:
            try:
                result = self._call_provider(
                    provider_config,
                    "text",
                    prompt=prompt,
                    model=model,
                    max_tokens=max_tokens,
                    temperature=temperature,
                    timeout=timeout or provider_config.get('timeout', 300)
                )
                return {
                    "success": True,
                    "provider": provider_config['provider'],
                    "text": result['content'],
                    "usage": result.get('usage', {}),
                    "model": result.get('model', model)
                }
            except Exception as e:
                logger.warning(
                    f"Provider {provider_config['provider']} failed: {str(e)}"
                )
                continue

        # All providers failed
        raise Exception("All AI providers exhausted. No response available.")

    def generate_image(
        self,
        prompt: str,
        model: str = "dall-e-3",
        size: str = "1024x1024",
        quality: str = "hd",
        timeout: Optional[int] = None
    ) -> Dict[str, Any]:
        """Generate image using available provider with fallback"""

        for provider_config in self.providers:
            try:
                result = self._call_provider(
                    provider_config,
                    "image",
                    prompt=prompt,
                    model=model,
                    size=size,
                    quality=quality,
                    timeout=timeout or provider_config.get('timeout', 120)
                )
                return {
                    "success": True,
                    "provider": provider_config['provider'],
                    "image_url": result['url'],
                    "revised_prompt": result.get('revised_prompt', prompt),
                    "model": result.get('model', model)
                }
            except Exception as e:
                logger.warning(
                    f"Provider {provider_config['provider']} failed: {str(e)}"
                )
                continue

        # All providers failed
        raise Exception("All image generation providers exhausted.")

    def _call_provider(
        self,
        provider_config: Dict[str, Any],
        task_type: str,  # "text" or "image"
        **kwargs
    ) -> Dict[str, Any]:
        """Call specific provider based on type"""

        provider = provider_config['provider']

        if provider == IntegrationProvider.SELF_HOSTED:
            return self._call_litellm(provider_config, task_type, **kwargs)
        elif provider == IntegrationProvider.OPENAI:
            return self._call_openai(provider_config, task_type, **kwargs)
        elif provider == IntegrationProvider.ANTHROPIC:
            return self._call_anthropic(provider_config, task_type, **kwargs)
        else:
            raise ValueError(f"Unknown provider: {provider}")

    def _call_litellm(
        self,
        provider_config: Dict[str, Any],
        task_type: str,
        **kwargs
    ) -> Dict[str, Any]:
        """Call LiteLLM proxy on localhost"""

        base_url = provider_config['base_url']
        timeout = kwargs.pop('timeout', 300)

        if task_type == "text":
            # Chat completion endpoint
            endpoint = f"{base_url}/v1/chat/completions"
            payload = {
                "model": kwargs.get('model', 'gpt-4'),
                "messages": [
                    {"role": "user", "content": kwargs['prompt']}
                ],
                "temperature": kwargs.get('temperature', 0.7),
                "max_tokens": kwargs.get('max_tokens', 2000)
            }
        elif task_type == "image":
            # Image generation endpoint
            endpoint = f"{base_url}/v1/images/generations"
            payload = {
                "model": kwargs.get('model', 'dall-e-3'),
                "prompt": kwargs['prompt'],
                "size": kwargs.get('size', '1024x1024'),
                "n": 1,
                "quality": kwargs.get('quality', 'hd')
            }
        else:
            raise ValueError(f"Unknown task type: {task_type}")

        try:
            response = requests.post(
                endpoint,
                json=payload,
                timeout=timeout,
                headers={"Authorization": "Bearer test"}
            )
            response.raise_for_status()

            data = response.json()

            if task_type == "text":
                return {
                    "content": data['choices'][0]['message']['content'],
                    "usage": data.get('usage', {}),
                    "model": data.get('model', kwargs.get('model'))
                }
            else:  # image
                return {
                    "url": data['data'][0]['url'],
                    "revised_prompt": data['data'][0].get('revised_prompt'),
                    "model": kwargs.get('model')
                }

        except requests.exceptions.Timeout:
            logger.error(f"LiteLLM timeout after {timeout}s")
            raise
        except requests.exceptions.ConnectionError:
            logger.error("Cannot connect to LiteLLM proxy - tunnel may be down")
            raise
        except Exception as e:
            logger.error(f"LiteLLM request failed: {str(e)}")
            raise

    def _call_openai(self, provider_config, task_type, **kwargs):
        """Existing OpenAI implementation"""
        # Use existing OpenAI integration code
        pass

    def _call_anthropic(self, provider_config, task_type, **kwargs):
        """Existing Anthropic implementation"""
        # Use existing Anthropic integration code
        pass


# Initialize global instance
ai_router = AIEngineRouter()
```

**Step 4: Update Content Generation Celery Tasks**

File: `backend/tasks/content_generation.py`

```python
from celery import shared_task
from backend.services.ai_engine import ai_router
import logging

logger = logging.getLogger(__name__)

@shared_task
def generate_article_content(user_id: int, article_id: int):
    """Generate article content using AI router (tries self-hosted first)"""
    try:
        # Get article from database
        article = Article.objects.get(id=article_id, user_id=user_id)

        # Generate content
        result = ai_router.generate_text(
            prompt=f"Write a detailed article about: {article.topic}",
            model="gpt-4",
            max_tokens=3000,
            temperature=0.7
        )

        # Save result
        article.content = result['text']
        article.ai_provider = result['provider']
        article.save()

        logger.info(
            f"Generated article {article_id} using {result['provider']}"
        )

        return {
            "success": True,
            "article_id": article_id,
            "provider": result['provider']
        }

    except Exception as e:
        logger.error(f"Article generation failed: {str(e)}")
        raise

@shared_task
def generate_product_images(user_id: int, product_id: int):
    """Generate product images using AI router"""
    try:
        product = Product.objects.get(id=product_id, user_id=user_id)

        # Try to generate with self-hosted first (faster)
        result = ai_router.generate_image(
            prompt=f"Professional product photo of: {product.description}",
            model="dall-e-3",
            size="1024x1024",
            quality="hd"
        )

        product.image_url = result['image_url']
        product.ai_provider = result['provider']
        product.save()

        logger.info(f"Generated image for product {product_id} using {result['provider']}")

        return {
            "success": True,
            "product_id": product_id,
            "provider": result['provider'],
            "image_url": result['image_url']
        }

    except Exception as e:
        logger.error(f"Image generation failed: {str(e)}")
        raise
```

**Step 5: Add AI Provider Tracking**

File: `backend/models/content.py`

```python
from django.db import models
from backend.models.integration import IntegrationProvider

class Article(models.Model):
    # ... existing fields ...

    # Track which AI provider generated content
    ai_provider = models.CharField(
        max_length=50,
        choices=IntegrationProvider.choices,
        default=IntegrationProvider.OPENAI,
        help_text="Which AI provider generated this content"
    )
    ai_cost = models.DecimalField(
        max_digits=10,
        decimal_places=6,
        default=0,
        help_text="Cost to generate via AI provider"
    )
    ai_generation_time = models.DurationField(
        null=True,
        blank=True,
        help_text="Time taken to generate content"
    )

class Product(models.Model):
    # ... existing fields ...

    ai_provider = models.CharField(
        max_length=50,
        choices=IntegrationProvider.choices,
        default=IntegrationProvider.OPENAI,
        help_text="Which AI provider generated the image"
    )
    ai_image_cost = models.DecimalField(
        max_digits=10,
        decimal_places=6,
        default=0,
        help_text="Cost to generate image"
    )
```

---

### Phase 4: Monitoring & Fallback (Days 4-5)

#### 4.6 Health Check & Failover System

**Step 1: Create Health Check Service**

File: `backend/services/ai_health_check.py`

```python
import requests
import time
import logging
from typing import Dict, Any, Tuple
from datetime import datetime, timedelta

logger = logging.getLogger(__name__)

class AIHealthMonitor:
    """Monitor health of self-hosted AI infrastructure"""

    OLLAMA_ENDPOINT = "http://localhost:11434/api/tags"
    COMFYUI_ENDPOINT = "http://localhost:8188/system_stats"
    LITELLM_ENDPOINT = "http://localhost:8000/health"

    HEALTH_CHECK_INTERVAL = 60  # seconds
    FAILURE_THRESHOLD = 3  # Mark unhealthy after 3 failures

    def __init__(self):
        self.last_check = None
        self.failure_count = {
            'ollama': 0,
            'comfyui': 0,
            'litellm': 0
        }
        self.is_healthy = {
            'ollama': True,
            'comfyui': True,
            'litellm': True
        }

    def check_all(self) -> Dict[str, Any]:
        """Run all health checks"""

        results = {
            'timestamp': datetime.now().isoformat(),
            'overall_healthy': True,
            'services': {}
        }

        # Check Ollama
        ollama_healthy = self._check_ollama()
        results['services']['ollama'] = {
            'healthy': ollama_healthy,
            'endpoint': self.OLLAMA_ENDPOINT
        }
        if not ollama_healthy:
            results['overall_healthy'] = False

        # Check ComfyUI
        comfyui_healthy = self._check_comfyui()
        results['services']['comfyui'] = {
            'healthy': comfyui_healthy,
            'endpoint': self.COMFYUI_ENDPOINT
        }
        if not comfyui_healthy:
            results['overall_healthy'] = False

        # Check LiteLLM
        litellm_healthy = self._check_litellm()
        results['services']['litellm'] = {
            'healthy': litellm_healthy,
            'endpoint': self.LITELLM_ENDPOINT
        }
        if not litellm_healthy:
            results['overall_healthy'] = False

        self.last_check = results

        # Log status change if needed
        if self.is_healthy['ollama'] != ollama_healthy:
            level = logging.WARNING if not ollama_healthy else logging.INFO
            logger.log(level, f"Ollama service {'down' if not ollama_healthy else 'recovered'}")

        if self.is_healthy['comfyui'] != comfyui_healthy:
            level = logging.WARNING if not comfyui_healthy else logging.INFO
            logger.log(level, f"ComfyUI service {'down' if not comfyui_healthy else 'recovered'}")

        if self.is_healthy['litellm'] != litellm_healthy:
            level = logging.WARNING if not litellm_healthy else logging.INFO
            logger.log(level, f"LiteLLM service {'down' if not litellm_healthy else 'recovered'}")

        # Update internal state
        self.is_healthy['ollama'] = ollama_healthy
        self.is_healthy['comfyui'] = comfyui_healthy
        self.is_healthy['litellm'] = litellm_healthy

        return results

    def _check_ollama(self) -> bool:
        """Check if Ollama is responding"""
        try:
            response = requests.get(self.OLLAMA_ENDPOINT, timeout=5)
            if response.status_code == 200:
                self.failure_count['ollama'] = 0
                return True
        except Exception as e:
            logger.debug(f"Ollama health check failed: {str(e)}")

        self.failure_count['ollama'] += 1
        return self.failure_count['ollama'] < self.FAILURE_THRESHOLD

    def _check_comfyui(self) -> bool:
        """Check if ComfyUI is responding"""
        try:
            response = requests.get(self.COMFYUI_ENDPOINT, timeout=5)
            if response.status_code == 200:
                self.failure_count['comfyui'] = 0
                return True
        except Exception as e:
            logger.debug(f"ComfyUI health check failed: {str(e)}")

        self.failure_count['comfyui'] += 1
        return self.failure_count['comfyui'] < self.FAILURE_THRESHOLD

    def _check_litellm(self) -> bool:
        """Check if LiteLLM is responding"""
        try:
            response = requests.get(self.LITELLM_ENDPOINT, timeout=5)
            if response.status_code == 200:
                self.failure_count['litellm'] = 0
                return True
        except Exception as e:
            logger.debug(f"LiteLLM health check failed: {str(e)}")

        self.failure_count['litellm'] += 1
        return self.failure_count['litellm'] < self.FAILURE_THRESHOLD

    def is_self_hosted_available(self) -> bool:
        """Check if self-hosted AI is fully available"""
        return all([
            self.is_healthy['ollama'],
            self.is_healthy['comfyui'],
            self.is_healthy['litellm']
        ])


# Create global instance
health_monitor = AIHealthMonitor()
```

**Step 2: Create Health Check Celery Task**

File: `backend/tasks/health_checks.py`

```python
from celery import shared_task
from celery.schedules import schedule
from backend.services.ai_health_check import health_monitor
from backend.models.monitoring import ServiceHealthLog
import logging

logger = logging.getLogger(__name__)

@shared_task
def check_ai_health():
    """Run AI infrastructure health checks every minute"""

    results = health_monitor.check_all()

    # Log to database
    ServiceHealthLog.objects.create(
        service='self_hosted_ai',
        is_healthy=results['overall_healthy'],
        details=results
    )

    # Alert if services are down
    if not results['overall_healthy']:
        down_services = [
            service for service, status in results['services'].items()
            if not status['healthy']
        ]

        logger.error(
            f"AI services down: {', '.join(down_services)}. "
            f"Falling back to external APIs."
        )

    return results


# Add to celery beat schedule
CELERY_BEAT_SCHEDULE = {
    'check-ai-health': {
        'task': 'backend.tasks.health_checks.check_ai_health',
        'schedule': 60.0,  # Every 60 seconds
    },
}
```

**Step 3: Create Monitoring Model**

File: `backend/models/monitoring.py`

```python
from django.db import models
from django.utils import timezone

class ServiceHealthLog(models.Model):
    """Log of service health checks"""

    SERVICE_CHOICES = [
        ('self_hosted_ai', 'Self-Hosted AI'),
        ('tunnel', 'SSH Tunnel'),
        ('litellm', 'LiteLLM Proxy'),
    ]

    service = models.CharField(max_length=50, choices=SERVICE_CHOICES)
    is_healthy = models.BooleanField()
    details = models.JSONField(default=dict)
    checked_at = models.DateTimeField(auto_now_add=True)

    class Meta:
        ordering = ['-checked_at']
        indexes = [
            models.Index(fields=['-checked_at']),
            models.Index(fields=['service', '-checked_at']),
        ]

    def __str__(self):
        status = "✓ Healthy" if self.is_healthy else "✗ Down"
        return f"{self.service} {status} @ {self.checked_at}"


class AIUsageLog(models.Model):
    """Track AI provider usage and costs"""

    PROVIDER_CHOICES = [
        ('self_hosted_ai', 'Self-Hosted AI'),
        ('openai', 'OpenAI'),
        ('anthropic', 'Anthropic'),
    ]

    TASK_TYPE_CHOICES = [
        ('text_generation', 'Text Generation'),
        ('image_generation', 'Image Generation'),
        ('keyword_research', 'Keyword Research'),
    ]

    user = models.ForeignKey('User', on_delete=models.CASCADE)
    provider = models.CharField(max_length=50, choices=PROVIDER_CHOICES)
    task_type = models.CharField(max_length=50, choices=TASK_TYPE_CHOICES)
    model_used = models.CharField(max_length=100)

    input_tokens = models.IntegerField(default=0)
    output_tokens = models.IntegerField(default=0)

    cost = models.DecimalField(max_digits=10, decimal_places=6, default=0)
    duration_ms = models.IntegerField()  # Milliseconds

    success = models.BooleanField(default=True)
    error_message = models.TextField(blank=True)

    created_at = models.DateTimeField(auto_now_add=True)

    class Meta:
        ordering = ['-created_at']
        indexes = [
            models.Index(fields=['user', '-created_at']),
            models.Index(fields=['provider', '-created_at']),
        ]

    def __str__(self):
        return f"{self.provider} - {self.task_type} - ${self.cost:.4f}"
```

---

### Phase 5: Cost Tracking & Optimization (Days 5-6)

#### 4.7 Cost Calculation & Dashboard

**Step 1: Create Cost Calculator**

File: `backend/services/cost_calculator.py`

```python
from decimal import Decimal
from typing import Dict, Any

class AICostCalculator:
    """Calculate AI generation costs by provider"""

    # Self-hosted cost (Vast.ai GPU rental amortized)
    # $200/month ÷ 30 days ÷ 24 hours = $0.278/hour
    # Assuming 70% utilization = $0.1945/hour
    SELF_HOSTED_COST_PER_HOUR = Decimal('0.20')  # Conservative estimate

    # OpenAI pricing (as of 2026)
    OPENAI_PRICING = {
        'gpt-4': {
            'input': Decimal('0.00003'),    # per token
            'output': Decimal('0.00006'),
        },
        'gpt-3.5-turbo': {
            'input': Decimal('0.0005'),
            'output': Decimal('0.0015'),
        },
        'dall-e-3': Decimal('0.04'),  # per image
    }

    # Anthropic pricing
    ANTHROPIC_PRICING = {
        'claude-3-opus': {
            'input': Decimal('0.000015'),
            'output': Decimal('0.000075'),
        },
        'claude-3-sonnet': {
            'input': Decimal('0.000003'),
            'output': Decimal('0.000015'),
        },
    }

    @classmethod
    def calculate_text_generation_cost(
        cls,
        provider: str,
        model: str,
        input_tokens: int,
        output_tokens: int,
        duration_ms: int = 0
    ) -> Decimal:
        """Calculate cost for text generation"""

        if provider == 'self_hosted_ai':
            # Cost based on compute time (rough estimate)
            duration_hours = duration_ms / (1000 * 3600)
            return cls.SELF_HOSTED_COST_PER_HOUR * Decimal(duration_hours)

        elif provider == 'openai':
            pricing = cls.OPENAI_PRICING.get(model, {})
            input_cost = Decimal(input_tokens) * pricing.get('input', Decimal(0))
            output_cost = Decimal(output_tokens) * pricing.get('output', Decimal(0))
            return input_cost + output_cost

        elif provider == 'anthropic':
            pricing = cls.ANTHROPIC_PRICING.get(model, {})
            input_cost = Decimal(input_tokens) * pricing.get('input', Decimal(0))
            output_cost = Decimal(output_tokens) * pricing.get('output', Decimal(0))
            return input_cost + output_cost

        return Decimal(0)

    @classmethod
    def calculate_image_generation_cost(
        cls,
        provider: str,
        model: str,
        duration_ms: int = 0
    ) -> Decimal:
        """Calculate cost for image generation"""

        if provider == 'self_hosted_ai':
            # Cost based on compute time
            duration_hours = duration_ms / (1000 * 3600)
            return cls.SELF_HOSTED_COST_PER_HOUR * Decimal(duration_hours)

        elif provider == 'openai':
            if 'dall-e' in model:
                return cls.OPENAI_PRICING.get('dall-e-3', Decimal('0.04'))

        return Decimal(0)

    @classmethod
    def monthly_cost_analysis(cls) -> Dict[str, Any]:
        """Analyze projected monthly costs"""

        from backend.models.monitoring import AIUsageLog
        from django.utils import timezone
        from datetime import timedelta

        # Get last 30 days of usage
        thirty_days_ago = timezone.now() - timedelta(days=30)
        usage_logs = AIUsageLog.objects.filter(
            created_at__gte=thirty_days_ago
        )

        cost_by_provider = {}
        total_cost = Decimal(0)

        for log in usage_logs:
            if log.provider not in cost_by_provider:
                cost_by_provider[log.provider] = {
                    'count': 0,
                    'total_cost': Decimal(0),
                    'saved_vs_openai': Decimal(0)
                }

            cost_by_provider[log.provider]['count'] += 1
            cost_by_provider[log.provider]['total_cost'] += log.cost
            total_cost += log.cost

        # Calculate savings
        self_hosted_usage = usage_logs.filter(provider='self_hosted_ai')
        openai_equivalent_cost = Decimal(0)

        for log in self_hosted_usage:
            # Calculate what OpenAI would have charged
            openai_cost = cls.calculate_text_generation_cost(
                'openai',
                'gpt-4',
                log.input_tokens,
                log.output_tokens
            ) if log.task_type == 'text_generation' else cls.calculate_image_generation_cost(
                'openai',
                'dall-e-3'
            )
            openai_equivalent_cost += openai_cost

        return {
            'cost_by_provider': cost_by_provider,
            'total_cost': total_cost,
            'savings_vs_openai': openai_equivalent_cost - cost_by_provider.get('self_hosted_ai', {}).get('total_cost', Decimal(0)),
            'roi_vs_gpu_cost': openai_equivalent_cost - Decimal(200),  # $200 = 1 month GPU
        }
```

---

## 5. Acceptance Criteria

### Infrastructure Ready
- [ ] Vast.ai GPU instance rented and running (2x RTX 3090 or better)
- [ ] SSH access confirmed from IGNY8 VPS
- [ ] Ollama container running with all Qwen3 models downloaded
- [ ] ComfyUI container running with FLUX.1 and Stable Diffusion 3.5 models
- [ ] Models tested via direct API calls (curl tests all pass)

### Network Tunnel Operational
- [ ] autossh service running on IGNY8 VPS
- [ ] SSH tunnel persists through network interruptions
- [ ] Ports 11434, 11435, 8188 accessible on localhost from VPS
- [ ] Tunnel auto-reconnects within 60 seconds of disconnect
- [ ] Systemd service enables on boot

### LiteLLM Proxy Functional
- [ ] LiteLLM service running on VPS port 8000
- [ ] OpenAI-compatible API endpoints working
- [ ] Text generation requests route to Ollama
- [ ] Image generation requests route to ComfyUI
- [ ] Fallback to OpenAI works when self-hosted unavailable
- [ ] Config includes all model variants
- [ ] Timeout values appropriate for each model

### IGNY8 Backend Integration Complete
- [ ] Self-hosted provider added to GlobalIntegrationSettings
- [ ] AIEngineRouter tries self-hosted before external APIs
- [ ] Celery tasks log which provider was used
- [ ] Content includes ai_provider tracking field
- [ ] Fallback chain works (self-hosted → OpenAI → Anthropic)
- [ ] Unit tests pass for all provider calls

### Health Check System Operational
- [ ] Health check task runs every 60 seconds
- [ ] ServiceHealthLog table populated
- [ ] Alerts generated when services down
- [ ] System continues working with degraded services
- [ ] Dashboard shows service status

### Cost Tracking Implemented
- [ ] AIUsageLog records all AI requests
- [ ] Cost calculation accurate per provider
- [ ] Monthly cost analysis working
- [ ] Cost comparison shows self-hosted savings
- [ ] Dashboard displays cost breakdown

### Documentation & Runbooks
- [ ] This build document complete and accurate
- [ ] Troubleshooting guide for common issues
- [ ] Runbook for GPU rental renewal
- [ ] Cost monitoring dashboard updated
- [ ] Team trained on fallback procedures

---

## 6. Claude Code Instructions

### Prerequisites
```bash
# Ensure VPS provisioned (see 00B)
# Have Vast.ai account created
# Have IGNY8 codebase cloned locally
```

### Build Execution

**Step 1: GPU Infrastructure (Operator)**
```bash
# Manual: Set up Vast.ai account, rent GPU, note IP
# This requires manual interaction with Vast.ai dashboard
# Once IP obtained, proceed to step 2
```

**Step 2: Vast.ai Setup (Automated)**
```bash
# Run on Vast.ai GPU server
VAST_AI_IP="<your-gpu-ip>"

ssh -i ~/.ssh/vast_key root@$VAST_AI_IP << 'EOF'

# Update system
apt update && apt upgrade -y

# Install Docker
curl https://get.docker.com -sSfL | sh
systemctl enable docker && systemctl start docker

# Create storage directories
mkdir -p /mnt/{models,ollama-cache,comfyui-models,comfyui-output}
chmod 777 /mnt/*

# Create docker network
docker network create ai-network

# Deploy Ollama
docker run -d \
  --name ollama \
  --network ai-network \
  --gpus all \
  -e OLLAMA_MODELS=/mnt/ollama-cache \
  -v /mnt/ollama-cache:/root/.ollama \
  -p 0.0.0.0:11434:11434 \
  ollama/ollama:latest

sleep 30

# Pull models (takes 1-2 hours)
docker exec ollama ollama pull qwen3:32b
docker exec ollama ollama pull qwen3:30b-a3b
docker exec ollama ollama pull qwen3:14b
docker exec ollama ollama pull qwen3:8b

# Deploy ComfyUI
docker run -d \
  --name comfyui \
  --network ai-network \
  --gpus all \
  -v /mnt/comfyui-models:/ComfyUI/models \
  -v /mnt/comfyui-output:/ComfyUI/output \
  -p 0.0.0.0:8188:8188 \
  comfyui-docker:latest

# Download image models
mkdir -p /mnt/comfyui-models/checkpoints
cd /mnt/comfyui-models/checkpoints
wget https://huggingface.co/black-forest-labs/FLUX.1-dev/resolve/main/flux1-dev-Q8_0.safetensors -O flux1-dev.safetensors
wget https://huggingface.co/stabilityai/stable-diffusion-3.5-large/resolve/main/sd_xl_base_1.0.safetensors -O sd3.5-large.safetensors

echo "✓ Vast.ai setup complete"
EOF
```

**Step 3: VPS Tunnel Setup (Automated)**
```bash
# Run on IGNY8 VPS
VAST_AI_IP="<your-gpu-ip>"

# Install autossh
apt install autossh -y

# Create tunnel user
useradd -m -s /bin/bash tunnel-user
mkdir -p /home/tunnel-user/.ssh

# Copy SSH key (paste private key content)
cat > /home/tunnel-user/.ssh/vast_ai << 'KEY'
-----BEGIN RSA PRIVATE KEY-----
<paste-private-key-here>
-----END RSA PRIVATE KEY-----
KEY

chmod 600 /home/tunnel-user/.ssh/vast_ai
chown -R tunnel-user:tunnel-user /home/tunnel-user/.ssh

# Create systemd service
cat > /etc/systemd/system/tunnel-vast-ai.service << 'SERVICE'
[Unit]
Description=SSH Tunnel to Vast.ai GPU Server
After=network.target
Wants=network-online.target

[Service]
Type=simple
User=tunnel-user
ExecStart=/usr/bin/autossh \
  -M 20000 \
  -N \
  -o "ServerAliveInterval=30" \
  -o "ServerAliveCountMax=3" \
  -o "ExitOnForwardFailure=no" \
  -o "StrictHostKeyChecking=accept-new" \
  -i /home/tunnel-user/.ssh/vast_ai \
  -L 11434:localhost:11434 \
  -L 11435:localhost:11435 \
  -L 8188:localhost:8188 \
  root@VAST_AI_IP

Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
SERVICE

# Update IP in service file
sed -i "s/VAST_AI_IP/$VAST_AI_IP/g" /etc/systemd/system/tunnel-vast-ai.service

# Start tunnel
systemctl daemon-reload
systemctl start tunnel-vast-ai
systemctl enable tunnel-vast-ai

# Wait and verify
sleep 5
netstat -tlnp | grep -E '(11434|8188)'

echo "✓ SSH tunnel operational"
```

**Step 4: LiteLLM Installation (Automated)**
```bash
# Run on IGNY8 VPS

# Install LiteLLM
pip install litellm fastapi uvicorn python-dotenv requests

# Create directories
mkdir -p /opt/litellm

# Create config file
cat > /opt/litellm/config.yaml << 'CONFIG'
model_list:
  - model_name: gpt-4
    litellm_params:
      model: ollama/qwen3:32b
      api_base: http://localhost:11434
      timeout: 300
      max_tokens: 8000

  - model_name: gpt-3.5-turbo
    litellm_params:
      model: ollama/qwen3:8b
      api_base: http://localhost:11434
      timeout: 120
      max_tokens: 2048

  - model_name: dall-e-3
    litellm_params:
      model: comfyui/flux.1-dev
      api_base: http://localhost:8188
      timeout: 120

litellm_settings:
  verbose: true
  log_level: INFO
  cache_responses: true
CONFIG

# Create .env file
cat > /opt/litellm/.env << 'ENV'
OPENAI_API_KEY=your-openai-key
PORT=8000
HOST=127.0.0.1
ENV

# Create start script
cat > /opt/litellm/start.sh << 'SCRIPT'
#!/bin/bash
cd /opt/litellm
source .env
python -m litellm.server --config config.yaml --host 127.0.0.1 --port 8000 --num_workers 4
SCRIPT

chmod +x /opt/litellm/start.sh

# Create systemd service
cat > /etc/systemd/system/litellm.service << 'SERVICE'
[Unit]
Description=LiteLLM AI Proxy Gateway
After=network.target tunnel-vast-ai.service
Wants=tunnel-vast-ai.service

[Service]
Type=simple
User=root
WorkingDirectory=/opt/litellm
ExecStart=/opt/litellm/start.sh
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
SERVICE

# Start LiteLLM
systemctl daemon-reload
systemctl start litellm
systemctl enable litellm

# Verify
sleep 5
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer test" \
  -d '{"model": "gpt-4", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 100}'

echo "✓ LiteLLM operational"
```

**Step 5: IGNY8 Backend Integration (Developer)**
```bash
# In IGNY8 codebase

# 1. Add to IntegrationProvider enum (backend/models/integration.py)
# 2. Update management command to initialize self-hosted settings
# 3. Implement AIEngineRouter with fallback logic
# 4. Update Celery tasks to use router
# 5. Add database fields for provider tracking
# 6. Run migrations
# 7. Create health check monitoring

python manage.py makemigrations
python manage.py migrate

# Initialize self-hosted integration
python manage.py init_integrations
```

**Step 6: Verification (Automated)**
```bash
# Test full chain
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "Write a 100-word article about clouds"}],
    "max_tokens": 200
  }'

# Expected response: Article from Qwen3:32B model

# Test fallback by stopping tunnel
systemctl stop tunnel-vast-ai
# Wait 10 seconds
# Retry request - should now use OpenAI instead
```

---

## Timeline & Resource Allocation

| Phase | Days | Task | Owner | Status |
|-------|------|------|-------|--------|
| 1.1 | 1 | Vast.ai account & GPU rental | Operator | Ready |
| 1.2 | 1 | Docker & Ollama setup | DevOps | Ready |
| 1.3 | 1 | Model pulling & ComfyUI | DevOps | Ready |
| 2.1 | 0.5 | VPS tunnel infrastructure | DevOps | Ready |
| 2.2 | 0.5 | autossh systemd service | DevOps | Ready |
| 2.3 | 1 | LiteLLM installation & config | DevOps | Ready |
| 3.1 | 1 | Backend integration scaffolding | Developer | Ready |
| 3.2 | 1 | AI router & fallback logic | Developer | Ready |
| 3.3 | 1 | Celery task updates | Developer | Ready |
| 4.1 | 1 | Health check system | DevOps | Ready |
| 5.1 | 1 | Cost tracking & dashboard | Developer | Ready |
| **Total** | **7** | | | |

---

## Cost Analysis

### Monthly GPU Rental
- **Vast.ai 2x RTX 3090:** $180-220/month (auto-bid recommended)
- **Fixed cost:** $200/month (conservative)

### Monthly API Costs (Current)
Estimated current external API costs (before optimization):
- **OpenAI (GPT-4/3.5):** $800-1,200/month
- **Anthropic (Claude):** $200-400/month
- **Image generation (Runware/Bria):** $300-500/month
- **Total:** $1,300-2,100/month

### Monthly API Costs (After)
With self-hosted supplementing external:
- **Self-hosted cost:** $200/month (amortized GPU)
- **External APIs (fallback only):** $200-300/month
- **Total:** $400-500/month

### Savings & ROI
- **Monthly savings:** $800-1,700
- **Break-even:** 12-24 days (1 GPU rental cost)
- **Annual savings:** $9,600-20,400

### Cost Per Subscriber
- **Before:** $26-42/subscriber/month (on $49/month tier)
- **After:** $8-10/subscriber/month
- **Improvement:** 65-76% cost reduction

---

## Troubleshooting Guide

### SSH Tunnel Not Connecting
```bash
# Check service status
systemctl status tunnel-vast-ai

# View detailed logs
journalctl -u tunnel-vast-ai -n 100 -f

# Test SSH manually
ssh -v -i /home/tunnel-user/.ssh/vast_ai root@<vast_ai_ip>

# Ensure Vast.ai machine still running and has bandwidth
```

### Ollama Not Responding
```bash
# Check container
docker ps | grep ollama

# View logs
docker logs -f ollama

# Test directly
docker exec ollama curl http://localhost:11434/api/tags

# Restart if needed
docker restart ollama
```

### ComfyUI Port Not Accessible
```bash
# Check container
docker ps | grep comfyui

# Test through tunnel
curl http://localhost:8188/system_stats

# Restart if needed
docker restart comfyui
```

### LiteLLM Timeouts
```bash
# Check LiteLLM logs
journalctl -u litellm -n 100

# Increase timeout in config.yaml
# Test simple request
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "Hi"}], "max_tokens": 10}'
```

### Fallback to External APIs Not Working
```bash
# Verify OpenAI API key in /opt/litellm/.env
# Test OpenAI directly (disable tunnel)
systemctl stop tunnel-vast-ai
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-3.5-turbo-fallback", "messages": [{"role": "user", "content": "Hi"}]}'
```

---

## Cross-References

**Dependency:** [00B VPS Provisioning & Infrastructure](./00B-vps-provisioning.md)
**Related:** [00A Project Planning](./00A-project-planning.md)
**Related:** [00C Database & Schema](./00C-database-schema.md)
**Related:** [00D Authentication & Security](./00D-auth-security.md)

---

## Document Version

| Version | Date | Changes |
|---------|------|---------|
| 1.0 | 2026-03-23 | Initial comprehensive build document |

---

**Status:** Ready for implementation
**Last Updated:** 2026-03-23
**Next Step:** Execute Phase 1 GPU infrastructure setup after 00B completion