# IGNY8 Phase 0: Self-Hosted AI Infrastructure (00F) **Status:** Ready for Implementation **Version:** 1.1 **Priority:** High (cost savings critical for unit economics) **Duration:** 5-7 days **Dependencies:** 00B (VPS provisioning) must be complete first **Source of Truth:** Codebase at `/data/app/igny8/` **Cost:** ~$200/month GPU rental + $0 software (open source) --- ## 1. Current State ### Existing AI Integration - **External providers (verified from `IntegrationProvider` model):** OpenAI (GPT-4, GPT-3.5), Anthropic (Claude), Runware (image gen) - **Storage:** API keys stored in `IntegrationProvider` model (table: `igny8_integration_providers`) with per-account overrides in `IntegrationSettings` (table: `igny8_integration_settings`). Global defaults in `GlobalIntegrationSettings`. - **Provider types in codebase:** `ai`, `payment`, `email`, `storage` (from `PROVIDER_TYPE_CHOICES`) - **Existing provider_ids:** `openai`, `runware`, `stripe`, `paypal`, `resend` - **Architecture:** Multi-provider AI engine with model selection capability - **Current AI functions:** `auto_cluster`, `generate_ideas`, `generate_content`, `generate_images`, `generate_image_prompts`, `optimize_content`, `generate_site_structure` - **Async handling:** Celery workers process long-running AI tasks - **Cost impact:** External APIs constitute 15-30% of monthly operational costs ### Problem - External API costs scale linearly with subscriber growth - No cost leverage at scale (pay-as-you-go model) - API rate limits require careful orchestration - Privacy concerns with offloading content generation --- ## 2. What to Build ### Infrastructure Stack ``` ┌─────────────────────────────────────────────────────────────┐ │ IGNY8 Backend (on VPS) │ │ - Send requests to LiteLLM proxy (local localhost:8000) │ │ - Fallback to OpenAI/Anthropic if self-hosted unavailable │ └──────────────┬──────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────┐ │ LiteLLM Proxy (on VPS, port 8000) │ │ - OpenAI-compatible API gateway │ │ - Routes requests to local Ollama and ComfyUI (via tunnel) │ │ - Load balancing & model selection │ │ - Fallback configuration for external APIs │ └──────────────┬───────────────────────────────────────────────┘ │ ┌────────┴──────────────────┐ │ │ ▼ ▼ ┌──────────────────┐ ┌──────────────────────┐ │ SSH Tunnel │ │ ComfyUI Tunnel │ │ (autossh) │ │ (autossh) │ │ Port 11434-11435 │ │ Port 8188 │ │ │ │ (image generation) │ └────────┬─────────┘ └──────────┬───────────┘ │ │ ▼ ▼ ┌────────────────────────────────────────────────────────┐ │ Vast.ai GPU Server (2x RTX 3090, 48GB VRAM) │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Ollama Container │ │ │ │ - Qwen3-32B (reasoning) │ │ │ │ - Qwen3-30B-A3B (multimodal) │ │ │ │ - Qwen3-14B (general purpose) │ │ │ │ - Qwen3-8B (fast inference) │ │ │ │ Listening on 0.0.0.0:11434 │ │ │ └──────────────────────────────────────────────────┘ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ ComfyUI Container │ │ │ │ - FLUX.1 (image gen) │ │ │ │ - Stable Diffusion 3.5 (image gen) │ │ │ │ - SDXL-Lightning (fast generation) │ │ │ │ Listening on 0.0.0.0:8188 │ │ │ └──────────────────────────────────────────────────┘ │ └────────────────────────────────────────────────────────┘ ``` ### Components to Deploy 1. **Vast.ai GPU Rental** - Machine: 2x NVIDIA RTX 3090 (48GB total VRAM) - Estimated cost: $180-220/month - Auto-bid setup for cost optimization - Persistence: Restore from snapshot between rentals 2. **Ollama (Text LLM Server)** - Container-based deployment on GPU - Models: Qwen3 series (32B, 30B-A3B, 14B, 8B) - API: OpenAI-compatible `/v1/chat/completions` - Port: 11434 (tunneled via SSH) 3. **ComfyUI (Image Generation)** - Container-based deployment on GPU - Models: FLUX.1, Stable Diffusion 3.5, SDXL-Lightning - API: REST endpoints for image generation - Port: 8188 (tunneled via SSH) 4. **SSH Tunnel (autossh)** - Persistent connection from VPS to GPU server - Systemd service with auto-restart - Ports: 11434/11435 (Ollama), 8188 (ComfyUI) - Handles network interruptions automatically 5. **LiteLLM Proxy** - Runs on IGNY8 VPS - Acts as OpenAI-compatible API gateway - Configurable routing based on model/task type - Fallback to OpenAI/Anthropic if self-hosted unavailable - Port: 8000 (local access only) 6. **IGNY8 Backend Integration** - Add self-hosted LiteLLM as new `IntegrationProvider` - Update AI request logic to check availability - Implement fallback chain: self-hosted → OpenAI → Anthropic - Cost tracking per provider --- ## 3. Data Models / APIs ### Database Models (Minimal Schema Changes) Use existing `IntegrationProvider` model — add a new row with `provider_type='ai'`: ```python # New IntegrationProvider row (NO new provider_type needed) # provider_type='ai' already exists in PROVIDER_TYPE_CHOICES # Create via admin or migration: IntegrationProvider.objects.create( provider_id='self_hosted_ai', display_name='Self-Hosted AI (LiteLLM)', provider_type='ai', api_key='', # LiteLLM doesn't require auth (internal) api_endpoint='http://localhost:8000', is_active=True, is_sandbox=False, config={ "priority": 10, # Try self-hosted first "models": { "text_generation": "qwen3:32b", "text_generation_fast": "qwen3:8b", "image_generation": "flux.1-dev", "image_generation_fast": "sdxl-lightning" }, "timeout": 300, # 5 minute timeout for slow models "fallback_to": "openai" # Fallback provider if self-hosted fails } ) ``` ### LiteLLM API Endpoints **Text Generation (Compatible with OpenAI API)** ```bash # Request curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "ollama/qwen3:32b", "messages": [{"role": "user", "content": "Write an article about..."}], "temperature": 0.7, "max_tokens": 2000 }' # Response (identical to OpenAI) { "id": "chatcmpl-...", "object": "chat.completion", "created": 1234567890, "model": "ollama/qwen3:32b", "choices": [{ "index": 0, "message": {"role": "assistant", "content": "Article text..."}, "finish_reason": "stop" }], "usage": { "prompt_tokens": 50, "completion_tokens": 500, "total_tokens": 550 } } ``` **Image Generation (ComfyUI via LiteLLM)** ```bash # Request curl -X POST http://localhost:8000/v1/images/generations \ -H "Content-Type: application/json" \ -d '{ "model": "comfyui/flux.1-dev", "prompt": "A professional product photo of...", "size": "1024x1024", "n": 1, "quality": "hd" }' # Response { "created": 1234567890, "data": [{ "url": "data:image/png;base64,...", "revised_prompt": "A professional product photo of..." }] } ``` ### Model Routing Configuration **LiteLLM Config (see section 4.2)** - Routes `gpt-4` requests → `ollama/qwen3:32b` - Routes `gpt-3.5-turbo` requests → `ollama/qwen3:8b` - Routes DALL-E requests → `comfyui/flux.1-dev` - Includes fallback to OpenAI for unavailable models - Respects timeout and retry limits --- ## 4. Implementation Steps ### Phase 1: GPU Infrastructure Setup (Days 1-2) #### 4.1 Vast.ai Account & GPU Rental **Step 1: Create Vast.ai Account** ```bash # Navigate to https://www.vast.ai # Sign up with email # Verify account via email # Add payment method (credit card or crypto) ``` **Step 2: Rent GPU Instance** Requirements: - 2x NVIDIA RTX 3090 (or 1x RTX 4090) = 48GB+ VRAM - Ubuntu 24.04 LTS base image (preferred) or later - Minimum bandwidth: 100 Mbps - SSH port 22 open Setup via Vast.ai dashboard: 1. Go to "Browse" → Filter by: - GPU: 2x RTX 3090 or RTX 4090 - Min VRAM: 48GB - OS: Ubuntu 24.04 LTS (or later) - Price: Sort by lowest $/hr 2. Click "Rent" on selected instance 3. Choose: - Disk size: 500GB (includes models) - Secure Cloud: No (to access port 22) 4. Wait for machine to start (2-5 minutes) 5. Record SSH credentials from dashboard **Step 3: Test SSH Access** ```bash # From your local machine ssh root@ -i ~/.ssh/vast_key # Update system apt update && apt upgrade -y ``` **Step 4: Set Up Snapshot for Persistence** ```bash # After first-time setup, create snapshot in Vast.ai dashboard # Future rentals: select snapshot to restore previous state ``` --- #### 4.2 Vast.ai: Docker & Base Containers **Step 1: Install Docker** ```bash # SSH into Vast.ai machine ssh root@ # Install Docker curl https://get.docker.com -sSfL | sh systemctl enable docker systemctl start docker # Verify docker --version ``` **Step 2: Set Up Storage** ```bash # Create persistent directory for models mkdir -p /mnt/models mkdir -p /mnt/ollama-cache mkdir -p /mnt/comfyui-models chmod 777 /mnt/* # Create docker network for inter-container communication docker network create ai-network ``` **Step 3: Deploy Ollama Container** ```bash docker run -d \ --name ollama \ --network ai-network \ --gpus all \ -e OLLAMA_MODELS=/mnt/ollama-cache \ -v /mnt/ollama-cache:/root/.ollama \ -p 0.0.0.0:11434:11434 \ ollama/ollama:latest ``` **Step 4: Pull Qwen3 Models** ```bash # Wait for ollama to be ready sleep 10 # Pull models (will take 30-60 minutes depending on speed) # Order by priority (largest first) docker exec ollama ollama pull qwen3:32b # ~20GB docker exec ollama ollama pull qwen3:30b-a3b # ~18GB docker exec ollama ollama pull qwen3:14b # ~9GB docker exec ollama ollama pull qwen3:8b # ~5GB # Verify models are loaded docker exec ollama ollama list # Output should show all models with their sizes ``` **Step 5: Deploy ComfyUI Container** ```bash # Clone ComfyUI repository cd /opt git clone https://github.com/comfyanonymous/ComfyUI.git cd ComfyUI # Use Docker image with CUDA support docker run -d \ --name comfyui \ --network ai-network \ --gpus all \ -e CUDA_VISIBLE_DEVICES=0,1 \ -v /mnt/comfyui-models:/ComfyUI/models \ -v /mnt/comfyui-output:/ComfyUI/output \ -p 0.0.0.0:8188:8188 \ comfyui-docker:latest # Alternative: Run from source docker run -d \ --name comfyui \ --network ai-network \ --gpus all \ -v /opt/ComfyUI:/ComfyUI \ -v /mnt/comfyui-models:/ComfyUI/models \ -v /mnt/comfyui-output:/ComfyUI/output \ -p 0.0.0.0:8188:8188 \ -w /ComfyUI \ nvidia/cuda:11.8.0-runtime-ubuntu22.04 \ bash -c "pip install -r requirements.txt && python -m http.server 8188" ``` **Step 6: Download Image Generation Models** ```bash # Download models to ComfyUI # FLUX.1 (recommended for quality) cd /mnt/comfyui-models/checkpoints wget -O flux1-dev-Q8.safetensors \ "https://huggingface.co/black-forest-labs/FLUX.1-dev/resolve/main/flux1-dev-Q8_0.safetensors" # Stable Diffusion 3.5 (alternative) wget -O sd3.5-large.safetensors \ "https://huggingface.co/stabilityai/stable-diffusion-3.5-large/resolve/main/sd_xl_base_1.0.safetensors" # SDXL-Lightning (fast, lower quality but acceptable) wget -O sdxl-lightning.safetensors \ "https://huggingface.co/latent-consistency/lcm-sdxl/resolve/main/pytorch_lora_weights.safetensors" # VAE (for all models) cd /mnt/comfyui-models/vae wget -O ae.safetensors \ "https://huggingface.co/black-forest-labs/FLUX.1-dev/resolve/main/ae.safetensors" ``` **Step 7: Verify Services** ```bash # Check Ollama API curl http://localhost:11434/api/tags # Should return: {"models": [{"name": "qwen3:32b", "size": ...}, ...]} # Check ComfyUI curl http://localhost:8188/system_stats # Should return GPU/memory stats ``` --- ### Phase 2: VPS Tunnel & LiteLLM Setup (Days 2-3) #### 4.3 IGNY8 VPS: SSH Tunnel Configuration **Prerequisites:** VPS must be provisioned (see 00B) **VPS Environment:** - Ubuntu 24.04 LTS - Docker 29.x - GPU server is deployed separately on Vast.ai (not on the VPS) - The VPS maintains SSH tunnels to the Vast.ai GPU server for accessing Ollama and ComfyUI **DNS Note:** During initial setup, before DNS flip, the IGNY8 backend connects to the LiteLLM proxy via `localhost:8000` within the VPS environment. This uses the internal Docker network and local port forwarding, so external DNS configuration does not affect this connection. DNS considerations only apply to external client connections to the IGNY8 API. **Step 1: Generate SSH Key Pair** ```bash # On VPS ssh-keygen -t rsa -b 4096 -f /root/.ssh/vast_ai -N "" # On local machine, copy public key to Vast.ai machine ssh-copy-id -i /root/.ssh/vast_ai.pub root@ ``` **Step 2: Install & Configure autossh** ```bash # On VPS apt install autossh -y # Create dedicated user for tunnel useradd -m -s /bin/bash tunnel-user mkdir -p /home/tunnel-user/.ssh cp /root/.ssh/vast_ai* /home/tunnel-user/.ssh/ chown -R tunnel-user:tunnel-user /home/tunnel-user/.ssh chmod 600 /home/tunnel-user/.ssh/vast_ai ``` **Step 3: Create autossh Systemd Service** File: `/etc/systemd/system/tunnel-vast-ai.service` ```ini [Unit] Description=SSH Tunnel to Vast.ai GPU Server After=network.target Wants=network-online.target [Service] Type=simple User=tunnel-user ExecStart=/usr/bin/autossh \ -M 20000 \ -N \ -o "ServerAliveInterval=30" \ -o "ServerAliveCountMax=3" \ -o "ExitOnForwardFailure=no" \ -o "ConnectTimeout=10" \ -o "StrictHostKeyChecking=accept-new" \ -i /home/tunnel-user/.ssh/vast_ai \ -L 11434:localhost:11434 \ -L 11435:localhost:11435 \ -L 8188:localhost:8188 \ root@ Restart=always RestartSec=10 StandardOutput=journal StandardError=journal [Install] WantedBy=multi-user.target ``` **Step 4: Start Tunnel Service** ```bash # Reload systemd systemctl daemon-reload # Start service systemctl start tunnel-vast-ai # Enable on boot systemctl enable tunnel-vast-ai # Verify tunnel is up systemctl status tunnel-vast-ai # Check logs journalctl -u tunnel-vast-ai -f ``` **Step 5: Test Tunnel Connectivity** ```bash # On VPS, verify ports are open netstat -tlnp | grep -E '(11434|8188)' # Should show: 127.0.0.1:11434 LISTEN # 127.0.0.1:8188 LISTEN # Test Ollama through tunnel curl http://localhost:11434/api/tags # Should return model list from remote Vast.ai machine # Test ComfyUI through tunnel curl http://localhost:8188/system_stats # Should return GPU stats ``` --- #### 4.4 LiteLLM Installation & Configuration **Step 1: Install LiteLLM** ```bash # On VPS pip install litellm fastapi uvicorn python-dotenv requests # Verify installation python -c "import litellm; print(litellm.__version__)" ``` **Step 2: Create LiteLLM Configuration** File: `/opt/litellm/config.yaml` ```yaml # LiteLLM Configuration for IGNY8 model_list: # Text Generation Models (Ollama via SSH tunnel) - model_name: gpt-4 litellm_params: model: ollama/qwen3:32b api_base: http://localhost:11434 timeout: 300 max_tokens: 8000 - model_name: gpt-4-turbo litellm_params: model: ollama/qwen3:30b-a3b api_base: http://localhost:11434 timeout: 300 max_tokens: 8000 - model_name: gpt-3.5-turbo litellm_params: model: ollama/qwen3:14b api_base: http://localhost:11434 timeout: 180 max_tokens: 4000 - model_name: gpt-3.5-turbo-fast litellm_params: model: ollama/qwen3:8b api_base: http://localhost:11434 timeout: 120 max_tokens: 2048 # Fallback to OpenAI for redundancy - model_name: gpt-4-fallback litellm_params: model: gpt-4 api_key: ${OPENAI_API_KEY} timeout: 60 - model_name: gpt-3.5-turbo-fallback litellm_params: model: gpt-3.5-turbo api_key: ${OPENAI_API_KEY} timeout: 60 # Image Generation (ComfyUI via tunnel) - model_name: dall-e-3 litellm_params: model: comfyui/flux.1-dev api_base: http://localhost:8188 timeout: 120 - model_name: dall-e-2 litellm_params: model: comfyui/sdxl-lightning api_base: http://localhost:8188 timeout: 60 # Router configuration for model selection router_settings: routing_strategy: "simple-shuffle" # Load balancing allowed_model_region: null # Logging configuration litellm_settings: verbose: true log_level: "INFO" cache_responses: true cache_params: ["model", "messages", "temperature"] # Fallback behavior fallback_models: - model_name: gpt-4 fallback_to: ["gpt-4-fallback", "gpt-3.5-turbo"] - model_name: gpt-3.5-turbo fallback_to: ["gpt-3.5-turbo-fallback"] - model_name: dall-e-3 fallback_to: ["dall-e-2"] ``` **Step 3: Create Environment File** File: `/opt/litellm/.env` ```bash # OpenAI (for fallback) OPENAI_API_KEY=sk-your-key-here # Anthropic (optional fallback) ANTHROPIC_API_KEY=sk-ant-your-key-here # LiteLLM settings LITELLM_LOG_LEVEL=INFO LITELLM_CACHE=true # Service settings PORT=8000 HOST=127.0.0.1 ``` **Step 4: Create LiteLLM Startup Script** File: `/opt/litellm/start.sh` ```bash #!/bin/bash set -e cd /opt/litellm # Load environment variables source .env # Start LiteLLM server python -m litellm.server \ --config config.yaml \ --host 127.0.0.1 \ --port 8000 \ --num_workers 4 \ --worker_timeout 600 ``` ```bash chmod +x /opt/litellm/start.sh ``` **Step 5: Create Systemd Service for LiteLLM** File: `/etc/systemd/system/litellm.service` ```ini [Unit] Description=LiteLLM AI Proxy Gateway After=network.target tunnel-vast-ai.service Wants=tunnel-vast-ai.service [Service] Type=simple User=root WorkingDirectory=/opt/litellm ExecStart=/opt/litellm/start.sh Restart=always RestartSec=10 StandardOutput=journal StandardError=journal Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/local/python/bin" [Install] WantedBy=multi-user.target ``` **Step 6: Start LiteLLM Service** ```bash systemctl daemon-reload systemctl start litellm systemctl enable litellm # Verify service systemctl status litellm # Check logs journalctl -u litellm -f ``` **Step 7: Test LiteLLM API** ```bash # Test text generation with self-hosted model curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer test-key" \ -d '{ "model": "gpt-4", "messages": [ {"role": "user", "content": "Say hello in one sentence"} ], "temperature": 0.7, "max_tokens": 100 }' # Should return a response from Qwen3:32B # Test fallback (disconnect tunnel first to test fallback logic) # curl should eventually fall back to OpenAI after timeout ``` --- ### Phase 3: IGNY8 Backend Integration (Days 3-4) #### 4.5 Add Self-Hosted Provider to IGNY8 **Step 1: Update GlobalIntegrationSettings Model** File: `backend/models/integration.py` ```python # Add to IntegrationProvider enum class IntegrationProvider(models.TextChoices): OPENAI = "openai", "OpenAI" ANTHROPIC = "anthropic", "Anthropic" RUNWARE = "runware", "Runware" BRIA = "bria", "Bria" SELF_HOSTED = "self_hosted_ai", "Self-Hosted AI (LiteLLM)" # NEW # Example settings structure SELF_HOSTED_SETTINGS = { "provider": "self_hosted_ai", "name": "Self-Hosted AI (LiteLLM)", "base_url": "http://localhost:8000", "api_key": "not_required", "enabled": True, "priority": 10, # Try first "models": { "text_generation": "gpt-4", # Maps to qwen3:32b "text_generation_fast": "gpt-3.5-turbo", # Maps to qwen3:8b "image_generation": "dall-e-3", # Maps to flux.1-dev "image_generation_fast": "dall-e-2" # Maps to sdxl-lightning }, "timeout": 300, "fallback_to": "openai" } ``` **Step 2: Add Self-Hosted Settings to Database** File: `backend/management/commands/init_integrations.py` ```python from backend.models.integration import GlobalIntegrationSettings, IntegrationProvider def add_self_hosted_integration(): """Initialize self-hosted AI integration""" self_hosted_config = { "provider": IntegrationProvider.SELF_HOSTED, "name": "Self-Hosted AI (LiteLLM)", "base_url": "http://localhost:8000", "api_key": "", # Not required for local proxy "enabled": True, "priority": 10, # Higher priority = try first "models": { "text_generation": "gpt-4", "text_generation_fast": "gpt-3.5-turbo", "image_generation": "dall-e-3", "image_generation_fast": "dall-e-2" }, "timeout": 300, "max_retries": 2, "fallback_provider": IntegrationProvider.OPENAI } integration, created = GlobalIntegrationSettings.objects.update_or_create( provider=IntegrationProvider.SELF_HOSTED, defaults=self_hosted_config ) if created: print(f"✓ Created {IntegrationProvider.SELF_HOSTED} integration") else: print(f"✓ Updated {IntegrationProvider.SELF_HOSTED} integration") # Run in management command initialization ``` **Step 3: Update AI Request Router** File: `backend/services/ai_engine.py` ```python import requests import logging from typing import Optional, List, Dict, Any from backend.models.integration import GlobalIntegrationSettings, IntegrationProvider logger = logging.getLogger(__name__) class AIEngineRouter: """Routes AI requests to appropriate provider with fallback chain""" PROVIDER_PRIORITY = { IntegrationProvider.SELF_HOSTED: 10, # Try first IntegrationProvider.OPENAI: 5, IntegrationProvider.ANTHROPIC: 4, } def __init__(self): self.providers = self._load_providers() def _load_providers(self) -> List[Dict[str, Any]]: """Load enabled providers from database""" configs = GlobalIntegrationSettings.objects.filter( enabled=True ).values() # Sort by priority (highest first) sorted_configs = sorted( configs, key=lambda x: self.PROVIDER_PRIORITY.get(x['provider'], 0), reverse=True ) return sorted_configs def generate_text( self, prompt: str, model: str = "gpt-4", max_tokens: int = 2000, temperature: float = 0.7, timeout: Optional[int] = None ) -> Dict[str, Any]: """Generate text using available provider with fallback""" for provider_config in self.providers: try: result = self._call_provider( provider_config, "text", prompt=prompt, model=model, max_tokens=max_tokens, temperature=temperature, timeout=timeout or provider_config.get('timeout', 300) ) return { "success": True, "provider": provider_config['provider'], "text": result['content'], "usage": result.get('usage', {}), "model": result.get('model', model) } except Exception as e: logger.warning( f"Provider {provider_config['provider']} failed: {str(e)}" ) continue # All providers failed raise Exception("All AI providers exhausted. No response available.") def generate_image( self, prompt: str, model: str = "dall-e-3", size: str = "1024x1024", quality: str = "hd", timeout: Optional[int] = None ) -> Dict[str, Any]: """Generate image using available provider with fallback""" for provider_config in self.providers: try: result = self._call_provider( provider_config, "image", prompt=prompt, model=model, size=size, quality=quality, timeout=timeout or provider_config.get('timeout', 120) ) return { "success": True, "provider": provider_config['provider'], "image_url": result['url'], "revised_prompt": result.get('revised_prompt', prompt), "model": result.get('model', model) } except Exception as e: logger.warning( f"Provider {provider_config['provider']} failed: {str(e)}" ) continue # All providers failed raise Exception("All image generation providers exhausted.") def _call_provider( self, provider_config: Dict[str, Any], task_type: str, # "text" or "image" **kwargs ) -> Dict[str, Any]: """Call specific provider based on type""" provider = provider_config['provider'] if provider == IntegrationProvider.SELF_HOSTED: return self._call_litellm(provider_config, task_type, **kwargs) elif provider == IntegrationProvider.OPENAI: return self._call_openai(provider_config, task_type, **kwargs) elif provider == IntegrationProvider.ANTHROPIC: return self._call_anthropic(provider_config, task_type, **kwargs) else: raise ValueError(f"Unknown provider: {provider}") def _call_litellm( self, provider_config: Dict[str, Any], task_type: str, **kwargs ) -> Dict[str, Any]: """Call LiteLLM proxy on localhost""" base_url = provider_config['base_url'] timeout = kwargs.pop('timeout', 300) if task_type == "text": # Chat completion endpoint endpoint = f"{base_url}/v1/chat/completions" payload = { "model": kwargs.get('model', 'gpt-4'), "messages": [ {"role": "user", "content": kwargs['prompt']} ], "temperature": kwargs.get('temperature', 0.7), "max_tokens": kwargs.get('max_tokens', 2000) } elif task_type == "image": # Image generation endpoint endpoint = f"{base_url}/v1/images/generations" payload = { "model": kwargs.get('model', 'dall-e-3'), "prompt": kwargs['prompt'], "size": kwargs.get('size', '1024x1024'), "n": 1, "quality": kwargs.get('quality', 'hd') } else: raise ValueError(f"Unknown task type: {task_type}") try: response = requests.post( endpoint, json=payload, timeout=timeout, headers={"Authorization": "Bearer test"} ) response.raise_for_status() data = response.json() if task_type == "text": return { "content": data['choices'][0]['message']['content'], "usage": data.get('usage', {}), "model": data.get('model', kwargs.get('model')) } else: # image return { "url": data['data'][0]['url'], "revised_prompt": data['data'][0].get('revised_prompt'), "model": kwargs.get('model') } except requests.exceptions.Timeout: logger.error(f"LiteLLM timeout after {timeout}s") raise except requests.exceptions.ConnectionError: logger.error("Cannot connect to LiteLLM proxy - tunnel may be down") raise except Exception as e: logger.error(f"LiteLLM request failed: {str(e)}") raise def _call_openai(self, provider_config, task_type, **kwargs): """Existing OpenAI implementation""" # Use existing OpenAI integration code pass def _call_anthropic(self, provider_config, task_type, **kwargs): """Existing Anthropic implementation""" # Use existing Anthropic integration code pass # Initialize global instance ai_router = AIEngineRouter() ``` **Step 4: Update Content Generation Celery Tasks** File: `backend/tasks/content_generation.py` ```python from celery import shared_task from backend.services.ai_engine import ai_router import logging logger = logging.getLogger(__name__) @shared_task def generate_article_content(user_id: int, article_id: int): """Generate article content using AI router (tries self-hosted first)""" try: # Get article from database article = Article.objects.get(id=article_id, user_id=user_id) # Generate content result = ai_router.generate_text( prompt=f"Write a detailed article about: {article.topic}", model="gpt-4", max_tokens=3000, temperature=0.7 ) # Save result article.content = result['text'] article.ai_provider = result['provider'] article.save() logger.info( f"Generated article {article_id} using {result['provider']}" ) return { "success": True, "article_id": article_id, "provider": result['provider'] } except Exception as e: logger.error(f"Article generation failed: {str(e)}") raise @shared_task def generate_product_images(user_id: int, product_id: int): """Generate product images using AI router""" try: product = Product.objects.get(id=product_id, user_id=user_id) # Try to generate with self-hosted first (faster) result = ai_router.generate_image( prompt=f"Professional product photo of: {product.description}", model="dall-e-3", size="1024x1024", quality="hd" ) product.image_url = result['image_url'] product.ai_provider = result['provider'] product.save() logger.info(f"Generated image for product {product_id} using {result['provider']}") return { "success": True, "product_id": product_id, "provider": result['provider'], "image_url": result['image_url'] } except Exception as e: logger.error(f"Image generation failed: {str(e)}") raise ``` **Step 5: Add AI Provider Tracking** File: `backend/models/content.py` ```python from django.db import models from backend.models.integration import IntegrationProvider class Article(models.Model): # ... existing fields ... # Track which AI provider generated content ai_provider = models.CharField( max_length=50, choices=IntegrationProvider.choices, default=IntegrationProvider.OPENAI, help_text="Which AI provider generated this content" ) ai_cost = models.DecimalField( max_digits=10, decimal_places=6, default=0, help_text="Cost to generate via AI provider" ) ai_generation_time = models.DurationField( null=True, blank=True, help_text="Time taken to generate content" ) class Product(models.Model): # ... existing fields ... ai_provider = models.CharField( max_length=50, choices=IntegrationProvider.choices, default=IntegrationProvider.OPENAI, help_text="Which AI provider generated the image" ) ai_image_cost = models.DecimalField( max_digits=10, decimal_places=6, default=0, help_text="Cost to generate image" ) ``` --- ### Phase 4: Monitoring & Fallback (Days 4-5) #### 4.6 Health Check & Failover System **Step 1: Create Health Check Service** File: `backend/services/ai_health_check.py` ```python import requests import time import logging from typing import Dict, Any, Tuple from datetime import datetime, timedelta logger = logging.getLogger(__name__) class AIHealthMonitor: """Monitor health of self-hosted AI infrastructure""" OLLAMA_ENDPOINT = "http://localhost:11434/api/tags" COMFYUI_ENDPOINT = "http://localhost:8188/system_stats" LITELLM_ENDPOINT = "http://localhost:8000/health" HEALTH_CHECK_INTERVAL = 60 # seconds FAILURE_THRESHOLD = 3 # Mark unhealthy after 3 failures def __init__(self): self.last_check = None self.failure_count = { 'ollama': 0, 'comfyui': 0, 'litellm': 0 } self.is_healthy = { 'ollama': True, 'comfyui': True, 'litellm': True } def check_all(self) -> Dict[str, Any]: """Run all health checks""" results = { 'timestamp': datetime.now().isoformat(), 'overall_healthy': True, 'services': {} } # Check Ollama ollama_healthy = self._check_ollama() results['services']['ollama'] = { 'healthy': ollama_healthy, 'endpoint': self.OLLAMA_ENDPOINT } if not ollama_healthy: results['overall_healthy'] = False # Check ComfyUI comfyui_healthy = self._check_comfyui() results['services']['comfyui'] = { 'healthy': comfyui_healthy, 'endpoint': self.COMFYUI_ENDPOINT } if not comfyui_healthy: results['overall_healthy'] = False # Check LiteLLM litellm_healthy = self._check_litellm() results['services']['litellm'] = { 'healthy': litellm_healthy, 'endpoint': self.LITELLM_ENDPOINT } if not litellm_healthy: results['overall_healthy'] = False self.last_check = results # Log status change if needed if self.is_healthy['ollama'] != ollama_healthy: level = logging.WARNING if not ollama_healthy else logging.INFO logger.log(level, f"Ollama service {'down' if not ollama_healthy else 'recovered'}") if self.is_healthy['comfyui'] != comfyui_healthy: level = logging.WARNING if not comfyui_healthy else logging.INFO logger.log(level, f"ComfyUI service {'down' if not comfyui_healthy else 'recovered'}") if self.is_healthy['litellm'] != litellm_healthy: level = logging.WARNING if not litellm_healthy else logging.INFO logger.log(level, f"LiteLLM service {'down' if not litellm_healthy else 'recovered'}") # Update internal state self.is_healthy['ollama'] = ollama_healthy self.is_healthy['comfyui'] = comfyui_healthy self.is_healthy['litellm'] = litellm_healthy return results def _check_ollama(self) -> bool: """Check if Ollama is responding""" try: response = requests.get(self.OLLAMA_ENDPOINT, timeout=5) if response.status_code == 200: self.failure_count['ollama'] = 0 return True except Exception as e: logger.debug(f"Ollama health check failed: {str(e)}") self.failure_count['ollama'] += 1 return self.failure_count['ollama'] < self.FAILURE_THRESHOLD def _check_comfyui(self) -> bool: """Check if ComfyUI is responding""" try: response = requests.get(self.COMFYUI_ENDPOINT, timeout=5) if response.status_code == 200: self.failure_count['comfyui'] = 0 return True except Exception as e: logger.debug(f"ComfyUI health check failed: {str(e)}") self.failure_count['comfyui'] += 1 return self.failure_count['comfyui'] < self.FAILURE_THRESHOLD def _check_litellm(self) -> bool: """Check if LiteLLM is responding""" try: response = requests.get(self.LITELLM_ENDPOINT, timeout=5) if response.status_code == 200: self.failure_count['litellm'] = 0 return True except Exception as e: logger.debug(f"LiteLLM health check failed: {str(e)}") self.failure_count['litellm'] += 1 return self.failure_count['litellm'] < self.FAILURE_THRESHOLD def is_self_hosted_available(self) -> bool: """Check if self-hosted AI is fully available""" return all([ self.is_healthy['ollama'], self.is_healthy['comfyui'], self.is_healthy['litellm'] ]) # Create global instance health_monitor = AIHealthMonitor() ``` **Step 2: Create Health Check Celery Task** File: `backend/tasks/health_checks.py` ```python from celery import shared_task from celery.schedules import schedule from backend.services.ai_health_check import health_monitor from backend.models.monitoring import ServiceHealthLog import logging logger = logging.getLogger(__name__) @shared_task def check_ai_health(): """Run AI infrastructure health checks every minute""" results = health_monitor.check_all() # Log to database ServiceHealthLog.objects.create( service='self_hosted_ai', is_healthy=results['overall_healthy'], details=results ) # Alert if services are down if not results['overall_healthy']: down_services = [ service for service, status in results['services'].items() if not status['healthy'] ] logger.error( f"AI services down: {', '.join(down_services)}. " f"Falling back to external APIs." ) return results # Add to celery beat schedule CELERY_BEAT_SCHEDULE = { 'check-ai-health': { 'task': 'backend.tasks.health_checks.check_ai_health', 'schedule': 60.0, # Every 60 seconds }, } ``` **Step 3: Create Monitoring Model** File: `backend/models/monitoring.py` ```python from django.db import models from django.utils import timezone class ServiceHealthLog(models.Model): """Log of service health checks""" SERVICE_CHOICES = [ ('self_hosted_ai', 'Self-Hosted AI'), ('tunnel', 'SSH Tunnel'), ('litellm', 'LiteLLM Proxy'), ] service = models.CharField(max_length=50, choices=SERVICE_CHOICES) is_healthy = models.BooleanField() details = models.JSONField(default=dict) checked_at = models.DateTimeField(auto_now_add=True) class Meta: ordering = ['-checked_at'] indexes = [ models.Index(fields=['-checked_at']), models.Index(fields=['service', '-checked_at']), ] def __str__(self): status = "✓ Healthy" if self.is_healthy else "✗ Down" return f"{self.service} {status} @ {self.checked_at}" class AIUsageLog(models.Model): """Track AI provider usage and costs""" PROVIDER_CHOICES = [ ('self_hosted_ai', 'Self-Hosted AI'), ('openai', 'OpenAI'), ('anthropic', 'Anthropic'), ] TASK_TYPE_CHOICES = [ ('text_generation', 'Text Generation'), ('image_generation', 'Image Generation'), ('keyword_research', 'Keyword Research'), ] user = models.ForeignKey('User', on_delete=models.CASCADE) provider = models.CharField(max_length=50, choices=PROVIDER_CHOICES) task_type = models.CharField(max_length=50, choices=TASK_TYPE_CHOICES) model_used = models.CharField(max_length=100) input_tokens = models.IntegerField(default=0) output_tokens = models.IntegerField(default=0) cost = models.DecimalField(max_digits=10, decimal_places=6, default=0) duration_ms = models.IntegerField() # Milliseconds success = models.BooleanField(default=True) error_message = models.TextField(blank=True) created_at = models.DateTimeField(auto_now_add=True) class Meta: ordering = ['-created_at'] indexes = [ models.Index(fields=['user', '-created_at']), models.Index(fields=['provider', '-created_at']), ] def __str__(self): return f"{self.provider} - {self.task_type} - ${self.cost:.4f}" ``` --- ### Phase 5: Cost Tracking & Optimization (Days 5-6) #### 4.7 Cost Calculation & Dashboard **Step 1: Create Cost Calculator** File: `backend/services/cost_calculator.py` ```python from decimal import Decimal from typing import Dict, Any class AICostCalculator: """Calculate AI generation costs by provider""" # Self-hosted cost (Vast.ai GPU rental amortized) # $200/month ÷ 30 days ÷ 24 hours = $0.278/hour # Assuming 70% utilization = $0.1945/hour SELF_HOSTED_COST_PER_HOUR = Decimal('0.20') # Conservative estimate # OpenAI pricing (as of 2026) OPENAI_PRICING = { 'gpt-4': { 'input': Decimal('0.00003'), # per token 'output': Decimal('0.00006'), }, 'gpt-3.5-turbo': { 'input': Decimal('0.0005'), 'output': Decimal('0.0015'), }, 'dall-e-3': Decimal('0.04'), # per image } # Anthropic pricing ANTHROPIC_PRICING = { 'claude-3-opus': { 'input': Decimal('0.000015'), 'output': Decimal('0.000075'), }, 'claude-3-sonnet': { 'input': Decimal('0.000003'), 'output': Decimal('0.000015'), }, } @classmethod def calculate_text_generation_cost( cls, provider: str, model: str, input_tokens: int, output_tokens: int, duration_ms: int = 0 ) -> Decimal: """Calculate cost for text generation""" if provider == 'self_hosted_ai': # Cost based on compute time (rough estimate) duration_hours = duration_ms / (1000 * 3600) return cls.SELF_HOSTED_COST_PER_HOUR * Decimal(duration_hours) elif provider == 'openai': pricing = cls.OPENAI_PRICING.get(model, {}) input_cost = Decimal(input_tokens) * pricing.get('input', Decimal(0)) output_cost = Decimal(output_tokens) * pricing.get('output', Decimal(0)) return input_cost + output_cost elif provider == 'anthropic': pricing = cls.ANTHROPIC_PRICING.get(model, {}) input_cost = Decimal(input_tokens) * pricing.get('input', Decimal(0)) output_cost = Decimal(output_tokens) * pricing.get('output', Decimal(0)) return input_cost + output_cost return Decimal(0) @classmethod def calculate_image_generation_cost( cls, provider: str, model: str, duration_ms: int = 0 ) -> Decimal: """Calculate cost for image generation""" if provider == 'self_hosted_ai': # Cost based on compute time duration_hours = duration_ms / (1000 * 3600) return cls.SELF_HOSTED_COST_PER_HOUR * Decimal(duration_hours) elif provider == 'openai': if 'dall-e' in model: return cls.OPENAI_PRICING.get('dall-e-3', Decimal('0.04')) return Decimal(0) @classmethod def monthly_cost_analysis(cls) -> Dict[str, Any]: """Analyze projected monthly costs""" from backend.models.monitoring import AIUsageLog from django.utils import timezone from datetime import timedelta # Get last 30 days of usage thirty_days_ago = timezone.now() - timedelta(days=30) usage_logs = AIUsageLog.objects.filter( created_at__gte=thirty_days_ago ) cost_by_provider = {} total_cost = Decimal(0) for log in usage_logs: if log.provider not in cost_by_provider: cost_by_provider[log.provider] = { 'count': 0, 'total_cost': Decimal(0), 'saved_vs_openai': Decimal(0) } cost_by_provider[log.provider]['count'] += 1 cost_by_provider[log.provider]['total_cost'] += log.cost total_cost += log.cost # Calculate savings self_hosted_usage = usage_logs.filter(provider='self_hosted_ai') openai_equivalent_cost = Decimal(0) for log in self_hosted_usage: # Calculate what OpenAI would have charged openai_cost = cls.calculate_text_generation_cost( 'openai', 'gpt-4', log.input_tokens, log.output_tokens ) if log.task_type == 'text_generation' else cls.calculate_image_generation_cost( 'openai', 'dall-e-3' ) openai_equivalent_cost += openai_cost return { 'cost_by_provider': cost_by_provider, 'total_cost': total_cost, 'savings_vs_openai': openai_equivalent_cost - cost_by_provider.get('self_hosted_ai', {}).get('total_cost', Decimal(0)), 'roi_vs_gpu_cost': openai_equivalent_cost - Decimal(200), # $200 = 1 month GPU } ``` --- ## 5. Acceptance Criteria ### Infrastructure Ready - [ ] Vast.ai GPU instance rented and running (2x RTX 3090 or better) - [ ] SSH access confirmed from IGNY8 VPS - [ ] Ollama container running with all Qwen3 models downloaded - [ ] ComfyUI container running with FLUX.1 and Stable Diffusion 3.5 models - [ ] Models tested via direct API calls (curl tests all pass) ### Network Tunnel Operational - [ ] autossh service running on IGNY8 VPS - [ ] SSH tunnel persists through network interruptions - [ ] Ports 11434, 11435, 8188 accessible on localhost from VPS - [ ] Tunnel auto-reconnects within 60 seconds of disconnect - [ ] Systemd service enables on boot ### LiteLLM Proxy Functional - [ ] LiteLLM service running on VPS port 8000 - [ ] OpenAI-compatible API endpoints working - [ ] Text generation requests route to Ollama - [ ] Image generation requests route to ComfyUI - [ ] Fallback to OpenAI works when self-hosted unavailable - [ ] Config includes all model variants - [ ] Timeout values appropriate for each model ### IGNY8 Backend Integration Complete - [ ] Self-hosted provider added to GlobalIntegrationSettings - [ ] AIEngineRouter tries self-hosted before external APIs - [ ] Celery tasks log which provider was used - [ ] Content includes ai_provider tracking field - [ ] Fallback chain works (self-hosted → OpenAI → Anthropic) - [ ] Unit tests pass for all provider calls ### Health Check System Operational - [ ] Health check task runs every 60 seconds - [ ] ServiceHealthLog table populated - [ ] Alerts generated when services down - [ ] System continues working with degraded services - [ ] Dashboard shows service status ### Cost Tracking Implemented - [ ] AIUsageLog records all AI requests - [ ] Cost calculation accurate per provider - [ ] Monthly cost analysis working - [ ] Cost comparison shows self-hosted savings - [ ] Dashboard displays cost breakdown ### Documentation & Runbooks - [ ] This build document complete and accurate - [ ] Troubleshooting guide for common issues - [ ] Runbook for GPU rental renewal - [ ] Cost monitoring dashboard updated - [ ] Team trained on fallback procedures --- ## 6. Claude Code Instructions ### Prerequisites ```bash # Ensure VPS provisioned (see 00B) # Have Vast.ai account created # Have IGNY8 codebase cloned locally ``` ### Build Execution **Step 1: GPU Infrastructure (Operator)** ```bash # Manual: Set up Vast.ai account, rent GPU, note IP # This requires manual interaction with Vast.ai dashboard # Once IP obtained, proceed to step 2 ``` **Step 2: Vast.ai Setup (Automated)** ```bash # Run on Vast.ai GPU server VAST_AI_IP="" ssh -i ~/.ssh/vast_key root@$VAST_AI_IP << 'EOF' # Update system apt update && apt upgrade -y # Install Docker curl https://get.docker.com -sSfL | sh systemctl enable docker && systemctl start docker # Create storage directories mkdir -p /mnt/{models,ollama-cache,comfyui-models,comfyui-output} chmod 777 /mnt/* # Create docker network docker network create ai-network # Deploy Ollama docker run -d \ --name ollama \ --network ai-network \ --gpus all \ -e OLLAMA_MODELS=/mnt/ollama-cache \ -v /mnt/ollama-cache:/root/.ollama \ -p 0.0.0.0:11434:11434 \ ollama/ollama:latest sleep 30 # Pull models (takes 1-2 hours) docker exec ollama ollama pull qwen3:32b docker exec ollama ollama pull qwen3:30b-a3b docker exec ollama ollama pull qwen3:14b docker exec ollama ollama pull qwen3:8b # Deploy ComfyUI docker run -d \ --name comfyui \ --network ai-network \ --gpus all \ -v /mnt/comfyui-models:/ComfyUI/models \ -v /mnt/comfyui-output:/ComfyUI/output \ -p 0.0.0.0:8188:8188 \ comfyui-docker:latest # Download image models mkdir -p /mnt/comfyui-models/checkpoints cd /mnt/comfyui-models/checkpoints wget https://huggingface.co/black-forest-labs/FLUX.1-dev/resolve/main/flux1-dev-Q8_0.safetensors -O flux1-dev.safetensors wget https://huggingface.co/stabilityai/stable-diffusion-3.5-large/resolve/main/sd_xl_base_1.0.safetensors -O sd3.5-large.safetensors echo "✓ Vast.ai setup complete" EOF ``` **Step 3: VPS Tunnel Setup (Automated)** ```bash # Run on IGNY8 VPS VAST_AI_IP="" # Install autossh apt install autossh -y # Create tunnel user useradd -m -s /bin/bash tunnel-user mkdir -p /home/tunnel-user/.ssh # Copy SSH key (paste private key content) cat > /home/tunnel-user/.ssh/vast_ai << 'KEY' -----BEGIN RSA PRIVATE KEY----- -----END RSA PRIVATE KEY----- KEY chmod 600 /home/tunnel-user/.ssh/vast_ai chown -R tunnel-user:tunnel-user /home/tunnel-user/.ssh # Create systemd service cat > /etc/systemd/system/tunnel-vast-ai.service << 'SERVICE' [Unit] Description=SSH Tunnel to Vast.ai GPU Server After=network.target Wants=network-online.target [Service] Type=simple User=tunnel-user ExecStart=/usr/bin/autossh \ -M 20000 \ -N \ -o "ServerAliveInterval=30" \ -o "ServerAliveCountMax=3" \ -o "ExitOnForwardFailure=no" \ -o "StrictHostKeyChecking=accept-new" \ -i /home/tunnel-user/.ssh/vast_ai \ -L 11434:localhost:11434 \ -L 11435:localhost:11435 \ -L 8188:localhost:8188 \ root@VAST_AI_IP Restart=always RestartSec=10 [Install] WantedBy=multi-user.target SERVICE # Update IP in service file sed -i "s/VAST_AI_IP/$VAST_AI_IP/g" /etc/systemd/system/tunnel-vast-ai.service # Start tunnel systemctl daemon-reload systemctl start tunnel-vast-ai systemctl enable tunnel-vast-ai # Wait and verify sleep 5 netstat -tlnp | grep -E '(11434|8188)' echo "✓ SSH tunnel operational" ``` **Step 4: LiteLLM Installation (Automated)** ```bash # Run on IGNY8 VPS # Install LiteLLM pip install litellm fastapi uvicorn python-dotenv requests # Create directories mkdir -p /opt/litellm # Create config file cat > /opt/litellm/config.yaml << 'CONFIG' model_list: - model_name: gpt-4 litellm_params: model: ollama/qwen3:32b api_base: http://localhost:11434 timeout: 300 max_tokens: 8000 - model_name: gpt-3.5-turbo litellm_params: model: ollama/qwen3:8b api_base: http://localhost:11434 timeout: 120 max_tokens: 2048 - model_name: dall-e-3 litellm_params: model: comfyui/flux.1-dev api_base: http://localhost:8188 timeout: 120 litellm_settings: verbose: true log_level: INFO cache_responses: true CONFIG # Create .env file cat > /opt/litellm/.env << 'ENV' OPENAI_API_KEY=your-openai-key PORT=8000 HOST=127.0.0.1 ENV # Create start script cat > /opt/litellm/start.sh << 'SCRIPT' #!/bin/bash cd /opt/litellm source .env python -m litellm.server --config config.yaml --host 127.0.0.1 --port 8000 --num_workers 4 SCRIPT chmod +x /opt/litellm/start.sh # Create systemd service cat > /etc/systemd/system/litellm.service << 'SERVICE' [Unit] Description=LiteLLM AI Proxy Gateway After=network.target tunnel-vast-ai.service Wants=tunnel-vast-ai.service [Service] Type=simple User=root WorkingDirectory=/opt/litellm ExecStart=/opt/litellm/start.sh Restart=always RestartSec=10 [Install] WantedBy=multi-user.target SERVICE # Start LiteLLM systemctl daemon-reload systemctl start litellm systemctl enable litellm # Verify sleep 5 curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer test" \ -d '{"model": "gpt-4", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 100}' echo "✓ LiteLLM operational" ``` **Step 5: IGNY8 Backend Integration (Developer)** ```bash # In IGNY8 codebase # 1. Add to IntegrationProvider enum (backend/models/integration.py) # 2. Update management command to initialize self-hosted settings # 3. Implement AIEngineRouter with fallback logic # 4. Update Celery tasks to use router # 5. Add database fields for provider tracking # 6. Run migrations # 7. Create health check monitoring python manage.py makemigrations python manage.py migrate # Initialize self-hosted integration python manage.py init_integrations ``` **Step 6: Verification (Automated)** ```bash # Test full chain curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-4", "messages": [{"role": "user", "content": "Write a 100-word article about clouds"}], "max_tokens": 200 }' # Expected response: Article from Qwen3:32B model # Test fallback by stopping tunnel systemctl stop tunnel-vast-ai # Wait 10 seconds # Retry request - should now use OpenAI instead ``` --- ## Timeline & Resource Allocation | Phase | Days | Task | Owner | Status | |-------|------|------|-------|--------| | 1.1 | 1 | Vast.ai account & GPU rental | Operator | Ready | | 1.2 | 1 | Docker & Ollama setup | DevOps | Ready | | 1.3 | 1 | Model pulling & ComfyUI | DevOps | Ready | | 2.1 | 0.5 | VPS tunnel infrastructure | DevOps | Ready | | 2.2 | 0.5 | autossh systemd service | DevOps | Ready | | 2.3 | 1 | LiteLLM installation & config | DevOps | Ready | | 3.1 | 1 | Backend integration scaffolding | Developer | Ready | | 3.2 | 1 | AI router & fallback logic | Developer | Ready | | 3.3 | 1 | Celery task updates | Developer | Ready | | 4.1 | 1 | Health check system | DevOps | Ready | | 5.1 | 1 | Cost tracking & dashboard | Developer | Ready | | **Total** | **7** | | | | --- ## Cost Analysis ### Monthly GPU Rental - **Vast.ai 2x RTX 3090:** $180-220/month (auto-bid recommended) - **Fixed cost:** $200/month (conservative) ### Monthly API Costs (Current) Estimated current external API costs (before optimization): - **OpenAI (GPT-4/3.5):** $800-1,200/month - **Anthropic (Claude):** $200-400/month - **Image generation (Runware/Bria):** $300-500/month - **Total:** $1,300-2,100/month ### Monthly API Costs (After) With self-hosted supplementing external: - **Self-hosted cost:** $200/month (amortized GPU) - **External APIs (fallback only):** $200-300/month - **Total:** $400-500/month ### Savings & ROI - **Monthly savings:** $800-1,700 - **Break-even:** 12-24 days (1 GPU rental cost) - **Annual savings:** $9,600-20,400 ### Cost Per Subscriber - **Before:** $26-42/subscriber/month (on $49/month tier) - **After:** $8-10/subscriber/month - **Improvement:** 65-76% cost reduction --- ## Troubleshooting Guide ### SSH Tunnel Not Connecting ```bash # Check service status systemctl status tunnel-vast-ai # View detailed logs journalctl -u tunnel-vast-ai -n 100 -f # Test SSH manually ssh -v -i /home/tunnel-user/.ssh/vast_ai root@ # Ensure Vast.ai machine still running and has bandwidth ``` ### Ollama Not Responding ```bash # Check container docker ps | grep ollama # View logs docker logs -f ollama # Test directly docker exec ollama curl http://localhost:11434/api/tags # Restart if needed docker restart ollama ``` ### ComfyUI Port Not Accessible ```bash # Check container docker ps | grep comfyui # Test through tunnel curl http://localhost:8188/system_stats # Restart if needed docker restart comfyui ``` ### LiteLLM Timeouts ```bash # Check LiteLLM logs journalctl -u litellm -n 100 # Increase timeout in config.yaml # Test simple request curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "Hi"}], "max_tokens": 10}' ``` ### Fallback to External APIs Not Working ```bash # Verify OpenAI API key in /opt/litellm/.env # Test OpenAI directly (disable tunnel) systemctl stop tunnel-vast-ai curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "gpt-3.5-turbo-fallback", "messages": [{"role": "user", "content": "Hi"}]}' ``` --- ## Cross-References **Dependency:** [00B VPS Provisioning & Infrastructure](./00B-vps-provisioning.md) **Related:** [00A Project Planning](./00A-project-planning.md) **Related:** [00C Database & Schema](./00C-database-schema.md) **Related:** [00D Authentication & Security](./00D-auth-security.md) --- ## Document Version | Version | Date | Changes | |---------|------|---------| | 1.0 | 2026-03-23 | Initial comprehensive build document | --- **Status:** Ready for implementation **Last Updated:** 2026-03-23 **Next Step:** Execute Phase 1 GPU infrastructure setup after 00B completion