# IGNY8 Phase 2: Content Optimizer (02F) ## Cluster-Aligned Content Optimization Engine **Document Version:** 1.0 **Date:** 2026-03-23 **Phase:** IGNY8 Phase 2 — Feature Expansion **Status:** Build Ready **Source of Truth:** Codebase at `/data/app/igny8/` **Audience:** Claude Code, Backend Developers, Architects --- ## 1. CURRENT STATE ### Optimization App Today The `optimization` Django app exists in `INSTALLED_APPS` but is **inactive** (behind feature flag). The following exist: - **`OptimizationTask` model** — exists with minimal fields (basic task tracking only) - **`optimize_content` AI function** — registered in `igny8_core/ai/registry.py` as one of the 7 registered functions, but only does basic content rewriting without cluster awareness, keyword coverage analysis, or scoring - **`optimization` app label** — app exists at `igny8_core/modules/optimization/` ### What Does Not Exist - No cluster-alignment during optimization - No keyword coverage analysis against cluster keyword sets - No heading restructure logic - No intent-based content rewrite - No schema gap detection - No before/after scoring system (0-100) - No batch optimization - No integration with SAG data (01A) or taxonomy terms (02B) ### Foundation Available - `Clusters` model (app_label=`planner`, db_table=`igny8_clusters`) with cluster keywords - `Keywords` model (app_label=`planner`, db_table=`igny8_keywords`) linked to clusters - `Content.schema_markup` JSONField — used by 02G for JSON-LD - `Content.content_type` and `Content.content_structure` — routing context - `Content.structured_data` JSONField (added by 02A) - `ContentTaxonomy` cluster mapping (added by 02B) with `mapping_confidence` - `GSCMetricsCache` (added by 02C) — position data identifies pages needing optimization - `SchemaValidationService` (added by 02G) — schema gap detection reuse - `BaseAIFunction` with `validate()`, `prepare()`, `build_prompt()`, `parse_response()`, `save_output()` --- ## 2. WHAT TO BUILD ### Overview Extend the existing `OptimizationTask` model and `optimize_content` AI function into a full cluster-aligned optimization engine. The system analyzes content against its cluster's keyword set, scores quality on a 0-100 scale, and produces optimized content with tracked before/after metrics. ### 2.1 Cluster Matching (Auto-Assign Optimization Context) When content has no cluster assignment, the optimizer auto-detects the best-fit cluster: **Scoring Algorithm:** - Keyword overlap (40%): count of cluster keywords found in content title + headings + body - Semantic similarity (40%): AI-scored relevance between content topic and cluster theme - Title match (20%): similarity between content title and cluster name/keywords **Thresholds:** - Confidence ≥ 0.6 → auto-assign cluster - Confidence < 0.6 → flag for manual review, suggest top 3 candidates This reuses the same scoring pattern as `ClusterMappingService` from 02B. ### 2.2 Keyword Coverage Analysis For content with an assigned cluster: 1. Load all `Keywords` records belonging to that cluster 2. Scan `content_html` for each keyword: exact match, partial match (stemmed), semantic presence 3. Report per keyword: `{keyword, target_density, current_density, status: present|missing|low_density}` 4. Coverage targets: - Hub content (`cluster_hub`): 70%+ of cluster keywords covered - Supporting articles: 40%+ of cluster keywords covered - Product/service pages: 30%+ (focused on commercial keywords) ### 2.3 Heading Restructure Analyze H1/H2/H3 hierarchy for SEO best practices: | Check | Rule | Fix | |-------|------|-----| | Single H1 | Content must have exactly one H1 | Merge or demote extra H1s | | H2 keyword coverage | H2s should contain target keywords from cluster | AI rewrites H2s with keyword incorporation | | Logical hierarchy | No skipped levels (H1 → H3 without H2) | Insert missing levels | | H2 count | Minimum 3 H2s for content >1000 words | AI suggests additional H2 sections | | Missing keyword themes | Cluster keywords not represented in any heading | AI suggests new H2/H3 sections for missing themes | ### 2.4 Content Rewrite (Intent-Aligned) **Intent Classification:** - **Informational**: expand explanations, add examples, increase depth, add definitions - **Commercial**: add comparison tables, pros/cons, feature highlights, trust signals - **Transactional**: strengthen CTAs, add urgency, streamline conversion path, social proof **Content Adjustments:** - Expand thin content (<500 words) to minimum viable length for the content structure - Compress bloated content (detect and remove redundancy) - Add missing sections identified by keyword coverage analysis - Maintain existing tone and style while improving SEO alignment ### 2.5 Schema Gap Detection Leverages `SchemaValidationService` from 02G: 1. Check existing `Content.schema_markup` against expected schemas for the content type 2. Expected schema by type: Article (post), Product (product), Service (service_page), FAQPage (if FAQ detected), BreadcrumbList (all), HowTo (if steps detected) 3. Identify missing required fields per schema type 4. Generate corrected/complete schema JSON-LD 5. Schema-only optimization mode available (no content rewrite, just schema fix) ### 2.6 Before/After Scoring **Content Quality Score (0-100):** | Factor | Weight | Score Criteria | |--------|--------|---------------| | Keyword Coverage | 30% | % of cluster keywords present vs target | | Heading Structure | 20% | Single H1, keyword H2s, logical hierarchy, no skipped levels | | Content Depth | 20% | Word count vs structure minimum, section completeness, detail level | | Readability | 15% | Sentence length, paragraph length, Flesch-Kincaid approximation | | Schema Completeness | 15% | Required schema fields present, validation passes | Every optimization records `score_before` and `score_after`. Dashboard aggregates show average improvement across all optimizations. ### 2.7 Batch Optimization - Select content by: cluster ID, score threshold (e.g., all content scoring < 50), content type, date range - Queue as Celery tasks with priority ordering (lowest scores first) - Concurrency: max 3 concurrent optimization tasks per account - Progress tracking via OptimizationTask status field - Cancel capability: change status to `rejected` to stop processing --- ## 3. DATA MODELS & APIS ### 3.1 Modified Model — OptimizationTask (optimization app) Extend the existing `OptimizationTask` model with 16 new fields: ```python # Add to existing OptimizationTask model: content = models.ForeignKey( 'writer.Content', on_delete=models.CASCADE, related_name='optimization_tasks' ) primary_cluster = models.ForeignKey( 'planner.Clusters', on_delete=models.SET_NULL, null=True, blank=True, related_name='optimization_tasks' ) secondary_clusters = models.JSONField( default=list, blank=True, help_text='List of Clusters IDs for secondary relevance' ) keyword_targets = models.JSONField( default=list, blank=True, help_text='[{keyword, target_density, current_density, status}]' ) optimization_type = models.CharField( max_length=20, choices=[ ('full_rewrite', 'Full Rewrite'), ('heading_only', 'Heading Only'), ('schema_only', 'Schema Only'), ('keyword_coverage', 'Keyword Coverage'), ('batch', 'Batch'), ], default='full_rewrite' ) intent_classification = models.CharField( max_length=15, choices=[ ('informational', 'Informational'), ('commercial', 'Commercial'), ('transactional', 'Transactional'), ], blank=True, default='' ) score_before = models.FloatField(null=True, blank=True) score_after = models.FloatField(null=True, blank=True) content_before = models.TextField( blank=True, default='', help_text='Snapshot of original content_html' ) content_after = models.TextField( blank=True, default='', help_text='Optimized HTML (null until optimization completes)' ) metadata_before = models.JSONField( default=dict, blank=True, help_text='{meta_title, meta_description, headings[]}' ) metadata_after = models.JSONField( default=dict, blank=True ) schema_before = models.JSONField(default=dict, blank=True) schema_after = models.JSONField(default=dict, blank=True) structure_changes = models.JSONField( default=list, blank=True, help_text='[{change_type, description, before, after}]' ) confidence_score = models.FloatField( null=True, blank=True, help_text='AI confidence in the quality of changes (0-1)' ) applied = models.BooleanField(default=False) applied_at = models.DateTimeField(null=True, blank=True) ``` **Update STATUS choices on OptimizationTask:** ```python STATUS_CHOICES = [ ('pending', 'Pending'), ('analyzing', 'Analyzing'), ('optimizing', 'Optimizing'), ('review', 'Ready for Review'), ('applied', 'Applied'), ('rejected', 'Rejected'), ] ``` **PK:** BigAutoField (integer) — existing model **Table:** existing `igny8_optimization_tasks` table (no rename needed) ### 3.2 Migration Single migration in the optimization app (or igny8_core migrations): ``` igny8_core/migrations/XXXX_extend_optimization_task.py ``` **Operations:** 1. `AddField('OptimizationTask', 'content', ...)` — FK to Content 2. `AddField('OptimizationTask', 'primary_cluster', ...)` — FK to Clusters 3. `AddField('OptimizationTask', 'secondary_clusters', ...)` — JSONField 4. `AddField('OptimizationTask', 'keyword_targets', ...)` — JSONField 5. `AddField('OptimizationTask', 'optimization_type', ...)` — CharField 6. `AddField('OptimizationTask', 'intent_classification', ...)` — CharField 7. `AddField('OptimizationTask', 'score_before', ...)` — FloatField 8. `AddField('OptimizationTask', 'score_after', ...)` — FloatField 9. `AddField('OptimizationTask', 'content_before', ...)` — TextField 10. `AddField('OptimizationTask', 'content_after', ...)` — TextField 11. `AddField('OptimizationTask', 'metadata_before', ...)` — JSONField 12. `AddField('OptimizationTask', 'metadata_after', ...)` — JSONField 13. `AddField('OptimizationTask', 'schema_before', ...)` — JSONField 14. `AddField('OptimizationTask', 'schema_after', ...)` — JSONField 15. `AddField('OptimizationTask', 'structure_changes', ...)` — JSONField 16. `AddField('OptimizationTask', 'confidence_score', ...)` — FloatField 17. `AddField('OptimizationTask', 'applied', ...)` — BooleanField 18. `AddField('OptimizationTask', 'applied_at', ...)` — DateTimeField ### 3.3 API Endpoints All endpoints under `/api/v1/optimizer/`: | Method | Path | Description | |--------|------|-------------| | POST | `/api/v1/optimizer/analyze/` | Analyze single content piece. Body: `{content_id}`. Returns scores + keyword coverage + heading analysis + recommendations. Does NOT rewrite. | | POST | `/api/v1/optimizer/optimize/` | Run full optimization. Body: `{content_id, optimization_type}`. Creates OptimizationTask, runs analysis + rewrite, returns preview. | | POST | `/api/v1/optimizer/preview/` | Preview changes without creating task. Body: `{content_id}`. Returns diff-style output. | | POST | `/api/v1/optimizer/apply/{id}/` | Apply optimized version. Copies `content_after` → `Content.content_html`, updates metadata, sets `applied=True`. | | POST | `/api/v1/optimizer/reject/{id}/` | Reject optimization. Sets status=`rejected`, keeps original content. | | POST | `/api/v1/optimizer/batch/` | Queue batch optimization. Body: `{site_id, cluster_id?, score_threshold?, content_type?, content_ids?}`. Returns batch task ID. | | GET | `/api/v1/optimizer/tasks/?site_id=X` | List OptimizationTask records with filters (status, optimization_type, cluster_id, date range). | | GET | `/api/v1/optimizer/tasks/{id}/` | Single optimization detail with full before/after data. | | GET | `/api/v1/optimizer/tasks/{id}/diff/` | HTML diff view — visual comparison of content_before vs content_after. | | GET | `/api/v1/optimizer/cluster-suggestions/?content_id=X` | Suggest best-fit cluster for unassigned content. Returns top 3 candidates with confidence scores. | | POST | `/api/v1/optimizer/assign-cluster/` | Assign cluster to content. Body: `{content_id, cluster_id}`. Updates Content record. | | GET | `/api/v1/optimizer/dashboard/?site_id=X` | Optimization stats: avg score improvement, count by status, top improved, lowest scoring content. | **Permissions:** All endpoints use `SiteSectorModelViewSet` permission patterns. ### 3.4 AI Function — Enhanced optimize_content Extend the existing registered `optimize_content` AI function: **Registry key:** `optimize_content` (already registered — enhance, not replace) **Location:** `igny8_core/ai/functions/optimize_content.py` (existing file) ```python class OptimizeContentFunction(BaseAIFunction): """ Enhanced cluster-aligned content optimization. Extends existing optimize_content with keyword coverage, heading restructure, intent classification, and scoring. """ function_name = 'optimize_content' def validate(self, content_id, optimization_type='full_rewrite', **kwargs): # Verify content exists, has content_html # Verify optimization_type is valid pass def prepare(self, content_id, optimization_type='full_rewrite', **kwargs): # Load Content record # Determine cluster (from Content or auto-match) # Load cluster Keywords # Analyze current keyword coverage # Parse heading structure # Classify intent # Calculate score_before # Snapshot content_before, metadata_before, schema_before pass def build_prompt(self): # Build type-specific optimization prompt: # - Include current content_html # - Include cluster keywords with coverage status # - Include heading analysis results # - Include intent classification # - Include optimization_type instructions: # full_rewrite: all optimizations # heading_only: heading restructure only # schema_only: schema fix only (no content change) # keyword_coverage: add missing keyword sections only pass def parse_response(self, response): # Parse optimized HTML # Parse updated metadata (meta_title, meta_description) # Parse structure_changes list # Parse confidence_score pass def save_output(self, parsed): # Create OptimizationTask with all before/after data # Calculate score_after # Set status='review' pass ``` ### 3.5 Content Scoring Service **Location:** `igny8_core/business/content_scoring.py` ```python class ContentScoringService: """ Calculates Content Quality Score (0-100) using 5 weighted factors. Used by optimizer for before/after and by dashboard for overview. """ WEIGHTS = { 'keyword_coverage': 0.30, 'heading_structure': 0.20, 'content_depth': 0.20, 'readability': 0.15, 'schema_completeness': 0.15, } def score(self, content_id, cluster_id=None): """ Calculate composite score for a content record. Returns: {total: float, breakdown: {factor: score}} """ pass def _score_keyword_coverage(self, content, cluster): """0-100: % of cluster keywords found in content.""" pass def _score_heading_structure(self, content_html): """0-100: single H1, keyword H2s, no skipped levels, H2 count.""" pass def _score_content_depth(self, content_html, content_structure): """0-100: word count vs minimum for structure type, section completeness.""" pass def _score_readability(self, content_html): """0-100: avg sentence length, paragraph length, Flesch-Kincaid approx.""" pass def _score_schema_completeness(self, content): """0-100: required schema fields present, from SchemaValidationService (02G).""" pass ``` ### 3.6 Keyword Coverage Analyzer **Location:** `igny8_core/business/keyword_coverage.py` ```python class KeywordCoverageAnalyzer: """ Analyzes content against cluster keyword set. Returns per-keyword presence and overall coverage percentage. """ def analyze(self, content_id, cluster_id): """ Returns { total_keywords: int, covered: int, missing: int, coverage_pct: float, keywords: [{keyword, target_density, current_density, status}] } """ pass def _extract_text(self, content_html): """Strip HTML, return plain text for analysis.""" pass def _check_keyword(self, keyword, text): """Check for exact, partial (stemmed), and semantic presence.""" pass ``` ### 3.7 Celery Tasks **Location:** `igny8_core/tasks/optimization_tasks.py` ```python @shared_task(name='run_optimization') def run_optimization(optimization_task_id): """Process a single OptimizationTask. Called by API endpoints.""" pass @shared_task(name='run_batch_optimization') def run_batch_optimization(site_id, cluster_id=None, score_threshold=None, content_type=None, content_ids=None, batch_size=10): """ Process batch of content for optimization. Selects content matching filters, creates OptimizationTask per item, processes sequentially with max 3 concurrent per account. """ pass @shared_task(name='identify_optimization_candidates') def identify_optimization_candidates(site_id, threshold=50): """ Weekly scan: find content with quality score below threshold. Creates report, does NOT auto-optimize. """ pass ``` **Beat Schedule Addition:** | Task | Schedule | Notes | |------|----------|-------| | `identify_optimization_candidates` | Weekly (Monday 4:00 AM) | Scans all sites, identifies low-scoring content | --- ## 4. IMPLEMENTATION STEPS ### Step 1: Migration 1. Add 16 new fields to `OptimizationTask` model 2. Update STATUS_CHOICES on OptimizationTask 3. Run migration ### Step 2: Services 1. Implement `ContentScoringService` in `igny8_core/business/content_scoring.py` 2. Implement `KeywordCoverageAnalyzer` in `igny8_core/business/keyword_coverage.py` ### Step 3: AI Function Enhancement 1. Extend `OptimizeContentFunction` in `igny8_core/ai/functions/optimize_content.py` 2. Add cluster-alignment, keyword coverage, heading analysis, intent classification, scoring 3. Maintain backward compatibility — existing `optimize_content` calls still work ### Step 4: API Endpoints 1. Add optimizer endpoints to `igny8_core/urls/optimizer.py` (or create if doesn't exist) 2. Create views: `AnalyzeView`, `OptimizeView`, `PreviewView`, `ApplyView`, `RejectView`, `BatchView` 3. Create `ClusterSuggestionsView`, `AssignClusterView`, `DashboardView`, `DiffView` 4. Register URL patterns under `/api/v1/optimizer/` ### Step 5: Celery Tasks 1. Implement `run_optimization`, `run_batch_optimization`, `identify_optimization_candidates` 2. Add `identify_optimization_candidates` to Celery beat schedule ### Step 6: Serializers & Admin 1. Update DRF serializer for extended OptimizationTask (include all 16 new fields) 2. Create nested serializers for before/after views 3. Update Django admin registration ### Step 7: Credit Cost Configuration Add to `CreditCostConfig` (billing app): | operation_type | default_cost | description | |---------------|-------------|-------------| | `optimization_analysis` | 2 | Analyze single content (scoring + keyword coverage) | | `optimization_full_rewrite` | 5-8 | Full rewrite optimization (varies by content length) | | `optimization_schema_only` | 1 | Schema gap fix only | | `optimization_batch` | 15-25 | Batch optimization for 10 items | Credit deduction follows existing `CreditUsageLog` pattern. --- ## 5. ACCEPTANCE CRITERIA ### Cluster Matching - [ ] Content without cluster assignment gets auto-matched with confidence scoring - [ ] Confidence ≥ 0.6 auto-assigns; < 0.6 flags for manual review with top 3 suggestions - [ ] Cluster suggestions endpoint returns ranked candidates ### Keyword Coverage - [ ] All cluster keywords analyzed for presence in content - [ ] Coverage report includes exact match, partial match, and missing keywords - [ ] Hub content targets 70%+, supporting articles 40%+, product/service 30%+ ### Heading Restructure - [ ] H1/H2/H3 hierarchy validated (single H1, no skipped levels) - [ ] Missing keyword themes identified and new headings suggested - [ ] AI rewrites headings incorporating target keywords while maintaining meaning ### Content Rewrite - [ ] Intent classified correctly (informational/commercial/transactional) - [ ] Rewrite adjusts content structure based on intent - [ ] Thin content expanded, bloated content compressed - [ ] Missing keyword sections added ### Scoring - [ ] Score 0-100 calculated with 5 weighted factors - [ ] score_before recorded before any changes - [ ] score_after recorded after optimization - [ ] Dashboard shows average improvement and distribution ### Before/After - [ ] Full snapshot of original content preserved in content_before - [ ] Optimized version stored in content_after without auto-applying - [ ] Diff view provides visual HTML comparison - [ ] Apply action copies content_after → Content.content_html - [ ] Reject action preserves original, marks task rejected ### Batch - [ ] Batch optimization selects content by cluster, score threshold, type, or explicit IDs - [ ] Max 3 concurrent optimizations per account enforced - [ ] Progress trackable via OptimizationTask status - [ ] Weekly candidate identification runs without auto-optimizing ### Integration - [ ] Schema gap detection leverages SchemaValidationService from 02G - [ ] Credit costs deducted per CreditCostConfig entries - [ ] All API endpoints respect account/site permission boundaries --- ## 6. CLAUDE CODE INSTRUCTIONS ### File Locations ``` igny8_core/ ├── ai/ │ └── functions/ │ └── optimize_content.py # Enhance existing function ├── business/ │ ├── content_scoring.py # ContentScoringService │ └── keyword_coverage.py # KeywordCoverageAnalyzer ├── tasks/ │ └── optimization_tasks.py # Celery tasks ├── urls/ │ └── optimizer.py # Optimizer endpoints └── migrations/ └── XXXX_extend_optimization_task.py ``` ### Conventions - **PKs:** BigAutoField (integer) — do NOT use UUIDs - **Table prefix:** `igny8_` (existing table `igny8_optimization_tasks`) - **Celery app name:** `igny8_core` - **URL pattern:** `/api/v1/optimizer/...` - **Permissions:** Use `SiteSectorModelViewSet` permission pattern - **AI functions:** Extend existing `BaseAIFunction` subclass — do NOT create a new registration key, enhance the existing `optimize_content` - **Frontend:** `.tsx` files with Zustand stores for state management ### Cross-References | Doc | Relationship | |-----|-------------| | **02B** | Taxonomy terms get cluster context for optimization; ClusterMappingService scoring pattern reused | | **02G** | SchemaValidationService used for schema gap detection; schema_only optimization triggers 02G schema generation | | **02C** | GSC position data identifies pages needing optimization (high impressions, low clicks) | | **02D** | Optimizer identifies internal link opportunities and feeds them to linker | | **01E** | Blueprint-aware pipeline sets initial content quality; optimizer improves post-generation | | **01A** | SAGBlueprint/SAGCluster data provides cluster context for optimization | | **01G** | SAG health monitoring can incorporate content quality scores as a health factor | ### Key Decisions 1. **Extend, don't replace** — The existing `OptimizationTask` model and `optimize_content` AI function are enhanced, not replaced with new models 2. **Preview-first workflow** — Optimizations always produce a preview (status=`review`) before applying to Content 3. **Content snapshot** — Full HTML snapshot stored in `content_before` for rollback capability 4. **Score reuse** — `ContentScoringService` is a standalone service usable by other modules (02G schema audit, 01G health monitoring) 5. **Schema delegation** — Schema gap detection reuses 02G's `SchemaValidationService` rather than duplicating logic