igny8/v2/V2-Execution-Docs/01C-cluster-formation-keyword-engine.md

# IGNY8 Phase 1: Cluster Formation & Keyword Engine (Doc 01C)

> **Version:** 1.1 (codebase-verified)
> **Source of Truth:** Codebase at `/data/app/igny8/backend/`
> **Last Verified:** 2025-07-14

**Document Version:** 1.1
**Date:** 2026-03-23
**Phase:** Phase 1 - Foundation & Intelligence
**Status:** Build Ready

---

## 1. Current State

### Existing Components
- **SAGBlueprint** (01A): Data model with status tracking, blueprint lifecycle management
- **SAGAttribute** & **SAGCluster** models (01A): Schema definitions for attributes and topic clusters
- **SectorAttributeTemplate** (01B): Pre-configured attribute framework with keyword templates per site_type
- **Setup Wizard** (01D): Collects sector, site_type, and populated attribute values from user
- **Blueprint Service** (01G - earlier iteration): Basic blueprint assembly, denormalization

### Current Limitations
- No automated cluster formation from attribute intersection logic
- No keyword generation from templates
- No conflict resolution for multi-cluster keyword assignments
- No cluster type classification (product, condition, feature, etc.)
- No validation of cluster viability (size, coherence, user demand)
- No hub title and supporting content plan generation

### Dependencies Ready
- ✅ Sector attribute templates loaded with keyword templates
- ✅ Setup wizard populates attributes
- ✅ Data models support cluster and keyword storage
- ✅ Blueprint lifecycle framework exists

---

## 2. What to Build

### 2.1 Cluster Formation AI Function
**File:** `sag/ai_functions/cluster_formation.py`
**Register Key:** `'form_clusters'`
**Triggering Context:** After user populates attributes in setup wizard; before keyword assignment

#### Input Contract
```python
{
    "populated_attributes": [
        {"name": "Target Audience", "values": ["Pet Owners", "Veterinarians"]},
        {"name": "Pet Type", "values": ["Dogs", "Cats"]},
        {"name": "Health Condition", "values": ["Allergies", "Arthritis", "Obesity"]}
    ],
    "sector_context": {
        "sector_id": int,  # FK to igny8_core_auth.Sector (BigAutoField PK)
        "site_type": "ecommerce|saas|blog|local_service",
        "sector_name": str
    },
    "constraints": {
        "max_clusters": 50,  # hard cap per sector
        "min_keywords_per_cluster": 5,
        "max_keywords_per_cluster": 20,
        "optimal_keywords_per_cluster": 7-15
    }
}
```

#### Output Contract
```python
{
    "clusters": [
        {
            "id": "cluster_001",
            "title": "Dog Arthritis Relief Solutions",
            "type": "product_category",  # or condition_problem, feature, brand, informational, comparison
            "dimensions": {
                "primary": ["Pet Type: Dogs", "Health Condition: Arthritis"],
                "secondary": ["Target Audience: Pet Owners"]
            },
            "intersection_depth": 3,  # count of dimensional intersections
            "viability_score": 0.92,  # 0-1 based on coherence + demand assessment
            "hub_title": "Best Arthritis Treatments for Dogs",
            "supporting_content_plan": [
                "Senior Dog Arthritis: Causes & Prevention",
                "Dog Arthritis Medications: Complete Guide",
                "Physical Therapy Exercises for Dogs with Arthritis",
                "Diet Changes to Support Joint Health",
                "When to See a Vet About Dog Joint Pain"
            ],
            "keywords": [],  # populated in keyword generation phase
            "dimension_count": 3,
            "validation": {
                "is_real_topical_ecosystem": true,
                "has_search_demand": true,
                "can_support_content_plan": true,
                "sufficient_differentiation": true
            }
        },
        // ... more clusters
    ],
    "summary": {
        "total_clusters_formed": 12,
        "type_distribution": {
            "product_category": 6,
            "condition_problem": 4,
            "feature": 1,
            "brand": 0,
            "informational": 1,
            "comparison": 0
        },
        "avg_intersection_depth": 2.3,
        "clusters_below_viability_threshold": 0
    }
}
```

#### Algorithm (Pseudocode)

```
FUNCTION form_clusters(populated_attributes, sector_context):

    # STEP 1: Generate all 2-value intersections
    all_intersections = []
    for each attribute_pair in populated_attributes:
        for value1 in attribute_pair[0].values:
            for value2 in attribute_pair[1].values:
                intersection = {
                    "dimensions": [value1, value2],
                    "attribute_names": [attribute_pair[0].name, attribute_pair[1].name]
                }
                all_intersections.append(intersection)

    # Also generate 3-value intersections for strong coherence
    for attribute_triplet in populated_attributes (size=3):
        for value1 in attribute_triplet[0].values:
            for value2 in attribute_triplet[1].values:
                for value3 in attribute_triplet[2].values:
                    intersection = {
                        "dimensions": [value1, value2, value3],
                        "attribute_names": [name[0], name[1], name[2]]
                    }
                    all_intersections.append(intersection)

    # STEP 2: AI evaluates each intersection
    valid_clusters = []
    for intersection in all_intersections:
        evaluation = AI_EVALUATE_INTERSECTION(intersection, sector_context):
            - Is this a real topical ecosystem?
            - Would users search for this combination?
            - Can we build a hub + 3-10 supporting articles?
            - Is there sufficient differentiation from other clusters?
            - Does the combination make semantic sense?

        if evaluation.is_valid:
            # STEP 3: Classify cluster type
            cluster_type = AI_CLASSIFY_TYPE(intersection)
                → product_category, condition_problem, feature, brand,
                  informational, comparison

            # STEP 4: Generate hub title + supporting content plan
            hub_title = AI_GENERATE_HUB_TITLE(intersection, sector_context)
            supporting_titles = AI_GENERATE_SUPPORTING_TITLES(
                hub_title,
                intersection,
                count=5-8
            )

            # Create cluster object
            cluster = {
                "dimensions": intersection.dimensions,
                "type": cluster_type,
                "viability_score": evaluation.confidence_score,
                "hub_title": hub_title,
                "supporting_content_plan": supporting_titles,
                "validation": evaluation
            }
            valid_clusters.append(cluster)

    # STEP 4: Apply constraints & filtering
    sorted_clusters = SORT_BY_VIABILITY_SCORE(valid_clusters)
    final_clusters = sorted_clusters[0:max_clusters]

    # STEP 5: Validate distribution & completeness
    distribution = CALCULATE_TYPE_DISTRIBUTION(final_clusters)

    # Flag if any type is severely under-represented
    if distribution.imbalance > THRESHOLD:
        LOG_WARNING("Type distribution may be suboptimal")

    # STEP 6: Return with summary
    return {
        "clusters": final_clusters,
        "summary": {
            "total_clusters": len(final_clusters),
            "type_distribution": distribution,
            "viability_threshold_met": all clusters have score >= 0.70
        }
    }

END FUNCTION
```

#### AI Evaluation Criteria
For each intersection, the AI must answer:

1. **Real Topical Ecosystem?**
   - Do the dimensions naturally connect in user intent?
   - Is there an existing product/service/solution category?
   - Example: YES - "Dog Arthritis Relief" (real problem + real solutions)
   - Example: NO - "Vegetarian Chainsaw" (nonsensical combination)

2. **User Search Demand?**
   - Would users actively search for this combination?
   - Check: keyword templates, search volume patterns, user forums
   - Target: ≥500 monthly searches for hub keyword

3. **Content Support?**
   - Can we create 1 hub + 3-10 supporting articles?
   - Is there enough subtopic depth?
   - Example: YES - "Dog Arthritis" can have medication, exercise, diet, vet visits
   - Example: NO - "Red Dog Collar" (too niche, limited subtopics)

4. **Sufficient Differentiation?**
   - Does this cluster stand apart from others?
   - Avoid near-duplicate clusters (e.g., "Dog Joint Health" vs "Dog Arthritis")
   - Decision: merge or reject the weaker one

5. **Dimensional Clarity**
   - Do all dimensions contribute meaningfully?
   - Remove secondary dimensions that don't add coherence

#### Hard Constraints
- **Maximum Clusters:** 50 per sector (enforce in sorting/filtering)
- **Minimum Keywords per Cluster:** 5 (checked in keyword generation)
- **Maximum Keywords per Cluster:** 20 (checked in keyword generation)
- **Optimal Range:** 7-15 keywords per cluster
- **No Keyword Duplication:** Each keyword in exactly one cluster (enforced in conflict resolution)
- **Type Distribution Target:**
  - Product/Service Type: 40-50%
  - Condition/Problem: 20-30%
  - Feature: 10-15%
  - Brand: 5-10%
  - Life Stage/Audience: 5-10%

---

### 2.2 Keyword Auto-Generation AI Function
**File:** `sag/ai_functions/keyword_generation.py`
**Register Key:** `'generate_keywords'`
**Triggering Context:** After cluster formation; before blueprint assembly

#### Input Contract
```python
{
    "clusters": [  # output from cluster_formation
        {
            "id": "cluster_001",
            "dimensions": ["Pet Type: Dogs", "Health Condition: Arthritis"],
            "hub_title": "Best Arthritis Treatments for Dogs",
            "supporting_content_plan": [...]
        }
    ],
    "sector_context": {
        "sector_id": int,  # FK to igny8_core_auth.Sector (BigAutoField PK)
        "site_type": "ecommerce|saas|blog|local_service",
        "site_intent": "sell|inform|book|download"
    },
    "keyword_templates": {  # loaded from SectorAttributeTemplate
        "template_001": "best {health_condition} for {pet_type}",
        "template_002": "{pet_type} {health_condition} treatment",
        // ... more templates
    },
    "constraints": {
        "min_keywords_per_cluster": 10,
        "max_keywords_per_cluster": 25,
        "total_target": "300-500"
    }
}
```

#### Output Contract
```python
{
    "keywords_per_cluster": {
        "cluster_001": {
            "keywords": [
                {
                    "keyword": "best arthritis treatment for dogs",
                    "search_volume": 1200,
                    "difficulty": "medium",
                    "intent": "informational",
                    "generated_from": "template_001",
                    "variant_type": "long_tail"
                },
                {
                    "keyword": "dog arthritis remedies",
                    "search_volume": 800,
                    "difficulty": "easy",
                    "intent": "informational",
                    "generated_from": "template_002",
                    "variant_type": "base"
                },
                // ... 13-23 more keywords
            ],
            "keyword_count": 15,
            "primary_intent": "informational",
            "search_volume_total": 12500
        }
    },
    "deduplication": {
        "duplicates_removed": 8,
        "flagged_conflicts": 3  # keywords fitting multiple clusters
    },
    "summary": {
        "total_unique_keywords": 342,
        "per_cluster_avg": 14.25,
        "total_search_volume": 892000,
        "within_constraints": true
    }
}
```

#### Algorithm (Pseudocode)

```
FUNCTION generate_keywords(clusters, sector_context, keyword_templates):

    all_keywords = {}

    FOR EACH cluster IN clusters:

        # STEP 1: Extract attribute values from cluster dimensions
        attribute_values = EXTRACT_ATTRIBUTE_VALUES(cluster.dimensions)
        # Output: {"Pet Type": "Dogs", "Health Condition": "Arthritis", ...}

        cluster_keywords = []

        # STEP 2: Substitute values into templates
        FOR EACH template IN keyword_templates:

            # Check if template requires all attribute values present
            required_attrs = PARSE_TEMPLATE_VARIABLES(template)
            if ALL_ATTRS_AVAILABLE(required_attrs, attribute_values):

                # Substitute values
                base_keyword = SUBSTITUTE_VALUES(template, attribute_values)
                cluster_keywords.append({
                    "keyword": base_keyword,
                    "generated_from": template.id,
                    "variant_type": "base"
                })

        # STEP 3: Generate long-tail variants
        long_tail_variants = []

        FOR EACH base_keyword IN cluster_keywords:

            # "best arthritis treatment for dogs"
            variants = []

            # Variant: Add "best"
            variants.append("best " + base_keyword)

            # Variant: Add "review"
            variants.append(base_keyword + " review")

            # Variant: Add "vs" (comparison)
            if CLUSTER_TYPE in [product_category, comparison]:
                variants.append(base_keyword + " vs alternatives")

            # Variant: Add "for" (audience)
            variants.append(base_keyword + " for seniors")

            # Variant: Add "how to"
            variants.append("how to " + base_keyword)

            # Variant: Add "cost" (ecommerce intent)
            if site_intent == "sell":
                variants.append(base_keyword + " cost")

            FOR EACH variant IN variants:
                if NOT_DUPLICATE(variant, cluster_keywords):
                    cluster_keywords.append({
                        "keyword": variant,
                        "variant_type": "long_tail",
                        "parent": base_keyword
                    })

        # STEP 4: Enrich keywords with metadata
        enriched_keywords = []
        FOR EACH kw IN cluster_keywords:
            enriched = {
                "keyword": kw.keyword,
                "search_volume": ESTIMATE_SEARCH_VOLUME(kw.keyword, sector),
                "difficulty": ESTIMATE_DIFFICULTY(kw.keyword, sector),
                "intent": CLASSIFY_INTENT(kw.keyword),  # informational, transactional, navigational
                "generated_from": kw.generated_from,
                "variant_type": kw.variant_type
            }
            enriched_keywords.append(enriched)

        # STEP 5: Filter & sort
        filtered_keywords = SORT_BY_SEARCH_VOLUME(enriched_keywords)

        # Keep top 10-25 per cluster
        cluster_keywords_final = filtered_keywords[0:25]

        # Validate minimum
        if LEN(cluster_keywords_final) < 10:
            ADD_SUPPLEMENTARY_KEYWORDS(cluster_keywords_final, 5)

        all_keywords[cluster.id] = {
            "keywords": cluster_keywords_final,
            "keyword_count": len(cluster_keywords_final),
            "primary_intent": MODE(intent from all keywords),
            "search_volume_total": SUM(all search volumes)
        }

    # STEP 6: Global deduplication
    all_keywords_flat = FLATTEN(all_keywords)
    duplicates = FIND_DUPLICATES(all_keywords_flat)

    FOR EACH duplicate_set IN duplicates:
        primary_cluster = PRIMARY_CLUSTER(duplicate_set)  # best fit by dimensions
        REASSIGN_DUPLICATES_TO_PRIMARY(duplicate_set, primary_cluster)

    # STEP 7: Validate constraints
    total_keywords = SUM(keyword_count for each cluster)

    validation = {
        "within_min_per_cluster": all clusters >= 10,
        "within_max_per_cluster": all clusters <= 25,
        "total_within_target": total_keywords between 300-500,
        "no_duplicates": len(duplicates) == 0
    }

    if NOT validation.all_true:
        LOG_WARNING("Keyword generation constraints not fully met")

    # STEP 8: Return results
    return {
        "keywords_per_cluster": all_keywords,
        "deduplication": {
            "duplicates_removed": len(duplicates),
            "flagged_conflicts": identify_multi_cluster_fits()
        },
        "summary": {
            "total_unique_keywords": total_keywords,
            "per_cluster_avg": total_keywords / len(clusters),
            "total_search_volume": sum of all volumes,
            "within_constraints": validation.all_true
        }
    }

END FUNCTION
```

#### Keyword Template Structure (from SectorAttributeTemplate, 01B)
```python
# Example for Pet Health ecommerce site
keyword_templates = {
    "site_type": "ecommerce",
    "templates": [
        {
            "id": "template_001",
            "pattern": "best {health_condition} treatment for {pet_type}",
            "weight": 5,  # prioritize this template
            "min_required_attrs": ["health_condition", "pet_type"]
        },
        {
            "id": "template_002",
            "pattern": "{pet_type} {health_condition} medication",
            "weight": 4,
            "min_required_attrs": ["pet_type", "health_condition"]
        },
        {
            "id": "template_003",
            "pattern": "affordable {health_condition} relief for {pet_type}",
            "weight": 3,
            "min_required_attrs": ["health_condition", "pet_type"]
        },
        // ... more templates
    ]
}
```

#### Long-tail Variant Rules

| Variant Type | Pattern | Use Case | Example |
|---|---|---|---|
| Base | {keyword} | All clusters | "dog arthritis relief" |
| Best/Top | best {keyword} | All clusters | "best dog arthritis relief" |
| Review | {keyword} review | Product clusters | "arthritis supplement for dogs review" |
| Comparison | {keyword} vs | Comparison intent | "arthritis medication vs supplement for dogs" |
| Audience | {keyword} for {audience} | Audience-specific | "dog arthritis relief for senior dogs" |
| How-to | how to {verb} {keyword} | Problem-solution | "how to manage dog arthritis" |
| Cost/Price | {keyword} cost | Ecommerce intent | "arthritis treatment for dogs cost" |
| Quick | {keyword} fast | Urgency-driven | "fast arthritis relief for dogs" |

---

### 2.3 Blueprint Assembly Service
**File:** `sag/services/blueprint_service.py`
**Primary Function:** `assemble_blueprint(site, attributes, clusters, keywords)`
**Triggering Context:** After keyword generation; creates SAGBlueprint (status=draft)

#### Input Contract
```python
assemble_blueprint(
    site: Site,  # igny8_core_auth.Site (integer PK)
    sector: Sector,  # igny8_core_auth.Sector (integer PK)
    attributes: List[Tuple[name, values]],  # user-populated
    clusters: List[Dict],  # from cluster_formation()
    keywords: Dict[cluster_id, List[Dict]]  # from generate_keywords()
)
```

#### Execution Steps

1. **Create SAGBlueprint Record**
   ```python
   blueprint = SAGBlueprint.objects.create(
       site=site,
       status='draft',
       phase='phase_1_foundation',
       sector=sector,
       created_by=current_user,
       metadata={
           'version': '1.0',
           'created_date': now(),
           'last_modified': now()
       }
   )
   ```

2. **Create SAGAttribute Records**
   ```python
   FOR EACH (attribute_name, values) IN attributes:
       attribute = SAGAttribute.objects.create(
           blueprint=blueprint,
           name=attribute_name,
           values=values,  # stored as JSON array
           is_primary=DETERMINE_PRIMACY(attribute_name, site.site_type),
           source='user_input'
       )
   ```

3. **Create SAGCluster Records from Formed Clusters**
   ```python
   FOR EACH cluster IN clusters:
       db_cluster = SAGCluster.objects.create(
           blueprint=blueprint,
           cluster_key=cluster['id'],
           title=cluster['hub_title'],
           description=GENERATE_CLUSTER_DESC(cluster),
           cluster_type=cluster['type'],
           dimensions=cluster['dimensions'],  # JSON
           intersection_depth=cluster['intersection_depth'],
           viability_score=cluster['viability_score'],
           hub_title=cluster['hub_title'],
           supporting_content_plan=cluster['supporting_content_plan'],  # JSON array
           status='draft',
           keyword_count=0  # updated in next step
       )
   ```

4. **Populate auto_generated_keywords on Each Cluster**
   ```python
   FOR EACH (cluster_id, keyword_list) IN keywords.items():
       cluster = SAGCluster.objects.get(cluster_key=cluster_id)

       keyword_records = []
       FOR EACH kw_data IN keyword_list:
           keyword = SAGKeyword.objects.create(
               cluster=cluster,
               keyword_text=kw_data['keyword'],
               search_volume=kw_data['search_volume'],
               difficulty=kw_data['difficulty'],
               intent=kw_data['intent'],
               generated_from=kw_data['generated_from'],
               variant_type=kw_data['variant_type'],
               source='auto_generated'
           )
           keyword_records.append(keyword)

       cluster.auto_generated_keywords.set(keyword_records)
       cluster.keyword_count = len(keyword_records)
       cluster.save()
   ```

5. **Generate Taxonomy Plan**
   ```python
   taxonomy_plan = {
       'wp_categories': [],
       'wp_tags': [],
       'hierarchy': {}
   }

   FOR EACH attribute IN blueprint.sagattribute_set.all():
       if attribute.is_primary:
           category = {
               'name': attribute.name,
               'slug': slugify(attribute.name),
               'description': f"Posts about {attribute.name}"
           }
           taxonomy_plan['wp_categories'].append(category)
       else:
           tag = {
               'name': v,
               'slug': slugify(v),
               'parent_category': primary_attr_name
           }
           FOR EACH v IN attribute.values:
               taxonomy_plan['wp_tags'].append(tag)

   blueprint.taxonomy_plan = taxonomy_plan  # JSON field
   ```

6. **Generate Execution Priority (Phased Approach)**
   ```python
   execution_priority = {
       'phase': 'phase_1_hubs',
       'content_sequence': []
   }

   # Phase 1: Hub pages (1 per cluster)
   hub_items = []
   FOR EACH cluster IN blueprint.sagcluster_set.filter(status='draft'):
       hub_items.append({
           'type': 'hub_page',
           'cluster_id': cluster.id,
           'title': cluster.hub_title,
           'priority': 1,
           'estimated_effort': 'high',
           'SEO_impact': 'critical'
       })

   execution_priority['content_sequence'].extend(hub_items)

   # Phase 2: Supporting content (5-8 articles per cluster)
   supporting_items = []
   FOR EACH cluster IN blueprint.sagcluster_set.filter(status='draft'):
       FOR EACH content_title IN cluster.supporting_content_plan:
           supporting_items.append({
               'type': 'supporting_article',
               'cluster_id': cluster.id,
               'parent_hub': cluster.hub_title,
               'title': content_title,
               'priority': 2,
               'estimated_effort': 'medium',
               'SEO_impact': 'supporting'
           })

   execution_priority['content_sequence'].extend(supporting_items)

   # Phase 3: Term/pillar pages (keywords + long-tail)
   term_items = []
   FOR EACH cluster IN blueprint.sagcluster_set.filter(status='draft'):
       FOR EACH keyword IN cluster.auto_generated_keywords.all():
           term_items.append({
               'type': 'term_page',
               'cluster_id': cluster.id,
               'keyword': keyword.keyword_text,
               'priority': 3,
               'estimated_effort': 'low',
               'SEO_impact': 'supportive'
           })

   execution_priority['content_sequence'].extend(term_items)

   blueprint.execution_priority = execution_priority  # JSON field
   ```

7. **Populate Denormalized JSON Fields**
   ```python
   blueprint.attributes_json = {
       'total_attributes': blueprint.sagattribute_set.count(),
       'summary': [
           {
               'name': attr.name,
               'value_count': len(attr.values),
               'values': attr.values,
               'is_primary': attr.is_primary
           }
           FOR EACH attr IN blueprint.sagattribute_set.all()
       ]
   }

   blueprint.clusters_json = {
       'total_clusters': blueprint.sagcluster_set.count(),
       'summary': [
           {
               'id': cluster.cluster_key,
               'title': cluster.title,
               'type': cluster.cluster_type,
               'keyword_count': cluster.keyword_count,
               'viability_score': cluster.viability_score
           }
           FOR EACH cluster IN blueprint.sagcluster_set.all()
       ]
   }

   blueprint.save()
   ```

8. **Return Blueprint ID & Status**
   ```python
   return {
       'blueprint_id': blueprint.id,
       'status': 'draft',
       'created_at': blueprint.created_at,
       'summary': {
           'total_attributes': blueprint.sagattribute_set.count(),
           'total_clusters': blueprint.sagcluster_set.count(),
           'total_keywords': SAGKeyword.objects.filter(cluster__blueprint=blueprint).count(),
           'next_step': 'review blueprint in 01E (Pipeline Configuration)'
       }
   }
   ```

---

### 2.4 Manual Keyword Supplementation (User Interface)

#### Feature: Add Keywords from Multiple Sources

1. **IGNY8 Library Integration**
   - Users browse pre-curated keyword library per site_type
   - Select keywords → auto-map to clusters by attribute match
   - Unmatched keywords → flagged for review

2. **Manual Entry**
   - Form field: paste or type keywords (comma-separated)
   - System deduplicates against existing
   - Prompts user to assign to cluster(s)

3. **CSV Import**
   - Upload CSV with columns: keyword, search_volume (optional), difficulty (optional)
   - Preview & validate before import
   - Bulk assign to clusters or mark for review

4. **Keyword API Integration** (optional in Phase 1)
   - Connect to SEMrush, Ahrefs, or similar
   - Fetch keyword suggestions for cluster dimensions
   - User approves additions

#### Keyword Mapping Logic
```python
FUNCTION map_keyword_to_clusters(new_keyword, clusters, threshold=0.70):

    matches = []

    FOR EACH cluster IN clusters:

        # Extract all attribute values from cluster dimensions
        cluster_attrs = EXTRACT_ATTRIBUTES(cluster.dimensions)

        # Calculate semantic similarity
        similarity = CALCULATE_SIMILARITY(new_keyword, cluster_attrs)

        if similarity > threshold:
            matches.append({
                'cluster_id': cluster.id,
                'cluster_title': cluster.title,
                'similarity_score': similarity
            })

    return matches  # May be 0, 1, or multiple matches

END FUNCTION
```

#### Conflict Resolution: Multi-Cluster Keyword Assignment

**Problem:** A keyword fits multiple clusters (e.g., "arthritis relief for pets" fits both Dog Cluster and Cat Cluster)

**Resolution Algorithm:**

1. **Identify Multi-Fit Keywords**
   ```python
   potential_conflicts = []
   FOR EACH new_keyword IN keywords_to_add:
       matching_clusters = map_keyword_to_clusters(new_keyword, all_clusters)
       if len(matching_clusters) > 1:
           potential_conflicts.append({
               'keyword': new_keyword,
               'matching_clusters': matching_clusters
           })
   ```

2. **Apply Decision Criteria (in order)**
   - **Criterion 1: Dimensional Intersection Count**
     - Assign to cluster with MOST dimensional intersections
     - Example: "dog arthritis relief" → Dog cluster has 3 dimensions (pet type, condition, audience); Cat cluster has 2 → assign to Dog cluster

   - **Criterion 2: Specificity**
     - If tied on intersection count, assign to MORE SPECIFIC cluster
     - Example: "arthritis relief" (general) vs "dog arthritis relief" (specific) → assign to Dog cluster

   - **Criterion 3: Primary User Intent Match**
     - If still tied, assign to cluster whose hub_title best matches user intent
     - Example: Both Dog & Cat clusters have "arthritis relief" hub; Dog hub is "Best Arthritis Treatments for Dogs" → assign to Dog

   - **Criterion 4: Last Resort - Create New Cluster**
     - If keyword doesn't fit any cluster well, flag as "potential_new_cluster"
     - User reviews and decides: split existing cluster, merge, or create new

3. **Implementation**
   ```python
   FUNCTION resolve_keyword_conflict(keyword, matching_clusters):

       # Step 1: Compare intersection depth
       sorted_by_depth = SORT_BY(matching_clusters, 'intersection_depth', DESC)
       best_by_depth = sorted_by_depth[0]

       if sorted_by_depth[0].intersection_depth > sorted_by_depth[1].intersection_depth:
           return best_by_depth

       # Step 2: Compare specificity
       specificity_scores = [CALC_SPECIFICITY(cluster, keyword) for cluster in sorted_by_depth]
       best_by_specificity = sorted_by_depth[ARGMAX(specificity_scores)]

       if specificity_scores[0] > specificity_scores[1]:
           return best_by_specificity

       # Step 3: Compare intent match
       intent_scores = [CALC_INTENT_MATCH(cluster.hub_title, keyword) for cluster in sorted_by_depth]
       best_by_intent = sorted_by_depth[ARGMAX(intent_scores)]

       if intent_scores[0] > intent_scores[1]:
           return best_by_intent

       # Step 4: Flag for user review
       return {
           'status': 'flagged_for_review',
           'keyword': keyword,
           'candidates': matching_clusters,
           'reason': 'ambiguous_assignment'
       }

   END FUNCTION
   ```

---

## 3. Data Models / APIs

### 3.1 Database Models (Django ORM)

#### SAGBlueprint (existing from 01A, extended)
```python
# Inherits account, created_at, updated_at from AccountBaseModel
class SAGBlueprint(AccountBaseModel):
    STATUS_CHOICES = (
        ('draft', 'Draft'),
        ('cluster_formation_complete', 'Cluster Formation Complete'),
        ('keyword_generation_complete', 'Keyword Generation Complete'),
        ('keyword_supplemented', 'Keywords Supplemented'),
        ('ready_for_pipeline', 'Ready for Pipeline'),
        ('published', 'Published'),
    )

    site = models.ForeignKey('igny8_core_auth.Site', on_delete=models.CASCADE)
    status = models.CharField(max_length=50, choices=STATUS_CHOICES, default='draft')
    phase = models.CharField(max_length=50, default='phase_1_foundation')
    sector = models.ForeignKey('igny8_core_auth.Sector', on_delete=models.CASCADE)

    # Denormalized JSON for fast access
    attributes_json = models.JSONField(default=dict, blank=True)
    clusters_json = models.JSONField(default=dict, blank=True)
    taxonomy_plan = models.JSONField(default=dict, blank=True)
    execution_priority = models.JSONField(default=dict, blank=True)

    created_by = models.ForeignKey(settings.AUTH_USER_MODEL, on_delete=models.SET_NULL, null=True)
    # created_at, updated_at inherited from AccountBaseModel

    class Meta:
        db_table = 'sag_blueprint'
        ordering = ['-created_at']
```

#### SAGAttribute (existing from 01A, no changes required)
```python
# Inherits account, created_at, updated_at from AccountBaseModel
class SAGAttribute(AccountBaseModel):
    blueprint = models.ForeignKey(SAGBlueprint, on_delete=models.CASCADE)
    name = models.CharField(max_length=255)
    values = models.JSONField()  # array of strings
    is_primary = models.BooleanField(default=False)
    source = models.CharField(max_length=50)  # 'user_input', 'template', 'api'
    # created_at, updated_at inherited from AccountBaseModel

    class Meta:
        db_table = 'sag_attribute'
        unique_together = ('blueprint', 'name')
```

#### SAGCluster (existing from 01A, extended)
```python
# Inherits account, created_at, updated_at from AccountBaseModel
class SAGCluster(AccountBaseModel):
    TYPE_CHOICES = (
        ('product_category', 'Product/Service Category'),
        ('condition_problem', 'Condition/Problem'),
        ('feature', 'Feature'),
        ('brand', 'Brand'),
        ('informational', 'Informational'),
        ('comparison', 'Comparison'),
        ('life_stage', 'Life Stage/Audience'),
    )

    STATUS_CHOICES = (
        ('draft', 'Draft'),
        ('validated', 'Validated'),
        ('keyword_assigned', 'Keywords Assigned'),
        ('content_created', 'Content Created'),
    )

    blueprint = models.ForeignKey(SAGBlueprint, on_delete=models.CASCADE)
    cluster_key = models.CharField(max_length=100)  # unique ID from cluster formation
    title = models.CharField(max_length=255)
    description = models.TextField(blank=True)

    cluster_type = models.CharField(max_length=50, choices=TYPE_CHOICES)
    dimensions = models.JSONField()  # ["dimension1", "dimension2", ...]
    intersection_depth = models.IntegerField()  # count of intersecting dimensions
    viability_score = models.FloatField()  # 0-1

    hub_title = models.CharField(max_length=255)
    supporting_content_plan = models.JSONField()  # array of content titles

    auto_generated_keywords = models.ManyToManyField(
        'SAGKeyword',
        related_name='clusters_auto',
        blank=True
    )
    supplemented_keywords = models.ManyToManyField(
        'SAGKeyword',
        related_name='clusters_supplemented',
        blank=True
    )

    keyword_count = models.IntegerField(default=0)
    status = models.CharField(max_length=50, choices=STATUS_CHOICES, default='draft')
    # created_at, updated_at inherited from AccountBaseModel

    class Meta:
        db_table = 'sag_cluster'
        unique_together = ('blueprint', 'cluster_key')
        ordering = ['-viability_score']
```

#### SAGKeyword (new)
```python
# Inherits account, created_at, updated_at from AccountBaseModel
class SAGKeyword(AccountBaseModel):
    INTENT_CHOICES = (
        ('informational', 'Informational'),
        ('transactional', 'Transactional'),
        ('navigational', 'Navigational'),
        ('commercial', 'Commercial Intent'),
    )

    VARIANT_TYPES = (
        ('base', 'Base Keyword'),
        ('long_tail', 'Long-tail Variant'),
        ('brand', 'Brand Variant'),
        ('comparison', 'Comparison'),
        ('review', 'Review'),
        ('how_to', 'How-to'),
    )

    SOURCE_CHOICES = (
        ('auto_generated', 'Auto-Generated'),
        ('manual_entry', 'Manual Entry'),
        ('csv_import', 'CSV Import'),
        ('api_fetch', 'API Fetch'),
        ('library', 'IGNY8 Library'),
    )

    cluster = models.ForeignKey(
        SAGCluster,
        on_delete=models.CASCADE,
        related_name='all_keywords'
    )
    keyword_text = models.CharField(max_length=255)
    search_volume = models.IntegerField(null=True, blank=True)
    difficulty = models.CharField(max_length=50, blank=True)  # 'easy', 'medium', 'hard'
    intent = models.CharField(max_length=50, choices=INTENT_CHOICES)

    generated_from = models.CharField(max_length=100, blank=True)  # template ID or source
    variant_type = models.CharField(max_length=50, choices=VARIANT_TYPES)
    source = models.CharField(max_length=50, choices=SOURCE_CHOICES)

    cpc = models.FloatField(null=True, blank=True)  # if available from API
    competition = models.CharField(max_length=50, blank=True)  # 'low', 'medium', 'high'
    # created_at, updated_at inherited from AccountBaseModel

    class Meta:
        db_table = 'sag_keyword'
        unique_together = ('cluster', 'keyword_text')
        ordering = ['-search_volume']
```

---

### 3.2 API Endpoints

#### POST /api/v1/blueprints/{blueprint_id}/clusters/form/
**Purpose:** Trigger cluster formation AI function
**Authentication:** Required (JWT)
**Input:**
```json
{
    "populated_attributes": [
        {"name": "Pet Type", "values": ["Dogs", "Cats"]},
        {"name": "Health Condition", "values": ["Allergies", "Arthritis"]}
    ],
    "max_clusters": 50
}
```

**Output:**
```json
{
    "clusters": [...],
    "summary": {
        "total_clusters_formed": 12,
        "type_distribution": {...}
    },
    "status": "success"
}
```

**Error Cases:**
- 400: Invalid attributes structure
- 403: Unauthorized (wrong blueprint owner)
- 422: Insufficient attributes for cluster formation (< 2 dimensions)

---

#### POST /api/v1/blueprints/{blueprint_id}/keywords/generate/
**Purpose:** Trigger keyword generation AI function
**Authentication:** Required
**Input:**
```json
{
    "use_cluster_ids": ["cluster_001", "cluster_002"],
    "target_keywords_per_cluster": 15,
    "include_long_tail_variants": true
}
```

**Output:**
```json
{
    "keywords_per_cluster": {...},
    "deduplication": {
        "duplicates_removed": 5
    },
    "summary": {
        "total_unique_keywords": 180,
        "within_constraints": true
    }
}
```

---

#### POST /api/v1/blueprints/{blueprint_id}/keywords/supplement/
**Purpose:** Add manual, CSV, library, or API-sourced keywords
**Authentication:** Required
**Input (Multiple Scenarios):**

**Scenario 1: Manual Entry**
```json
{
    "source": "manual_entry",
    "keywords": ["arthritis relief dogs", "joint pain dogs"],
    "cluster_id": "cluster_001"
}
```

**Scenario 2: CSV Import**
```json
{
    "source": "csv_import",
    "csv_url": "https://example.com/keywords.csv",
    "auto_cluster": true
}
```

**Scenario 3: Library Selection**
```json
{
    "source": "library",
    "library_keyword_ids": [123, 456, 789],
    "auto_cluster": true
}
```

**Output:**
```json
{
    "added_keywords": 10,
    "auto_clustered": 9,
    "flagged_for_review": 1,
    "conflicts_resolved": {
        "reassigned": 2,
        "deferred": 1
    }
}
```

---

#### POST /api/v1/blueprints/{blueprint_id}/assemble/
**Purpose:** Trigger blueprint assembly (create final SAGBlueprint with all records)
**Authentication:** Required
**Input:**
```json
{
    "finalize_keyword_review": true,
    "set_status": "ready_for_pipeline"
}
```

**Output:**
```json
{
    "blueprint_id": 42,
    "status": "ready_for_pipeline",
    "summary": {
        "total_attributes": 4,
        "total_clusters": 12,
        "total_keywords": 180,
        "execution_priority_phases": 3
    }
}
```

---

#### GET /api/v1/blueprints/{blueprint_id}/clusters/?status=draft&type=product_category
**Purpose:** List clusters with filtering
**Query Params:**
- `status`: draft, validated, keyword_assigned, content_created
- `type`: product_category, condition_problem, feature, brand, informational, comparison
- `min_viability`: 0.70
- `limit`: 50, `offset`: 0

**Output:**
```json
{
    "results": [
        {
            "id": 1,
            "cluster_key": "cluster_001",
            "title": "Dog Arthritis Relief Solutions",
            "hub_title": "Best Arthritis Treatments for Dogs",
            "keyword_count": 15,
            "viability_score": 0.92,
            "type": "product_category"
        }
    ],
    "total_count": 12,
    "total_keywords": 180
}
```

---

#### GET /api/v1/blueprints/{blueprint_id}/keywords/?cluster_id=cluster_001&source=auto_generated
**Purpose:** List keywords for a cluster
**Query Params:**
- `cluster_id`: filter by cluster
- `source`: auto_generated, manual_entry, csv_import, api_fetch, library
- `intent`: informational, transactional, navigational
- `min_search_volume`: 100
- `order_by`: search_volume (DESC), difficulty, intent

**Output:**
```json
{
    "results": [
        {
            "id": 1,
            "keyword_text": "best arthritis treatment for dogs",
            "search_volume": 1200,
            "difficulty": "medium",
            "intent": "informational",
            "variant_type": "long_tail",
            "source": "auto_generated"
        }
    ],
    "total_count": 15
}
```

---

#### DELETE /api/v1/blueprints/{blueprint_id}/keywords/{keyword_id}/
**Purpose:** Remove a keyword (before assembly)
**Authentication:** Required
**Status:** Only available if blueprint.status='draft' or 'keyword_generation_complete'

---

## 4. Implementation Steps

### Phase 1: AI Functions Development (Week 1-2)

#### Step 1.1: Set up cluster_formation.py structure
- [ ] Create `sag/ai_functions/cluster_formation.py`
- [ ] Define input/output contracts
- [ ] Implement intersection generation logic (2-value, 3-value)
- [ ] Stub out AI evaluation function (ready for Claude integration)
- [ ] Implement constraint filtering & sorting

#### Step 1.2: Implement cluster formation AI logic
- [ ] Integrate Claude AI API for cluster viability evaluation
  - Real topical ecosystem check
  - User search demand validation
  - Content support assessment
  - Differentiation evaluation
- [ ] Implement cluster type classification (using embeddings or rule-based logic)
- [ ] Implement hub title & supporting content plan generation
- [ ] Add viability scoring (0-1 scale)
- [ ] Implement distribution validation

#### Step 1.3: Unit tests for cluster formation
- [ ] Test intersection generation (2-value, 3-value)
- [ ] Test AI evaluation with mock responses
- [ ] Test constraint filtering (max 50 clusters)
- [ ] Test type distribution analysis
- [ ] Test handling of edge cases (0 intersections, all rejected, etc.)

#### Step 1.4: Create keyword_generation.py structure
- [ ] Create `sag/ai_functions/keyword_generation.py`
- [ ] Define input/output contracts
- [ ] Implement template substitution logic
- [ ] Implement long-tail variant generation
- [ ] Implement deduplication logic

#### Step 1.5: Implement keyword generation AI logic
- [ ] Integrate template loading from SectorAttributeTemplate (01B)
- [ ] Implement keyword enrichment (search volume, difficulty, intent)
- [ ] Implement filtering & sorting by search volume
- [ ] Implement constraint validation (10-25 per cluster, 300-500 total)
- [ ] Implement global deduplication & conflict resolution

#### Step 1.6: Unit tests for keyword generation
- [ ] Test template substitution with various attribute combinations
- [ ] Test long-tail variant generation
- [ ] Test deduplication across clusters
- [ ] Test constraint validation
- [ ] Test conflict resolution (multi-cluster keywords)

---

### Phase 2: Data Models & Service Layer (Week 2-3)

#### Step 2.1: Database migrations
- [ ] Create SAGKeyword model
- [ ] Add ManyToMany relations to SAGCluster (auto_generated_keywords, supplemented_keywords)
- [ ] Extend SAGBlueprint with denormalized JSON fields (attributes_json, clusters_json, taxonomy_plan, execution_priority)
- [ ] Extend SAGCluster with cluster_key, type, intersection_depth, viability_score, hub_title, supporting_content_plan
- [ ] Run and test migrations on dev database

#### Step 2.2: Implement blueprint_service.py
- [ ] Create `sag/services/blueprint_service.py`
- [ ] Implement assemble_blueprint() function with 8 steps
- [ ] Implement SAGBlueprint creation & status management
- [ ] Implement SAGAttribute creation from user input
- [ ] Implement SAGCluster creation from cluster formation results
- [ ] Implement SAGKeyword creation & assignment
- [ ] Implement taxonomy_plan generation
- [ ] Implement execution_priority generation
- [ ] Implement denormalized JSON population

#### Step 2.3: Unit tests for blueprint_service
- [ ] Test blueprint creation & status transitions
- [ ] Test attribute record creation
- [ ] Test cluster record creation with all fields
- [ ] Test keyword assignment to clusters
- [ ] Test taxonomy plan generation
- [ ] Test execution priority generation
- [ ] Test denormalized JSON accuracy

---

### Phase 3: API Endpoints & Integration (Week 3-4)

#### Step 3.1: Implement cluster formation API endpoint
- [ ] Create POST /api/v1/blueprints/{blueprint_id}/clusters/form/
- [ ] Validate input attributes
- [ ] Call cluster_formation() AI function
- [ ] Return results with summary
- [ ] Error handling (400, 403, 422)

#### Step 3.2: Implement keyword generation API endpoint
- [ ] Create POST /api/v1/blueprints/{blueprint_id}/keywords/generate/
- [ ] Validate input & cluster availability
- [ ] Call keyword_generation() AI function
- [ ] Return results with deduplication summary
- [ ] Error handling

#### Step 3.3: Implement keyword supplementation API endpoint
- [ ] Create POST /api/v1/blueprints/{blueprint_id}/keywords/supplement/
- [ ] Support multiple input sources (manual, CSV, library, API)
- [ ] Implement auto-clustering via map_keyword_to_clusters()
- [ ] Implement conflict resolution via resolve_keyword_conflict()
- [ ] Return summary of added, clustered, flagged keywords

#### Step 3.4: Implement blueprint assembly API endpoint
- [ ] Create POST /api/v1/blueprints/{blueprint_id}/assemble/
- [ ] Call blueprint_service.assemble_blueprint()
- [ ] Manage status transitions
- [ ] Return blueprint summary with next steps

#### Step 3.5: Implement read endpoints
- [ ] Create GET /api/v1/blueprints/{blueprint_id}/clusters/?status=draft
- [ ] Create GET /api/v1/blueprints/{blueprint_id}/keywords/?cluster_id=...
- [ ] Implement filtering & pagination
- [ ] Add ordering options

#### Step 3.6: Implement keyword removal endpoint
- [ ] Create DELETE /api/v1/blueprints/{blueprint_id}/keywords/{keyword_id}/
- [ ] Validate blueprint status (only draft)
- [ ] Cascade delete as needed

---

### Phase 4: Integration with 01D & Testing (Week 4-5)

#### Step 4.1: Integrate with Setup Wizard (01D)
- [ ] Call cluster_formation() after user populates attributes
- [ ] Display clusters to user for review (optional: allow edits)
- [ ] Call keyword_generation() if user confirms clusters
- [ ] Display keywords for review
- [ ] Allow manual supplementation before final assembly

#### Step 4.2: End-to-end testing
- [ ] Test full flow: attributes → clusters → keywords → blueprint
- [ ] Test with various sector/site_type combinations
- [ ] Test constraint enforcement
- [ ] Test conflict resolution with real scenarios
- [ ] Performance test with large attribute sets (100+ values)

#### Step 4.3: Integration with 01E (Pipeline Configuration)
- [ ] Verify blueprint is available to pipeline service
- [ ] Test taxonomy plan usage in content generation
- [ ] Test execution_priority ordering in pipeline

---

## 5. Acceptance Criteria

### Cluster Formation AI Function (01C-CF)
- [ ] **CF-1:** Generates all 2-value intersections from populated attributes
- [ ] **CF-2:** Generates relevant 3-value intersections (at least 50% of possible combinations)
- [ ] **CF-3:** AI evaluates each intersection on 5 decision criteria (ecosystem, demand, content support, differentiation, clarity)
- [ ] **CF-4:** Classification assigns correct cluster type (product_category, condition_problem, feature, brand, informational, comparison)
- [ ] **CF-5:** Hub titles are specific, actionable, and 5-12 words long
- [ ] **CF-6:** Supporting content plans contain 5-8 titles, semantically related to hub, covering different angles
- [ ] **CF-7:** Viability scores accurately reflect cluster strength (0-1 scale, with clear rationale)
- [ ] **CF-8:** Hard constraint enforced: max 50 clusters per sector, sorted by viability score
- [ ] **CF-9:** Type distribution meets targets: Product/Service 40-50%, Condition/Problem 20-30%, Feature 10-15%, Brand 5-10%, Life Stage 5-10%
- [ ] **CF-10:** Clusters have 3+ dimensional intersections for strong coherence
- [ ] **CF-11:** No duplicative clusters (semantic coherence check prevents near-duplicates like "Dog Joint Health" + "Dog Arthritis")
- [ ] **CF-12:** API response includes summary with cluster count, type distribution, avg intersection depth

### Keyword Generation AI Function (01C-KG)
- [ ] **KG-1:** Loads keyword templates from SectorAttributeTemplate for correct site_type
- [ ] **KG-2:** Substitutes attribute values into templates to generate base keywords
- [ ] **KG-3:** Generates long-tail variants (best, review, vs, for, how to) for each base keyword
- [ ] **KG-4:** Deduplicates keywords across all clusters (no keyword appears twice)
- [ ] **KG-5:** Global deduplication identifies multi-cluster keywords and reassigns via conflict resolution
- [ ] **KG-6:** Per-cluster keyword count: 10-25 keywords (soft target 15)
- [ ] **KG-7:** Total keyword count: 300-500+ for site (configurable per sector)
- [ ] **KG-8:** Keywords enriched with search volume, difficulty, intent classification
- [ ] **KG-9:** API response includes per-cluster breakdown, deduplication summary, total keyword count
- [ ] **KG-10:** Handles missing attribute values gracefully (skips template if required attrs not present)

### Keyword Conflict Resolution (01C-CR)
- [ ] **CR-1:** Identifies keywords matching multiple clusters (≥2 matches)
- [ ] **CR-2:** Decision Criterion 1: assigns to cluster with most dimensional intersections
- [ ] **CR-3:** Decision Criterion 2 (tiebreaker): assigns to more specific cluster
- [ ] **CR-4:** Decision Criterion 3 (tiebreaker): assigns by primary user intent match
- [ ] **CR-5:** Decision Criterion 4 (last resort): flags for user review with clear reasoning
- [ ] **CR-6:** Reassignment logic preserves keyword integrity (no loss, duplication, or orphaning)

### Blueprint Assembly Service (01C-BA)
- [ ] **BA-1:** Creates SAGBlueprint record with status='draft'
- [ ] **BA-2:** Creates SAGAttribute records from populated attributes (preserves name, values, is_primary flag)
- [ ] **BA-3:** Creates SAGCluster records from cluster formation output (all fields populated)
- [ ] **BA-4:** Creates SAGKeyword records from keyword generation output (all fields preserved)
- [ ] **BA-5:** Associates keywords to clusters via ManyToMany relations
- [ ] **BA-6:** Generates taxonomy_plan with WP categories (primary attributes) and tags (secondary)
- [ ] **BA-7:** Generates execution_priority with 3 phases: hubs first, supporting articles, term pages
- [ ] **BA-8:** Populates denormalized JSON fields (attributes_json, clusters_json) for fast queries
- [ ] **BA-9:** Returns blueprint ID and summary (attribute count, cluster count, keyword count, next steps)
- [ ] **BA-10:** Status transitions correctly: draft → ready_for_pipeline (or intermediate statuses as needed)

### Manual Keyword Supplementation (01C-MKS)
- [ ] **MKS-1:** Users can add keywords via: manual entry, CSV import, library selection, API fetch
- [ ] **MKS-2:** Manual entry accepts comma-separated keywords, validates against duplicates
- [ ] **MKS-3:** CSV import validates file structure (keyword, search_volume optional, difficulty optional)
- [ ] **MKS-4:** Library integration allows browsing & selection per site_type
- [ ] **MKS-5:** Auto-clustering maps new keywords to clusters via attribute similarity matching
- [ ] **MKS-6:** Unmatched keywords flagged for user review: gap analysis, potential new cluster, or outlier
- [ ] **MKS-7:** User can assign unmatched keywords to specific cluster or create new cluster
- [ ] **MKS-8:** API returns summary: added count, auto-clustered count, flagged count, conflicts resolved

### API Endpoints (01C-API)
- [ ] **API-1:** POST /api/v1/blueprints/{blueprint_id}/clusters/form/ returns 200 + cluster results
- [ ] **API-2:** POST /api/v1/blueprints/{blueprint_id}/keywords/generate/ returns 200 + keyword results
- [ ] **API-3:** POST /api/v1/blueprints/{blueprint_id}/keywords/supplement/ returns 200 + supplementation summary
- [ ] **API-4:** POST /api/v1/blueprints/{blueprint_id}/assemble/ returns 200 + blueprint summary
- [ ] **API-5:** GET /api/v1/blueprints/{blueprint_id}/clusters/ supports status, type, min_viability filters
- [ ] **API-6:** GET /api/v1/blueprints/{blueprint_id}/keywords/ supports cluster_id, source, intent, min_search_volume filters
- [ ] **API-7:** DELETE /api/v1/blueprints/{blueprint_id}/keywords/{keyword_id}/ only works on draft blueprints
- [ ] **API-8:** Error handling: 400 (bad input), 403 (unauthorized), 404 (not found), 422 (unprocessable)

### Data Integrity (01C-DI)
- [ ] **DI-1:** No keyword appears in multiple clusters (enforced via unique_together in SAGKeyword)
- [ ] **DI-2:** Deleted clusters cascade-delete associated keywords (no orphaned keywords)
- [ ] **DI-3:** Deleted blueprints cascade-delete all attributes, clusters, keywords
- [ ] **DI-4:** Blueprint status transitions prevent invalid operations (e.g., can't supplement keywords on published blueprint)
- [ ] **DI-5:** Denormalized JSON fields stay in sync with normalized records (updated on every change)

### Performance (01C-PERF)
- [ ] **PERF-1:** Cluster formation completes in <5 seconds for 100+ intersection combinations
- [ ] **PERF-2:** Keyword generation completes in <10 seconds for 50 clusters
- [ ] **PERF-3:** Blueprint assembly completes in <3 seconds (DB writes + JSON generation)
- [ ] **PERF-4:** GET endpoints with filters return results in <2 seconds
- [ ] **PERF-5:** CSV import (1000 keywords) completes in <15 seconds

---

## 6. Claude Code Instructions

### 6.1 Generating Cluster Formation Logic

**Prompt Template for Claude:**
```
Generate the cluster formation algorithm for an AI-powered content planning system.

Input:
- populated_attributes: List of attributes with values from user setup wizard
  Example: [
    {"name": "Pet Type", "values": ["Dogs", "Cats", "Birds"]},
    {"name": "Health Condition", "values": ["Allergies", "Arthritis", "Obesity"]}
  ]
- sector_context: Information about the sector (e.g., "pet health e-commerce")

Task:
1. Generate all meaningful 2-value intersections (Pet Type × Health Condition, Pet Type × Pet Type, etc.)
2. For each intersection, use Claude's reasoning to evaluate:
   - Is this a real topical ecosystem? (do the dimensions naturally fit together?)
   - Would users search for this? (assess search demand)
   - Can we build 1 hub + 3-8 supporting articles?
   - Is it differentiated from other clusters?
3. Classify valid clusters by type: product_category, condition_problem, feature, brand, informational
4. Generate a compelling hub title and 5-8 supporting content titles
5. Assign a viability score (0-1) based on coherence, search demand, content potential

Output:
- clusters: Array of cluster objects with all fields from the spec
- summary: Total clusters, type distribution, viability analysis

Constraints:
- Max 50 clusters per sector
- Minimum 3 dimensional intersections for strong clusters
- Quality over quantity: prefer 5 strong clusters over 15 weak ones
```

### 6.2 Generating Keyword Generation Logic

**Prompt Template for Claude:**
```
Generate keywords for content clusters using templates and AI-driven expansion.

Input:
- clusters: Array of clusters from cluster formation (with dimensions and hub title)
- keyword_templates: Pre-configured templates for site_type
  Example: [
    "best {health_condition} for {pet_type}",
    "{pet_type} {health_condition} treatment",
    "affordable {health_condition} relief for {pet_type}"
  ]
- sector_context: Site type (ecommerce, blog, saas, etc.)

Task:
1. Load keyword templates filtered by sector site_type
2. For each cluster:
   - Extract dimension values
   - Substitute values into matching templates
   - Generate long-tail variants: best, review, vs, for, how to
   - Enrich with search volume, difficulty, intent (informational, transactional, etc.)
3. Deduplicate globally across all clusters
4. Identify multi-cluster keywords and resolve conflicts via:
   - Highest dimensional intersection count
   - Most specific cluster (tiebreaker)
   - Primary user intent match (tiebreaker)
5. Validate constraints: 10-25 per cluster, 300-500 total

Output:
- keywords_per_cluster: Keywords organized by cluster ID
- deduplication: Count of duplicates removed, conflicts flagged
- summary: Total unique keywords, per-cluster average, search volume total

Constraints:
- Do NOT generate more than 25 keywords per cluster
- Do NOT allow duplicates
- Prioritize high search volume keywords
- Ensure diversity: mix of base keywords and long-tail variants
```

### 6.3 Integrating with Setup Wizard (01D)

**Implementation Notes:**
1. After user completes attribute population in wizard:
   - Call `POST /api/v1/blueprints/{blueprint_id}/clusters/form/`
   - Display clusters to user (preview mode)
   - Allow user to: review, edit (rename hub titles, remove clusters), or confirm

2. After user confirms clusters:
   - Call `POST /api/v1/blueprints/{blueprint_id}/keywords/generate/`
   - Display keywords grouped by cluster (preview mode)
   - Allow user to: supplement keywords, remove outliers, or confirm

3. Before finalizing blueprint:
   - Optionally allow manual keyword supplementation (CSV, library, manual entry)
   - Call `POST /api/v1/blueprints/{blueprint_id}/keywords/supplement/` for each source
   - Resolve conflicts (auto or manual)
   - Call `POST /api/v1/blueprints/{blueprint_id}/assemble/` to finalize

### 6.4 Testing with Sample Data

**Test Case 1: Pet Health E-commerce Site**
```python
populated_attributes = [
    {"name": "Pet Type", "values": ["Dogs", "Cats"]},
    {"name": "Health Condition", "values": ["Arthritis", "Allergies", "Obesity"]},
    {"name": "Target Audience", "values": ["Pet Owners", "Veterinarians"]}
]

sector_context = {
    "sector_id": 1,  # integer PK (BigAutoField)
    "site_type": "ecommerce",
    "sector_name": "Pet Health Products"
}

# Expected clusters:
# 1. Dog Arthritis Relief (product_category)
# 2. Cat Allergies Nutrition (product_category)
# 3. Senior Dog Joint Support (life_stage)
# ... etc.
```

**Test Case 2: Local Service (Veterinary Clinic)**
```python
populated_attributes = [
    {"name": "Service Type", "values": ["Surgery", "Preventive Care", "Emergency"]},
    {"name": "Pet Type", "values": ["Dogs", "Cats", "Exotic"]},
    {"name": "Location", "values": ["Downtown", "Suburbs"]}
]

sector_context = {
    "sector_id": 2,  # integer PK (BigAutoField)
    "site_type": "local_service",
    "sector_name": "Veterinary Clinic"
}

# Expected clusters:
# 1. Emergency Dog Surgery Downtown (local_service + product_category)
# 2. Preventive Cat Care Suburbs (informational + local_service)
# ... etc.
```

---

## 7. Cross-Document References

### Upstream Dependencies
- **01A (SAG Master Data Models):** Provides SAGBlueprint, SAGAttribute, SAGCluster base models
- **01B (Sector Attribute Templates):** Provides attribute framework, keyword templates, site_type configurations

### Downstream Consumers
- **01D (Setup Wizard):** Triggers cluster formation & keyword generation after attribute population
- **01E (Blueprint-aware Pipeline):** Uses clusters, keywords, taxonomy_plan, execution_priority for content generation
- **01F (Existing Site Analysis):** May feed competitor/existing keywords into supplementation process
- **01G (Health Monitoring):** Tracks cluster completeness, keyword coverage, content generation progress against blueprint

---

## 8. Appendix: Algorithm Complexity & Performance Estimates

### Cluster Formation Complexity
- **Input:** N attributes with M average values each
- **Intersections Generated:** O(M²) for 2-value, O(M³) for 3-value
- **AI Evaluations:** O(M² or M³) function calls (largest cost)
- **Time Estimate:** ~1-2 seconds per 100 intersections (depending on Claude API latency)
- **Bottleneck:** Claude API response time for viability evaluation

### Keyword Generation Complexity
- **Input:** C clusters, T keyword templates per cluster
- **Base Keywords:** O(C × T) (template substitution)
- **Long-tail Variants:** O(C × T × V) where V ≈ 7 (base + 6 variants)
- **Deduplication:** O(K log K) where K = total keywords (sort-based)
- **Time Estimate:** ~3-5 seconds for 300+ keywords

### Blueprint Assembly Complexity
- **DB Writes:** O(A + C + K) where A=attributes, C=clusters, K=keywords
- **JSON Generation:** O(A + C + K) for denormalization
- **Time Estimate:** <1 second for typical blueprints (< 10 MB JSON)

---

**Document Complete**
**Status:** Ready for Development
**Next Step:** Implement Phase 1 (AI Functions) per Section 4