igny8/v2/V2-Execution-Docs/01F-existing-site-analysis-case1.md

# 01F: IGNY8 Phase 1 — Existing Site Analysis (Case 1)

**Document Type:** Build Specification
**Phase:** Phase 1: Existing Site Analysis
**Use Case:** Case 1 (Users with existing sites)
**Status:** Active Development
**Last Updated:** 2026-03-23

---

## 1. Current State

### 1.1 Existing IGNY8 WordPress Plugin

The IGNY8 WordPress plugin is currently operational with the following capabilities:

**Current Data Collection:**
- Post status tracking
- Site metadata (domain, WordPress version, plugin count, theme)
- Keyword mapping and analysis
- Site structure analysis
- Taxonomy sync across registered taxonomies
- 7 active cron jobs managing periodic data updates

**Current Plugin Endpoint:**
- `GET /wp-json/igny8/v1/health` — basic health check
- Plugin location: WordPress plugins directory
- Sync frequency: Configurable via cron (daily default)

**Limitations:**
- Does not collect detailed product data (WooCommerce stores)
- Does not analyze product descriptions for attribute patterns
- No collection of custom attribute assignments
- No menu structure analysis
- No blog content summary extraction
- No confidence scoring for discovered patterns
- Manual attribute creation required post-analysis

### 1.2 Case 1 User Journey

**Trigger:** User logs into IGNY8 platform with existing WordPress site (WooCommerce-based)

**Current Flow:**
1. User connects WordPress site via API key
2. Plugin syncs basic site data
3. User manually creates SAG blueprint
4. User manually defines attributes
5. User manually tags existing products

**Desired Flow:**
1. User connects WordPress site via API key
2. Plugin collects comprehensive site data (products, categories, content)
3. AI automatically extracts attributes from product titles/descriptions
4. System generates SAG blueprint with discovered attributes
5. System performs gap analysis (what's missing vs. SAG template)
6. User reviews and confirms blueprint
7. System auto-tags existing products
8. Blueprint feeds into content pipeline (01E) and cluster formation (01C)

### 1.3 Dependencies & Prerequisites

- WordPress 5.8+ with WooCommerce 5.0+
- IGNY8 plugin v2.0+ installed and activated
- OpenAI API or compatible LLM for attribute extraction
- Celery for async task processing (analysis may take 2-5 minutes)
- Database schema supports site analysis metadata storage
- Sector templates (01B) available for validation

---

## 2. What to Build

### 2.1 Enhanced Plugin: Site Data Collection

**Objective:** Extend WordPress plugin to collect comprehensive site data for SAG analysis.

**New Plugin Endpoint:**

```
GET /wp-json/igny8/v1/sag/site-analysis
Headers: Authorization: Bearer {IGNY8_API_TOKEN}
Query Parameters:
  - limit_products: 500 (max products to analyze; default 500)
  - include_drafts: false (include draft products; default false)
  - cache_ttl: 3600 (cache results for N seconds; default 3600)

Response: 200 OK with payload (see section 2.3)
```

**Data Collection Modules:**

| Module | Responsibility | Data Returned |
|--------|-----------------|----------------|
| ProductCollector | Extract all products with metadata | titles, descriptions, prices, categories, tags, images, custom attributes, sku |
| CategoryCollector | Map product category hierarchy | names, slugs, parent-child hierarchy, descriptions, product counts |
| TaxonomyCollector | Enumerate all custom taxonomies | taxonomy names, all registered terms, term hierarchies, term metadata |
| AttributeCollector | Extract WooCommerce attributes | attribute names, attribute types (select/text/color), all values, product assignments |
| PageCollector | Identify key pages | titles, URLs, content summaries (first 500 chars), page type detection |
| PostCollector | Extract blog posts | titles, URLs, content summaries, categories, tags, publish date |
| MenuCollector | Analyze navigation structure | menu items, hierarchy, target URLs/categories |
| PluginCollector | Document site technical stack | active plugins, theme, WordPress version, WooCommerce version |

**Implementation:**
- Location: `plugins/igny8-sync/includes/collectors/`
- Each collector implements `DataCollectorInterface` with `collect()` and `sanitize()` methods
- Data sanitization: Remove PII, HTML tags, limit text length
- Error handling: Log failures per collector, return partial data if one collector fails
- Performance: Optimize queries to avoid site slowdown (use transients, batch operations)

**Plugin Cron Job Addition:**
- New job: `igny8_sync_sag_site_analysis` (optional, runs if user triggers analysis)
- Frequency: On-demand via API call, not scheduled
- Timeout: 60 seconds (analysis itself happens server-side via Celery)

### 2.2 AI Attribute Extraction Service

**File:** `sag/ai_functions/attribute_extraction.py`
**Register Key:** `extract_site_attributes`
**Input Type:** SiteAnalysisPayload
**Output Type:** AttributeExtractionResult

**Function Signature:**

```python
def extract_site_attributes(
    site_data: SiteAnalysisPayload,
    sector_template: Optional[SectorTemplate] = None,
    confidence_threshold: float = 0.6,
    max_attributes: int = 20
) -> AttributeExtractionResult:
    """
    Analyze site data to discover attributes.

    Args:
        site_data: Raw site data from WordPress plugin
        sector_template: Optional sector template for validation
        confidence_threshold: Min confidence to include attribute (0.0-1.0)
        max_attributes: Max attributes to return

    Returns:
        AttributeExtractionResult with discovered attributes, frequencies, confidence scores
    """
```

**Algorithm:**

1. **Text Analysis Phase**
   - Concatenate product titles and descriptions
   - Apply tokenization and noun phrase extraction
   - Identify recurring modifiers and descriptors
   - Extract from category names and tags
   - Extract from custom attribute values (if any exist)

2. **Pattern Recognition Phase**
   - Group similar terms (e.g., "back pain" + "back relief" + "lower back" → "back/spine")
   - Calculate frequency across product dataset
   - Identify dimensional axes (e.g., "target area," "device type")
   - Score statistical significance

3. **Validation Phase**
   - Cross-reference against sector template (if provided)
   - Validate against common attribute taxonomies
   - Flag conflicting or ambiguous discoveries
   - Assign confidence scores based on:
     - Frequency (how often appears)
     - Consistency (appears across multiple products)
     - Specificity (not too vague)
     - Template alignment (matches known attributes)

4. **Ranking Phase**
   - Rank by frequency and confidence
   - Assign dimensionality (Primary/Secondary/Tertiary)
   - Cap results at `max_attributes`

**Output Structure:**

```json
{
  "analysis_id": "uuid",
  "site_id": "uuid",
  "timestamp": "2026-03-23T14:30:00Z",
  "analysis_confidence": 0.82,
  "attributes": [
    {
      "name": "Target Area",
      "dimension": "Primary",
      "confidence": 0.95,
      "frequency": 32,
      "discovered_from": ["product_titles", "product_descriptions", "categories"],
      "values": [
        {
          "value": "Neck",
          "frequency": 12,
          "example_products": ["Product A", "Product B"]
        },
        {
          "value": "Back",
          "frequency": 8,
          "example_products": ["Product C"]
        },
        {
          "value": "Foot",
          "frequency": 25,
          "example_products": ["Product D", "Product E"]
        }
      ],
      "template_validation": {
        "matched_sector": "massage_devices",
        "matched_attribute": "body_region",
        "alignment_score": 0.98
      }
    },
    {
      "name": "Device Type",
      "dimension": "Primary",
      "confidence": 0.88,
      "frequency": 28,
      "discovered_from": ["product_titles", "product_descriptions"],
      "values": [
        {
          "value": "Shiatsu",
          "frequency": 18,
          "example_products": ["Product F"]
        },
        {
          "value": "EMS",
          "frequency": 7,
          "example_products": ["Product G"]
        },
        {
          "value": "Percussion",
          "frequency": 3,
          "example_products": ["Product H"]
        }
      ],
      "template_validation": {
        "matched_sector": "massage_devices",
        "matched_attribute": "therapy_type",
        "alignment_score": 0.91
      }
    },
    {
      "name": "Heat Setting",
      "dimension": "Secondary",
      "confidence": 0.72,
      "frequency": 15,
      "discovered_from": ["product_descriptions"],
      "values": [
        {
          "value": "Heated",
          "frequency": 15,
          "example_products": ["Product I", "Product J"]
        }
      ],
      "template_validation": {
        "matched_sector": "massage_devices",
        "matched_attribute": "heat_enabled",
        "alignment_score": 0.85
      }
    }
  ],
  "low_confidence_discoveries": [
    {
      "name": "Brand",
      "confidence": 0.55,
      "reason": "High variability, many single-mention values"
    }
  ],
  "analysis_notes": {
    "total_products_analyzed": 50,
    "total_categories": 8,
    "total_tags": 23,
    "extraction_method": "llm_analysis",
    "model_used": "gpt-4-turbo"
  }
}
```

**Error Handling:**
- Insufficient data: Log warning, return empty attributes list
- LLM API failure: Retry with exponential backoff (3 retries)
- Timeout (>5 minutes): Abort and return partial results
- Invalid sector template: Log error, continue analysis without validation

**Performance Considerations:**
- Cache sector templates in memory
- Batch LLM calls (process 5-10 products per API call)
- Store extraction results in database for audit trail
- Return results within 2-5 minutes for typical sites

### 2.3 Data Models

#### SiteAnalysisPayload

```python
from dataclasses import dataclass
from typing import List, Dict, Optional

@dataclass
class Product:
    id: str
    title: str
    description: str
    sku: str
    price: float
    categories: List[str]
    tags: List[str]
    custom_attributes: Dict[str, List[str]]
    image_urls: List[str]

@dataclass
class Category:
    id: str
    name: str
    slug: str
    parent_id: Optional[str]
    description: str
    product_count: int

@dataclass
class Taxonomy:
    name: str
    label: str
    is_hierarchical: bool
    terms: List['Term']

@dataclass
class Term:
    id: str
    name: str
    slug: str
    parent_id: Optional[str]
    description: str
    count: int

@dataclass
class Page:
    id: str
    title: str
    url: str
    content_summary: str
    page_type: str  # e.g., "shop", "landing", "faq"

@dataclass
class Post:
    id: str
    title: str
    url: str
    content_summary: str
    categories: List[str]
    tags: List[str]
    publish_date: str

@dataclass
class MenuItem:
    id: str
    title: str
    url: str
    target: str
    parent_id: Optional[str]

@dataclass
class SiteMetadata:
    site_id: str
    domain: str
    wordpress_version: str
    woocommerce_version: str
    total_products: int
    total_categories: int
    total_pages: int
    total_posts: int
    active_plugins: List[str]
    theme: str

@dataclass
class SiteAnalysisPayload:
    metadata: SiteMetadata
    products: List[Product]
    categories: List[Category]
    taxonomies: List[Taxonomy]
    pages: List[Page]
    posts: List[Post]
    menus: List[MenuItem]
    collected_at: str  # ISO 8601 timestamp
```

#### AttributeExtractionResult

```python
@dataclass
class AttributeValue:
    value: str
    frequency: int
    example_products: List[str]

@dataclass
class TemplateValidation:
    matched_sector: str
    matched_attribute: str
    alignment_score: float

@dataclass
class DiscoveredAttribute:
    name: str
    dimension: str  # "Primary", "Secondary", "Tertiary"
    confidence: float  # 0.0-1.0
    frequency: int
    discovered_from: List[str]  # ["product_titles", "product_descriptions", ...]
    values: List[AttributeValue]
    template_validation: Optional[TemplateValidation]

@dataclass
class LowConfideryDiscovery:
    name: str
    confidence: float
    reason: str

@dataclass
class AnalysisNotes:
    total_products_analyzed: int
    total_categories: int
    total_tags: int
    extraction_method: str
    model_used: str

@dataclass
class AttributeExtractionResult:
    analysis_id: str
    site_id: str
    timestamp: str
    analysis_confidence: float
    attributes: List[DiscoveredAttribute]
    low_confidence_discoveries: List[LowConfideryDiscovery]
    analysis_notes: AnalysisNotes
```

### 2.4 Gap Analysis Service

**File:** `sag/services/gap_analysis_service.py`
**Class:** `GapAnalysisService`
**Method:** `analyze_gap(site_data: SiteAnalysisPayload, blueprint: SAGBlueprint) -> GapAnalysisReport`

**Purpose:** Compare existing site structure against SAG blueprint to identify gaps.

**Analysis Dimensions:**

1. **Attribute Coverage Gap**
   - SAG blueprint specifies X attributes
   - Site currently has Y custom attributes assigned to products
   - Gap: Missing attributes or low coverage (% of products with attribute values)

2. **Hub Page Gap**
   - Blueprint specifies Z cluster hubs
   - Site analysis identifies M existing pages
   - Gap: Missing hub pages (authority pages for attribute clusters)

3. **Term Landing Page Gap**
   - Blueprint specifies N attribute values requiring term landing pages
   - Site has existing category/tag pages
   - Gap: Missing term landing pages (one per attribute value)

4. **Blog Content Gap**
   - Blueprint specifies recommended blog posts per cluster
   - Site has P existing blog posts
   - Gap: Blog content aligned to clusters and keyword targets

5. **Internal Linking Gap**
   - Blueprint specifies internal linking strategy
   - Site has current internal link structure
   - Gap: Missing cross-cluster and term-to-hub links

6. **Product Enrichment Gap**
   - Products lacking attribute assignments
   - Products missing description optimization
   - Products missing images

7. **Technical SEO Gap**
   - Missing schema markup for products
   - Category pages lacking optimization
   - Menu structure not optimized for crawlability

**Output Structure:**

```json
{
  "analysis_id": "uuid",
  "site_id": "uuid",
  "blueprint_id": "uuid",
  "timestamp": "2026-03-23T14:30:00Z",
  "summary": {
    "products_current": 50,
    "products_gap": 0,
    "attributes_current": 3,
    "attributes_blueprint": 8,
    "attributes_gap": 5,
    "hub_pages_current": 2,
    "hub_pages_blueprint": 4,
    "hub_pages_gap": 2,
    "term_pages_current": 12,
    "term_pages_blueprint": 35,
    "term_pages_gap": 23,
    "blog_posts_current": 8,
    "blog_posts_blueprint": 24,
    "blog_posts_gap": 16,
    "overall_gap_percentage": 62
  },
  "attributes_gap_detail": [
    {
      "attribute": "Target Area",
      "coverage_current": "100% (50/50)",
      "coverage_blueprint": "100% (50/50)",
      "gap": "None — attribute well-covered"
    },
    {
      "attribute": "Device Type",
      "coverage_current": "80% (40/50)",
      "coverage_blueprint": "100% (50/50)",
      "gap": "10 products missing Device Type assignment"
    }
  ],
  "hub_pages_gap_detail": [
    {
      "cluster": "Foot Massagers",
      "status": "EXISTS",
      "url": "/shop/foot-massagers",
      "optimization_notes": "Good; consider adding testimonials section"
    },
    {
      "cluster": "Neck & Shoulder Relief",
      "status": "MISSING",
      "recommendation": "Create hub page at /neck-shoulder-relief"
    }
  ],
  "term_pages_gap_detail": [
    {
      "attribute": "Target Area",
      "term": "Neck",
      "status": "MISSING",
      "recommendation": "Create term page at /target-area/neck (products filter + blog links)"
    }
  ],
  "blog_posts_gap_detail": [
    {
      "cluster": "Foot Massagers",
      "recommended_posts": [
        "Best Foot Massagers for Neuropathy",
        "How to Use Shiatsu Foot Massagers",
        "Foot Massage Benefits"
      ],
      "existing_posts": [
        "Foot Massage 101"
      ],
      "gap": 2
    }
  ],
  "internal_linking_gap": {
    "status": "High gaps identified",
    "recommendation": "Blueprint specifies 3-5 internal links per hub page; current average: 1.2",
    "priority_links": [
      "Neck hub → Foot hub (shared body region cluster)",
      "Device Type pages → Hub pages",
      "Blog posts → Related term pages"
    ]
  },
  "actionable_recommendations": [
    "IMMEDIATE: Assign Device Type to 10 untagged products",
    "WEEK 1: Create 2 missing hub pages",
    "WEEK 2: Create 23 term landing pages via script",
    "WEEK 3: Bulk create 16 blog posts (outline + AI generation)",
    "WEEK 4: Implement internal linking strategy"
  ]
}
```

### 2.5 Product Auto-Tagging Service

**File:** `sag/services/auto_tagger_service.py`
**Class:** `ProductAutoTagger`
**Method:** `generate_tag_suggestions(products: List[Product], attributes: List[DiscoveredAttribute], blueprint: SAGBlueprint) -> List[TagSuggestion]`

**Purpose:** Generate batch product-to-attribute assignments based on product titles/descriptions.

**Algorithm:**

1. For each product:
   - Extract key terms from title and description
   - Match against attribute values (fuzzy matching allowed)
   - Score confidence for each attribute assignment
   - Rank by confidence

2. For each attribute:
   - Verify assignment makes semantic sense
   - Check for conflicting assignments (e.g., can't be both "Shiatsu" and "EMS")
   - Return ranked list

3. Group by product for review UI

**Output Structure:**

```json
{
  "batch_id": "uuid",
  "site_id": "uuid",
  "blueprint_id": "uuid",
  "timestamp": "2026-03-23T14:30:00Z",
  "total_products": 50,
  "total_suggestions": 87,
  "suggestions": [
    {
      "product_id": "woo_123",
      "product_title": "Nekteck Foot Massager with Heat",
      "proposed_tags": [
        {
          "attribute": "Target Area",
          "value": "Foot",
          "confidence": 0.98,
          "reasoning": "Title contains 'Foot Massager'"
        },
        {
          "attribute": "Device Type",
          "value": "Shiatsu",
          "confidence": 0.82,
          "reasoning": "Description mentions shiatsu nodes"
        },
        {
          "attribute": "Heat Setting",
          "value": "Heated",
          "confidence": 0.95,
          "reasoning": "Title explicitly states 'with Heat'"
        }
      ],
      "status": "pending_review"
    }
  ],
  "summary": {
    "high_confidence_suggestions": 72,
    "medium_confidence_suggestions": 12,
    "low_confidence_suggestions": 3,
    "conflicts_detected": 0,
    "ready_to_apply": true
  }
}
```

---

## 3. APIs & Endpoints

### 3.1 Backend API Endpoints

All endpoints are authenticated via `Authorization: Bearer {IGNY8_API_TOKEN}` header.

#### POST /api/v1/sag/sites/{site_id}/analyze/

**Purpose:** Trigger comprehensive site analysis (async).

**Request:**
```json
{
  "include_draft_products": false,
  "product_limit": 500,
  "sector_template_id": "optional_uuid",
  "webhook_url": "optional_https_url_for_completion_notification"
}
```

**Response:** 202 Accepted
```json
{
  "task_id": "celery_task_uuid",
  "site_id": "site_uuid",
  "status": "queued",
  "estimated_duration_seconds": 120,
  "check_status_url": "/api/v1/sag/sites/{site_id}/analysis-status/?task_id={task_id}"
}
```

**Error Responses:**
- 400: Invalid parameters
- 401: Unauthorized
- 404: Site not found
- 429: Rate limited (max 1 analysis per 30 minutes per site)

---

#### GET /api/v1/sag/sites/{site_id}/analysis-status/

**Purpose:** Check analysis progress.

**Query Parameters:**
- `task_id` (required): Celery task ID from analysis trigger

**Response:** 200 OK
```json
{
  "task_id": "celery_task_uuid",
  "site_id": "site_uuid",
  "status": "processing",
  "progress_percent": 45,
  "current_step": "Analyzing product attributes",
  "elapsed_seconds": 32,
  "estimated_remaining_seconds": 48
}
```

**Status Values:**
- `queued` — waiting to start
- `processing` — actively analyzing
- `complete` — analysis finished
- `failed` — analysis error (see error message)

---

#### GET /api/v1/sag/sites/{site_id}/analysis-results/

**Purpose:** Retrieve completed analysis results.

**Response:** 200 OK
```json
{
  "analysis_id": "uuid",
  "site_id": "site_uuid",
  "timestamp": "2026-03-23T14:30:00Z",
  "site_data_summary": {
    "total_products": 50,
    "total_categories": 8,
    "total_pages": 12,
    "total_posts": 8
  },
  "extracted_attributes": {
    "analysis_confidence": 0.82,
    "attributes_count": 8,
    "attributes": [
      { "name": "Target Area", "dimension": "Primary", "confidence": 0.95, ... }
    ]
  },
  "gap_analysis": {
    "overall_gap_percentage": 62,
    "summary": { ... }
  },
  "status": "ready_for_review"
}
```

**Status Values:**
- `ready_for_review` — user should review before confirming
- `confirmed` — user has accepted analysis
- `archived` — superceded by newer analysis

---

#### POST /api/v1/sag/sites/{site_id}/confirm-analysis/

**Purpose:** User confirms analysis; creates SAG blueprint.

**Request:**
```json
{
  "analysis_id": "uuid",
  "approved_attributes": [
    {
      "name": "Target Area",
      "approved_values": ["Neck", "Back", "Foot"],
      "exclude_values": []
    }
  ],
  "confirmed_by_user_id": "user_uuid"
}
```

**Response:** 201 Created
```json
{
  "blueprint_id": "uuid",
  "site_id": "site_uuid",
  "analysis_id": "uuid",
  "status": "created",
  "attributes_count": 8,
  "attribute_values_count": 45,
  "created_at": "2026-03-23T14:32:00Z",
  "next_steps": [
    "Review auto-tagging suggestions",
    "Approve product tags",
    "Start content pipeline (01E)"
  ]
}
```

---

#### GET /api/v1/sag/sites/{site_id}/auto-tag/suggestions/

**Purpose:** Retrieve product auto-tagging suggestions.

**Query Parameters:**
- `blueprint_id` (required): ID of confirmed blueprint
- `confidence_min` (optional): Filter by minimum confidence (0.0-1.0, default 0.6)
- `limit` (optional): Max suggestions per product (default 5)

**Response:** 200 OK
```json
{
  "batch_id": "uuid",
  "blueprint_id": "blueprint_uuid",
  "total_suggestions": 87,
  "suggestions": [
    {
      "product_id": "woo_123",
      "product_title": "Nekteck Foot Massager",
      "proposed_tags": [
        {
          "attribute": "Target Area",
          "value": "Foot",
          "confidence": 0.98,
          "reasoning": "Title contains 'Foot Massager'"
        }
      ]
    }
  ]
}
```

---

#### POST /api/v1/sag/sites/{site_id}/auto-tag/apply/

**Purpose:** Apply approved product tags to site (async bulk operation).

**Request:**
```json
{
  "blueprint_id": "uuid",
  "approved_suggestions": [
    {
      "product_id": "woo_123",
      "approved_tags": [
        {
          "attribute": "Target Area",
          "value": "Foot"
        }
      ]
    }
  ],
  "skip_existing_values": true
}
```

**Response:** 202 Accepted
```json
{
  "task_id": "celery_task_uuid",
  "site_id": "site_uuid",
  "blueprint_id": "blueprint_uuid",
  "status": "processing",
  "products_to_tag": 47,
  "tags_to_apply": 87,
  "check_status_url": "/api/v1/sag/sites/{site_id}/auto-tag/status/?task_id={task_id}"
}
```

---

#### GET /api/v1/sag/sites/{site_id}/auto-tag/status/

**Purpose:** Check auto-tagging progress.

**Query Parameters:**
- `task_id` (required): Celery task ID

**Response:** 200 OK
```json
{
  "task_id": "celery_task_uuid",
  "site_id": "site_uuid",
  "status": "processing",
  "progress_percent": 62,
  "products_tagged": 29,
  "total_products": 47,
  "tags_applied": 54,
  "estimated_remaining_seconds": 30
}
```

---

### 3.2 WordPress Plugin Endpoint

#### GET /wp-json/igny8/v1/sag/site-analysis

**Purpose:** Collect comprehensive site data for analysis.

**Headers:**
- `Authorization: Bearer {IGNY8_API_TOKEN}`
- `X-IGNY8-Request-ID: {uuid}` (optional, for request tracking)

**Query Parameters:**
- `limit_products`: int (1-1000, default 500)
- `include_drafts`: boolean (default false)
- `cache_ttl`: int (seconds, default 3600)

**Response:** 200 OK
```json
{
  "metadata": {
    "site_id": "uuid",
    "domain": "example-store.com",
    "wordpress_version": "6.4.2",
    "woocommerce_version": "8.5.0",
    "total_products": 50,
    "total_categories": 8,
    "total_pages": 12,
    "total_posts": 8,
    "active_plugins": ["woocommerce", "yoast-seo", ...],
    "theme": "storefront"
  },
  "products": [
    {
      "id": "woo_123",
      "title": "Nekteck Foot Massager with Heat",
      "description": "Premium foot massage device...",
      "sku": "NEKTECK-FM-001",
      "price": 79.99,
      "categories": ["Foot Massagers", "Massage Devices"],
      "tags": ["heated", "cordless"],
      "custom_attributes": {
        "brand": ["Nekteck"],
        "color": ["Black"],
        "warranty": ["2 Year"]
      },
      "image_urls": ["image1.jpg", "image2.jpg"]
    }
  ],
  "categories": [
    {
      "id": "cat_1",
      "name": "Foot Massagers",
      "slug": "foot-massagers",
      "parent_id": null,
      "description": "Electronic foot massage devices",
      "product_count": 12
    }
  ],
  "taxonomies": [
    {
      "name": "brand",
      "label": "Brand",
      "is_hierarchical": false,
      "terms": [
        {
          "id": "brand_1",
          "name": "Nekteck",
          "slug": "nekteck",
          "parent_id": null,
          "description": "",
          "count": 5
        }
      ]
    }
  ],
  "pages": [
    {
      "id": "page_1",
      "title": "Shop",
      "url": "/shop",
      "content_summary": "Browse our selection of massage devices",
      "page_type": "shop"
    }
  ],
  "posts": [
    {
      "id": "post_1",
      "title": "Benefits of Foot Massage",
      "url": "/blog/foot-massage-benefits",
      "content_summary": "Learn why foot massage is beneficial...",
      "categories": ["Health"],
      "tags": ["foot", "massage"],
      "publish_date": "2026-03-15"
    }
  ],
  "menus": [
    {
      "id": "menu_1",
      "title": "Main Menu",
      "items": [
        {
          "id": "item_1",
          "title": "Shop",
          "url": "/shop",
          "target": "_self",
          "parent_id": null
        }
      ]
    }
  ],
  "collected_at": "2026-03-23T14:30:00Z"
}
```

**Error Responses:**
- 400: Invalid query parameters
- 401: Invalid or missing API token
- 500: Plugin error (logged on WordPress side)

**Performance:**
- Response time target: <5 seconds for sites with <500 products
- Data is cached for 1 hour (configurable via `cache_ttl`)
- Uses WordPress transients API for caching

---

## 4. Implementation Steps

### Phase 1: Plugin Enhancement (Week 1)

**Tasks:**
1. Create collector classes in `plugins/igny8-sync/includes/collectors/`
   - ProductCollector
   - CategoryCollector
   - TaxonomyCollector
   - AttributeCollector
   - PageCollector
   - PostCollector
   - MenuCollector
   - PluginCollector

2. Implement `DataCollectorInterface`
   - `collect()` method (fetches raw data)
   - `sanitize()` method (removes PII, normalizes format)
   - Error handling per collector

3. Add `/wp-json/igny8/v1/sag/site-analysis` endpoint
   - Route definition
   - Parameter validation
   - Response formatting
   - Caching logic

4. Add unit tests for collectors
   - Mock data tests
   - Error condition tests
   - Performance tests

**Acceptance Criteria:**
- Endpoint returns valid JSON payload matching schema
- All 8 collectors implemented and tested
- Response time <5 seconds for 500 products
- Caching works correctly
- Error handling tested

---

### Phase 2: AI Attribute Extraction (Week 1-2)

**Tasks:**
1. Implement `attribute_extraction.py`
   - Text analysis functions
   - Pattern recognition logic
   - Confidence scoring
   - Validation against sector templates

2. Register with LLM framework
   - Implement `extract_site_attributes` function
   - Add input/output validation
   - Error handling (retry logic)

3. Create data models
   - DiscoveredAttribute
   - AttributeValue
   - TemplateValidation
   - AttributeExtractionResult

4. Add unit and integration tests
   - Mock LLM responses
   - Test with real site data
   - Confidence scoring validation
   - Performance tests (2-5 minute runtime)

**Acceptance Criteria:**
- Extracts 5-20 attributes from sample site data
- Confidence scores accurate and meaningful
- Sector template validation works
- Low-confidence discoveries flagged
- Results auditable (model used, reasoning provided)

---

### Phase 3: Gap Analysis Service (Week 2)

**Tasks:**
1. Implement `gap_analysis_service.py`
   - GapAnalysisService class
   - analyze_gap() method
   - All 7 gap dimensions analyzed

2. Create gap analysis models
   - GapAnalysisReport
   - Recommendation structures
   - Detail sections

3. Integrate with blueprint comparison
   - Query SAG blueprint
   - Compare against site data
   - Calculate gap percentages

4. Add unit tests
   - Test each gap dimension
   - Test recommendation generation
   - Test report structure

**Acceptance Criteria:**
- All 7 gap dimensions analyzed
- Report clearly identifies missing elements
- Actionable recommendations provided
- Report generated in <1 second

---

### Phase 4: API Endpoints (Week 2-3)

**Tasks:**
1. Implement analysis trigger endpoint
   - POST /api/v1/sag/sites/{site_id}/analyze/
   - Celery task queueing
   - Webhook support

2. Implement status check endpoint
   - GET /api/v1/sag/sites/{site_id}/analysis-status/
   - Real-time progress updates

3. Implement results retrieval endpoint
   - GET /api/v1/sag/sites/{site_id}/analysis-results/
   - Caching of results

4. Implement blueprint confirmation endpoint
   - POST /api/v1/sag/sites/{site_id}/confirm-analysis/
   - Attribute approval logic
   - Blueprint creation

5. Add request/response validation
   - Marshmallow schemas
   - Error responses

6. Add authentication/authorization checks
   - API token validation
   - User site ownership verification

**Acceptance Criteria:**
- All 4 endpoints implemented
- Endpoints return correct status codes
- Validation working
- Authentication required and checked
- Error responses follow standard format

---

### Phase 5: Product Auto-Tagging (Week 3)

**Tasks:**
1. Implement `auto_tagger_service.py`
   - ProductAutoTagger class
   - generate_tag_suggestions() method
   - Confidence scoring

2. Create auto-tagging endpoints
   - GET /api/v1/sag/sites/{site_id}/auto-tag/suggestions/
   - POST /api/v1/sag/sites/{site_id}/auto-tag/apply/
   - GET /api/v1/sag/sites/{site_id}/auto-tag/status/

3. Implement Celery task for bulk tagging
   - Batch product processing
   - Conflict detection
   - Error handling

4. Add unit tests
   - Test suggestion generation
   - Test bulk tagging
   - Test conflict detection

**Acceptance Criteria:**
- Suggestions endpoint returns valid suggestions
- Confidence scores reasonable (0.6+)
- Bulk tagging applies tags correctly to products
- Progress tracking works
- 47+ products can be tagged in <2 minutes

---

### Phase 6: Frontend Components (Week 3-4)

**Tasks:**
1. Implement SiteAnalysisPanel
   - Trigger analysis button
   - Progress indicator
   - Error messaging

2. Implement DiscoveredAttributesReview
   - Display discovered attributes
   - Show confidence scores
   - Allow approval/rejection per attribute
   - Show example products

3. Implement GapAnalysisReport
   - Visual representation of gaps
   - Actionable recommendations
   - Priority ordering

4. Implement AutoTagReviewPanel
   - Display product suggestions
   - Batch selection/deselection
   - Apply tags button
   - Progress tracking

5. Add styling and UX polish
   - Responsive design
   - Loading states
   - Error states
   - Success confirmations

**Acceptance Criteria:**
- All 4 components implemented
- Responsive on desktop/tablet
- Accessible (WCAG 2.1 AA)
- User can complete workflow without errors
- Loading/error states clearly communicated

---

### Phase 7: Integration & Testing (Week 4)

**Tasks:**
1. End-to-end testing
   - Connect real WordPress site
   - Run full analysis workflow
   - Confirm blueprint created
   - Verify auto-tagging works

2. Performance testing
   - Benchmark analysis with various site sizes
   - Optimize slow operations
   - Load testing on API endpoints

3. Documentation
   - API documentation (OpenAPI/Swagger)
   - Plugin setup guide
   - User guide for Case 1 workflow
   - Developer setup guide

4. Bug fixing and refinement
   - Fix integration issues
   - Refine UI/UX based on testing
   - Improve error messages

**Acceptance Criteria:**
- End-to-end workflow works without errors
- Performance meets targets (analysis <5 min for 500 products)
- Documentation complete
- All bugs fixed
- Ready for beta testing

---

## 5. Acceptance Criteria

### 5.1 Functional Requirements

**Site Data Collection:**
- Plugin collects all 8 data types (products, categories, taxonomies, pages, posts, menus, attributes, metadata)
- Data is valid JSON matching defined schema
- All product titles/descriptions included
- Custom attribute values extracted correctly
- Menu hierarchy preserved

**Attribute Extraction:**
- AI identifies 5-20 attributes from site data
- Confidence scores meaningful and accurate
- Low-confidence discoveries flagged
- Sector template validation working
- Results include frequency counts and example products

**Gap Analysis:**
- All 7 gap dimensions analyzed
- Missing hubs, term pages, blog posts clearly identified
- Product attribute coverage calculated
- Internal linking gaps identified
- Actionable recommendations provided

**Blueprint Creation:**
- Confirmed analysis creates valid SAGBlueprint
- Attributes and values recorded correctly
- Gap analysis linked to blueprint
- Blueprint feeds into cluster formation (01C)

**Product Auto-Tagging:**
- Suggestions generated for 90%+ of products
- Confidence scores reasonable (0.6+)
- Bulk tagging applies tags correctly
- No data loss or corruption
- Existing tags not overwritten (configurable)

**API Endpoints:**
- All 4 analysis endpoints implemented
- All 3 auto-tagging endpoints implemented
- Correct HTTP status codes
- Valid error responses
- Authentication required

**Frontend Components:**
- SiteAnalysisPanel triggers analysis and shows progress
- DiscoveredAttributesReview allows attribute approval
- GapAnalysisReport displays gaps clearly
- AutoTagReviewPanel allows batch product tagging
- All components responsive and accessible

### 5.2 Non-Functional Requirements

**Performance:**
- Site analysis completes in <5 minutes for typical sites (50-500 products)
- WordPress plugin endpoint responds in <5 seconds
- API endpoints respond in <2 seconds
- Frontend components load in <3 seconds

**Reliability:**
- Plugin handles errors gracefully (missing products, etc.)
- Partial failures return partial data with warnings
- Celery tasks have retry logic
- Webhook notifications reliable

**Security:**
- API token authentication required
- User can only access own sites
- No PII in logs
- HTTPS enforced
- Input validation on all endpoints

**Scalability:**
- Plugin handles 1000+ products
- API handles 100+ concurrent analysis requests
- Database indexes optimized for queries
- Caching prevents redundant processing

**Data Quality:**
- Analysis results auditable (model used, timestamps, reasoning)
- No duplicate attribute suggestions
- Confidence scores calibrated
- Low-confidence results flagged for review

### 5.3 User Experience Requirements

**Clarity:**
- User understands analysis process and time required
- Gap analysis clearly shows what's missing
- Recommendations are actionable
- Error messages explain what went wrong

**Simplicity:**
- Workflow is 4-5 steps (analyze → review → confirm → auto-tag → apply)
- One button to trigger analysis
- Clear next steps after each stage

**Feedback:**
- Real-time progress updates during analysis
- Success/error notifications
- Ability to view raw analysis results
- Audit trail of approvals

---

## 6. Claude Code Instructions

### 6.1 Skill Development

**Skill Name:** `igny8-case1-analysis`
**Version:** 2.0
**Prerequisites:** IGNY8 platform deployed, WordPress plugin v2.0+, Celery configured

**Skill Workflow:**

```yaml
Trigger: User connects existing WordPress site to IGNY8

Step 1: Collect Site Data
  - Call: POST /api/v1/sag/sites/{site_id}/analyze/
  - Wait: Poll /api/v1/sag/sites/{site_id}/analysis-status/ every 10 seconds
  - Timeout: 5 minutes
  - Output: task_id for tracking

Step 2: Retrieve Analysis Results
  - Call: GET /api/v1/sag/sites/{site_id}/analysis-results/
  - Parse: extracted_attributes, gap_analysis
  - Display: DiscoveredAttributesReview panel
  - User action: Approve/reject attributes

Step 3: Confirm Analysis
  - Call: POST /api/v1/sag/sites/{site_id}/confirm-analysis/
  - Payload: approved_attributes from user review
  - Output: blueprint_id
  - Display: Gap analysis report
  - Next: Show auto-tagging recommendations

Step 4: Generate Auto-Tag Suggestions
  - Call: GET /api/v1/sag/sites/{site_id}/auto-tag/suggestions/?blueprint_id={blueprint_id}
  - Display: AutoTagReviewPanel
  - User action: Select products to tag

Step 5: Apply Auto-Tags
  - Call: POST /api/v1/sag/sites/{site_id}/auto-tag/apply/
  - Wait: Poll /api/v1/sag/sites/{site_id}/auto-tag/status/ every 5 seconds
  - Timeout: 10 minutes
  - Output: Number of tags applied, products tagged

Step 6: Complete & Next Steps
  - Display: Success message
  - Recommendations: Run cluster formation (01C), start content pipeline (01E)
  - Links: View blueprint, view gap report, start cluster creation
```

### 6.2 Development Checklist

**Code Quality:**
- [ ] All functions have docstrings
- [ ] Type hints on all function parameters and returns
- [ ] Logging at DEBUG, INFO, WARNING levels as appropriate
- [ ] Error handling with specific exception types
- [ ] No hardcoded values (use config/env vars)

**Testing:**
- [ ] Unit tests for each service (>80% coverage)
- [ ] Integration tests for API endpoints
- [ ] Fixtures for sample site data
- [ ] Mock LLM responses for deterministic tests
- [ ] Performance tests for analysis (time and memory)

**Documentation:**
- [ ] Docstrings follow Google style
- [ ] README with setup instructions
- [ ] API documentation in OpenAPI format
- [ ] Example requests/responses for each endpoint
- [ ] Troubleshooting guide for common errors

**Security:**
- [ ] API token validation on all endpoints
- [ ] User ownership checks before accessing site data
- [ ] Input validation with Marshmallow
- [ ] SQL injection prevention (use ORM)
- [ ] No credentials in logs or errors

**Performance:**
- [ ] Database queries indexed
- [ ] Caching implemented for plugin endpoint
- [ ] Celery task optimization
- [ ] LLM API call batching
- [ ] Frontend component lazy loading

### 6.3 Debugging & Troubleshooting

**Common Issues:**

**Issue:** Analysis hangs or times out
- Check: Celery worker status (`celery -A sag inspect active`)
- Check: Redis/message queue status
- Check: LLM API rate limits
- Solution: Reduce product limit, retry analysis

**Issue:** Plugin endpoint returns partial data
- Check: Specific collector failure (check logs)
- Solution: Fix collector, re-run analysis (uses cache bypass)
- Note: Partial data is returned if one collector fails

**Issue:** Auto-tagging misses products
- Check: Product title/description quality (missing keywords)
- Check: Confidence threshold (lower if needed)
- Solution: Review low-confidence suggestions, adjust threshold

**Issue:** Gap analysis shows 100% gaps
- Check: Blueprint created correctly
- Check: Gap analysis query (verify site_id matches)
- Solution: Re-run analysis, confirm blueprint

### 6.4 Integration Checkpoints

**Integration with 01A (SAGBlueprint):**
- Confirmed analysis creates SAGBlueprint via POST /api/v1/sag/sites/{site_id}/confirm-analysis/
- Blueprint includes extracted attributes and values
- Blueprint links to analysis for audit trail
- Blueprint ready for cluster formation (01C)

**Integration with 01B (Sector Templates):**
- Attribute extraction uses sector template for validation (optional parameter)
- Alignment scores show how closely discovered attributes match template
- Low-confidence discoveries flagged if they don't align with template
- Template selection based on site category detection

**Integration with 01C (Cluster Formation):**
- Blueprint created from Case 1 analysis feeds into cluster formation
- Attributes and values used to create cluster hierarchies
- Cluster formation references blueprint_id for traceability
- Can override clusters if needed

**Integration with 01E (Content Pipeline):**
- Blueprint creation triggers content pipeline pre-planning
- Gap analysis informs content prioritization
- Hub page templates created for missing clusters
- Blog post outlines generated for content gaps

**Integration with 01G (Health Monitoring):**
- Analysis metrics stored for health dashboard
- Gap analysis metrics tracked over time
- Product attribute coverage tracked
- Auto-tagging success rate monitored

---

## 7. Related Documents

- **01A:** SAGBlueprint Definition — Output of Case 1 analysis
- **01B:** Sector Templates — Used for attribute validation
- **01C:** Cluster Formation — Consumes SAGBlueprint from Case 1
- **01D:** Case 2 Wizard — Alternative path for new sites
- **01E:** Content Pipeline — Feeds blueprint and gap analysis
- **01G:** Health Monitoring — Tracks analysis and enrichment metrics

---

## 8. Glossary

- **SAG:** Semantic Attribute Grid — the structured product attribute framework
- **Attribute:** A dimension of product information (e.g., "Target Area," "Device Type")
- **Attribute Value:** A specific instance of an attribute (e.g., "Foot" for Target Area)
- **Cluster:** A group of related attribute values forming a content hub
- **Gap:** Missing element compared to SAG blueprint (hub pages, term pages, blog posts, etc.)
- **Confidence Score:** AI's confidence in discovered attribute (0.0-1.0)
- **Dimension:** Priority level of attribute (Primary, Secondary, Tertiary)
- **Term Landing Page:** Single-page optimized for specific attribute value
- **Hub Page:** Authority page for entire attribute cluster
- **Auto-Tagging:** Bulk assignment of attributes to products

---

**Document Status:** Ready for Development
**Last Review:** 2026-03-23
**Next Review:** Post-Phase 2 Development