# IGNY8 Phase 2: Internal Linker (02D) ## SAG-Based Internal Linking Engine **Document Version:** 1.0 **Date:** 2026-03-23 **Phase:** IGNY8 Phase 2 — Feature Expansion **Status:** Build Ready **Source of Truth:** Codebase at `/data/app/igny8/` **Audience:** Claude Code, Backend Developers, Architects --- ## 1. CURRENT STATE ### Internal Linking Today There is **no** internal linking system in IGNY8. Content is generated and published without any cross-linking strategy. Links within content are only those the AI incidentally includes during generation. ### What Exists - `Content` model (app_label=`writer`, db_table=`igny8_content`) — stores `content_html` where links would be inserted - `SAGCluster` and `SAGBlueprint` models (from 01A) — provide the cluster hierarchy for link topology - The 7-stage automation pipeline (01E) generates and publishes content but has no linking stage between generation and publish - `SiteIntegration` model (app_label=`integration`) tracks WordPress connections ### What Does Not Exist - No SAGLink model, no LinkMap model, no SAGLinkAudit model - No link scoring algorithm - No anchor text management - No link density enforcement - No link insertion into content_html - No orphan page detection - No link health monitoring - No link audit system ### Foundation Available - `SAGBlueprint` (01A) — defines the SAG hierarchy (site → sectors → clusters → content) - `SAGCluster` (01A) — cluster_type, hub_page_type, hub_page_structure - `SAGAttribute` (01A) — attribute values shared across clusters (basis for cross-cluster linking) - 01E pipeline — post-generation hook point available between Stage 4 (Content) and Stage 7 (Publish) - `Content.content_type` and `Content.content_structure` — determines link density rules - 02B `ContentTaxonomy` with cluster mapping — taxonomy-to-cluster relationships for taxonomy contextual links --- ## 2. WHAT TO BUILD ### Overview Build a SAG-aware internal linking engine that automatically plans, scores, and inserts internal links into content. The system operates in two modes: new content mode (pipeline integration) and existing content remediation (audit + fix). ### 2.1 Seven Link Types | # | Link Type | Direction | Description | Limit | Placement | |---|-----------|-----------|-------------|-------|-----------| | 1 | **Vertical Upward** | Supporting → Hub | MANDATORY: every supporting article links to its cluster hub | 1 per article | First 2 paragraphs | | 2 | **Vertical Downward** | Hub → Supporting | Hub lists ALL its supporting articles | No cap | "Related Articles" section + contextual body links | | 3 | **Horizontal Sibling** | Supporting ↔ Supporting | Same-cluster articles linking to each other | Max 2 per article | Natural content overlap points | | 4 | **Cross-Cluster** | Hub ↔ Hub | Hubs sharing a SAGAttribute value can cross-link | Max 2 per hub | Contextual body links | | 5 | **Taxonomy Contextual** | Term Page → Hubs | Term pages link to ALL cluster hubs using that attribute | No cap | Auto-generated from 02B taxonomy-cluster mapping | | 6 | **Breadcrumb** | Hierarchical | Home → Sector → [Attribute] → Hub → Current Page | 1 chain per page | Top of page (auto-generated from SAG hierarchy) | | 7 | **Related Content** | Cross-cluster allowed | 2-3 links in "Related Reading" section at end of article | 2-3 per article | End of article section | **Link Density Rules (outbound per page type, by word count):** | Page Type | <1000 words | 1000-2000 words | 2000+ words | |-----------|------------|-----------------|-------------| | Hub (`cluster_hub`) | 5-10 | 10-15 | 15-20 | | Blog (article/guide/etc.) | 2-5 | 3-8 | 4-12 | | Product/Service | 2-3 | 3-5 | 3-5 | | Term Page (taxonomy) | 3+ | 3+ | unlimited | ### 2.2 Link Scoring Algorithm (5 Factors) Each candidate link target receives a score (0-100): | Factor | Weight | Description | |--------|--------|-------------| | Shared attribute values | 40% | Count of SAGAttribute values shared between source and target clusters | | Target page authority | 25% | Inbound link count of target page (from LinkMap) | | Keyword overlap | 20% | Common keywords between source cluster and target content | | Content recency | 10% | Newer content gets a boost (exponential decay over 6 months) | | Link count gap | 5% | Pages with fewest inbound links get a priority boost | **Threshold:** Score ≥ 60 qualifies for automatic linking. Scores 40-59 are suggested for manual review. ### 2.3 Anchor Text Rules | Rule | Value | |------|-------| | Min length | 2 words | | Max length | 8 words | | Grammatically natural | Must read naturally in surrounding sentence | | No exact-match overuse | Same exact anchor cannot be used >3 times to same target URL | | Anchor distribution per target | Primary keyword 60%, page title 30%, natural phrase 10% | | Diversification audit | Flag if any single anchor accounts for >40% of links to a target | **Anchor Types:** - `primary_keyword` — cluster primary keyword - `page_title` — target content's title (or shortened version) - `natural` — AI-selected contextually appropriate phrase - `branded` — brand/site name (for homepage links) ### 2.4 Two Operating Modes #### A. New Content Mode (Pipeline Integration) Runs after Stage 4 (content generated), before Stage 7 (publish): 1. Content generated by pipeline → link planning triggers 2. Calculate link targets using scoring algorithm 3. Insert links into `content_html` at natural positions 4. Store link plan in SAGLink records 5. If content is a hub → auto-generate "Related Articles" section with links to all supporting articles in cluster 6. **Mandatory check:** if content is a supporting article, verify vertical_up link to hub exists; insert if missing #### B. Existing Content Remediation (Audit + Fix) For already-published content without proper internal linking: 1. **Crawl phase:** Scan all published content for a site, extract all `` tags, build LinkMap 2. **Audit analysis:** - Orphan pages: 0 inbound internal links - Over-linked pages: outbound > density max for page type/word count - Under-linked pages: outbound < density min - Missing mandatory links: supporting articles without hub uplink - Broken links: target URL returns 4xx/5xx 3. **Recommendation generation:** Priority-scored fix recommendations with AI-suggested anchor text 4. **Batch application:** Insert missing links across multiple content records ### 2.5 Cluster-Level Link Health Score Per-cluster health score (0-100) for link coverage: | Factor | Points | |--------|--------| | Hub published and linked (has outbound + inbound links) | 25 | | All supporting articles have mandatory uplink to hub | 25 | | At least 1 cross-cluster link from hub | 15 | | Term pages link to hub | 15 | | No broken links in cluster | 10 | | Link density within range for all pages | 10 | Site-wide link health = average of all cluster scores. Feeds into SAG health monitoring (01G). --- ## 3. DATA MODELS & APIS ### 3.1 New Models #### SAGLink (new `linker` app) ```python class SAGLink(SiteSectorBaseModel): """ Represents a planned or inserted internal link between two content pages. Tracks link type, anchor text, score, and status through lifecycle. """ blueprint = models.ForeignKey( 'planner.SAGBlueprint', on_delete=models.SET_NULL, null=True, blank=True, related_name='sag_links' ) source_content = models.ForeignKey( 'writer.Content', on_delete=models.CASCADE, related_name='outbound_sag_links' ) target_content = models.ForeignKey( 'writer.Content', on_delete=models.CASCADE, related_name='inbound_sag_links' ) link_type = models.CharField( max_length=20, choices=[ ('vertical_up', 'Vertical Upward'), ('vertical_down', 'Vertical Downward'), ('horizontal', 'Horizontal Sibling'), ('cross_cluster', 'Cross-Cluster'), ('taxonomy', 'Taxonomy Contextual'), ('breadcrumb', 'Breadcrumb'), ('related', 'Related Content'), ] ) anchor_text = models.CharField(max_length=200) anchor_type = models.CharField( max_length=20, choices=[ ('primary_keyword', 'Primary Keyword'), ('page_title', 'Page Title'), ('natural', 'Natural Phrase'), ('branded', 'Branded'), ] ) placement_zone = models.CharField( max_length=20, choices=[ ('in_body', 'In Body'), ('related_section', 'Related Section'), ('breadcrumb', 'Breadcrumb'), ('sidebar', 'Sidebar'), ] ) placement_position = models.IntegerField( null=True, blank=True, help_text='Paragraph number for in_body placement' ) score = models.FloatField( default=0, help_text='Link scoring algorithm result (0-100)' ) status = models.CharField( max_length=15, choices=[ ('planned', 'Planned'), ('inserted', 'Inserted'), ('verified', 'Verified'), ('broken', 'Broken'), ('removed', 'Removed'), ], default='planned' ) is_mandatory = models.BooleanField( default=False, help_text='True for vertical_up links (supporting → hub)' ) inserted_at = models.DateTimeField(null=True, blank=True) class Meta: app_label = 'linker' db_table = 'igny8_sag_links' ``` **PK:** BigAutoField (integer) — inherits from SiteSectorBaseModel #### SAGLinkAudit (linker app) ```python class SAGLinkAudit(SiteSectorBaseModel): """ Stores results of a site-wide or cluster-level link audit. """ blueprint = models.ForeignKey( 'planner.SAGBlueprint', on_delete=models.SET_NULL, null=True, blank=True, related_name='link_audits' ) audit_date = models.DateTimeField(auto_now_add=True) total_links = models.IntegerField(default=0) missing_mandatory = models.IntegerField(default=0) orphan_pages = models.IntegerField(default=0) broken_links = models.IntegerField(default=0) over_linked_pages = models.IntegerField(default=0) under_linked_pages = models.IntegerField(default=0) cluster_scores = models.JSONField( default=dict, help_text='{cluster_id: {score, missing, issues[]}}' ) recommendations = models.JSONField( default=list, help_text='[{content_id, action, link_type, target_id, anchor_suggestion, priority}]' ) overall_health_score = models.FloatField( default=0, help_text='Average of cluster scores (0-100)' ) class Meta: app_label = 'linker' db_table = 'igny8_sag_link_audits' ``` **PK:** BigAutoField (integer) — inherits from SiteSectorBaseModel #### LinkMap (linker app) ```python class LinkMap(SiteSectorBaseModel): """ Full link map of all internal (and external) links found in published content. Built by crawling content_html of all published content records. """ source_url = models.URLField() source_content = models.ForeignKey( 'writer.Content', on_delete=models.SET_NULL, null=True, blank=True, related_name='outbound_link_map' ) target_url = models.URLField() target_content = models.ForeignKey( 'writer.Content', on_delete=models.SET_NULL, null=True, blank=True, related_name='inbound_link_map' ) anchor_text = models.CharField(max_length=500) is_internal = models.BooleanField(default=True) is_follow = models.BooleanField(default=True) position = models.CharField( max_length=20, choices=[ ('in_content', 'In Content'), ('navigation', 'Navigation'), ('footer', 'Footer'), ('sidebar', 'Sidebar'), ], default='in_content' ) last_verified = models.DateTimeField(null=True, blank=True) status = models.CharField( max_length=15, choices=[ ('active', 'Active'), ('broken', 'Broken'), ('removed', 'Removed'), ], default='active' ) class Meta: app_label = 'linker' db_table = 'igny8_link_map' ``` **PK:** BigAutoField (integer) — inherits from SiteSectorBaseModel ### 3.2 Modified Models #### Content (writer app) — add 4 fields ```python # Add to Content model: link_plan = models.JSONField( null=True, blank=True, help_text='Planned links before insertion: [{target_id, link_type, anchor, score}]' ) links_inserted = models.BooleanField( default=False, help_text='Whether link plan has been applied to content_html' ) inbound_link_count = models.IntegerField( default=0, help_text='Cached count of inbound internal links' ) outbound_link_count = models.IntegerField( default=0, help_text='Cached count of outbound internal links' ) ``` ### 3.3 New App Registration Create linker app: - **App config:** `igny8_core/modules/linker/apps.py` with `app_label = 'linker'` - **Add to INSTALLED_APPS** in `igny8_core/settings.py` ### 3.4 Migration ``` igny8_core/migrations/XXXX_add_linker_models.py ``` **Operations:** 1. `CreateModel('SAGLink', ...)` — with indexes on source_content, target_content, link_type, status 2. `CreateModel('SAGLinkAudit', ...)` 3. `CreateModel('LinkMap', ...)` — with index on source_url, target_url 4. `AddField('Content', 'link_plan', JSONField(null=True, blank=True))` 5. `AddField('Content', 'links_inserted', BooleanField(default=False))` 6. `AddField('Content', 'inbound_link_count', IntegerField(default=0))` 7. `AddField('Content', 'outbound_link_count', IntegerField(default=0))` ### 3.5 API Endpoints All endpoints under `/api/v1/linker/`: #### Link Management | Method | Path | Description | |--------|------|-------------| | GET | `/api/v1/linker/links/?site_id=X` | List all SAGLink records with filters (link_type, status, cluster_id, source_content_id) | | POST | `/api/v1/linker/links/plan/` | Generate link plan for a content piece. Body: `{content_id}`. Returns planned SAGLink records. | | POST | `/api/v1/linker/links/insert/` | Insert planned links into content_html. Body: `{content_id}`. Modifies Content.content_html. | | POST | `/api/v1/linker/links/batch-insert/` | Batch insert for multiple content. Body: `{content_ids: [int]}`. Queues Celery task. | | GET | `/api/v1/linker/content/{id}/links/` | All inbound + outbound links for a specific content piece. | #### Link Audit | Method | Path | Description | |--------|------|-------------| | GET | `/api/v1/linker/audit/?site_id=X` | Latest SAGLinkAudit results. | | POST | `/api/v1/linker/audit/run/` | Trigger site-wide link audit. Body: `{site_id}`. Queues Celery task. Returns task ID. | | GET | `/api/v1/linker/audit/recommendations/?site_id=X` | Get fix recommendations from latest audit. | | POST | `/api/v1/linker/audit/apply/` | Apply recommended fixes in batch. Body: `{site_id, recommendation_ids: [int]}`. | #### Link Map & Health | Method | Path | Description | |--------|------|-------------| | GET | `/api/v1/linker/link-map/?site_id=X` | Full LinkMap for site with pagination. | | GET | `/api/v1/linker/orphans/?site_id=X` | List orphan pages (0 inbound internal links). | | GET | `/api/v1/linker/health/?site_id=X` | Cluster-level link health scores. | **Permissions:** All endpoints use `SiteSectorModelViewSet` permission patterns. ### 3.6 Link Planning Service **Location:** `igny8_core/business/link_planning.py` ```python class LinkPlanningService: """ Generates internal link plans for content based on SAG hierarchy and scoring algorithm. """ SCORE_WEIGHTS = { 'shared_attributes': 0.40, 'target_authority': 0.25, 'keyword_overlap': 0.20, 'content_recency': 0.10, 'link_count_gap': 0.05, } AUTO_LINK_THRESHOLD = 60 REVIEW_THRESHOLD = 40 def plan(self, content_id): """ Generate link plan for a content piece. 1. Identify content's cluster and role (hub vs supporting) 2. Determine mandatory links (vertical_up for supporting) 3. Score all candidate targets 4. Select targets within density limits 5. Generate anchor text per link 6. Create SAGLink records with status='planned' Returns list of planned SAGLink records. """ pass def _get_mandatory_links(self, content, cluster): """Vertical upward: supporting → hub. Always added.""" pass def _get_candidates(self, content, cluster, blueprint): """Gather all potential link targets from cluster and related clusters.""" pass def _score_candidate(self, source_content, target_content, source_cluster, target_cluster, blueprint): """Calculate 0-100 score using 5-factor algorithm.""" pass def _select_within_density(self, content, scored_candidates): """Filter candidates to stay within density limits for page type and word count.""" pass def _generate_anchor_text(self, source_content, target_content, link_type): """AI-generate contextually appropriate anchor text.""" pass ``` ### 3.7 Link Insertion Service **Location:** `igny8_core/business/link_insertion.py` ```python class LinkInsertionService: """ Inserts planned links into content_html. Handles placement, anchor text insertion, and collision avoidance. """ def insert(self, content_id): """ Insert all planned SAGLink records into Content.content_html. 1. Load all SAGLinks where source_content=content_id, status='planned' 2. Parse content_html 3. For each link, find insertion point based on placement_zone + position 4. Insert tag with anchor text 5. Update SAGLink status='inserted', set inserted_at 6. Update Content.content_html, links_inserted=True, outbound_link_count 7. Update target Content.inbound_link_count """ pass def _find_insertion_point(self, html_tree, link): """ Find best insertion point in parsed HTML: - in_body: find paragraph at placement_position, find natural spot for anchor - related_section: append to "Related Articles" section (create if missing) - breadcrumb: insert breadcrumb trail at top """ pass def _insert_link(self, html_tree, position, anchor_text, target_url): """Insert tag at position without breaking existing HTML.""" pass ``` ### 3.8 Link Audit Service **Location:** `igny8_core/business/link_audit.py` ```python class LinkAuditService: """ Runs site-wide link audits: builds link map, identifies issues, generates recommendations. """ def run_audit(self, site_id): """ Full audit: 1. Crawl all published Content for site 2. Extract all tags, build/update LinkMap records 3. Identify orphan pages, over/under-linked, missing mandatory, broken 4. Calculate per-cluster health scores 5. Generate prioritized recommendations 6. Create SAGLinkAudit record Returns SAGLinkAudit instance. """ pass def _build_link_map(self, site_id): """Extract links from all published content_html, create LinkMap records.""" pass def _find_orphans(self, site_id): """Content with 0 inbound internal links.""" pass def _check_density(self, site_id): """Compare outbound counts against density rules per page type.""" pass def _check_mandatory(self, site_id): """Verify all supporting articles have vertical_up link to their hub.""" pass def _calculate_cluster_health(self, site_id, cluster): """Calculate 0-100 health score per cluster.""" pass def _generate_recommendations(self, issues): """Priority-scored recommendations with AI-suggested anchor text.""" pass ``` ### 3.9 Celery Tasks **Location:** `igny8_core/tasks/linker_tasks.py` ```python @shared_task(name='generate_link_plan') def generate_link_plan(content_id): """Runs after content generation, before publish. Creates SAGLink records.""" pass @shared_task(name='run_link_audit') def run_link_audit(site_id): """Scheduled weekly or triggered manually. Full site-wide audit.""" pass @shared_task(name='verify_links') def verify_links(site_id): """Check for broken links via HTTP status checks on LinkMap URLs.""" pass @shared_task(name='rebuild_link_map') def rebuild_link_map(site_id): """Full crawl of published content to rebuild LinkMap from scratch.""" pass ``` **Beat Schedule Additions:** | Task | Schedule | Notes | |------|----------|-------| | `run_link_audit` | Weekly (Sunday 1:00 AM) | Site-wide audit for all active sites | | `verify_links` | Weekly (Wednesday 2:00 AM) | HTTP check all active LinkMap entries | --- ## 4. IMPLEMENTATION STEPS ### Step 1: Create Linker App 1. Create `igny8_core/modules/linker/` directory with `__init__.py` and `apps.py` 2. Add `linker` to `INSTALLED_APPS` in settings.py 3. Create models: SAGLink, SAGLinkAudit, LinkMap ### Step 2: Migration 1. Create migration for 3 new models 2. Add 4 new fields to Content model (link_plan, links_inserted, inbound_link_count, outbound_link_count) 3. Run migration ### Step 3: Services 1. Implement `LinkPlanningService` in `igny8_core/business/link_planning.py` 2. Implement `LinkInsertionService` in `igny8_core/business/link_insertion.py` 3. Implement `LinkAuditService` in `igny8_core/business/link_audit.py` ### Step 4: Pipeline Integration Insert link planning + insertion between Stage 4 and Stage 7: ```python # After content generation completes in pipeline: def post_content_generation(content_id): # 02G: Generate schema + SERP elements # ... # 02D: Plan and insert internal links link_service = LinkPlanningService() link_service.plan(content_id) insertion_service = LinkInsertionService() insertion_service.insert(content_id) ``` ### Step 5: API Endpoints 1. Create `igny8_core/urls/linker.py` with link, audit, and health endpoints 2. Create views extending `SiteSectorModelViewSet` 3. Register URL patterns under `/api/v1/linker/` ### Step 6: Celery Tasks 1. Implement all 4 tasks in `igny8_core/tasks/linker_tasks.py` 2. Add `run_link_audit` and `verify_links` to Celery beat schedule ### Step 7: Serializers & Admin 1. Create DRF serializers for SAGLink, SAGLinkAudit, LinkMap 2. Register models in Django admin ### Step 8: Credit Cost Configuration Add to `CreditCostConfig` (billing app): | operation_type | default_cost | description | |---------------|-------------|-------------| | `link_audit` | 1 | Site-wide link audit | | `link_generation` | 0.5 | Generate 1-5 links with AI anchor text | | `link_audit_full` | 3-5 | Full site audit with recommendations | --- ## 5. ACCEPTANCE CRITERIA ### Link Types - [ ] Vertical upward link (supporting → hub) automatically inserted for all supporting articles - [ ] Vertical downward links (hub → supporting) generated with "Related Articles" section - [ ] Horizontal sibling links (max 2) between same-cluster supporting articles - [ ] Cross-cluster links (max 2) between hubs sharing SAGAttribute values - [ ] Taxonomy contextual links from term pages to all relevant cluster hubs - [ ] Breadcrumb chain generated from SAG hierarchy for all content - [ ] Related content section (2-3 links) generated at end of article ### Link Scoring - [ ] 5-factor scoring algorithm produces 0-100 scores - [ ] Links with score ≥ 60 auto-inserted - [ ] Links with score 40-59 suggested for manual review - [ ] Score algorithm uses: shared attributes (40%), authority (25%), keyword overlap (20%), recency (10%), gap boost (5%) ### Anchor Text - [ ] Anchor text 2-8 words, grammatically natural - [ ] Same exact anchor not used >3 times to same target - [ ] Distribution per target: 60% primary keyword, 30% page title, 10% natural - [ ] Diversification audit flags if any anchor >40% of links to a target ### Link Density - [ ] Hub pages: 5-20 outbound links based on word count - [ ] Blog pages: 2-12 outbound links based on word count - [ ] Product/Service pages: 2-5 outbound links - [ ] Term pages: 3+ outbound, unlimited for taxonomy contextual ### Audit & Remediation - [ ] Link audit identifies orphan pages, over/under-linked, missing mandatory, broken links - [ ] Cluster-level health score (0-100) calculated per cluster - [ ] Recommendations generated with priority scores and AI-suggested anchors - [ ] Batch application of recommendations modifies content_html correctly ### Pipeline Integration - [ ] Link plan generated automatically after content generation in pipeline - [ ] Links inserted before publish stage - [ ] Mandatory vertical_up link verified before allowing publish - [ ] Content.inbound_link_count and outbound_link_count updated on insert --- ## 6. CLAUDE CODE INSTRUCTIONS ### File Locations ``` igny8_core/ ├── modules/ │ └── linker/ │ ├── __init__.py │ ├── apps.py # app_label = 'linker' │ └── models.py # SAGLink, SAGLinkAudit, LinkMap ├── business/ │ ├── link_planning.py # LinkPlanningService │ ├── link_insertion.py # LinkInsertionService │ └── link_audit.py # LinkAuditService ├── tasks/ │ └── linker_tasks.py # Celery tasks ├── urls/ │ └── linker.py # Linker endpoints └── migrations/ └── XXXX_add_linker_models.py ``` ### Conventions - **PKs:** BigAutoField (integer) — do NOT use UUIDs - **Table prefix:** `igny8_` on all new tables - **App label:** `linker` (new app) - **Celery app name:** `igny8_core` - **URL pattern:** `/api/v1/linker/...` - **Permissions:** Use `SiteSectorModelViewSet` permission pattern - **Model inheritance:** SAGLink and SAGLinkAudit extend `SiteSectorBaseModel`; LinkMap extends `SiteSectorBaseModel` - **Frontend:** `.tsx` files with Zustand stores for state management ### Cross-References | Doc | Relationship | |-----|-------------| | **01A** | SAGBlueprint/SAGCluster/SAGAttribute provide hierarchy and cross-cluster relationships | | **01E** | Pipeline integration — link planning hooks after Stage 4, before Stage 7 | | **01G** | SAG health monitoring incorporates cluster link health scores | | **02B** | ContentTaxonomy cluster mapping enables taxonomy contextual links | | **02E** | External backlinks complement internal links; authority distributed by internal links | | **02F** | Optimizer identifies internal link opportunities and feeds to linker | | **03A** | WP plugin standalone mode has its own internal linking module — separate from this | | **03C** | Theme renders breadcrumbs and related content sections generated by linker | ### Key Decisions 1. **New `linker` app** — Separate app because linking is a distinct domain with its own models, not tightly coupled to writer or planner 2. **SAGLink stores planned AND inserted** — Single model tracks the full lifecycle from planning through insertion to verification 3. **LinkMap is separate from SAGLink** — LinkMap stores the actual crawled link state (including non-SAG links); SAGLink stores the planned/managed links 4. **Cached counts on Content** — `inbound_link_count` and `outbound_link_count` are denormalized for fast queries; updated on insert/removal 5. **HTML parsing for insertion** — Use Python HTML parser (BeautifulSoup or lxml) for safe link insertion without corrupting content_html