Files
igny8/docs/plans/automation/AUTOMATION-ENHANCEMENT-PLAN.md
2026-01-17 17:47:16 +00:00

766 lines
28 KiB
Markdown

# Automation System Enhancement Plan
**Created:** January 17, 2026
**Updated:** January 17, 2026 (IMPLEMENTATION COMPLETE)
**Status:** ✅ ALL PHASES COMPLETE
**Priority:** 🔴 CRITICAL - Blocks Production Launch
---
## Implementation Progress
### ✅ PHASE 1: Bug Fixes (COMPLETE)
1. **Bug #1:** Cancel releases lock - [views.py](../../backend/igny8_core/business/automation/views.py)
2. **Bug #2:** Scheduled check includes 'paused' - [tasks.py](../../backend/igny8_core/business/automation/tasks.py)
3. **Bug #3:** Resume reacquires lock - [tasks.py](../../backend/igny8_core/business/automation/tasks.py)
4. **Bug #4:** Resume has pause/cancel checks - [tasks.py](../../backend/igny8_core/business/automation/tasks.py)
5. **Bug #5:** Pause logs to files - [views.py](../../backend/igny8_core/business/automation/views.py)
6. **Bug #6:** Resume exception releases lock - [tasks.py](../../backend/igny8_core/business/automation/tasks.py)
### ✅ PHASE 2: Per-Run Item Limits (COMPLETE)
- Added 8 new fields to `AutomationConfig` model:
- `max_keywords_per_run`, `max_clusters_per_run`, `max_ideas_per_run`
- `max_tasks_per_run`, `max_content_per_run`, `max_images_per_run`
- `max_approvals_per_run`, `max_credits_per_run`
- Migration: [0014_automation_per_run_limits.py](../../backend/migrations/0014_automation_per_run_limits.py)
- Service: Updated `automation_service.py` with `_get_per_run_limit()`, `_apply_per_run_limit()`, `_check_credit_budget()`
- API: Updated config endpoints in views.py
### ✅ PHASE 3: Publishing Settings Overhaul (COMPLETE)
- Added scheduling modes: `time_slots`, `stagger`, `immediate`
- New fields: `scheduling_mode`, `stagger_start_time`, `stagger_end_time`, `stagger_interval_minutes`, `queue_limit`
- Migration: [0015_publishing_settings_overhaul.py](../../backend/migrations/0015_publishing_settings_overhaul.py)
- Scheduler: Updated `_calculate_available_slots()` with three mode handlers
### ✅ PHASE 4: Credit % Allocation per AI Function (COMPLETE)
- New model: `SiteAIBudgetAllocation` in billing/models.py
- Default allocations: 15% clustering, 10% ideas, 40% content, 5% prompts, 30% images
- Migration: [0016_site_ai_budget_allocation.py](../../backend/migrations/0016_site_ai_budget_allocation.py)
- API: New viewset at `/api/v1/billing/sites/{site_id}/ai-budget/`
### ✅ PHASE 5: UI Updates (COMPLETE)
- Updated `AutomationConfig` interface in `automationService.ts` with new per-run limit fields
- GlobalProgressBar already implements correct calculation using `initial_snapshot`
---
## Migrations To Run
```bash
cd /data/app/igny8/backend
python manage.py migrate
```
## Files Modified
### Backend
- `backend/igny8_core/business/automation/views.py` - Cancel releases lock, pause logs
- `backend/igny8_core/business/automation/tasks.py` - Resume fixes, scheduled check
- `backend/igny8_core/business/automation/models.py` - Per-run limit fields
- `backend/igny8_core/business/automation/services/automation_service.py` - Limit enforcement
- `backend/igny8_core/business/integration/models.py` - Publishing modes
- `backend/igny8_core/business/billing/models.py` - SiteAIBudgetAllocation
- `backend/igny8_core/modules/billing/views.py` - AI budget viewset
- `backend/igny8_core/modules/billing/urls.py` - AI budget route
- `backend/igny8_core/modules/integration/views.py` - Publishing serializer
- `backend/igny8_core/tasks/publishing_scheduler.py` - Scheduling modes
### Frontend
- `frontend/src/services/automationService.ts` - Config interface updated
### Migrations
- `backend/migrations/0014_automation_per_run_limits.py`
- `backend/migrations/0015_publishing_settings_overhaul.py`
- `backend/migrations/0016_site_ai_budget_allocation.py`
---
## Executive Summary
This plan addresses critical automation bugs and introduces 4 major enhancements:
1. **Fix Critical Automation Bugs** - Lock management, scheduled runs, logging
2. **Credit Budget Allocation** - Configurable % per AI function
3. **Publishing Schedule Overhaul** - Robust, predictable scheduling
4. **Per-Run Item Limits** - Control throughput per automation run
---
## Part 1: Critical Bug Fixes ✅ COMPLETE
### 🔴 BUG #1: Cancel Action Doesn't Release Lock
**Location:** `backend/igny8_core/business/automation/views.py` line ~1614
**Current Code:**
```python
def cancel_automation(self, request):
run.status = 'cancelled'
run.cancelled_at = timezone.now()
run.completed_at = timezone.now()
run.save(update_fields=['status', 'cancelled_at', 'completed_at'])
# ❌ MISSING: cache.delete(f'automation_lock_{run.site.id}')
```
**Fix:**
```python
def cancel_automation(self, request):
run.status = 'cancelled'
run.cancelled_at = timezone.now()
run.completed_at = timezone.now()
run.save(update_fields=['status', 'cancelled_at', 'completed_at'])
# Release the lock so user can start new automation
from django.core.cache import cache
cache.delete(f'automation_lock_{run.site.id}')
# Log the cancellation
from igny8_core.business.automation.services.automation_logger import AutomationLogger
logger = AutomationLogger()
logger.log_stage_progress(
run.run_id, run.account.id, run.site.id, run.current_stage,
f"Automation cancelled by user"
)
```
**Impact:** Users can immediately start new automation after cancelling
---
### 🔴 BUG #2: Scheduled Automation Doesn't Check 'paused' Status
**Location:** `backend/igny8_core/business/automation/tasks.py` line ~52
**Current Code:**
```python
# Check if already running
if AutomationRun.objects.filter(site=config.site, status='running').exists():
logger.info(f"[AutomationTask] Skipping site {config.site.id} - already running")
continue
```
**Fix:**
```python
# Check if already running OR paused
if AutomationRun.objects.filter(site=config.site, status__in=['running', 'paused']).exists():
logger.info(f"[AutomationTask] Skipping site {config.site.id} - automation in progress (running/paused)")
continue
```
**Impact:** Prevents duplicate runs when one is paused
---
### 🔴 BUG #3: Resume Doesn't Reacquire Lock
**Location:** `backend/igny8_core/business/automation/tasks.py` line ~164
**Current Code:**
```python
def resume_automation_task(self, run_id: str):
service = AutomationService.from_run_id(run_id)
# ❌ No lock check - could run unprotected after 6hr expiry
```
**Fix:**
```python
def resume_automation_task(self, run_id: str):
"""Resume paused automation run from current stage"""
logger.info(f"[AutomationTask] Resuming automation run: {run_id}")
try:
run = AutomationRun.objects.get(run_id=run_id)
# Verify run is actually in 'running' status (set by views.resume)
if run.status != 'running':
logger.warning(f"[AutomationTask] Run {run_id} status is {run.status}, not 'running'. Aborting resume.")
return
# Reacquire lock in case it expired during long pause
from django.core.cache import cache
lock_key = f'automation_lock_{run.site.id}'
# Try to acquire - if fails, another run may have started
if not cache.add(lock_key, 'locked', timeout=21600):
# Check if WE still own it (compare run_id if stored)
existing = cache.get(lock_key)
if existing and existing != 'locked':
logger.warning(f"[AutomationTask] Lock held by different run. Aborting resume for {run_id}")
run.status = 'failed'
run.error_message = 'Lock acquired by another run during pause'
run.save()
return
# Lock exists but may be ours - proceed cautiously
service = AutomationService.from_run_id(run_id)
# ... rest of processing with pause/cancel checks between stages
```
---
### 🔴 BUG #4: Resume Missing Pause/Cancel Checks Between Stages
**Location:** `backend/igny8_core/business/automation/tasks.py` line ~183
**Current Code:**
```python
for stage in range(run.current_stage - 1, 7):
if stage_enabled[stage]:
stage_methods[stage]()
# ❌ No pause/cancel check after each stage
```
**Fix:**
```python
for stage in range(run.current_stage - 1, 7):
if stage_enabled[stage]:
stage_methods[stage]()
# Check for pause/cancel AFTER each stage (same as run_automation_task)
service.run.refresh_from_db()
if service.run.status in ['paused', 'cancelled']:
logger.info(f"[AutomationTask] Resumed automation {service.run.status} after stage {stage + 1}")
return
else:
logger.info(f"[AutomationTask] Stage {stage + 1} is disabled, skipping")
```
---
### 🟡 BUG #5: Pause Missing File Log Entry
**Location:** `backend/igny8_core/business/automation/views.py` pause action
**Fix:** Add logging call:
```python
def pause(self, request):
# ... existing code ...
service.pause_automation()
# Log to automation files
service.logger.log_stage_progress(
service.run.run_id, service.account.id, service.site.id,
service.run.current_stage, f"Automation paused by user"
)
return Response({'message': 'Automation paused'})
```
---
## Part 2: Credit Budget Allocation System
### Overview
Add configurable credit % allocation per AI function. Users can:
- Use global defaults (configured by admin)
- Override with site-specific allocations
### Database Changes
**Extend `CreditCostConfig` model:**
```python
class CreditCostConfig(models.Model):
# ... existing fields ...
# NEW: Budget allocation percentage
budget_percentage = models.DecimalField(
max_digits=5,
decimal_places=2,
default=0,
validators=[MinValueValidator(0), MaxValueValidator(100)],
help_text="Default % of credits allocated to this operation (0-100)"
)
```
**New `SiteAIBudgetAllocation` model:**
```python
class SiteAIBudgetAllocation(AccountBaseModel):
"""Site-specific credit budget allocation overrides"""
site = models.OneToOneField(
'igny8_core_auth.Site',
on_delete=models.CASCADE,
related_name='ai_budget_allocation'
)
use_global_defaults = models.BooleanField(
default=True,
help_text="Use global CreditCostConfig percentages"
)
# Per-operation overrides (only used when use_global_defaults=False)
clustering_percentage = models.DecimalField(max_digits=5, decimal_places=2, default=10)
idea_generation_percentage = models.DecimalField(max_digits=5, decimal_places=2, default=10)
content_generation_percentage = models.DecimalField(max_digits=5, decimal_places=2, default=40)
image_prompt_extraction_percentage = models.DecimalField(max_digits=5, decimal_places=2, default=5)
image_generation_percentage = models.DecimalField(max_digits=5, decimal_places=2, default=35)
class Meta:
db_table = 'igny8_site_ai_budget_allocations'
```
### Service Changes
**New `BudgetAllocationService`:**
```python
class BudgetAllocationService:
@staticmethod
def get_operation_budget(site, operation_type, total_credits):
"""
Get credits allocated for an operation based on site settings.
Args:
site: Site instance
operation_type: 'clustering', 'content_generation', etc.
total_credits: Total credits available
Returns:
int: Credits allocated for this operation
"""
allocation = SiteAIBudgetAllocation.objects.filter(site=site).first()
if not allocation or allocation.use_global_defaults:
# Use global CreditCostConfig percentages
config = CreditCostConfig.objects.filter(
operation_type=operation_type,
is_active=True
).first()
percentage = config.budget_percentage if config else 0
else:
# Use site-specific override
field_map = {
'clustering': 'clustering_percentage',
'idea_generation': 'idea_generation_percentage',
'content_generation': 'content_generation_percentage',
'image_prompt_extraction': 'image_prompt_extraction_percentage',
'image_generation': 'image_generation_percentage',
}
field = field_map.get(operation_type)
percentage = getattr(allocation, field, 0) if field else 0
return int(total_credits * (percentage / 100))
```
### Frontend Changes
**Site Settings > AI Settings Tab:**
- Add "Credit Budget Allocation" section
- Toggle: "Use Global Defaults" / "Custom Allocation"
- If custom: Show sliders for each operation (must sum to 100%)
- Visual pie chart showing allocation
---
## Part 3: Publishing Schedule Overhaul
### Current Issues
1. Limits are confusing - daily/weekly/monthly are treated as hard caps
2. Items not getting scheduled (30% missed in last run)
3. Time slot calculation doesn't account for stagger intervals
4. No visibility into WHY items weren't scheduled
### New Publishing Model
**Replace `PublishingSettings` with enhanced version:**
```python
class PublishingSettings(AccountBaseModel):
site = models.OneToOneField('igny8_core_auth.Site', on_delete=models.CASCADE)
# Auto-approval/publish toggles (keep existing)
auto_approval_enabled = models.BooleanField(default=True)
auto_publish_enabled = models.BooleanField(default=True)
# NEW: Scheduling configuration (replaces hard limits)
scheduling_mode = models.CharField(
max_length=20,
choices=[
('slots', 'Time Slots'), # Publish at specific times
('stagger', 'Staggered'), # Spread evenly throughout day
('immediate', 'Immediate'), # Publish as soon as approved
],
default='slots'
)
# Time slot configuration
publish_days = models.JSONField(
default=['mon', 'tue', 'wed', 'thu', 'fri'],
help_text="Days allowed for publishing"
)
publish_time_slots = models.JSONField(
default=['09:00', '14:00', '18:00'],
help_text="Specific times for slot mode"
)
# Stagger mode configuration
stagger_start_time = models.TimeField(default='09:00')
stagger_end_time = models.TimeField(default='18:00')
stagger_interval_minutes = models.IntegerField(
default=15,
help_text="Minutes between publications in stagger mode"
)
# Daily TARGET (soft limit - for estimation, not blocking)
daily_publish_target = models.IntegerField(
default=3,
help_text="Target articles per day (for scheduling spread)"
)
# Weekly/Monthly targets (informational only)
weekly_publish_target = models.IntegerField(default=15)
monthly_publish_target = models.IntegerField(default=50)
# NEW: Maximum queue depth (actual limit)
max_scheduled_queue = models.IntegerField(
default=100,
help_text="Maximum items that can be in 'scheduled' status at once"
)
```
### New Scheduling Algorithm
```python
def calculate_publishing_slots(settings, site, count_needed):
"""
Calculate publishing slots with NO arbitrary limits.
Returns:
List of (datetime, slot_info) tuples
"""
slots = []
now = timezone.now()
if settings.scheduling_mode == 'immediate':
# Return 'now' for all items
return [(now + timedelta(seconds=i*60), {'mode': 'immediate'}) for i in range(count_needed)]
elif settings.scheduling_mode == 'stagger':
# Spread throughout each day
return _calculate_stagger_slots(settings, site, count_needed, now)
else: # 'slots' mode
return _calculate_time_slot_slots(settings, site, count_needed, now)
def _calculate_stagger_slots(settings, site, count_needed, now):
"""
Stagger mode: Spread publications evenly throughout publish hours.
"""
slots = []
day_map = {'mon': 0, 'tue': 1, 'wed': 2, 'thu': 3, 'fri': 4, 'sat': 5, 'sun': 6}
allowed_days = [day_map[d] for d in settings.publish_days if d in day_map]
current_date = now.date()
interval = timedelta(minutes=settings.stagger_interval_minutes)
for day_offset in range(90): # Look up to 90 days ahead
check_date = current_date + timedelta(days=day_offset)
if check_date.weekday() not in allowed_days:
continue
# Generate slots for this day
day_start = timezone.make_aware(
datetime.combine(check_date, settings.stagger_start_time)
)
day_end = timezone.make_aware(
datetime.combine(check_date, settings.stagger_end_time)
)
# Get existing scheduled for this day
existing = Content.objects.filter(
site=site,
site_status='scheduled',
scheduled_publish_at__date=check_date
).values_list('scheduled_publish_at', flat=True)
existing_times = set(existing)
current_slot = day_start
if check_date == current_date and now > day_start:
# Start from next interval after now
minutes_since_start = (now - day_start).total_seconds() / 60
intervals_passed = int(minutes_since_start / settings.stagger_interval_minutes) + 1
current_slot = day_start + timedelta(minutes=intervals_passed * settings.stagger_interval_minutes)
while current_slot <= day_end and len(slots) < count_needed:
if current_slot not in existing_times:
slots.append((current_slot, {'mode': 'stagger', 'date': str(check_date)}))
current_slot += interval
if len(slots) >= count_needed:
break
return slots
```
### Frontend Changes
**Site Settings > Publishing Tab - Redesign:**
```
┌─────────────────────────────────────────────────────────────────┐
│ Publishing Schedule │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Auto-Approval: [✓] Automatically approve content │
│ Auto-Publish: [✓] Automatically publish approved content │
│ │
│ ─── Scheduling Mode ─── │
│ ○ Time Slots - Publish at specific times each day │
│ ● Staggered - Spread evenly throughout publish hours │
│ ○ Immediate - Publish as soon as approved │
│ │
│ ─── Stagger Settings ─── │
│ Start Time: [09:00] End Time: [18:00] │
│ Interval: [15] minutes between publications │
│ │
│ ─── Publish Days ─── │
│ [✓] Mon [✓] Tue [✓] Wed [✓] Thu [✓] Fri [ ] Sat [ ] Sun │
│ │
│ ─── Targets (for estimation) ─── │
│ Daily: [3] Weekly: [15] Monthly: [50] │
│ │
│ ─── Current Queue ─── │
│ 📊 23 items scheduled │ Queue limit: 100 │
│ │
└─────────────────────────────────────────────────────────────────┘
```
---
## Part 4: Per-Run Item Limits
### Overview
Allow users to limit how many items are processed per automation run. This enables:
- Balancing content production with publishing capacity
- Predictable credit usage per run
- Gradual pipeline processing
### Database Changes
**Extend `AutomationConfig`:**
```python
class AutomationConfig(models.Model):
# ... existing fields ...
# NEW: Per-run limits (0 = unlimited)
max_keywords_per_run = models.IntegerField(
default=0,
help_text="Max keywords to cluster per run (0=unlimited)"
)
max_clusters_per_run = models.IntegerField(
default=0,
help_text="Max clusters to generate ideas for per run (0=unlimited)"
)
max_ideas_per_run = models.IntegerField(
default=0,
help_text="Max ideas to convert to tasks per run (0=unlimited)"
)
max_tasks_per_run = models.IntegerField(
default=0,
help_text="Max tasks to generate content for per run (0=unlimited)"
)
max_content_per_run = models.IntegerField(
default=0,
help_text="Max content to extract image prompts for per run (0=unlimited)"
)
max_images_per_run = models.IntegerField(
default=0,
help_text="Max images to generate per run (0=unlimited)"
)
max_approvals_per_run = models.IntegerField(
default=0,
help_text="Max content to auto-approve per run (0=unlimited)"
)
```
### Service Changes
**Modify stage methods to respect limits:**
```python
def run_stage_1(self):
"""Stage 1: Keywords → Clusters"""
# ... existing setup ...
# Apply per-run limit
max_per_run = self.config.max_keywords_per_run
if max_per_run > 0:
pending_keywords = pending_keywords[:max_per_run]
self.logger.log_stage_progress(
self.run.run_id, self.account.id, self.site.id,
1, f"Per-run limit: Processing up to {max_per_run} keywords"
)
total_count = pending_keywords.count()
# ... rest of processing ...
```
### Frontend Changes
**Automation Settings Panel - Enhanced:**
```
┌─────────────────────────────────────────────────────────────────┐
│ Per-Run Limits │
│ Control how much is processed in each automation run │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Stage 1: Keywords → Clusters │
│ [ 50 ] keywords per run │ Current pending: 150 │
│ ⚡ Will take ~3 runs to process all │
│ │
│ Stage 2: Clusters → Ideas │
│ [ 10 ] clusters per run │ Current pending: 25 │
│ │
│ Stage 3: Ideas → Tasks │
│ [ 0 ] (unlimited) │ Current pending: 30 │
│ │
│ Stage 4: Tasks → Content │
│ [ 5 ] tasks per run │ Current pending: 30 │
│ 💡 Tip: Match with daily publish target for balanced flow │
│ │
│ Stage 5: Content → Image Prompts │
│ [ 5 ] content per run │ Current pending: 10 │
│ │
│ Stage 6: Image Prompts → Images │
│ [ 20 ] images per run │ Current pending: 50 │
│ │
│ Stage 7: Review → Approved │
│ [ 5 ] approvals per run│ Current in review: 15 │
│ ⚠️ Limited by publishing schedule capacity │
│ │
└─────────────────────────────────────────────────────────────────┘
```
---
## Part 5: UI/UX Fixes
### Automation Dashboard Issues
1. **Wrong metrics display** - Fix counts to show accurate pipeline state
2. **Confusing progress bars** - Use consistent calculation
3. **Missing explanations** - Add tooltips explaining each metric
### Run Detail Page Issues
1. **Stage results showing wrong data** - Fix JSON field mapping
2. **Missing "items remaining" after partial run** - Calculate from initial_snapshot
3. **No clear indication of WHY run stopped** - Show stopped_reason prominently
### Fixes
**GlobalProgressBar.tsx - Fix progress calculation:**
```typescript
// Use initial_snapshot as denominator, stage results as numerator
const calculateGlobalProgress = (run: AutomationRun): number => {
if (!run.initial_snapshot) return 0;
const total = run.initial_snapshot.total_initial_items || 0;
if (total === 0) return 0;
let processed = 0;
processed += run.stage_1_result?.keywords_processed || 0;
processed += run.stage_2_result?.clusters_processed || 0;
processed += run.stage_3_result?.ideas_processed || 0;
processed += run.stage_4_result?.tasks_processed || 0;
processed += run.stage_5_result?.content_processed || 0;
processed += run.stage_6_result?.images_processed || 0;
processed += run.stage_7_result?.approved_count || 0;
return Math.min(100, Math.round((processed / total) * 100));
};
```
---
## Implementation Order
### Phase 1: Critical Bug Fixes (Day 1)
1. ✅ Cancel releases lock
2. ✅ Scheduled check includes 'paused'
3. ✅ Resume reacquires lock
4. ✅ Resume has pause/cancel checks
5. ✅ Pause logs to files
### Phase 2: Per-Run Limits (Day 2)
1. Add model fields to AutomationConfig
2. Migration
3. Update automation_service.py stage methods
4. Frontend settings panel
5. Test with small limits
### Phase 3: Publishing Overhaul (Day 3)
1. Update PublishingSettings model
2. Migration
3. New scheduling algorithm
4. Frontend redesign
5. Test scheduling edge cases
### Phase 4: Credit Budget (Day 4)
1. Add model fields/new model
2. Migration
3. BudgetAllocationService
4. Frontend AI Settings section
5. Test budget calculations
### Phase 5: UI Fixes (Day 5)
1. Fix GlobalProgressBar
2. Fix AutomationPage metrics
3. Fix RunDetail display
4. Add helpful tooltips
5. End-to-end testing
---
## Testing Checklist
### Automation Flow
- [ ] Manual run starts, pauses, resumes, completes
- [ ] Manual run cancels, lock released, new run can start
- [ ] Scheduled run starts on time
- [ ] Scheduled run skips if manual run paused
- [ ] Resume after 7+ hour pause works
- [ ] Per-run limits respected
- [ ] Remaining items processed in next run
### Publishing
- [ ] Stagger mode spreads correctly
- [ ] Time slot mode uses exact times
- [ ] Immediate mode publishes right away
- [ ] No items missed due to limits
- [ ] Queue shows accurate count
### Credits
- [ ] Budget allocation calculates correctly
- [ ] Site override works
- [ ] Global defaults work
- [ ] Estimation uses budget
### UI
- [ ] Progress bar accurate during run
- [ ] Metrics match database counts
- [ ] Run detail shows correct stage results
- [ ] Stopped reason displayed clearly
---
## Rollback Plan
If issues arise:
1. All changes in separate migrations - can rollback individually
2. Feature flags for new behaviors (use_new_scheduling, use_budget_allocation)
3. Keep existing fields alongside new ones initially
4. Frontend changes are purely additive
---
## Success Criteria
1. **Zero lock issues** - Users never stuck unable to start automation
2. **100% scheduling** - All approved content gets scheduled
3. **Predictable runs** - Per-run limits produce consistent results
4. **Clear visibility** - UI shows exactly what's happening and why
5. **No regressions** - All existing functionality continues working