Files
igny8/UNDER-OBSERVATION.md
2025-12-10 14:52:31 +00:00

119 lines
6.3 KiB
Markdown

# UNDER OBSERVATION
## Issue: User Logged Out During Image Prompt Generation (Dec 10, 2025)
### Original Problem
User performed workflow: auto-cluster → generate ideas → queue to writer → generate content → generate image prompt. During image prompt generation (near completion), user was automatically logged out.
### Investigation Timeline
**Initial Analysis:**
- Suspected backend container restarts invalidating sessions
- Docker ps showed all containers up 19+ minutes - NO RESTARTS during incident
- Backend logs showed: `[IsAuthenticatedAndActive] DENIED: User not authenticated` and `Client error: Authentication credentials were not provided`
- Token was not being sent with API requests
**Root Cause Identified:**
The logout was NOT caused by backend issues or container restarts. It was caused by **frontend state corruption during HMR (Hot Module Reload)** triggered by code changes made to fix an unrelated useLocation() error.
**What Actually Happened:**
1. **Commit 5fb3687854d9aadfc5d604470f3712004b23243c** - Already had proper fix for useLocation() error (Suspense outside Routes)
2. **Additional "fixes" applied on Dec 10, 2025:**
- Changed `cacheDir: "/tmp/vite-cache"` in vite.config.ts
- Moved BrowserRouter above ErrorBoundary in main.tsx
- Added `watch.interval: 100` and `fs.strict: false`
3. **These changes triggered:**
- Vite cache stored in /tmp got wiped on container operations
- Full rebuild with HMR
- Component tree restructuring (BrowserRouter position change)
- Auth store (Zustand persist) lost state during rapid unmount/remount cycle
- Frontend started making API calls WITHOUT Authorization header
- Backend correctly rejected unauthenticated requests
- Frontend logout() triggered
### Fix Applied
**Reverted the problematic changes:**
- Removed `cacheDir: "/tmp/vite-cache"` - let Vite use default node_modules/.vite
- Restored BrowserRouter position inside ErrorBoundary/ThemeProvider (original structure)
- Removed `watch.interval` and `fs.strict` additions
**Kept the actual fixes:**
- Backend: Removed `IsSystemAccountOrDeveloper` from IntegrationSettingsViewSet class-level permissions
- Backend: Auto-cluster `extra_data``debug_info` parameter fix
- Frontend: Suspense wrapping Routes (from commit 5fb3687) - THIS was the real useLocation() fix
### What to Watch For
**1. useLocation() Error After Container Restarts**
- **Symptom:** "useLocation() may be used only in the context of a <Router> component"
- **Where:** Keywords page, other planner/writer module pages (50-60% of pages)
- **If it happens:**
- Check if Vite cache is stale
- Clear node_modules/.vite inside frontend container: `docker compose exec igny8_frontend rm -rf /app/node_modules/.vite`
- Restart frontend container
- DO NOT change cacheDir or component tree structure
**2. Auth State Loss During Development**
- **Symptom:** Random logouts during active sessions, "Authentication credentials were not provided"
- **Triggers:**
- HMR with significant component tree changes
- Rapid container restarts during development
- Changes to context provider order in main.tsx
- **Prevention:**
- Avoid restructuring main.tsx component tree
- Test auth persistence after any main.tsx changes
- Monitor browser console for localStorage errors during HMR
**3. Permission Errors for Normal Users**
- **Symptom:** "You do not have permission to perform this action" for valid users with complete account setup
- **Check:**
- Backend logs for permission class debug output: `[IsAuthenticatedAndActive]`, `[IsViewerOrAbove]`, `[HasTenantAccess]`
- Verify user has role='owner' and is_active=True
- Ensure viewset doesn't have `IsSystemAccountOrDeveloper` at class level for endpoints normal users need
**4. Celery Task Progress Polling 403 Errors**
- **Symptom:** Task progress endpoint returns 403 for normal users
- **Root cause:** ViewSet class-level permissions blocking action-level overrides
- **Solution:** Ensure IntegrationSettingsViewSet permission_classes doesn't include IsSystemAccountOrDeveloper
### Lessons Learned
1. **Don't layer fixes on top of fixes** - Identify root cause first
2. **Vite cache location matters** - /tmp gets wiped, breaking HMR state persistence
3. **Component tree structure is fragile** - Moving BrowserRouter breaks auth rehydration timing
4. **Container uptime ≠ code stability** - HMR can cause issues without restart
5. **Permission debugging** - Added logging to permission classes was critical for diagnosis
6. **The original fix was already correct** - Commit 5fb3687 had it right, additional "improvements" broke it
### Files Modified (Reverted)
- `frontend/vite.config.ts` - Removed cacheDir and watch config changes
- `frontend/src/main.tsx` - Restored original component tree structure
### Files Modified (Kept)
- `backend/igny8_core/modules/system/integration_views.py` - Removed IsSystemAccountOrDeveloper
- `backend/igny8_core/modules/planner/views.py` - Fixed extra_data → debug_info
- `backend/igny8_core/api/permissions.py` - Added debug logging (can be removed later)
### Status
**RESOLVED** - Auth state stable, backend permissions correct, useLocation fix preserved.
**ADDITIONAL FIX (Dec 10, 2025 - Evening):**
1. **Permission Fix**: Fixed image generation task progress polling 403 errors
- Root cause: `IsSystemAccountOrDeveloper` was still in class-level permissions
- Solution: Moved to `get_permissions()` method to allow action-level overrides
- `task_progress` and `get_image_generation_settings` now accessible to all authenticated users
- Save/test operations still restricted to system accounts
2. **System Account Fallback**: Fixed "Image generation settings not found" for normal users
- Root cause: IntegrationSettings are account-specific - normal users don't have their own settings
- Only super user account (aws-admin) has configured API keys
- Solution: Added fallback to system account (aws-admin) settings in `process_image_generation_queue` task
- When user's account doesn't have IntegrationSettings, falls back to system account
- Allows normal users to use centralized API keys managed by super users
- Files modified: `backend/igny8_core/ai/tasks.py`
**Monitor for 48 hours** - Watch for any recurrence of useLocation errors or auth issues after container restarts. Test image generation with normal user accounts (paid-2).