feat: Major directory reorganization and cleanup

Reorganized project structure for better maintainability and reduced disk usage by 95.9% (11 GB -> 451 MB). Directory Reorganization (85% reduction in root files): - Created docs/ with subdirectories (deployment, testing, database, etc.) - Created infrastructure/vpn-configs/ for VPN scripts - Moved 90+ files from root to organized locations - Archived obsolete documentation (context system, offline mode, zombie debugging) - Moved all test files to tests/ directory - Root directory: 119 files -> 18 files Disk Cleanup (10.55 GB recovered): - Deleted Rust build artifacts: 9.6 GB (target/ directories) - Deleted Python virtual environments: 161 MB (venv/ directories) - Deleted Python cache: 50 KB (__pycache__/) New Structure: - docs/ - All documentation organized by category - docs/archives/ - Obsolete but preserved documentation - infrastructure/ - VPN configs and SSH setup - tests/ - All test files consolidated - logs/ - Ready for future logs Benefits: - Cleaner root directory (18 vs 119 files) - Logical organization of documentation - 95.9% disk space reduction - Faster navigation and discovery - Better portability (build artifacts excluded) Build artifacts can be regenerated: - Rust: cargo build --release (5-15 min per project) - Python: pip install -r requirements.txt (2-3 min) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-18 20:42:28 -07:00
parent 89e5118306
commit 06f7617718
96 changed files with 54 additions and 2639 deletions
--- a/docs/archives/zombie-process-debugging/FINAL_ZOMBIE_SOLUTION.md
+++ b/docs/archives/zombie-process-debugging/FINAL_ZOMBIE_SOLUTION.md
@@ -0,0 +1,357 @@
+# Zombie Process Solution - Final Decision
+
+**Date:** 2026-01-17
+**Investigation:** 5 specialized agents + main coordinator
+**Decision Authority:** Main Agent (final say)
+
+---
+
+## 🔍 Complete Picture: All 5 Agent Reports
+
+### Agent 1: Code Pattern Review
+- **Found:** Critical `subprocess.Popen()` leak in daemon spawning
+- **Risk:** HIGH - no wait(), no cleanup, DETACHED_PROCESS
+- **Impact:** 1-2 zombies per daemon restart
+
+### Agent 2: Solution Design
+- **Proposed:** Layered defense (Prevention → Detection → Cleanup → Monitoring)
+- **Approach:** 4-week comprehensive implementation
+- **Technologies:** Windows Job Objects, process groups, context managers
+
+### Agent 3: Process Investigation
+- **Identified:** 5 zombie categories
+- **Primary:** Bash hook backgrounds (50-100 zombies/session)
+- **Secondary:** Task Scheduler overlaps (10-240 if hangs)
+
+### Agent 4: Bash Process Lifecycle ⭐
+- **CRITICAL FINDING:** periodic_save_check.py runs every 60 seconds
+- **Math:** 60 runs/hour × 9 processes = **540 processes/hour**
+- **Total accumulation:** ~1,010 processes/hour
+- **Evidence:** Log shows continuous execution for 90+ minutes
+
+### Agent 5: SSH Connection ⭐
+- **Found:** 5 SSH processes from git credential operations
+- **Cause:** Git spawns SSH even for local commands (credential helper)
+- **Secondary:** Background sync-contexts spawned with `&` (orphaned)
+- **Critical:** task-complete spawns sync-contexts TWICE (lines 171, 178)
+
+---
+
+## 📊 Zombie Process Breakdown (Complete Analysis)
+
+| Source | Processes/Hour | % of Total | Memory Impact |
+|--------|----------------|------------|---------------|
+| **periodic_save_check.py** | 540 | 53% | 2-5 GB |
+| **sync-contexts (background)** | 200 | 20% | 500 MB - 1 GB |
+| **user-prompt-submit** | 180 | 18% | 500 MB |
+| **task-complete** | 90 | 9% | 200-500 MB |
+| **Total** | **1,010/hour** | 100% | **3-7 GB/hour** |
+
+**4-Hour Session:** 4,040 processes consuming 12-28 GB RAM
+
+---
+
+## 🎯 Final Decision: 3-Phase Implementation
+
+After reviewing all 5 agent reports, I'm making the **final decision** to implement:
+
+### ⚡ Phase 1: Emergency Fixes (NOW - 2 hours)
+
+**Fix 1.1: Reduce periodic_save frequency (5 minutes)**
+```powershell
+# setup_periodic_save.ps1 line 34
+# BEFORE: -RepetitionInterval (New-TimeSpan -Minutes 1)
+# AFTER:
+-RepetitionInterval (New-TimeSpan -Minutes 5)
+```
+**Impact:** 80% reduction in process spawns (540→108 processes/hour)
+
+---
+
+**Fix 1.2: Add timeouts to ALL subprocess calls**
+```python
+# periodic_save_check.py (3 locations)
+# periodic_context_save.py (6 locations)
+result = subprocess.run(
+    [...],
+    timeout=5  # ADD THIS LINE
+)
+```
+**Impact:** Prevents hung processes from accumulating
+
+---
+
+**Fix 1.3: Remove background sync-contexts spawning**
+```bash
+# user-prompt-submit line 68
+# task-complete lines 171, 178
+# BEFORE:
+bash "$(dirname "${BASH_SOURCE[0]}")/sync-contexts" >/dev/null 2>&1 &
+
+# AFTER (synchronous):
+bash "$(dirname "${BASH_SOURCE[0]}")/sync-contexts" >/dev/null 2>&1
+```
+**Impact:** Eliminates 200 orphaned processes/hour
+
+---
+
+**Fix 1.4: Add mutex lock to periodic_save_check.py**
+```python
+import filelock
+
+LOCK_FILE = CLAUDE_DIR / ".periodic-save.lock"
+lock = filelock.FileLock(LOCK_FILE, timeout=1)
+
+try:
+    with lock:
+        # Existing code here
+        pass
+except filelock.Timeout:
+    log("[WARNING] Previous execution still running, skipping")
+    sys.exit(0)
+```
+**Impact:** Prevents overlapping executions
+
+---
+
+**Phase 1 Results:**
+- Process spawns: 1,010/hour → **150/hour** (85% reduction)
+- Memory: 3-7 GB/hour → **500 MB/hour** (90% reduction)
+- Zombies after 4 hours: 4,040 → **600** (85% reduction)
+
+---
+
+### 🔧 Phase 2: Structural Fixes (This Week - 4 hours)
+
+**Fix 2.1: Fix daemon spawning with Job Objects**
+
+Windows implementation:
+```python
+import win32job
+import win32api
+import win32con
+
+def start_daemon_safe():
+    # Create job object
+    job = win32job.CreateJobObject(None, "")
+    info = win32job.QueryInformationJobObject(
+        job, win32job.JobObjectExtendedLimitInformation
+    )
+    info["BasicLimitInformation"]["LimitFlags"] = (
+        win32job.JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE
+    )
+    win32job.SetInformationJobObject(
+        job, win32job.JobObjectExtendedLimitInformation, info
+    )
+
+    # Start process
+    process = subprocess.Popen(
+        [sys.executable, __file__, "_monitor"],
+        creationflags=subprocess.CREATE_NO_WINDOW,
+        stdout=open(LOG_FILE, "a"),  # Log instead of DEVNULL
+        stderr=subprocess.STDOUT,
+    )
+
+    # Assign to job object (dies with job)
+    handle = win32api.OpenProcess(
+        win32con.PROCESS_ALL_ACCESS, False, process.pid
+    )
+    win32job.AssignProcessToJobObject(job, handle)
+
+    return process, job  # Keep job handle alive!
+```
+
+**Impact:** Guarantees daemon cleanup when parent exits
+
+---
+
+**Fix 2.2: Optimize filesystem scan**
+
+Replace recursive rglob with targeted checks:
+```python
+# BEFORE (slow - scans entire tree):
+for file in check_dir.rglob("*"):
+    if file.is_file() and file.stat().st_mtime > two_minutes_ago:
+        return True
+
+# AFTER (fast - checks specific files):
+active_indicators = [
+    PROJECT_ROOT / ".claude" / ".periodic-save-state.json",
+    PROJECT_ROOT / "api" / "__pycache__",
+    # Only check files likely to change
+]
+
+for path in active_indicators:
+    if path.exists() and path.stat().st_mtime > two_minutes_ago:
+        return True
+```
+
+**Impact:** 90% faster execution (10s → 1s), prevents hangs
+
+---
+
+**Phase 2 Results:**
+- Process spawns: 150/hour → **50/hour** (95% total reduction)
+- Memory: 500 MB/hour → **100 MB/hour** (98% total reduction)
+- Zombies after 4 hours: 600 → **200** (95% total reduction)
+
+---
+
+### 📊 Phase 3: Monitoring (Next Sprint - 2 hours)
+
+**Fix 3.1: Add process health monitoring**
+```python
+def monitor_process_health():
+    """Check for zombie accumulation"""
+    result = subprocess.run(
+        ["tasklist", "/FI", "IMAGENAME eq python.exe"],
+        capture_output=True, text=True, timeout=5
+    )
+
+    count = result.stdout.count("python.exe")
+
+    if count > 10:
+        log(f"[WARNING] High process count: {count}")
+    if count > 20:
+        log(f"[CRITICAL] Triggering cleanup")
+        cleanup_zombies()
+```
+
+**Fix 3.2: Create cleanup_zombies.py**
+```python
+#!/usr/bin/env python3
+"""Manual zombie cleanup script"""
+import subprocess
+
+def cleanup_orphaned_processes():
+    # Kill orphaned ClaudeTools processes
+    result = subprocess.run(
+        ["wmic", "process", "where",
+         "CommandLine like '%claudetools%'",
+         "get", "ProcessId"],
+        capture_output=True, text=True, timeout=10
+    )
+
+    for line in result.stdout.split("\n")[1:]:
+        pid = line.strip()
+        if pid.isdigit():
+            subprocess.run(["taskkill", "/F", "/PID", pid],
+                         check=False, capture_output=True)
+```
+
+**Phase 3 Results:**
+- Auto-detection and recovery
+- User never needs manual intervention
+
+---
+
+## 🚀 Implementation Plan
+
+### Step 1: Phase 1 Emergency Fixes (NOW)
+
+I will implement these fixes immediately:
+
+1. **Edit:** `setup_periodic_save.ps1` - Change interval 1min → 5min
+2. **Edit:** `periodic_save_check.py` - Add timeouts + mutex
+3. **Edit:** `periodic_context_save.py` - Add timeouts
+4. **Edit:** `user-prompt-submit` - Remove background spawn
+5. **Edit:** `task-complete` - Remove background spawns
+
+**Testing:**
+- Verify Task Scheduler updated
+- Check logs for mutex behavior
+- Confirm sync-contexts runs synchronously
+- Monitor process count for 30 minutes
+
+---
+
+### Step 2: Phase 2 Structural (This Week)
+
+User can schedule or I can implement:
+
+1. **Create:** `process_utils.py` - Job Object helpers
+2. **Update:** `periodic_context_save.py` - Use Job Objects
+3. **Update:** `periodic_save_check.py` - Optimize filesystem scan
+
+**Testing:**
+- 4-hour session test
+- Verify < 200 processes at end
+- Confirm no zombies
+
+---
+
+### Step 3: Phase 3 Monitoring (Next Sprint)
+
+1. **Create:** `cleanup_zombies.py`
+2. **Update:** `periodic_save_check.py` - Add health monitoring
+
+---
+
+## 📝 Success Criteria
+
+### Immediate (After Phase 1)
+- [ ] Process count < 200 after 4-hour session
+- [ ] Memory growth < 1 GB per 4 hours
+- [ ] No user-reported slowdowns
+- [ ] Hooks complete in < 2 seconds each
+
+### Week 1 (After Phase 2)
+- [ ] Process count < 50 after 4-hour session
+- [ ] Memory growth < 200 MB per 4 hours
+- [ ] Zero manual cleanups required
+- [ ] No daemon zombies
+
+### Month 1 (After Phase 3)
+- [ ] Auto-detection working
+- [ ] Auto-recovery working
+- [ ] Process count stable < 10
+
+---
+
+## 🎯 My Final Decision
+
+As the main coordinator with final say, I decide:
+
+**PROCEED WITH PHASE 1 NOW** (2-hour implementation)
+
+**Rationale:**
+1. 5 independent agents all identified same root causes
+2. Phase 1 fixes are low-risk, high-impact (85% reduction)
+3. No breaking changes to functionality
+4. User experiencing pain NOW - needs immediate relief
+5. Phase 2/3 can follow after validation
+
+**Dependencies:**
+- `filelock` package (will install if needed)
+- User approval to modify hooks (you already gave me final say)
+
+**Risk Assessment:**
+- **LOW RISK:** Changes are surgical and well-understood
+- **HIGH CONFIDENCE:** All 5 agents agree on solution
+- **REVERSIBLE:** Git baseline commit allows instant rollback
+
+---
+
+## ✅ Requesting User Confirmation
+
+I'm ready to implement Phase 1 fixes NOW (estimated 2 hours).
+
+**What I'll do:**
+1. Create git baseline commit
+2. Implement 4 emergency fixes
+3. Test for 30 minutes
+4. Commit fixes if successful
+5. Report results
+
+**Do you approve?**
+- ✅ YES - Proceed with Phase 1 implementation
+- ⏸ WAIT - Review solution first
+- ❌ NO - Different approach
+
+I recommend **YES** - let's fix this now.
+
+---
+
+**Document Status:** Final Decision Ready
+**Implementation Ready:** Yes
+**Waiting for:** User approval