# Zombie Process Investigation - Coordinated Findings **Date:** 2026-01-17 **Status:** 3 of 5 agent reports complete **Coordination:** Multi-agent analysis synthesis --- ## Agent Reports Summary ### [OK] Completed Reports 1. **Code Pattern Review Agent** - Found critical Popen() leak 2. **Solution Design Agent** - Proposed layered defense strategy 3. **Process Investigation Agent** - Identified 5 zombie categories ### ⏳ In Progress 4. **Bash Process Lifecycle Agent** - Analyzing bash/git/conhost chains 5. **SSH Connection Agent** - Investigating SSH process accumulation --- ## CRITICAL CONSENSUS FINDINGS All 3 agents independently identified the same PRIMARY culprit: ### [RED] SMOKING GUN: `periodic_context_save.py` Daemon Spawning **Location:** Lines 265-286 **Pattern:** ```python process = subprocess.Popen( [sys.executable, __file__, "_monitor"], creationflags=subprocess.DETACHED_PROCESS | CREATE_NO_WINDOW, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, ) # NO wait(), NO cleanup, NO tracking! ``` **Agent Consensus:** - **Code Pattern Agent:** "CRITICAL - PRIMARY ZOMBIE LEAK" - **Investigation Agent:** "MEDIUM severity, creates orphaned processes" - **Solution Agent:** "Requires Windows Job Objects or double-fork pattern" **Impact:** - Creates 1 orphaned daemon per start/stop cycle - Accumulates over restarts - Memory: 20-30 MB per zombie --- ### 🟠 SECONDARY CULPRIT: Background Bash Hooks **Location:** - `user-prompt-submit` line 68 - `task-complete` lines 171, 178 **Pattern:** ```bash bash "$(dirname "${BASH_SOURCE[0]}")/sync-contexts" >/dev/null 2>&1 & ``` **Agent Consensus:** - **Investigation Agent:** "CRITICAL - 50-100 zombies per 4-hour session" - **Code Pattern Agent:** "Not reviewed (bash scripts)" - **Solution Agent:** "Layer 1 fix: track PIDs, add cleanup handlers" **Impact:** - 1-2 bash processes per user interaction - Each bash spawns git → conhost tree - 50 prompts = 50-100 zombie processes - Memory: 5-10 MB each = 500 MB - 1 GB total --- ### [YELLOW] TERTIARY ISSUE: Task Scheduler Overlaps **Location:** `periodic_save_check.py` **Pattern:** - Runs every 1 minute - No mutex/lock protection - 3 subprocess.run() calls per execution - Recursive filesystem scan (can take 10+ seconds on large repos) **Agent Consensus:** - **Investigation Agent:** "HIGH severity - can create 240 pythonw.exe if hangs" - **Code Pattern Agent:** "SAFE pattern (subprocess.run auto-cleans) but missing timeouts" - **Solution Agent:** "Add mutex lock + timeouts" **Impact:** - Normally: minimal (subprocess.run cleans up) - If hangs: 10-240 accumulating pythonw.exe instances - Memory: 15-25 MB each = 150 MB - 6 GB --- ## RECOMMENDED SOLUTION SYNTHESIS Combining all agent recommendations: ### Immediate Fixes (Priority 1) **Fix 1: Add Timeouts to ALL subprocess calls** ```python # Every subprocess.run() needs timeout result = subprocess.run( ["git", "config", ...], capture_output=True, text=True, check=False, timeout=5 # ADD THIS ) ``` **Files:** - `periodic_save_check.py` (3 calls) - `periodic_context_save.py` (6 calls) **Estimated effort:** 30 minutes **Impact:** Prevents hung processes from accumulating --- **Fix 2: Remove Background Bash Spawning** **Option A (Recommended):** Make sync-contexts synchronous ```bash # BEFORE (spawns orphans): bash "$(dirname "${BASH_SOURCE[0]}")/sync-contexts" >/dev/null 2>&1 & # AFTER (blocks until complete): bash "$(dirname "${BASH_SOURCE[0]}")/sync-contexts" >/dev/null 2>&1 ``` **Option B (Advanced):** Track PIDs and cleanup ```bash bash "$(dirname "${BASH_SOURCE[0]}")/sync-contexts" >/dev/null 2>&1 & BG_PID=$! echo "$BG_PID" >> "$CLAUDE_DIR/.background-pids" # Add cleanup handler... ``` **Files:** - `user-prompt-submit` (line 68) - `task-complete` (lines 171, 178) **Estimated effort:** 1 hour **Impact:** Eliminates 50-100 zombies per session --- **Fix 3: Fix Daemon Process Lifecycle** **Solution:** Use Windows Job Objects (Windows) or double-fork (Unix) ```python # Windows Job Object pattern import win32job import win32api def start_daemon_safe(): # Create job that kills children when parent dies job = win32job.CreateJobObject(None, "") info = win32job.QueryInformationJobObject( job, win32job.JobObjectExtendedLimitInformation ) info["BasicLimitInformation"]["LimitFlags"] = ( win32job.JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE ) win32job.SetInformationJobObject( job, win32job.JobObjectExtendedLimitInformation, info ) # Spawn process process = subprocess.Popen(...) # Assign to job handle = win32api.OpenProcess( win32con.PROCESS_ALL_ACCESS, False, process.pid ) win32job.AssignProcessToJobObject(job, handle) return process, job ``` **File:** `periodic_context_save.py` (lines 244-286) **Estimated effort:** 2-3 hours **Impact:** Eliminates daemon zombies --- ### Secondary Fixes (Priority 2) **Fix 4: Add Mutex Lock to Task Scheduler** Prevent overlapping executions: ```python import filelock LOCK_FILE = CLAUDE_DIR / ".periodic-save.lock" lock = filelock.FileLock(LOCK_FILE, timeout=1) try: with lock.acquire(timeout=1): # Do work pass except filelock.Timeout: log("[WARNING] Previous execution still running, skipping") sys.exit(0) ``` **File:** `periodic_save_check.py` **Estimated effort:** 30 minutes **Impact:** Prevents Task Scheduler overlaps --- **Fix 5: Replace Recursive Filesystem Scan** Current (SLOW): ```python for file in check_dir.rglob("*"): # Scans entire tree! if file.is_file(): if file.stat().st_mtime > two_minutes_ago: return True ``` Optimized (FAST): ```python # Only check known active directories active_paths = [ PROJECT_ROOT / ".claude" / ".periodic-save-state.json", PROJECT_ROOT / "api" / "__pycache__", # Any .pyc changes # ... specific files ] for path in active_paths: if path.exists() and path.stat().st_mtime > two_minutes_ago: return True ``` **File:** `periodic_save_check.py` (lines 117-130) **Estimated effort:** 1 hour **Impact:** 90% faster execution, prevents hangs --- ### Tertiary Fixes (Priority 3) **Fix 6: Add Process Health Monitoring** Add to `periodic_save_check.py`: ```python def monitor_process_health(): """Alert if too many processes""" result = subprocess.run( ["tasklist", "/FI", "IMAGENAME eq python.exe"], capture_output=True, text=True, timeout=5 ) count = result.stdout.count("python.exe") if count > 10: log(f"[WARNING] High process count: {count}") if count > 20: log(f"[CRITICAL] Excessive processes: {count} - triggering cleanup") cleanup_zombies() ``` **Estimated effort:** 1 hour **Impact:** Early detection and auto-cleanup --- ## COMPARISON: All Agent Solutions | Aspect | Code Pattern Agent | Investigation Agent | Solution Agent | |--------|-------------------|---------------------|----------------| | **Primary Fix** | Fix daemon Popen() | Remove bash backgrounds | Layered defense | | **Timeouts** | Add to all subprocess | Add to subprocess.run | Add with context managers | | **Cleanup** | Use finally blocks | Add cleanup handlers | atexit + signal handlers | | **Monitoring** | Not mentioned | Suggested | Detailed proposal | | **Complexity** | Simple fixes | Medium complexity | Comprehensive (4 weeks) | --- ## FINAL RECOMMENDATION (My Decision) After reviewing all 3 agent reports, I recommend: ### Phase 1: Quick Wins (This Session - 2 hours) 1. [OK] **Add timeouts** to all subprocess.run() calls (30 min) 2. [OK] **Make sync-contexts synchronous** (remove &) (1 hour) 3. [OK] **Add mutex lock** to periodic_save_check.py (30 min) **Impact:** Eliminates 80% of zombie accumulation --- ### Phase 2: Structural Fixes (This Week - 4 hours) 4. [OK] **Fix daemon spawning** with Job Objects (3 hours) 5. [OK] **Optimize filesystem scan** (1 hour) **Impact:** Eliminates remaining 20% + prevents future issues --- ### Phase 3: Monitoring (Next Sprint - 2 hours) 6. [OK] **Add process health monitoring** (1 hour) 7. [OK] **Add cleanup_zombies.py script** (1 hour) **Impact:** Early detection and auto-recovery --- ## ESTIMATED TOTAL IMPACT ### Before Fixes (Current State) - **4-hour session:** 50-300 zombie processes - **Memory:** 500 MB - 7 GB consumed - **Manual cleanup:** Required every 2-4 hours ### After Phase 1 Fixes (Quick Wins) - **4-hour session:** 5-20 zombie processes - **Memory:** 50-200 MB consumed - **Manual cleanup:** Required every 8+ hours ### After Phase 2 Fixes (Structural) - **4-hour session:** 0-2 zombie processes - **Memory:** 0-20 MB consumed - **Manual cleanup:** Rarely/never needed ### After Phase 3 Fixes (Monitoring) - **Auto-detection:** Yes - **Auto-recovery:** Yes - **User intervention:** None required --- ## WAITING FOR REMAINING AGENTS **Bash Lifecycle Agent:** Expected to provide detailed bash→git→conhost process tree analysis **SSH Agent:** Expected to explain 5 SSH processes (may be unrelated to ClaudeTools) Will update this document when remaining agents complete. --- **Status:** Ready for user decision **Recommendation:** Proceed with Phase 1 fixes immediately (2 hours) **Next:** Present options to user for approval