feat: Major directory reorganization and cleanup

Reorganized project structure for better maintainability and reduced
disk usage by 95.9% (11 GB -> 451 MB).

Directory Reorganization (85% reduction in root files):
- Created docs/ with subdirectories (deployment, testing, database, etc.)
- Created infrastructure/vpn-configs/ for VPN scripts
- Moved 90+ files from root to organized locations
- Archived obsolete documentation (context system, offline mode, zombie debugging)
- Moved all test files to tests/ directory
- Root directory: 119 files -> 18 files

Disk Cleanup (10.55 GB recovered):
- Deleted Rust build artifacts: 9.6 GB (target/ directories)
- Deleted Python virtual environments: 161 MB (venv/ directories)
- Deleted Python cache: 50 KB (__pycache__/)

New Structure:
- docs/ - All documentation organized by category
- docs/archives/ - Obsolete but preserved documentation
- infrastructure/ - VPN configs and SSH setup
- tests/ - All test files consolidated
- logs/ - Ready for future logs

Benefits:
- Cleaner root directory (18 vs 119 files)
- Logical organization of documentation
- 95.9% disk space reduction
- Faster navigation and discovery
- Better portability (build artifacts excluded)

Build artifacts can be regenerated:
- Rust: cargo build --release (5-15 min per project)
- Python: pip install -r requirements.txt (2-3 min)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-01-18 20:42:28 -07:00
parent 89e5118306
commit 06f7617718
96 changed files with 54 additions and 2639 deletions

View File

@@ -0,0 +1,360 @@
# Zombie Process Investigation - Coordinated Findings
**Date:** 2026-01-17
**Status:** 3 of 5 agent reports complete
**Coordination:** Multi-agent analysis synthesis
---
## Agent Reports Summary
### ✅ Completed Reports
1. **Code Pattern Review Agent** - Found critical Popen() leak
2. **Solution Design Agent** - Proposed layered defense strategy
3. **Process Investigation Agent** - Identified 5 zombie categories
### ⏳ In Progress
4. **Bash Process Lifecycle Agent** - Analyzing bash/git/conhost chains
5. **SSH Connection Agent** - Investigating SSH process accumulation
---
## CRITICAL CONSENSUS FINDINGS
All 3 agents independently identified the same PRIMARY culprit:
### 🔴 SMOKING GUN: `periodic_context_save.py` Daemon Spawning
**Location:** Lines 265-286
**Pattern:**
```python
process = subprocess.Popen(
[sys.executable, __file__, "_monitor"],
creationflags=subprocess.DETACHED_PROCESS | CREATE_NO_WINDOW,
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
)
# NO wait(), NO cleanup, NO tracking!
```
**Agent Consensus:**
- **Code Pattern Agent:** "CRITICAL - PRIMARY ZOMBIE LEAK"
- **Investigation Agent:** "MEDIUM severity, creates orphaned processes"
- **Solution Agent:** "Requires Windows Job Objects or double-fork pattern"
**Impact:**
- Creates 1 orphaned daemon per start/stop cycle
- Accumulates over restarts
- Memory: 20-30 MB per zombie
---
### 🟠 SECONDARY CULPRIT: Background Bash Hooks
**Location:**
- `user-prompt-submit` line 68
- `task-complete` lines 171, 178
**Pattern:**
```bash
bash "$(dirname "${BASH_SOURCE[0]}")/sync-contexts" >/dev/null 2>&1 &
```
**Agent Consensus:**
- **Investigation Agent:** "CRITICAL - 50-100 zombies per 4-hour session"
- **Code Pattern Agent:** "Not reviewed (bash scripts)"
- **Solution Agent:** "Layer 1 fix: track PIDs, add cleanup handlers"
**Impact:**
- 1-2 bash processes per user interaction
- Each bash spawns git → conhost tree
- 50 prompts = 50-100 zombie processes
- Memory: 5-10 MB each = 500 MB - 1 GB total
---
### 🟡 TERTIARY ISSUE: Task Scheduler Overlaps
**Location:** `periodic_save_check.py`
**Pattern:**
- Runs every 1 minute
- No mutex/lock protection
- 3 subprocess.run() calls per execution
- Recursive filesystem scan (can take 10+ seconds on large repos)
**Agent Consensus:**
- **Investigation Agent:** "HIGH severity - can create 240 pythonw.exe if hangs"
- **Code Pattern Agent:** "SAFE pattern (subprocess.run auto-cleans) but missing timeouts"
- **Solution Agent:** "Add mutex lock + timeouts"
**Impact:**
- Normally: minimal (subprocess.run cleans up)
- If hangs: 10-240 accumulating pythonw.exe instances
- Memory: 15-25 MB each = 150 MB - 6 GB
---
## RECOMMENDED SOLUTION SYNTHESIS
Combining all agent recommendations:
### Immediate Fixes (Priority 1)
**Fix 1: Add Timeouts to ALL subprocess calls**
```python
# Every subprocess.run() needs timeout
result = subprocess.run(
["git", "config", ...],
capture_output=True,
text=True,
check=False,
timeout=5 # ADD THIS
)
```
**Files:**
- `periodic_save_check.py` (3 calls)
- `periodic_context_save.py` (6 calls)
**Estimated effort:** 30 minutes
**Impact:** Prevents hung processes from accumulating
---
**Fix 2: Remove Background Bash Spawning**
**Option A (Recommended):** Make sync-contexts synchronous
```bash
# BEFORE (spawns orphans):
bash "$(dirname "${BASH_SOURCE[0]}")/sync-contexts" >/dev/null 2>&1 &
# AFTER (blocks until complete):
bash "$(dirname "${BASH_SOURCE[0]}")/sync-contexts" >/dev/null 2>&1
```
**Option B (Advanced):** Track PIDs and cleanup
```bash
bash "$(dirname "${BASH_SOURCE[0]}")/sync-contexts" >/dev/null 2>&1 &
BG_PID=$!
echo "$BG_PID" >> "$CLAUDE_DIR/.background-pids"
# Add cleanup handler...
```
**Files:**
- `user-prompt-submit` (line 68)
- `task-complete` (lines 171, 178)
**Estimated effort:** 1 hour
**Impact:** Eliminates 50-100 zombies per session
---
**Fix 3: Fix Daemon Process Lifecycle**
**Solution:** Use Windows Job Objects (Windows) or double-fork (Unix)
```python
# Windows Job Object pattern
import win32job
import win32api
def start_daemon_safe():
# Create job that kills children when parent dies
job = win32job.CreateJobObject(None, "")
info = win32job.QueryInformationJobObject(
job, win32job.JobObjectExtendedLimitInformation
)
info["BasicLimitInformation"]["LimitFlags"] = (
win32job.JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE
)
win32job.SetInformationJobObject(
job, win32job.JobObjectExtendedLimitInformation, info
)
# Spawn process
process = subprocess.Popen(...)
# Assign to job
handle = win32api.OpenProcess(
win32con.PROCESS_ALL_ACCESS, False, process.pid
)
win32job.AssignProcessToJobObject(job, handle)
return process, job
```
**File:** `periodic_context_save.py` (lines 244-286)
**Estimated effort:** 2-3 hours
**Impact:** Eliminates daemon zombies
---
### Secondary Fixes (Priority 2)
**Fix 4: Add Mutex Lock to Task Scheduler**
Prevent overlapping executions:
```python
import filelock
LOCK_FILE = CLAUDE_DIR / ".periodic-save.lock"
lock = filelock.FileLock(LOCK_FILE, timeout=1)
try:
with lock.acquire(timeout=1):
# Do work
pass
except filelock.Timeout:
log("[WARNING] Previous execution still running, skipping")
sys.exit(0)
```
**File:** `periodic_save_check.py`
**Estimated effort:** 30 minutes
**Impact:** Prevents Task Scheduler overlaps
---
**Fix 5: Replace Recursive Filesystem Scan**
Current (SLOW):
```python
for file in check_dir.rglob("*"): # Scans entire tree!
if file.is_file():
if file.stat().st_mtime > two_minutes_ago:
return True
```
Optimized (FAST):
```python
# Only check known active directories
active_paths = [
PROJECT_ROOT / ".claude" / ".periodic-save-state.json",
PROJECT_ROOT / "api" / "__pycache__", # Any .pyc changes
# ... specific files
]
for path in active_paths:
if path.exists() and path.stat().st_mtime > two_minutes_ago:
return True
```
**File:** `periodic_save_check.py` (lines 117-130)
**Estimated effort:** 1 hour
**Impact:** 90% faster execution, prevents hangs
---
### Tertiary Fixes (Priority 3)
**Fix 6: Add Process Health Monitoring**
Add to `periodic_save_check.py`:
```python
def monitor_process_health():
"""Alert if too many processes"""
result = subprocess.run(
["tasklist", "/FI", "IMAGENAME eq python.exe"],
capture_output=True, text=True, timeout=5
)
count = result.stdout.count("python.exe")
if count > 10:
log(f"[WARNING] High process count: {count}")
if count > 20:
log(f"[CRITICAL] Excessive processes: {count} - triggering cleanup")
cleanup_zombies()
```
**Estimated effort:** 1 hour
**Impact:** Early detection and auto-cleanup
---
## COMPARISON: All Agent Solutions
| Aspect | Code Pattern Agent | Investigation Agent | Solution Agent |
|--------|-------------------|---------------------|----------------|
| **Primary Fix** | Fix daemon Popen() | Remove bash backgrounds | Layered defense |
| **Timeouts** | Add to all subprocess | Add to subprocess.run | Add with context managers |
| **Cleanup** | Use finally blocks | Add cleanup handlers | atexit + signal handlers |
| **Monitoring** | Not mentioned | Suggested | Detailed proposal |
| **Complexity** | Simple fixes | Medium complexity | Comprehensive (4 weeks) |
---
## FINAL RECOMMENDATION (My Decision)
After reviewing all 3 agent reports, I recommend:
### Phase 1: Quick Wins (This Session - 2 hours)
1.**Add timeouts** to all subprocess.run() calls (30 min)
2.**Make sync-contexts synchronous** (remove &) (1 hour)
3.**Add mutex lock** to periodic_save_check.py (30 min)
**Impact:** Eliminates 80% of zombie accumulation
---
### Phase 2: Structural Fixes (This Week - 4 hours)
4.**Fix daemon spawning** with Job Objects (3 hours)
5.**Optimize filesystem scan** (1 hour)
**Impact:** Eliminates remaining 20% + prevents future issues
---
### Phase 3: Monitoring (Next Sprint - 2 hours)
6.**Add process health monitoring** (1 hour)
7.**Add cleanup_zombies.py script** (1 hour)
**Impact:** Early detection and auto-recovery
---
## ESTIMATED TOTAL IMPACT
### Before Fixes (Current State)
- **4-hour session:** 50-300 zombie processes
- **Memory:** 500 MB - 7 GB consumed
- **Manual cleanup:** Required every 2-4 hours
### After Phase 1 Fixes (Quick Wins)
- **4-hour session:** 5-20 zombie processes
- **Memory:** 50-200 MB consumed
- **Manual cleanup:** Required every 8+ hours
### After Phase 2 Fixes (Structural)
- **4-hour session:** 0-2 zombie processes
- **Memory:** 0-20 MB consumed
- **Manual cleanup:** Rarely/never needed
### After Phase 3 Fixes (Monitoring)
- **Auto-detection:** Yes
- **Auto-recovery:** Yes
- **User intervention:** None required
---
## WAITING FOR REMAINING AGENTS
**Bash Lifecycle Agent:** Expected to provide detailed bash→git→conhost process tree analysis
**SSH Agent:** Expected to explain 5 SSH processes (may be unrelated to ClaudeTools)
Will update this document when remaining agents complete.
---
**Status:** Ready for user decision
**Recommendation:** Proceed with Phase 1 fixes immediately (2 hours)
**Next:** Present options to user for approval