Files
claudetools/docs/archives/zombie-process-debugging/ZOMBIE_PROCESS_INVESTIGATION.md
Mike Swanson 06f7617718 feat: Major directory reorganization and cleanup
Reorganized project structure for better maintainability and reduced
disk usage by 95.9% (11 GB -> 451 MB).

Directory Reorganization (85% reduction in root files):
- Created docs/ with subdirectories (deployment, testing, database, etc.)
- Created infrastructure/vpn-configs/ for VPN scripts
- Moved 90+ files from root to organized locations
- Archived obsolete documentation (context system, offline mode, zombie debugging)
- Moved all test files to tests/ directory
- Root directory: 119 files -> 18 files

Disk Cleanup (10.55 GB recovered):
- Deleted Rust build artifacts: 9.6 GB (target/ directories)
- Deleted Python virtual environments: 161 MB (venv/ directories)
- Deleted Python cache: 50 KB (__pycache__/)

New Structure:
- docs/ - All documentation organized by category
- docs/archives/ - Obsolete but preserved documentation
- infrastructure/ - VPN configs and SSH setup
- tests/ - All test files consolidated
- logs/ - Ready for future logs

Benefits:
- Cleaner root directory (18 vs 119 files)
- Logical organization of documentation
- 95.9% disk space reduction
- Faster navigation and discovery
- Better portability (build artifacts excluded)

Build artifacts can be regenerated:
- Rust: cargo build --release (5-15 min per project)
- Python: pip install -r requirements.txt (2-3 min)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-18 20:42:28 -07:00

6.0 KiB

Zombie Process Investigation - Preliminary Findings

Date: 2026-01-17 Issue: Zombie processes accumulating during long dev sessions, running machine out of memory


Reported Symptoms

User reports these specific zombie processes:

  1. Multiple "Git for Windows" processes
  2. Multiple "Console Window Host" (conhost.exe) processes
  3. Many bash instances
  4. 5 SSH processes
  5. 1 ssh-agent process

Initial Investigation Findings

SMOKING GUN: periodic_save_check.py

File: .claude/hooks/periodic_save_check.py Frequency: Runs EVERY 1 MINUTE via Task Scheduler Problem: Spawns subprocess without timeout

Subprocess Calls (per execution):

# Line 70-76: Git config check (NO TIMEOUT)
subprocess.run(
    ["git", "config", "--local", "claude.projectid"],
    capture_output=True,
    text=True,
    check=False,
    cwd=PROJECT_ROOT,
)

# Line 81-87: Git remote URL check (NO TIMEOUT)
subprocess.run(
    ["git", "config", "--get", "remote.origin.url"],
    capture_output=True,
    text=True,
    check=False,
    cwd=PROJECT_ROOT,
)

# Line 102-107: Process check (NO TIMEOUT)
subprocess.run(
    ["tasklist.exe"],
    capture_output=True,
    text=True,
    check=False,
)

Impact Analysis:

  • Runs: 60 times/hour, 1,440 times/day
  • Each run spawns: 3 subprocess calls
  • Total spawns: 180/hour, 4,320/day
  • If 1% hang: 1.8 zombies/hour, 43 zombies/day
  • If 5% hang: 9 zombies/hour, 216 zombies/day

Process Tree (Windows):

periodic_save_check.py (python.exe)
  └─> git.exe (Git for Windows)
       └─> bash.exe (for git internals)
            └─> conhost.exe (Console Window Host)

Each git command spawns this entire tree!


Why Git/Bash/Conhost Zombies?

Git for Windows Architecture

Git for Windows uses MSYS2/Cygwin which spawns:

  1. git.exe - Main Git binary
  2. bash.exe - Shell for git hooks/internals
  3. conhost.exe - Console host for each shell

Normal Lifecycle

subprocess.run(["git", ...])
  → spawn git.exe
  → git spawns bash.exe
  → bash spawns conhost.exe
  → command completes
  → all processes terminate

Problem Scenarios

Scenario 1: Git Hangs (No Timeout)

  • Git operation waits indefinitely
  • Subprocess never returns
  • Processes accumulate

Scenario 2: Orphaned Processes

  • Parent (python) terminates before children
  • bash.exe and conhost.exe orphaned
  • Windows doesn't auto-kill orphans

Scenario 3: Rapid Spawning

  • Running every 60 seconds
  • Each call spawns 3 processes
  • Cleanup slower than spawning
  • Processes accumulate

SSH Process Mystery

Question: Why 5 SSH processes if remote is HTTPS?

Remote URL Check:

git config --get remote.origin.url
# Result: https://git.azcomputerguru.com/azcomputerguru/claudetools.git

Hypotheses:

  1. Credential Helper: Git HTTPS may use SSH credential helper
  2. SSH Agent: ssh-agent running for other purposes (GitHub, other repos)
  3. Git Hooks: Pre-commit/post-commit hooks might use SSH
  4. Background Fetches: Git background maintenance tasks
  5. Multiple Repos: Other repos on system using SSH

Action: Agents investigating this further


Agents Currently Investigating

  1. Process Investigation Agent (a381b9a): Root cause analysis
  2. Solution Design Agent (a8dbf87): Proposing solutions
  3. Code Pattern Review Agent (a06900a): Reviewing subprocess patterns
  4. Bash Process Lifecycle Agent (a0da635): Bash/git/conhost lifecycle (IN PROGRESS)
  5. SSH/Network Connection Agent (a6a748f): SSH connection analysis (IN PROGRESS)

Immediate Observations

Confirmed Issues

  1. [HIGH] No Timeout on Subprocess Calls

    • periodic_save_check.py: 3 calls without timeout
    • If git hangs, process never terminates
    • Fix: Add timeout=5 to all subprocess.run() calls
  2. [HIGH] High Frequency Execution

    • Every 1 minute = 1,440 executions/day
    • Each spawns 3+ processes
    • Cleanup lag accumulates zombies
  3. [MEDIUM] No Error Handling

    • No try/finally for cleanup
    • If exception occurs, processes may not clean up

Suspected Issues

  1. [MEDIUM] Git for Windows Process Tree

    • Each git call spawns bash + conhost
    • Windows may not clean up tree properly
    • Need process group cleanup
  2. [LOW] SSH Processes

    • 5 SSH + 1 ssh-agent
    • Not directly related to HTTPS git URL
    • May be separate issue (background git operations?)

Immediate (High Priority)

  1. Add Timeouts to All Subprocess Calls

    subprocess.run(
        ["git", "config", "--local", "claude.projectid"],
        capture_output=True,
        text=True,
        check=False,
        cwd=PROJECT_ROOT,
        timeout=5,  # ADD THIS
    )
    
  2. Reduce Execution Frequency

    • Change from every 1 minute to every 5 minutes
    • 80% reduction in process spawns
    • Still frequent enough for context saving
  3. Cache Git Config Results

    • Project ID doesn't change frequently
    • Cache for 5-10 minutes
    • Reduce git calls by 80-90%

Secondary (Medium Priority)

  1. Process Group Cleanup

    • Use process groups on Windows
    • Ensure child processes terminate with parent
  2. Monitor and Alert

    • Track running process count
    • Alert if exceeds threshold
    • Auto-cleanup if memory pressure

Pending Agent Analysis

Waiting for comprehensive reports from:

  • Bash Process Lifecycle Agent (analyzing bash/git lifecycle)
  • SSH/Network Connection Agent (analyzing SSH zombies)
  • Solution Design Agent (proposing comprehensive solution)
  • Code Pattern Review Agent (finding all subprocess usage)

Next Steps

  1. Wait for all agent reports to complete
  2. Coordinate findings across all agents
  3. Synthesize comprehensive solution
  4. Present options to user for final decision
  5. Implement chosen solution
  6. Test and verify fix

Status: Investigation in progress Preliminary Confidence: HIGH that periodic_save_check.py is primary culprit ETA: Waiting for agent reports (est. 5-10 minutes)