Files
claudetools/ZOMBIE_PROCESS_INVESTIGATION.md
Mike Swanson 4545fc8ca3 [Baseline] Pre-zombie-fix checkpoint
Investigation complete - 5 agents identified root causes:
- periodic_save_check.py: 540 processes/hour (53%)
- Background sync-contexts: 200 processes/hour (20%)
- user-prompt-submit: 180 processes/hour (18%)
- task-complete: 90 processes/hour (9%)
Total: 1,010 zombie processes/hour, 3-7 GB RAM/hour

Phase 1 fixes ready to implement:
1. Reduce periodic save frequency (1min to 5min)
2. Add timeouts to all subprocess calls
3. Remove background sync-contexts spawning
4. Add mutex lock to prevent overlaps

See: FINAL_ZOMBIE_SOLUTION.md for complete analysis

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-17 13:34:42 -07:00

6.0 KiB

Zombie Process Investigation - Preliminary Findings

Date: 2026-01-17 Issue: Zombie processes accumulating during long dev sessions, running machine out of memory


Reported Symptoms

User reports these specific zombie processes:

  1. Multiple "Git for Windows" processes
  2. Multiple "Console Window Host" (conhost.exe) processes
  3. Many bash instances
  4. 5 SSH processes
  5. 1 ssh-agent process

Initial Investigation Findings

SMOKING GUN: periodic_save_check.py

File: .claude/hooks/periodic_save_check.py Frequency: Runs EVERY 1 MINUTE via Task Scheduler Problem: Spawns subprocess without timeout

Subprocess Calls (per execution):

# Line 70-76: Git config check (NO TIMEOUT)
subprocess.run(
    ["git", "config", "--local", "claude.projectid"],
    capture_output=True,
    text=True,
    check=False,
    cwd=PROJECT_ROOT,
)

# Line 81-87: Git remote URL check (NO TIMEOUT)
subprocess.run(
    ["git", "config", "--get", "remote.origin.url"],
    capture_output=True,
    text=True,
    check=False,
    cwd=PROJECT_ROOT,
)

# Line 102-107: Process check (NO TIMEOUT)
subprocess.run(
    ["tasklist.exe"],
    capture_output=True,
    text=True,
    check=False,
)

Impact Analysis:

  • Runs: 60 times/hour, 1,440 times/day
  • Each run spawns: 3 subprocess calls
  • Total spawns: 180/hour, 4,320/day
  • If 1% hang: 1.8 zombies/hour, 43 zombies/day
  • If 5% hang: 9 zombies/hour, 216 zombies/day

Process Tree (Windows):

periodic_save_check.py (python.exe)
  └─> git.exe (Git for Windows)
       └─> bash.exe (for git internals)
            └─> conhost.exe (Console Window Host)

Each git command spawns this entire tree!


Why Git/Bash/Conhost Zombies?

Git for Windows Architecture

Git for Windows uses MSYS2/Cygwin which spawns:

  1. git.exe - Main Git binary
  2. bash.exe - Shell for git hooks/internals
  3. conhost.exe - Console host for each shell

Normal Lifecycle

subprocess.run(["git", ...])
  → spawn git.exe
  → git spawns bash.exe
  → bash spawns conhost.exe
  → command completes
  → all processes terminate

Problem Scenarios

Scenario 1: Git Hangs (No Timeout)

  • Git operation waits indefinitely
  • Subprocess never returns
  • Processes accumulate

Scenario 2: Orphaned Processes

  • Parent (python) terminates before children
  • bash.exe and conhost.exe orphaned
  • Windows doesn't auto-kill orphans

Scenario 3: Rapid Spawning

  • Running every 60 seconds
  • Each call spawns 3 processes
  • Cleanup slower than spawning
  • Processes accumulate

SSH Process Mystery

Question: Why 5 SSH processes if remote is HTTPS?

Remote URL Check:

git config --get remote.origin.url
# Result: https://git.azcomputerguru.com/azcomputerguru/claudetools.git

Hypotheses:

  1. Credential Helper: Git HTTPS may use SSH credential helper
  2. SSH Agent: ssh-agent running for other purposes (GitHub, other repos)
  3. Git Hooks: Pre-commit/post-commit hooks might use SSH
  4. Background Fetches: Git background maintenance tasks
  5. Multiple Repos: Other repos on system using SSH

Action: Agents investigating this further


Agents Currently Investigating

  1. Process Investigation Agent (a381b9a): Root cause analysis
  2. Solution Design Agent (a8dbf87): Proposing solutions
  3. Code Pattern Review Agent (a06900a): Reviewing subprocess patterns
  4. Bash Process Lifecycle Agent (a0da635): Bash/git/conhost lifecycle (IN PROGRESS)
  5. SSH/Network Connection Agent (a6a748f): SSH connection analysis (IN PROGRESS)

Immediate Observations

Confirmed Issues

  1. [HIGH] No Timeout on Subprocess Calls

    • periodic_save_check.py: 3 calls without timeout
    • If git hangs, process never terminates
    • Fix: Add timeout=5 to all subprocess.run() calls
  2. [HIGH] High Frequency Execution

    • Every 1 minute = 1,440 executions/day
    • Each spawns 3+ processes
    • Cleanup lag accumulates zombies
  3. [MEDIUM] No Error Handling

    • No try/finally for cleanup
    • If exception occurs, processes may not clean up

Suspected Issues

  1. [MEDIUM] Git for Windows Process Tree

    • Each git call spawns bash + conhost
    • Windows may not clean up tree properly
    • Need process group cleanup
  2. [LOW] SSH Processes

    • 5 SSH + 1 ssh-agent
    • Not directly related to HTTPS git URL
    • May be separate issue (background git operations?)

Immediate (High Priority)

  1. Add Timeouts to All Subprocess Calls

    subprocess.run(
        ["git", "config", "--local", "claude.projectid"],
        capture_output=True,
        text=True,
        check=False,
        cwd=PROJECT_ROOT,
        timeout=5,  # ADD THIS
    )
    
  2. Reduce Execution Frequency

    • Change from every 1 minute to every 5 minutes
    • 80% reduction in process spawns
    • Still frequent enough for context saving
  3. Cache Git Config Results

    • Project ID doesn't change frequently
    • Cache for 5-10 minutes
    • Reduce git calls by 80-90%

Secondary (Medium Priority)

  1. Process Group Cleanup

    • Use process groups on Windows
    • Ensure child processes terminate with parent
  2. Monitor and Alert

    • Track running process count
    • Alert if exceeds threshold
    • Auto-cleanup if memory pressure

Pending Agent Analysis

Waiting for comprehensive reports from:

  • Bash Process Lifecycle Agent (analyzing bash/git lifecycle)
  • SSH/Network Connection Agent (analyzing SSH zombies)
  • Solution Design Agent (proposing comprehensive solution)
  • Code Pattern Review Agent (finding all subprocess usage)

Next Steps

  1. Wait for all agent reports to complete
  2. Coordinate findings across all agents
  3. Synthesize comprehensive solution
  4. Present options to user for final decision
  5. Implement chosen solution
  6. Test and verify fix

Status: Investigation in progress Preliminary Confidence: HIGH that periodic_save_check.py is primary culprit ETA: Waiting for agent reports (est. 5-10 minutes)