[Baseline] Pre-zombie-fix checkpoint
Investigation complete - 5 agents identified root causes: - periodic_save_check.py: 540 processes/hour (53%) - Background sync-contexts: 200 processes/hour (20%) - user-prompt-submit: 180 processes/hour (18%) - task-complete: 90 processes/hour (9%) Total: 1,010 zombie processes/hour, 3-7 GB RAM/hour Phase 1 fixes ready to implement: 1. Reduce periodic save frequency (1min to 5min) 2. Add timeouts to all subprocess calls 3. Remove background sync-contexts spawning 4. Add mutex lock to prevent overlaps See: FINAL_ZOMBIE_SOLUTION.md for complete analysis Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
239
ZOMBIE_PROCESS_INVESTIGATION.md
Normal file
239
ZOMBIE_PROCESS_INVESTIGATION.md
Normal file
@@ -0,0 +1,239 @@
|
||||
# Zombie Process Investigation - Preliminary Findings
|
||||
|
||||
**Date:** 2026-01-17
|
||||
**Issue:** Zombie processes accumulating during long dev sessions, running machine out of memory
|
||||
|
||||
---
|
||||
|
||||
## Reported Symptoms
|
||||
|
||||
User reports these specific zombie processes:
|
||||
1. Multiple "Git for Windows" processes
|
||||
2. Multiple "Console Window Host" (conhost.exe) processes
|
||||
3. Many bash instances
|
||||
4. 5 SSH processes
|
||||
5. 1 ssh-agent process
|
||||
|
||||
---
|
||||
|
||||
## Initial Investigation Findings
|
||||
|
||||
### SMOKING GUN: periodic_save_check.py
|
||||
|
||||
**File:** `.claude/hooks/periodic_save_check.py`
|
||||
**Frequency:** Runs EVERY 1 MINUTE via Task Scheduler
|
||||
**Problem:** Spawns subprocess without timeout
|
||||
|
||||
**Subprocess Calls (per execution):**
|
||||
|
||||
```python
|
||||
# Line 70-76: Git config check (NO TIMEOUT)
|
||||
subprocess.run(
|
||||
["git", "config", "--local", "claude.projectid"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
check=False,
|
||||
cwd=PROJECT_ROOT,
|
||||
)
|
||||
|
||||
# Line 81-87: Git remote URL check (NO TIMEOUT)
|
||||
subprocess.run(
|
||||
["git", "config", "--get", "remote.origin.url"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
check=False,
|
||||
cwd=PROJECT_ROOT,
|
||||
)
|
||||
|
||||
# Line 102-107: Process check (NO TIMEOUT)
|
||||
subprocess.run(
|
||||
["tasklist.exe"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
check=False,
|
||||
)
|
||||
```
|
||||
|
||||
**Impact Analysis:**
|
||||
- Runs: 60 times/hour, 1,440 times/day
|
||||
- Each run spawns: 3 subprocess calls
|
||||
- Total spawns: 180/hour, 4,320/day
|
||||
- If 1% hang: 1.8 zombies/hour, 43 zombies/day
|
||||
- If 5% hang: 9 zombies/hour, 216 zombies/day
|
||||
|
||||
**Process Tree (Windows):**
|
||||
```
|
||||
periodic_save_check.py (python.exe)
|
||||
└─> git.exe (Git for Windows)
|
||||
└─> bash.exe (for git internals)
|
||||
└─> conhost.exe (Console Window Host)
|
||||
```
|
||||
|
||||
Each git command spawns this entire tree!
|
||||
|
||||
---
|
||||
|
||||
## Why Git/Bash/Conhost Zombies?
|
||||
|
||||
### Git for Windows Architecture
|
||||
Git for Windows uses MSYS2/Cygwin which spawns:
|
||||
1. `git.exe` - Main Git binary
|
||||
2. `bash.exe` - Shell for git hooks/internals
|
||||
3. `conhost.exe` - Console host for each shell
|
||||
|
||||
### Normal Lifecycle
|
||||
```
|
||||
subprocess.run(["git", ...])
|
||||
→ spawn git.exe
|
||||
→ git spawns bash.exe
|
||||
→ bash spawns conhost.exe
|
||||
→ command completes
|
||||
→ all processes terminate
|
||||
```
|
||||
|
||||
### Problem Scenarios
|
||||
|
||||
**Scenario 1: Git Hangs (No Timeout)**
|
||||
- Git operation waits indefinitely
|
||||
- Subprocess never returns
|
||||
- Processes accumulate
|
||||
|
||||
**Scenario 2: Orphaned Processes**
|
||||
- Parent (python) terminates before children
|
||||
- bash.exe and conhost.exe orphaned
|
||||
- Windows doesn't auto-kill orphans
|
||||
|
||||
**Scenario 3: Rapid Spawning**
|
||||
- Running every 60 seconds
|
||||
- Each call spawns 3 processes
|
||||
- Cleanup slower than spawning
|
||||
- Processes accumulate
|
||||
|
||||
---
|
||||
|
||||
## SSH Process Mystery
|
||||
|
||||
**Question:** Why 5 SSH processes if remote is HTTPS?
|
||||
|
||||
**Remote URL Check:**
|
||||
```bash
|
||||
git config --get remote.origin.url
|
||||
# Result: https://git.azcomputerguru.com/azcomputerguru/claudetools.git
|
||||
```
|
||||
|
||||
**Hypotheses:**
|
||||
1. **Credential Helper:** Git HTTPS may use SSH credential helper
|
||||
2. **SSH Agent:** ssh-agent running for other purposes (GitHub, other repos)
|
||||
3. **Git Hooks:** Pre-commit/post-commit hooks might use SSH
|
||||
4. **Background Fetches:** Git background maintenance tasks
|
||||
5. **Multiple Repos:** Other repos on system using SSH
|
||||
|
||||
**Action:** Agents investigating this further
|
||||
|
||||
---
|
||||
|
||||
## Agents Currently Investigating
|
||||
|
||||
1. **Process Investigation Agent (a381b9a):** Root cause analysis
|
||||
2. **Solution Design Agent (a8dbf87):** Proposing solutions
|
||||
3. **Code Pattern Review Agent (a06900a):** Reviewing subprocess patterns
|
||||
4. **Bash Process Lifecycle Agent (a0da635):** Bash/git/conhost lifecycle (IN PROGRESS)
|
||||
5. **SSH/Network Connection Agent (a6a748f):** SSH connection analysis (IN PROGRESS)
|
||||
|
||||
---
|
||||
|
||||
## Immediate Observations
|
||||
|
||||
### Confirmed Issues
|
||||
|
||||
1. [HIGH] **No Timeout on Subprocess Calls**
|
||||
- periodic_save_check.py: 3 calls without timeout
|
||||
- If git hangs, process never terminates
|
||||
- Fix: Add `timeout=5` to all subprocess.run() calls
|
||||
|
||||
2. [HIGH] **High Frequency Execution**
|
||||
- Every 1 minute = 1,440 executions/day
|
||||
- Each spawns 3+ processes
|
||||
- Cleanup lag accumulates zombies
|
||||
|
||||
3. [MEDIUM] **No Error Handling**
|
||||
- No try/finally for cleanup
|
||||
- If exception occurs, processes may not clean up
|
||||
|
||||
### Suspected Issues
|
||||
|
||||
4. [MEDIUM] **Git for Windows Process Tree**
|
||||
- Each git call spawns bash + conhost
|
||||
- Windows may not clean up tree properly
|
||||
- Need process group cleanup
|
||||
|
||||
5. [LOW] **SSH Processes**
|
||||
- 5 SSH + 1 ssh-agent
|
||||
- Not directly related to HTTPS git URL
|
||||
- May be separate issue (background git operations?)
|
||||
|
||||
---
|
||||
|
||||
## Recommended Fixes (Pending Agent Reports)
|
||||
|
||||
### Immediate (High Priority)
|
||||
|
||||
1. **Add Timeouts to All Subprocess Calls**
|
||||
```python
|
||||
subprocess.run(
|
||||
["git", "config", "--local", "claude.projectid"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
check=False,
|
||||
cwd=PROJECT_ROOT,
|
||||
timeout=5, # ADD THIS
|
||||
)
|
||||
```
|
||||
|
||||
2. **Reduce Execution Frequency**
|
||||
- Change from every 1 minute to every 5 minutes
|
||||
- 80% reduction in process spawns
|
||||
- Still frequent enough for context saving
|
||||
|
||||
3. **Cache Git Config Results**
|
||||
- Project ID doesn't change frequently
|
||||
- Cache for 5-10 minutes
|
||||
- Reduce git calls by 80-90%
|
||||
|
||||
### Secondary (Medium Priority)
|
||||
|
||||
4. **Process Group Cleanup**
|
||||
- Use process groups on Windows
|
||||
- Ensure child processes terminate with parent
|
||||
|
||||
5. **Monitor and Alert**
|
||||
- Track running process count
|
||||
- Alert if exceeds threshold
|
||||
- Auto-cleanup if memory pressure
|
||||
|
||||
---
|
||||
|
||||
## Pending Agent Analysis
|
||||
|
||||
Waiting for comprehensive reports from:
|
||||
- Bash Process Lifecycle Agent (analyzing bash/git lifecycle)
|
||||
- SSH/Network Connection Agent (analyzing SSH zombies)
|
||||
- Solution Design Agent (proposing comprehensive solution)
|
||||
- Code Pattern Review Agent (finding all subprocess usage)
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. Wait for all agent reports to complete
|
||||
2. Coordinate findings across all agents
|
||||
3. Synthesize comprehensive solution
|
||||
4. Present options to user for final decision
|
||||
5. Implement chosen solution
|
||||
6. Test and verify fix
|
||||
|
||||
---
|
||||
|
||||
**Status:** Investigation in progress
|
||||
**Preliminary Confidence:** HIGH that periodic_save_check.py is primary culprit
|
||||
**ETA:** Waiting for agent reports (est. 5-10 minutes)
|
||||
Reference in New Issue
Block a user