sync: auto-sync from Mikes-MacBook-Air.local at 2026-05-24 13:57:12
Author: Mike Swanson Machine: Mikes-MacBook-Air.local Timestamp: 2026-05-24 13:57:12
This commit is contained in:
@@ -883,3 +883,88 @@ git push origin main # -> 04f70c9
|
||||
- Confidence levels: high (exact match), medium (FQDN match), low (no match, no alerts)
|
||||
|
||||
---
|
||||
|
||||
## Update: 15:30 PT — Auto-Update Mechanism Investigation and Fix (In Progress)
|
||||
|
||||
### User
|
||||
- **User:** Mike Swanson (mike)
|
||||
- **Machine:** Mikes-MacBook-Air
|
||||
- **Role:** admin
|
||||
- **Session span:** ~14:50–15:30 PT (interrupted for save)
|
||||
|
||||
---
|
||||
|
||||
### Session Summary
|
||||
|
||||
Investigated root cause of two production agents (BB-SERVER and RECEPTIONIST-PC) stuck on version 0.6.37 when the fleet is on 0.6.38. The agents had flaky WebSocket connections that caused them to miss the auto-update dispatch window. The investigation identified that the server does not query for pending updates when agents reconnect, causing updates to be permanently missed.
|
||||
|
||||
Implemented comprehensive fix with validation, version comparison, atomic status transitions, and enhanced logging. Two code review iterations identified critical issues: first review caught security vulnerabilities (empty string validation), logic flaws (no version comparison), and race conditions. Second review identified a critical control flow bug where successful re-dispatch incorrectly fell through to normal update check, causing duplicate updates.
|
||||
|
||||
Session was interrupted mid-implementation of the control flow fix to save progress.
|
||||
|
||||
---
|
||||
|
||||
### Key Decisions
|
||||
|
||||
- **Root cause investigation over manual fixes**: Chose to understand and prevent future failures rather than just applying updates to the two stuck agents
|
||||
- **Pending update query on reconnect**: Server now checks database for pending updates before creating new ones
|
||||
- **Atomic status transitions**: Used SQL UPDATE...WHERE...RETURNING pattern to prevent race conditions
|
||||
- **Version comparison with semver**: Skip re-dispatch if agent already on or past target version
|
||||
- **Security validation**: Check download_url and checksum for non-empty values before re-dispatch
|
||||
- **Early return statements over flags**: Control flow uses explicit returns to prevent fallthrough
|
||||
- **Multiple code review iterations**: Thorough review process caught critical bugs before deployment
|
||||
|
||||
---
|
||||
|
||||
### Problems Encountered
|
||||
|
||||
- **First implementation had security vulnerability**: Using unwrap_or_default() on download_url and checksum would allow empty strings, potentially letting agents accept malicious binaries. Fixed with filter(|u| !u.is_empty()) validation.
|
||||
- **No version comparison in first implementation**: Would re-dispatch updates even if agent already on target version. Fixed by adding semver version comparison.
|
||||
- **Race condition in concurrent reconnects**: Multiple connections could dispatch same update. Fixed with atomic status transition using SQL UPDATE...WHERE...RETURNING.
|
||||
- **Control flow bug causing duplicate updates**: After successful re-dispatch, code incorrectly proceeded to normal update check. Identified in second code review; fix in progress when session interrupted.
|
||||
|
||||
---
|
||||
|
||||
### Configuration Changes
|
||||
|
||||
**Modified (not yet committed):**
|
||||
- `server/src/ws/mod.rs` (~lines 812-1090) - Added pending update check on agent reconnect with validation, version comparison, atomic transitions, and early returns (incomplete - control flow fix in progress)
|
||||
|
||||
---
|
||||
|
||||
### Pending / Incomplete Tasks
|
||||
|
||||
**Immediate:**
|
||||
1. Complete control flow fix - replace flag-based logic with early returns
|
||||
2. Test compilation and verify no syntax errors
|
||||
3. Create database migration for pending update index (performance optimization)
|
||||
4. Commit changes to gururmm repository
|
||||
5. Build and deploy to production server
|
||||
6. Monitor logs for [RE-DISPATCH] messages when BB-SERVER and RECEPTIONIST-PC reconnect
|
||||
7. Verify both agents successfully update to 0.6.38
|
||||
|
||||
**Investigation Details:**
|
||||
- Root Cause: Server sends update via send_to() which uses mpsc channel. If WS disconnects during send, message is lost with no retry mechanism.
|
||||
- Missing Link: Server has pending update records in database but doesn't query for them on agent reconnect.
|
||||
- Solution: Query get_pending_update() on reconnect and re-send if found, with proper validation and atomicity.
|
||||
|
||||
---
|
||||
|
||||
### Reference Information
|
||||
|
||||
**Key Files Reviewed:**
|
||||
- `server/src/db/updates.rs` - get_pending_update() function exists (line 129)
|
||||
- `server/src/ws/mod.rs` - WebSocket connection handler and auto-update dispatch
|
||||
- `server/src/api/agents.rs` - Manual update trigger endpoint
|
||||
- `agent/src/updater/mod.rs` - Agent-side update logic
|
||||
|
||||
**Agents Affected:**
|
||||
- BB-SERVER: stuck on 0.6.37
|
||||
- RECEPTIONIST-PC: stuck on 0.6.37
|
||||
- Fleet version: 0.6.38
|
||||
|
||||
**Code Review Findings:**
|
||||
- Iteration 1: REJECTED - Security vulnerabilities, logic flaws, race conditions
|
||||
- Iteration 2: REJECTED - Critical control flow bug causing duplicate updates
|
||||
|
||||
---
|
||||
|
||||
Reference in New Issue
Block a user