sync: auto-sync from DESKTOP-0O8A1RL at 2026-05-15 09:15:55
Author: Mike Swanson Machine: DESKTOP-0O8A1RL Timestamp: 2026-05-15 09:15:55
This commit is contained in:
@@ -388,3 +388,136 @@ INFO gururmm_server::ws: Agent 8cd0440f-... reconnected after update: 0.6.19 ->
|
||||
- **Discovery DB fixes**: `server/src/db/discovery.rs` — `host(ip_address)` instead of `ip_address::text`; `complete_scan()` computes `new_devices` via CTE
|
||||
- **Subnet field**: agents now report `ipv4_subnets: Vec<String>` alongside `ipv4_addresses` in `NetworkInterface` struct (both agent and server side)
|
||||
- **PTR lookup**: `agent/src/discovery/mod.rs` — `dns_lookup::lookup_addr(&ip)` wrapped in `spawn_blocking`
|
||||
|
||||
---
|
||||
|
||||
## Update: 09:13 PT — Zombie connection fix (0.6.21) + automated changelog system
|
||||
|
||||
## User
|
||||
- **User:** Mike Swanson (mike)
|
||||
- **Machine:** DESKTOP-0O8A1RL
|
||||
- **Role:** admin
|
||||
- **Session span:** ~08:30–09:13 PT (continued from prior context window)
|
||||
|
||||
## Session Summary
|
||||
|
||||
Investigation began after a screenshot showed a failed network discovery scan at 8:26 AM (19ms, no devices) on the gururmm site. The discovery node (agent 8cd0440f on host `gururmm`) had been unavailable since 14:48:36 UTC — over an hour without reconnecting, despite the process (PID 1026153) still running.
|
||||
|
||||
Diagnostic work confirmed the agent had zero TCP connections but was logging metrics every 60 seconds (in two interleaved streams, ~3 seconds apart). The dual metrics stream is normal: the `connect_and_run` metrics task and the `main.rs` metrics loop both log independently. The absence of any reconnect attempts or timeout messages pointed to the agent being stuck inside `connect_and_run` with what appeared to be a live WebSocket but was actually a zombie: Cloudflare held the client-side WebSocket open after the backend server closed it at 14:48:36 (TCP RST), so the agent receive-side was blocking indefinitely with no error.
|
||||
|
||||
Root cause in `agent/src/transport/websocket.rs`: the 90-second connection timeout used `tokio::time::sleep(Duration::from_secs(90))` inside the select loop. Because this sleep restarts from zero on every loop iteration — and the heartbeat task fires every 30 seconds, resetting the sleep constantly — the timeout never expired. Fix: track `last_incoming = Instant::now()` initialized before the loop, update it only in the incoming message branch, replace the sleep with `sleep_until(last_incoming + Duration::from_secs(90))`. Timeout now fires if no server message is received for 90 seconds regardless of outgoing heartbeat frequency.
|
||||
|
||||
After restarting the service to restore the discovery node immediately, the fix was implemented, agent bumped to 0.6.21, built, and deployed. The scanner picked up the new binary and dispatched auto-update at 16:12:02 UTC. PID changed from 1033371 to 1038912 with "Backup file cleaned up" confirming the full update flow end-to-end.
|
||||
|
||||
Second half of the session implemented automated changelog generation. `scripts/generate-changelog.sh` generates two sections per build: a user-facing release notes section (parsed from conventional commits — feat/fix/perf prefixes) and a full developer section (complete git log with commit bodies for the component path since the previous version). Wired into `agent/build-all-platforms.sh` and new `build-server.sh`. Files stored in `changelogs/agent/vX.Y.Z.md` and `changelogs/server/vX.Y.Z.md` in the repo (GrepAI indexes them) and copied to `/var/www/gururmm/changelogs/` for serving. Two server API endpoints added: `GET /api/changelog/:component/latest` and `GET /api/changelog/:component/:version`. All committed and pushed to Gitea.
|
||||
|
||||
## Key Decisions
|
||||
|
||||
- **`sleep_until` anchored to incoming messages only** — fix must not reset the deadline on outgoing writes. Cloudflare accepts writes from the agent while sending nothing back; any reset on outgoing events would continue masking zombie connections.
|
||||
- **90-second deadline retained** — matches existing intent. Healthy connections see server messages (ConfigUpdate, AuthAck) on reconnect well within 90 seconds.
|
||||
- **Service restart before code fix** — restored the discovery node immediately rather than waiting for the full build cycle.
|
||||
- **Changelog in-repo + served directory** — git repo location ensures GrepAI indexes content for context searches; `/var/www/gururmm/changelogs/` copy serves the API endpoint.
|
||||
- **No Ollama for changelog generation** — server (172.16.3.30) cannot reach Ollama at 100.92.127.64:11434. Shell-based conventional commit parsing used instead; clean release notes without AI dependency.
|
||||
- **Version path sanitization in changelog endpoint** — only digits, dots, and leading `v` allowed to prevent path traversal. Component validated against allowlist.
|
||||
|
||||
## Problems Encountered
|
||||
|
||||
- **Zombie connection not self-detecting**: Agent stuck ~56 minutes without triggering its own 90s timeout. `sleep(90s)` inside select loop resets on every iteration; 30s heartbeats prevented it from ever firing. Fixed with `sleep_until`.
|
||||
- **Dual metrics stream misread**: Initially suspected as evidence of two concurrent reconnects or task leak. Actually normal — two independent timers started at slightly different times. Not a bug.
|
||||
- **Changelog directory write permissions**: `generate-changelog.sh` runs as `guru`; `/var/www/gururmm/changelogs/` owned by root. Added `sudo mkdir -p` and `sudo cp` with `|| true` fallback.
|
||||
- **Heredoc quoting failures**: Multiple SSH heredoc and Python one-liner attempts failed due to quote escaping. Resolved by writing scripts to `/tmp/` locally and using `scp`.
|
||||
|
||||
## Configuration Changes
|
||||
|
||||
**Modified (gururmm repo):**
|
||||
- `agent/src/transport/websocket.rs` — `last_incoming` deadline replacing `sleep(90s)`; imports updated
|
||||
- `agent/Cargo.toml` — version 0.6.20 -> 0.6.21
|
||||
- `server/src/api/mod.rs` — added `pub mod changelog;` and two changelog routes
|
||||
- `agent/build-all-platforms.sh` — appended changelog generation call
|
||||
|
||||
**Created (gururmm repo):**
|
||||
- `server/src/api/changelog.rs` — `latest` and `by_version` handlers
|
||||
- `scripts/generate-changelog.sh` — dev + user changelog generator
|
||||
- `build-server.sh` — build, deploy, changelog in one script
|
||||
- `changelogs/agent/v0.6.21.md`, `changelogs/server/v0.3.1.md`
|
||||
- `changelogs/LATEST_AGENT.md`, `changelogs/LATEST_SERVER.md`
|
||||
|
||||
**Modified (server filesystem):**
|
||||
- `/opt/gururmm/.env` — added `CHANGELOG_DIR=/var/www/gururmm/changelogs`
|
||||
- `/usr/local/bin/gururmm-agent` — auto-updated to 0.6.21
|
||||
- `/opt/gururmm/gururmm-server` — redeployed with changelog endpoint
|
||||
|
||||
**Created (server filesystem):**
|
||||
- `/var/www/gururmm/changelogs/` — served changelog directory
|
||||
- `/var/www/gururmm/downloads/gururmm-agent-linux-amd64-0.6.21` + `.sha256`
|
||||
|
||||
## Credentials & Secrets
|
||||
|
||||
None new.
|
||||
|
||||
## Infrastructure & Servers
|
||||
|
||||
- **GuruRMM server**: 172.16.3.30:3001, service `gururmm-server` (PID 1022326)
|
||||
- **GuruRMM agent** (gururmm host): PID 1038912, version 0.6.21
|
||||
- **Agent WebSocket**: `wss://rmm-api.azcomputerguru.com/ws` (through Cloudflare)
|
||||
- **Changelog API**: `https://rmm-api.azcomputerguru.com/api/changelog/:component/latest`
|
||||
- **Changelogs served**: `/var/www/gururmm/changelogs/`
|
||||
- **Changelogs in repo**: `/home/guru/gururmm/changelogs/`
|
||||
|
||||
## Commands & Outputs
|
||||
|
||||
```bash
|
||||
# Restore discovery node
|
||||
sudo systemctl restart gururmm-agent
|
||||
|
||||
# Build agent 0.6.21 (server-side)
|
||||
source ~/.cargo/env && cd /home/guru/gururmm/agent && cargo build --release
|
||||
# Finished release in 1m 24s
|
||||
|
||||
# Deploy binary + sha256
|
||||
sudo cp agent/target/release/gururmm-agent /var/www/gururmm/downloads/gururmm-agent-linux-amd64-0.6.21
|
||||
sha256sum /var/www/gururmm/downloads/gururmm-agent-linux-amd64-0.6.21 | awk '{print $1}' | sudo tee ...sha256
|
||||
# SHA256: 54637a82d113471fe11983800bf0ef207ec250dcaf1b2fe2cfd15e2e03cd8b76
|
||||
|
||||
# Build server with changelog endpoint
|
||||
source ~/.cargo/env && cd /home/guru/gururmm/server && cargo build --release
|
||||
# Finished in 4m 28s
|
||||
|
||||
# Test endpoints
|
||||
curl http://localhost:3001/api/changelog/agent/latest # 200 text/markdown
|
||||
curl http://localhost:3001/api/changelog/agent/0.6.21 # 200
|
||||
curl http://localhost:3001/api/changelog/server/latest # 200
|
||||
|
||||
# Auto-update log (agent, 16:12:02 UTC)
|
||||
# INFO Received update command: 0.6.20 -> 0.6.21 (id: 3721cb41-e87c-487e-899e-079186ff8dd5)
|
||||
# INFO Downloading from https://rmm-api.azcomputerguru.com/downloads/gururmm-agent-linux-amd64-0.6.21
|
||||
# INFO Exiting for service restart by systemd
|
||||
# INFO Server confirmed update success — cleaning up rollback artifacts
|
||||
```
|
||||
|
||||
## Pending / Incomplete Tasks
|
||||
|
||||
- **BB-SERVER enrollment loop**: duplicate key `idx_agents_site_device` every ~10s — pre-existing, unresolved
|
||||
- **Windows/macOS agent builds**: 0.6.21 not built for Windows or macOS
|
||||
- **LHM bundling in MSI**: LibreHardwareMonitor not in build pipeline
|
||||
- **Build lock**: `build-all-platforms.sh` has no `flock` mutex
|
||||
- **Portal changelog page**: API endpoints exist; no dashboard UI to display them yet
|
||||
- **Tray changelog link**: no `changelog_url` in TrayPolicy yet
|
||||
- **Policy wiring plan** (`ticklish-questing-stallman.md`): Still deferred
|
||||
- **IMC1 Unicode escape sequence** in hardware inventory JSON: unresolved
|
||||
|
||||
## Reference Information
|
||||
|
||||
- **Commits (gururmm repo)**:
|
||||
- `1849733` — fix(agent): replace resetting sleep with sleep_until for zombie connection detection
|
||||
- `b8809c5` — feat: add automated changelog generation for agent and server builds
|
||||
- `52b5695` — feat(server): add changelog API endpoints + deploy-to-serve in generate script
|
||||
- **Changelog API**:
|
||||
- `GET https://rmm-api.azcomputerguru.com/api/changelog/agent/latest`
|
||||
- `GET https://rmm-api.azcomputerguru.com/api/changelog/server/latest`
|
||||
- `GET https://rmm-api.azcomputerguru.com/api/changelog/agent/0.6.21`
|
||||
- **Agent 0.6.21 SHA256**: `54637a82d113471fe11983800bf0ef207ec250dcaf1b2fe2cfd15e2e03cd8b76`
|
||||
- **Auto-update dispatch**: 2026-05-15T16:12:02Z, update_id `3721cb41-e87c-487e-899e-079186ff8dd5`
|
||||
- **Key file**: `agent/src/transport/websocket.rs` — `last_incoming` at line ~279, `sleep_until` at line ~361
|
||||
- **Key file**: `server/src/api/changelog.rs`
|
||||
- **Key file**: `scripts/generate-changelog.sh`
|
||||
|
||||
Reference in New Issue
Block a user