sync: auto-sync from GURU-5070 at 2026-05-31 16:35:50
Author: Mike Swanson Machine: GURU-5070 Timestamp: 2026-05-31 16:35:50
This commit is contained in:
@@ -24,3 +24,11 @@ built/linted/tested locally — **no more build-host (172.16.3.30) round-trips j
|
||||
**How to apply:** when a Coding Agent works on GuruConnect Rust, have it self-verify with the local toolchain
|
||||
(set PROTOC, run the four gates, iterate to green) and commit CI-green code — don't delegate fmt/clippy to the
|
||||
build host. See [[project_guruconnect_v2_direction]].
|
||||
|
||||
**CI fmt gate — don't omit it from agent briefs (incident 2026-05-31):** the CI `Build Server (Linux)` job runs
|
||||
`cargo fmt --check` as a hard gate FIRST (before build/test). SPEC-004 Task 2 + Task 4 (commits ffca7f0, 4e80573)
|
||||
went red on Linux CI even though the Coding Agents reported "clippy clean, tests pass" — because the briefs listed
|
||||
only `cargo check` + `clippy` + `test` and NOT `cargo fmt --check`, so rustfmt drift slipped through (compactly-
|
||||
written new test code). Fixed with a `cargo fmt` follow-up commit (cef1928). **Every Coding-Agent brief that
|
||||
touches GuruConnect Rust MUST list `cargo fmt --check` (server: `cd server && cargo fmt --check`) as a required
|
||||
gate alongside clippy/test** — a clippy-clean, test-green change can still fail CI on formatting.
|
||||
|
||||
89
session-logs/2026-05-31-mike-spec004-sprint-deploy.md
Normal file
89
session-logs/2026-05-31-mike-spec004-sprint-deploy.md
Normal file
@@ -0,0 +1,89 @@
|
||||
## User
|
||||
- **User:** Mike Swanson (mike)
|
||||
- **Machine:** GURU-5070
|
||||
- **Role:** admin
|
||||
|
||||
## Session Summary
|
||||
|
||||
Executed the GuruConnect P1 sprint A-track (SPEC-004: stable machine identity, session lifecycle, operator removal) end to end — code through production deploy — then purged the live ghost-machine rows. This continued the same session as the earlier H.264 viewer fix. Began by scoping the P1 roadmap items into a tracked task list with workstream tags (A: identity/lifecycle, B: release-eng, C: phase-1 exit, D: rollout) and blocked-by dependencies, then worked the A-track critical path autonomously.
|
||||
|
||||
Implemented four SPEC-004 tasks, each via a single Coding Agent (explore→implement→test), then Code Review (security/data-integrity focus), then a hardening pass on the review nits, then commit via the Gitea Agent with a two-repo flow (GC submodule commit + push, then parent pointer bump). A1 (Task 2, commit ffca7f0): migration 008 `machine_uid` + two-path upsert dedup (ON CONFLICT machine_uid | agent_id) + session reattach index; closed an L1 startup-spoof gap fail-closed. A2 (Task 4, 4e80573): TTL session reaper (10min, 60s sweep) + same-machine supersede; closed a snapshot→remove TOCTOU with a write-lock-guarded `remove_session_if`. A3a (Task 5 server, 5ee6675): admin-gated soft-delete/purge endpoints (DELETE ?purge=true + bulk-remove) + audit to connect_session_events + list/get `deleted_at IS NULL` filtering + upsert revive; closed two dead-code nits. A3b (Task 5 dashboard, 96f9c0a): per-row + multi-select removal UI wired to the API, admin-gated via useAuth().isAdmin, danger confirm modals, refetch-on-success; passed a frontend-design validation pass + Code Review + a count-consistency nit fix.
|
||||
|
||||
Hit a CI regression: A1 and A2 went red on the CI Build-Server-Linux job because it gates on `cargo fmt --check` first, which the Coding-Agent validation briefs had omitted (they ran check/clippy/test only). Fixed with a rustfmt-only commit (cef1928), confirmed CI green, and recorded the lesson in memory so future GC-Rust agent briefs always include the fmt gate. All subsequent agent briefs included `cargo fmt --check`.
|
||||
|
||||
Deployed the GC server (b3e8f32 → 96f9c0a) to production 172.16.3.30 following the existing deploy memory: backup-first (pg_dump + binary + the downloadable agent .exe preserved across the git reset), dashboard SPA build + Linux server binary build on the box (login shell, PROTOC set), systemctl restart. Migrations 008/009 auto-applied, the stale-session reaper started, health/dashboard/public-ingress all returned 200. Then purged 11 ghost connect_machines rows via a principled "keep newest per hostname" soft-delete (19→8 live, 0 duplicate hostnames), verified via dry-run before the destructive UPDATE. The entire SPEC-004 A-track is now live in production and the ghost-session accumulation is fixed at the root and cleaned up.
|
||||
|
||||
## Key Decisions
|
||||
|
||||
- Split A3 (Task 5, "operator removal API + dashboard") into A3a (server) and A3b (dashboard): the server API alone unblocks A4 (ghost purge needs the endpoint, not the UI), and each half is a manageable review unit.
|
||||
- Fixed every substantive Code Review nit before committing (L1 startup-spoof, reap TOCTOU, dead-code update-command leak, NULL-session audit type, bulk-count display) rather than deferring — security/data-integrity work warrants leaving each task fully hardened.
|
||||
- Ran Coding Agents sequentially (not parallel) on the same GC repo to avoid cargo target-dir lock contention between concurrent builds.
|
||||
- Purged ghosts via DIRECT SQL soft-delete (deleted_at=NOW()) rather than the removal API: functionally identical for these rows (all offline, no live in-memory sessions), avoids resetting Mike/Howard's dashboard admin password (no .admin-credentials file on the box, Argon2id hash unrecoverable). Soft-delete is reversible.
|
||||
- Purge rule = "keep the most-recently-seen row per hostname, soft-delete older duplicates" — exactly what dedup would have prevented; applied only to duplicate hostnames, leaving single-row stale machines untouched (decommission is a separate decision).
|
||||
- Preserved the downloadable agent .exe (server/static/downloads/guruconnect.exe, locally modified) across `git reset --hard origin/main` by copying it aside and restoring — the deploy must not roll back what clients download.
|
||||
- Used direct internal Gitea API (172.16.3.20:3000) + plink/pscp for all server ops; transferred SQL via pscp'd files after multi-layer shell quoting repeatedly mangled `$$` dollar-quoting and parens.
|
||||
|
||||
## Problems Encountered
|
||||
|
||||
- CI Build-Server-Linux failed on commits ffca7f0 + 4e80573: root cause was `cargo fmt --check` (CI's first gate) — local validation ran check/clippy/test but not fmt. Fixed with rustfmt-only commit cef1928; added the gate to all later agent briefs and to memory.
|
||||
- Multi-layer shell quoting (plink → bash -lc "..." → heredoc) mangled psql `$$` dollar-quoting (expanded to the shell PID) and choked on parens in echo labels. Resolved by writing SQL to local files and pscp-transferring them, and removing parens from echo strings.
|
||||
- No GC dashboard admin creds available (.admin-credentials absent, hash unrecoverable, connect_users table doesn't exist — users table differently named). Sidestepped by purging via direct SQL.
|
||||
- Server working tree was not clean (modified downloadable .exe) so the memory's `git reset --hard` would have clobbered it — preserved it across the reset.
|
||||
- Migration 007 already fixed the NULL-tags bug (null_tags=0), so the deploy memory's manual `UPDATE ... SET tags='{}'` mitigation was unnecessary.
|
||||
|
||||
## Configuration Changes
|
||||
|
||||
GuruConnect repo (azcomputerguru/guru-connect), all committed + pushed:
|
||||
- `ffca7f0` server: migration 008_machine_uid.sql (new), db/machines.rs, db/agent_keys.rs, main.rs, relay/mod.rs, session/mod.rs — machine_uid dedup + reattach index + startup keyed-skip.
|
||||
- `4e80573` server: session/mod.rs, main.rs — reap_stale_persistent + 60s sweep task + supersede + remove_session_if.
|
||||
- `cef1928` server: machines.rs, main.rs, session/mod.rs — cargo fmt only.
|
||||
- `5ee6675` server: migration 009_session_machine_soft_delete.sql (new), api/removal.rs (new), api/mod.rs, main.rs, db/{machines,sessions,events,releases}.rs — removal API + soft-delete + audit.
|
||||
- `96f9c0a` dashboard: api/{types,machines,sessions}.ts, features/machines/{hooks,MachinesPage}.tsx + 2 new dialogs, features/sessions/{hooks,SessionsPage}.tsx + 1 new dialog, features/machines/DeleteMachineDialog.tsx (comment), components/ui/table.css.
|
||||
|
||||
ClaudeTools monorepo: submodule pointer bumps for each GC commit; `.claude/memory/reference_guru5070_rust_toolchain.md` updated (CI fmt-gate lesson, uncommitted at save time — will sync now).
|
||||
|
||||
Production server 172.16.3.30: code reset b3e8f32→96f9c0a; rebuilt dashboard SPA (server/static/app/) + server binary; service restarted; migrations 008/009 applied; 11 connect_machines rows soft-deleted.
|
||||
|
||||
## Credentials & Secrets
|
||||
|
||||
- SSH deploy host: `guru@172.16.3.30`, password `Gptf*77ttb123!@#-rmm` (plink/pscp).
|
||||
- GC DB: `postgresql://guruconnect:<pw>@localhost/guruconnect` (password in server/.env on the box; vault entry `projects/guruconnect/database.sops.yaml` has username only, no password field).
|
||||
- Gitea API token (read CI status): vault `services/gitea.sops.yaml` → `credentials.api.api-token` (40 chars).
|
||||
- No GC dashboard admin password available (no .admin-credentials file; Argon2id hash in DB).
|
||||
|
||||
## Infrastructure & Servers
|
||||
|
||||
- GC relay/server: 172.16.3.30:3002, systemd `guruconnect.service` (User=guru, EnvironmentFile=server/.env, no WatchdogSec — do NOT run setup-systemd.sh). Repo on box: /home/guru/guru-connect (NOT a submodule there). Binary: target/x86_64-unknown-linux-gnu/release/guruconnect-server. Toolchain in login shell only (cargo ~/.cargo/bin, protoc /home/guru/.local/bin, node/npm /usr/bin); PROTOC must be set for build.rs.
|
||||
- Public: connect.azcomputerguru.com via NPM on Jupiter (172.16.3.20) → :3002. CONNECT_TRUSTED_PROXIES=127.0.0.1,::1,172.16.3.20.
|
||||
- DB: Postgres `guruconnect` on localhost; migrations sqlx-embedded, auto-run on boot.
|
||||
- CI: Gitea Actions; jobs Build Server (Linux), Build Agent (Windows, Pluto runner), Security Audit, Build Summary. Build-Server-Linux gates on cargo fmt --check FIRST. Dashboard is NOT built by CI (only server/agent); validated locally.
|
||||
- Backups on box: ~/backups/guruconnect/pre-deploy-20260531-222216.sql.gz, rollback-sha-*.txt (b3e8f32), guruconnect-server-*.bak.
|
||||
|
||||
## Commands & Outputs
|
||||
|
||||
- Deploy build (on box, login shell): `export PROTOC=/home/guru/.local/bin/protoc; cd dashboard && npm ci && npm run build; cd .. && cargo build --release -p guruconnect-server --target x86_64-unknown-linux-gnu` (~2 min server).
|
||||
- Cutover: `sudo systemctl restart guruconnect`; journal showed "Migrations complete", "Stale-session reaper started (ttl 600s, sweep every 60s)", "Server listening on 0.0.0.0:3002".
|
||||
- Verify: health/dashboard/public all HTTP 200; no errors/panics since restart.
|
||||
- Purge result: UPDATE 11; live_machines=8, soft_deleted=11, 0 duplicate hostnames remaining.
|
||||
- Purged hostnames: DESKTOP-I66IM5Q (10→1, kept 935a3920), DESKTOP-N9MIFGD (2→1, kept 99432392), GURU-5070 (2→1, kept h264sync).
|
||||
- CI status via `GET http://172.16.3.20:3000/api/v1/repos/azcomputerguru/guru-connect/actions/tasks` (token auth) and `/actions/runs/<n>/jobs`.
|
||||
- Coord: PUT /api/coord/components/guruconnect/server → state=deployed, version=0.2.0.
|
||||
|
||||
## Pending / Incomplete Tasks
|
||||
|
||||
- In-memory ghost sessions (restored at startup) auto-reap within ~10 min of the 22:25 restart (offline persistent past 10-min TTL) — live Task-4 validation; no action.
|
||||
- 5 stale single-row machines (offline since Dec 2025; ACG-M-L5090, ACG-TECH-02L, DESKTOP-ET51AJO, NOTE-XZD817, kept I66IM5Q) left intact — decommission is a separate user decision.
|
||||
- B-track: #18 auto-versioning (build.rs conventional-commit bump) → #19 code signing (jsign/Azure Trusted Signing). Ready entry point; gates the fleet rollout.
|
||||
- C-track: #3 Phase-1 functional verification + /gc-audit security re-pass; #4 HW-H.264 cross-GPU validation (beast→5070), then DEFAULT_PREFER_H264 (stays false until validated).
|
||||
- D: #16 fleet rollout off 0.1.0 (blocked by B1+B2).
|
||||
- Downstream: #7 fleet per-agent cak_ migration (now unblocked by SPEC-004 dedup) → #5 retire shared AGENT_API_KEY.
|
||||
- DeleteMachineDialog.tsx left as an intentionally-unwired component (legacy uninstall/export) with a header comment.
|
||||
|
||||
## Reference Information
|
||||
|
||||
- GC HEAD: 96f9c0a. Parent HEAD after A3b: ed8bfe7 (further advanced by other-machine syncs since).
|
||||
- GC commits this session: ffca7f0 (Task2), 4e80573 (Task4), cef1928 (fmt), 5ee6675 (Task5 server), 96f9c0a (Task5 dashboard).
|
||||
- Rollback: `cd /home/guru/guru-connect && git reset --hard b3e8f32 && <rebuild> && sudo systemctl restart guruconnect && gunzip -c ~/backups/guruconnect/pre-deploy-20260531-222216.sql.gz | psql "$DATABASE_URL"`.
|
||||
- Specs: specs/v2-stable-identity/plan.md (SPEC-004 Tasks 1-5). docs/specs/SPEC-004-session-lifecycle-and-removal.md.
|
||||
- Sprint tasks: #13 A1, #14 A2, #15 A3a, #21 A3b, #20 A4 (all done); #18/#19 B, #3/#4 C, #16 D pending.
|
||||
- Memory updated: reference_guru5070_rust_toolchain.md (CI fmt-gate lesson). Deploy procedure: project_guruconnect_deploy.md.
|
||||
Reference in New Issue
Block a user