Files
claudetools/session-logs/2026-05-31-mike-spec004-sprint-deploy.md
Mike Swanson 5ee92ad5b1 sync: auto-sync from GURU-5070 at 2026-05-31 18:23:00
Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-05-31 18:23:00
2026-05-31 18:23:05 -07:00

15 KiB

User

  • User: Mike Swanson (mike)
  • Machine: GURU-5070
  • Role: admin

Session Summary

Executed the GuruConnect P1 sprint A-track (SPEC-004: stable machine identity, session lifecycle, operator removal) end to end — code through production deploy — then purged the live ghost-machine rows. This continued the same session as the earlier H.264 viewer fix. Began by scoping the P1 roadmap items into a tracked task list with workstream tags (A: identity/lifecycle, B: release-eng, C: phase-1 exit, D: rollout) and blocked-by dependencies, then worked the A-track critical path autonomously.

Implemented four SPEC-004 tasks, each via a single Coding Agent (explore→implement→test), then Code Review (security/data-integrity focus), then a hardening pass on the review nits, then commit via the Gitea Agent with a two-repo flow (GC submodule commit + push, then parent pointer bump). A1 (Task 2, commit ffca7f0): migration 008 machine_uid + two-path upsert dedup (ON CONFLICT machine_uid | agent_id) + session reattach index; closed an L1 startup-spoof gap fail-closed. A2 (Task 4, 4e80573): TTL session reaper (10min, 60s sweep) + same-machine supersede; closed a snapshot→remove TOCTOU with a write-lock-guarded remove_session_if. A3a (Task 5 server, 5ee6675): admin-gated soft-delete/purge endpoints (DELETE ?purge=true + bulk-remove) + audit to connect_session_events + list/get deleted_at IS NULL filtering + upsert revive; closed two dead-code nits. A3b (Task 5 dashboard, 96f9c0a): per-row + multi-select removal UI wired to the API, admin-gated via useAuth().isAdmin, danger confirm modals, refetch-on-success; passed a frontend-design validation pass + Code Review + a count-consistency nit fix.

Hit a CI regression: A1 and A2 went red on the CI Build-Server-Linux job because it gates on cargo fmt --check first, which the Coding-Agent validation briefs had omitted (they ran check/clippy/test only). Fixed with a rustfmt-only commit (cef1928), confirmed CI green, and recorded the lesson in memory so future GC-Rust agent briefs always include the fmt gate. All subsequent agent briefs included cargo fmt --check.

Deployed the GC server (b3e8f32 → 96f9c0a) to production 172.16.3.30 following the existing deploy memory: backup-first (pg_dump + binary + the downloadable agent .exe preserved across the git reset), dashboard SPA build + Linux server binary build on the box (login shell, PROTOC set), systemctl restart. Migrations 008/009 auto-applied, the stale-session reaper started, health/dashboard/public-ingress all returned 200. Then purged 11 ghost connect_machines rows via a principled "keep newest per hostname" soft-delete (19→8 live, 0 duplicate hostnames), verified via dry-run before the destructive UPDATE. The entire SPEC-004 A-track is now live in production and the ghost-session accumulation is fixed at the root and cleaned up.

Key Decisions

  • Split A3 (Task 5, "operator removal API + dashboard") into A3a (server) and A3b (dashboard): the server API alone unblocks A4 (ghost purge needs the endpoint, not the UI), and each half is a manageable review unit.
  • Fixed every substantive Code Review nit before committing (L1 startup-spoof, reap TOCTOU, dead-code update-command leak, NULL-session audit type, bulk-count display) rather than deferring — security/data-integrity work warrants leaving each task fully hardened.
  • Ran Coding Agents sequentially (not parallel) on the same GC repo to avoid cargo target-dir lock contention between concurrent builds.
  • Purged ghosts via DIRECT SQL soft-delete (deleted_at=NOW()) rather than the removal API: functionally identical for these rows (all offline, no live in-memory sessions), avoids resetting Mike/Howard's dashboard admin password (no .admin-credentials file on the box, Argon2id hash unrecoverable). Soft-delete is reversible.
  • Purge rule = "keep the most-recently-seen row per hostname, soft-delete older duplicates" — exactly what dedup would have prevented; applied only to duplicate hostnames, leaving single-row stale machines untouched (decommission is a separate decision).
  • Preserved the downloadable agent .exe (server/static/downloads/guruconnect.exe, locally modified) across git reset --hard origin/main by copying it aside and restoring — the deploy must not roll back what clients download.
  • Used direct internal Gitea API (172.16.3.20:3000) + plink/pscp for all server ops; transferred SQL via pscp'd files after multi-layer shell quoting repeatedly mangled $$ dollar-quoting and parens.

Problems Encountered

  • CI Build-Server-Linux failed on commits ffca7f0 + 4e80573: root cause was cargo fmt --check (CI's first gate) — local validation ran check/clippy/test but not fmt. Fixed with rustfmt-only commit cef1928; added the gate to all later agent briefs and to memory.
  • Multi-layer shell quoting (plink → bash -lc "..." → heredoc) mangled psql $$ dollar-quoting (expanded to the shell PID) and choked on parens in echo labels. Resolved by writing SQL to local files and pscp-transferring them, and removing parens from echo strings.
  • No GC dashboard admin creds available (.admin-credentials absent, hash unrecoverable, connect_users table doesn't exist — users table differently named). Sidestepped by purging via direct SQL.
  • Server working tree was not clean (modified downloadable .exe) so the memory's git reset --hard would have clobbered it — preserved it across the reset.
  • Migration 007 already fixed the NULL-tags bug (null_tags=0), so the deploy memory's manual UPDATE ... SET tags='{}' mitigation was unnecessary.

Configuration Changes

GuruConnect repo (azcomputerguru/guru-connect), all committed + pushed:

  • ffca7f0 server: migration 008_machine_uid.sql (new), db/machines.rs, db/agent_keys.rs, main.rs, relay/mod.rs, session/mod.rs — machine_uid dedup + reattach index + startup keyed-skip.
  • 4e80573 server: session/mod.rs, main.rs — reap_stale_persistent + 60s sweep task + supersede + remove_session_if.
  • cef1928 server: machines.rs, main.rs, session/mod.rs — cargo fmt only.
  • 5ee6675 server: migration 009_session_machine_soft_delete.sql (new), api/removal.rs (new), api/mod.rs, main.rs, db/{machines,sessions,events,releases}.rs — removal API + soft-delete + audit.
  • 96f9c0a dashboard: api/{types,machines,sessions}.ts, features/machines/{hooks,MachinesPage}.tsx + 2 new dialogs, features/sessions/{hooks,SessionsPage}.tsx + 1 new dialog, features/machines/DeleteMachineDialog.tsx (comment), components/ui/table.css.

ClaudeTools monorepo: submodule pointer bumps for each GC commit; .claude/memory/reference_guru5070_rust_toolchain.md updated (CI fmt-gate lesson, uncommitted at save time — will sync now).

Production server 172.16.3.30: code reset b3e8f32→96f9c0a; rebuilt dashboard SPA (server/static/app/) + server binary; service restarted; migrations 008/009 applied; 11 connect_machines rows soft-deleted.

Credentials & Secrets

  • SSH deploy host: guru@172.16.3.30, password Gptf*77ttb123!@#-rmm (plink/pscp).
  • GC DB: postgresql://guruconnect:<pw>@localhost/guruconnect (password in server/.env on the box; vault entry projects/guruconnect/database.sops.yaml has username only, no password field).
  • Gitea API token (read CI status): vault services/gitea.sops.yamlcredentials.api.api-token (40 chars).
  • No GC dashboard admin password available (no .admin-credentials file; Argon2id hash in DB).

Infrastructure & Servers

  • GC relay/server: 172.16.3.30:3002, systemd guruconnect.service (User=guru, EnvironmentFile=server/.env, no WatchdogSec — do NOT run setup-systemd.sh). Repo on box: /home/guru/guru-connect (NOT a submodule there). Binary: target/x86_64-unknown-linux-gnu/release/guruconnect-server. Toolchain in login shell only (cargo ~/.cargo/bin, protoc /home/guru/.local/bin, node/npm /usr/bin); PROTOC must be set for build.rs.
  • Public: connect.azcomputerguru.com via NPM on Jupiter (172.16.3.20) → :3002. CONNECT_TRUSTED_PROXIES=127.0.0.1,::1,172.16.3.20.
  • DB: Postgres guruconnect on localhost; migrations sqlx-embedded, auto-run on boot.
  • CI: Gitea Actions; jobs Build Server (Linux), Build Agent (Windows, Pluto runner), Security Audit, Build Summary. Build-Server-Linux gates on cargo fmt --check FIRST. Dashboard is NOT built by CI (only server/agent); validated locally.
  • Backups on box: ~/backups/guruconnect/pre-deploy-20260531-222216.sql.gz, rollback-sha-.txt (b3e8f32), guruconnect-server-.bak.

Commands & Outputs

  • Deploy build (on box, login shell): export PROTOC=/home/guru/.local/bin/protoc; cd dashboard && npm ci && npm run build; cd .. && cargo build --release -p guruconnect-server --target x86_64-unknown-linux-gnu (~2 min server).
  • Cutover: sudo systemctl restart guruconnect; journal showed "Migrations complete", "Stale-session reaper started (ttl 600s, sweep every 60s)", "Server listening on 0.0.0.0:3002".
  • Verify: health/dashboard/public all HTTP 200; no errors/panics since restart.
  • Purge result: UPDATE 11; live_machines=8, soft_deleted=11, 0 duplicate hostnames remaining.
  • Purged hostnames: DESKTOP-I66IM5Q (10→1, kept 935a3920), DESKTOP-N9MIFGD (2→1, kept 99432392), GURU-5070 (2→1, kept h264sync).
  • CI status via GET http://172.16.3.20:3000/api/v1/repos/azcomputerguru/guru-connect/actions/tasks (token auth) and /actions/runs/<n>/jobs.
  • Coord: PUT /api/coord/components/guruconnect/server → state=deployed, version=0.2.0.

Pending / Incomplete Tasks

  • In-memory ghost sessions (restored at startup) auto-reap within ~10 min of the 22:25 restart (offline persistent past 10-min TTL) — live Task-4 validation; no action.
  • 5 stale single-row machines (offline since Dec 2025; ACG-M-L5090, ACG-TECH-02L, DESKTOP-ET51AJO, NOTE-XZD817, kept I66IM5Q) left intact — decommission is a separate user decision.
  • B-track: #18 auto-versioning (build.rs conventional-commit bump) → #19 code signing (jsign/Azure Trusted Signing). Ready entry point; gates the fleet rollout.
  • C-track: #3 Phase-1 functional verification + /gc-audit security re-pass; #4 HW-H.264 cross-GPU validation (beast→5070), then DEFAULT_PREFER_H264 (stays false until validated).
  • D: #16 fleet rollout off 0.1.0 (blocked by B1+B2).
  • Downstream: #7 fleet per-agent cak_ migration (now unblocked by SPEC-004 dedup) → #5 retire shared AGENT_API_KEY.
  • DeleteMachineDialog.tsx left as an intentionally-unwired component (legacy uninstall/export) with a header comment.

Reference Information

  • GC HEAD: 96f9c0a. Parent HEAD after A3b: ed8bfe7 (further advanced by other-machine syncs since).
  • GC commits this session: ffca7f0 (Task2), 4e80573 (Task4), cef1928 (fmt), 5ee6675 (Task5 server), 96f9c0a (Task5 dashboard).
  • Rollback: cd /home/guru/guru-connect && git reset --hard b3e8f32 && <rebuild> && sudo systemctl restart guruconnect && gunzip -c ~/backups/guruconnect/pre-deploy-20260531-222216.sql.gz | psql "$DATABASE_URL".
  • Specs: specs/v2-stable-identity/plan.md (SPEC-004 Tasks 1-5). docs/specs/SPEC-004-session-lifecycle-and-removal.md.
  • Sprint tasks: #13 A1, #14 A2, #15 A3a, #21 A3b, #20 A4 (all done); #18/#19 B, #3/#4 C, #16 D pending.
  • Memory updated: reference_guru5070_rust_toolchain.md (CI fmt-gate lesson). Deploy procedure: project_guruconnect_deploy.md.

Update: 18:22 PT — B-track (v0.3.0 release), D1 (publish), C1 (Phase-1 EXIT)

Continued the P1 sprint after the A-track deploy. Discovered that B1 (auto-versioning) and B2 (code signing) were ALREADY implemented in .gitea/workflows/release.yml (24KB; conventional-commit version bump + git-cliff changelog + native Windows build on Pluto + jsign Azure Trusted Signing + Gitea release) — they had been mislabeled in the sprint plan as "build" tasks. Validated them by cutting a real release: reconciled the drifted manifests to a clean v0.2.2 baseline (agent/server Cargo.toml were hardcoded 0.2.0 below the workspace/tag; dashboard was on a divergent 2.0.0 scheme; synced package-lock; commit 16586c4), then triggered release.yml. Gitea 1.25.2's workflow_dispatch API returns 204 but does NOT enqueue a run (known bug) — Mike triggered it from the web UI. Release run #71 succeeded all three jobs and published v0.3.0: tag e967cce, signed guruconnect.exe (Azure Trusted Signing) + .sha256 + CHANGELOG.md as Gitea release assets.

D1 (#16): there is NO 0.1.0 fleet (Mike corrected the stale task framing) — the connect_machines rows were ghosts (purged) + test boxes + stale Dec-2025 client rows. So D1 reduced to PUBLISHING v0.3.0 as the canonical release rather than a fleet push. The /api/version endpoint reads get_latest_stable_release from the empty releases table ("No stable release available"). Registered v0.3.0 via direct SQL INSERT (no dashboard admin creds available): download_url = the public Gitea release asset, checksum bc4767f4...06ef (verified the binary actually hashes to it). /api/version now serves v0.3.0 on both the local endpoint and the public connect.azcomputerguru.com ingress.

C1 (#3): v2 Phase 1 formally EXITED. Live functional verification of the secure-session-core CRITICAL boundaries against the DEPLOYED binary (forged HS256 tokens via curl WS upgrade): login-JWT on /ws/viewer -> 401; validly-signed viewer token for session AAAA used on session BBBB -> 403 (session bind enforced); login-JWT as agent api_key on /ws/agent -> 401; wrong-sig -> 401. Then ran /gc-audit --pass=security (Agent E, Opus): PASS, 0 CRITICAL/HIGH/MEDIUM/LOW. The 3 relay CRITICALs stay closed, the prior agent-update-TLS HIGH and chat-logging LOW are fixed, and the net-new SPEC-004 surface (machine_uid dedup gate, reaper/supersede, operator removal API) audits clean — no non-admin removal path, no uid-spoof hijack (worst case denial-of-persistence), no auth-plane crossover. Report: reports/2026-05-31-gc-audit.md (commit 1601745). Roadmap banner updated to mark Phase 1 exited.

Commits (this update)

  • 16586c4 chore: reconcile manifest versions to v0.2.2 baseline
  • e967cce chore: release v0.3.0 [skip ci] (release.yml; tag v0.3.0)
  • 1601745 docs: 2026-05-31 security re-audit (Phase-1 EXIT) + roadmap reconcile
  • parent: 8fafd5a..3a3362b (submodule pointer bumps)

Key facts

  • v0.3.0 release assets (public): https://git.azcomputerguru.com/azcomputerguru/guru-connect/releases/download/v0.3.0/guruconnect.exe (+ .sha256, CHANGELOG.md). sha256 bc4767f4d2088459b984f7b266f45b0678aad8edddbdab716bbf3c1ae8ee06ef
  • /api/version serves: latest_version 0.3.0, that download_url, is_mandatory false.
  • Gitea 1.25.2 workflow_dispatch API = 204 but no-op; use the web UI to trigger releases.
  • release.yml does NOT bump package-lock.json root version (cosmetic; npm ci tolerates the lag).
  • Running server binary self-reports 0.2.0 (built from 96f9c0a pre-bump); functionally == 0.3.0 (version-string only). Optional redeploy at the v0.3.0 tag for self-version consistency.

P1 sprint status

A1-A4 done+deployed (ghosts purged 19->8); B1/B2 done (v0.3.0 signed); C1 done (Phase-1 exit); D1 done (v0.3.0 published). ONLY C2 (#4) remains: live HW-H.264 cross-GPU validation (beast agent -> 5070 viewer), then decide DEFAULT_PREFER_H264 (stays false until validated). Not a Phase-1 blocker.