sync: auto-sync from GURU-KALI at 2026-05-25 13:49:31

Author: Mike Swanson
Machine: GURU-KALI
Timestamp: 2026-05-25 13:49:31
This commit is contained in:
2026-05-25 13:49:31 -07:00
parent c5f7c73381
commit 0d1085b145
3 changed files with 198 additions and 114 deletions

View File

@@ -1 +1 @@
{"sessionId":"2158f2e7-8168-4859-b2cf-e0b05d6517b2","pid":18624,"acquiredAt":1779727871169} {"sessionId":"eda9a628-252f-4dd7-b4cf-1d987ea11512","pid":16195,"procStart":"259600","acquiredAt":1779740400025}

View File

@@ -1088,115 +1088,199 @@ if let (Some(version), Some(arch)) = (
- 12:25 PT - Final compilation successful on Saturn - 12:25 PT - Final compilation successful on Saturn
- 12:40 PT - Session log written, ready to sync - 12:40 PT - Session log written, ready to sync
--- ---
## Update: 12:55 PT — Dataforth ESXi License Recovery + Syncro Emergency Billing Skill ## Update: 12:55 PT — Dataforth ESXi License Recovery + Syncro Emergency Billing Skill
### User ### User
- **User:** Mike Swanson (mike) - **User:** Mike Swanson (mike)
- **Machine:** GURU-5070 - **Machine:** GURU-5070
- **Role:** admin - **Role:** admin
- **Session span:** ~2026-05-24 evening 2026-05-25 afternoon - **Session span:** ~2026-05-24 evening 2026-05-25 afternoon
### Session Summary ### Session Summary
Session began as an emergency response: John Lehman texted after hours reporting VPN was down. Investigation via SSH (through D2TESTNAS at 192.168.0.9 as jump host) revealed AD1 and AD2 were offline because ESXi-122's 60-day evaluation license had expired, taking all VMs with it. ESXi-124 was also at risk. SSH was not running on ESXi-122, requiring DCUI physical console access to enable it first. Session began as an emergency response: John Lehman texted after hours reporting VPN was down. Investigation via SSH (through D2TESTNAS at 192.168.0.9 as jump host) revealed AD1 and AD2 were offline because ESXi-122's 60-day evaluation license had expired, taking all VMs with it. ESXi-124 was also at risk. SSH was not running on ESXi-122, requiring DCUI physical console access to enable it first.
License recovery on ESXi-122 was accomplished by copying the hidden backup license file (`/etc/vmware/.#license.cfg`) over the active `license.cfg`, then restarting hostd. This resets the 60-day evaluation timer. ESXi-124 was treated preemptively with the same procedure. After license restoration, all four VMs on ESXi-122 (AD1, AD2, FILES-D1, PBX) were powered on. Both ESXi hosts were configured with a persistent monthly cron job (first Sunday of each month at 02:00) to auto-reset the license and reboot, written directly to `/var/spool/cron/crontabs/root` via paramiko SFTP and persisted through `/etc/rc.local.d/local.sh` since ESXi's filesystem is RAM-based. License recovery on ESXi-122 was accomplished by copying the hidden backup license file (`/etc/vmware/.#license.cfg`) over the active `license.cfg`, then restarting hostd. This resets the 60-day evaluation timer. ESXi-124 was treated preemptively with the same procedure. After license restoration, all four VMs on ESXi-122 (AD1, AD2, FILES-D1, PBX) were powered on. Both ESXi hosts were configured with a persistent monthly cron job (first Sunday of each month at 02:00) to auto-reset the license and reboot, written directly to `/var/spool/cron/crontabs/root` via paramiko SFTP and persisted through `/etc/rc.local.d/local.sh` since ESXi's filesystem is RAM-based.
A Syncro ticket was created (#32320) for the incident. The session then shifted to building out emergency/afterhours billing rules as a skill file (`syncro-emergency-billing.md`), researching Winter's historical tickets to establish the correct billing pattern. The key finding: block customers (Dataforth, VWP, Cascades) require two line items on the standard product (actual hours + 0.5x labeled "Afterhours rate") because block accounts track hours not dollars; non-block customers use a single dedicated emergency product (26184, $262.50/hr). A Syncro ticket was created (#32320) for the incident. The session then shifted to building out emergency/afterhours billing rules as a skill file (`syncro-emergency-billing.md`), researching Winter's historical tickets to establish the correct billing pattern. The key finding: block customers (Dataforth, VWP, Cascades) require two line items on the standard product (actual hours + 0.5x labeled "Afterhours rate") because block accounts track hours not dollars; non-block customers use a single dedicated emergency product (26184, $262.50/hr).
Adding labor to the Dataforth ticket required discovering the correct Syncro API endpoint through trial and error — `/tickets/{id}/add_line_item` (not `/line_item`, `/line_items`, or top-level endpoints). Experimented on ACG internal test ticket #32321 to confirm payload format before touching the real ticket. Once confirmed, added 2.0hr main labor + 1.0hr afterhours premium to ticket #32320, then deleted the test ticket. The skill was then audited: live product rate fetch revealed two rate errors in the original draft ($150/hr not $175 for Remote Business and In-Shop Business), residential rates were removed as legacy, and the confirmed API method was documented with all required fields. Adding labor to the Dataforth ticket required discovering the correct Syncro API endpoint through trial and error — `/tickets/{id}/add_line_item` (not `/line_item`, `/line_items`, or top-level endpoints). Experimented on ACG internal test ticket #32321 to confirm payload format before touching the real ticket. Once confirmed, added 2.0hr main labor + 1.0hr afterhours premium to ticket #32320, then deleted the test ticket. The skill was then audited: live product rate fetch revealed two rate errors in the original draft ($150/hr not $175 for Remote Business and In-Shop Business), residential rates were removed as legacy, and the confirmed API method was documented with all required fields.
### Key Decisions ### Key Decisions
- **ESXi crontab via SFTP, not shell**: ESXi has no `crontab` command. Wrote directly to `/var/spool/cron/crontabs/root` via paramiko SFTP; sent SIGHUP to crond after. Shell-based approaches (echo/heredoc) were tried first and failed. - **ESXi crontab via SFTP, not shell**: ESXi has no `crontab` command. Wrote directly to `/var/spool/cron/crontabs/root` via paramiko SFTP; sent SIGHUP to crond after. Shell-based approaches (echo/heredoc) were tried first and failed.
- **local.sh persistence in Python, not shell**: `grep -c` through a shell command produced "0\n0" (grep output + fallback), causing false-positive match detection. Rewrote local.sh update logic using SFTP read/write in Python to avoid shell quoting/output ambiguity. - **local.sh persistence in Python, not shell**: `grep -c` through a shell command produced "0\n0" (grep output + fallback), causing false-positive match detection. Rewrote local.sh update logic using SFTP read/write in Python to avoid shell quoting/output ambiguity.
- **Test before touching real ticket**: Rather than guessing the Syncro line item payload format and hitting the real Dataforth ticket, opened a test ticket on ACG internal customer to confirm endpoint and required fields first. - **Test before touching real ticket**: Rather than guessing the Syncro line item payload format and hitting the real Dataforth ticket, opened a test ticket on ACG internal customer to confirm endpoint and required fields first.
- **Both `name` and `description` required**: Syncro's `add_line_item` endpoint returns 422 if either field is missing — not obvious from the API name. Documented explicitly. - **Both `name` and `description` required**: Syncro's `add_line_item` endpoint returns 422 if either field is missing — not obvious from the API name. Documented explicitly.
- **Live rate fetch mandatory**: Memory note confirmed rates had been wrong before (2026-05-20 incident). Fetched all product rates live before finalizing the skill; found Remote Business ($150) and In-Shop Business ($150) were both documented as $175 in the original draft. - **Live rate fetch mandatory**: Memory note confirmed rates had been wrong before (2026-05-20 incident). Fetched all product rates live before finalizing the skill; found Remote Business ($150) and In-Shop Business ($150) were both documented as $175 in the original draft.
- **$262.50 emergency product covers all business work**: Confirmed with Mike — no distinction between remote and onsite emergency. One product for all business emergency billing regardless of service delivery method. - **$262.50 emergency product covers all business work**: Confirmed with Mike — no distinction between remote and onsite emergency. One product for all business emergency billing regardless of service delivery method.
- **Residential rates are legacy**: Removed 42584 and 1190471 from all active sections of the skill; added to "Products NOT to Use." - **Residential rates are legacy**: Removed 42584 and 1190471 from all active sections of the skill; added to "Products NOT to Use."
### Problems Encountered ### Problems Encountered
- **SSH not enabled on ESXi-122**: License expiration locks out management — had to enable SSH via DCUI physical console before remote work was possible. No automated fix; required hands-on at the host. - **SSH not enabled on ESXi-122**: License expiration locks out management — had to enable SSH via DCUI physical console before remote work was possible. No automated fix; required hands-on at the host.
- **`crontab` command missing on ESXi**: ESXi busybox environment does not include the `crontab` CLI. Fix: write the crontab file directly via SFTP. - **`crontab` command missing on ESXi**: ESXi busybox environment does not include the `crontab` CLI. Fix: write the crontab file directly via SFTP.
- **`grep -c` false positive in local.sh check**: Shell command `grep -c 'pattern' file 2>/dev/null || echo 0` emitted both the grep count and the fallback "0", causing the Python string comparison to see "0\n0" (truthy). Fixed by using SFTP to read and rewrite local.sh entirely in Python. - **`grep -c` false positive in local.sh check**: Shell command `grep -c 'pattern' file 2>/dev/null || echo 0` emitted both the grep count and the fallback "0", causing the Python string comparison to see "0\n0" (truthy). Fixed by using SFTP to read and rewrite local.sh entirely in Python.
- **Syncro line item endpoint discovery**: No working documentation for the correct path. Tried `/line_item`, `/line_items`, PUT with `line_items_attributes` — all 404. Eventually fetched the Syncro Swagger spec from `api-docs.syncromsp.com/swagger.json` and found `add_line_item`. - **Syncro line item endpoint discovery**: No working documentation for the correct path. Tried `/line_item`, `/line_items`, PUT with `line_items_attributes` — all 404. Eventually fetched the Syncro Swagger spec from `api-docs.syncromsp.com/swagger.json` and found `add_line_item`.
- **422 on add_line_item with only `name` field**: Both `name` and `description` are required; omitting either returns 422. - **422 on add_line_item with only `name` field**: Both `name` and `description` are required; omitting either returns 422.
### Configuration Changes ### Configuration Changes
- **Created:** `D:\claudetools\.claude\commands\syncro-emergency-billing.md` — Emergency/afterhours billing skill for Syncro (rules, billing scenarios, confirmed API method) - **Created:** `D:\claudetools\.claude\commands\syncro-emergency-billing.md` — Emergency/afterhours billing skill for Syncro (rules, billing scenarios, confirmed API method)
- **Modified:** `syncro-emergency-billing.md` — Rate corrections (Remote Business $150, In-Shop $150), residential removed as legacy, API section added - **Modified:** `syncro-emergency-billing.md` — Rate corrections (Remote Business $150, In-Shop $150), residential removed as legacy, API section added
- **ESXi-122** (`192.168.0.122`): license.cfg restored, cron job written, local.sh updated, all VMs powered on - **ESXi-122** (`192.168.0.122`): license.cfg restored, cron job written, local.sh updated, all VMs powered on
- **ESXi-124** (`192.168.0.124`): license.cfg restored preemptively, cron job written, local.sh updated - **ESXi-124** (`192.168.0.124`): license.cfg restored preemptively, cron job written, local.sh updated
### Credentials & Secrets ### Credentials & Secrets
- **D2TESTNAS (jump host):** `192.168.0.9` — root / `Paper123!@#` - **D2TESTNAS (jump host):** `192.168.0.9` — root / `Paper123!@#`
- **ESXi root password (both hosts):** `Gptf*77ttb!@#!@#` - **ESXi root password (both hosts):** `Gptf*77ttb!@#!@#`
- **Syncro API key:** `T259810e5c9917386b-52c2aeea7cdb5ff41c6685a73cebbeb3` — vault: `msp-tools/syncro.sops.yaml` → `credentials.credential` - **Syncro API key:** `T259810e5c9917386b-52c2aeea7cdb5ff41c6685a73cebbeb3` — vault: `msp-tools/syncro.sops.yaml` → `credentials.credential`
### Infrastructure & Servers ### Infrastructure & Servers
| Host | IP | Role | Notes | | Host | IP | Role | Notes |
|---|---|---|---| |---|---|---|---|
| D2TESTNAS | 192.168.0.9 | Jump host / NAS | SSH root access; used as paramiko jump for ESXi | | D2TESTNAS | 192.168.0.9 | Jump host / NAS | SSH root access; used as paramiko jump for ESXi |
| ESXi-122 | 192.168.0.122 | Hypervisor | Datastore: `datastore1`; hosts AD1, AD2, FILES-D1, PBX | | ESXi-122 | 192.168.0.122 | Hypervisor | Datastore: `datastore1`; hosts AD1, AD2, FILES-D1, PBX |
| ESXi-124 | 192.168.0.124 | Hypervisor | Datastore: `Backup`; treated preemptively | | ESXi-124 | 192.168.0.124 | Hypervisor | Datastore: `Backup`; treated preemptively |
| AD1 | (on ESXi-122) | Domain Controller | Was offline due to license expiry; restored | | AD1 | (on ESXi-122) | Domain Controller | Was offline due to license expiry; restored |
| AD2 | (on ESXi-122) | Domain Controller | Was offline; restored | | AD2 | (on ESXi-122) | Domain Controller | Was offline; restored |
| FILES-D1 | (on ESXi-122) | File server | Was offline; restored | | FILES-D1 | (on ESXi-122) | File server | Was offline; restored |
| PBX | (on ESXi-122) | Phone system | Was offline; restored | | PBX | (on ESXi-122) | Phone system | Was offline; restored |
ESXi license reset script locations: ESXi license reset script locations:
- ESXi-122: `/vmfs/volumes/datastore1/license_reset.sh` - ESXi-122: `/vmfs/volumes/datastore1/license_reset.sh`
- ESXi-124: `/vmfs/volumes/Backup/license_reset.sh` - ESXi-124: `/vmfs/volumes/Backup/license_reset.sh`
Cron schedule (both hosts): `0 2 * * 0 [ $(date +%d) -le 7 ] && <script> >> /tmp/license_reset.log 2>&1` Cron schedule (both hosts): `0 2 * * 0 [ $(date +%d) -le 7 ] && <script> >> /tmp/license_reset.log 2>&1`
Persistence: `/etc/rc.local.d/local.sh` — restores crontab entry on each boot. Persistence: `/etc/rc.local.d/local.sh` — restores crontab entry on each boot.
### Commands & Outputs ### Commands & Outputs
```bash ```bash
# ESXi license reset (run on each host via SSH) # ESXi license reset (run on each host via SSH)
cp /etc/vmware/.#license.cfg /etc/vmware/license.cfg cp /etc/vmware/.#license.cfg /etc/vmware/license.cfg
/etc/init.d/hostd restart /etc/init.d/hostd restart
# Verify license state # Verify license state
vim-cmd vimsvc/license --show | grep -E 'serial|diagnostic|expirationHours' vim-cmd vimsvc/license --show | grep -E 'serial|diagnostic|expirationHours'
# Add line item to existing Syncro ticket (confirmed working 2026-05-25) # Add line item to existing Syncro ticket (confirmed working 2026-05-25)
curl -s -X POST "https://computerguru.syncromsp.com/api/v1/tickets/{ticket_id}/add_line_item" \ curl -s -X POST "https://computerguru.syncromsp.com/api/v1/tickets/{ticket_id}/add_line_item" \
-H "Authorization: <api_key>" \ -H "Authorization: <api_key>" \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{"product_id":1190473,"name":"Labor - Remote Business","description":"Work description","quantity":2.0,"price":0.0,"taxable":false}' -d '{"product_id":1190473,"name":"Labor - Remote Business","description":"Work description","quantity":2.0,"price":0.0,"taxable":false}'
# Fetch live product rate before billing non-block # Fetch live product rate before billing non-block
curl -s "https://computerguru.syncromsp.com/api/v1/products/{product_id}" \ curl -s "https://computerguru.syncromsp.com/api/v1/products/{product_id}" \
-H "Authorization: <api_key>" | jq '.product.price_retail' -H "Authorization: <api_key>" | jq '.product.price_retail'
``` ```
Dataforth ticket #32320 (ID: 110958232) — line items added: Dataforth ticket #32320 (ID: 110958232) — line items added:
- ID 42571127: Labor - Remote Business, 2.0 hr, "Afterhours remote — John Lehman reported VPN down..." - ID 42571127: Labor - Remote Business, 2.0 hr, "Afterhours remote — John Lehman reported VPN down..."
- ID 42571130: Labor - Remote Business, 1.0 hr, "Afterhours rate" - ID 42571130: Labor - Remote Business, 1.0 hr, "Afterhours rate"
### Pending / Incomplete Tasks ### Pending / Incomplete Tasks
None. Ticket is complete, skill is complete, ESXi cron is configured and persistent. None. Ticket is complete, skill is complete, ESXi cron is configured and persistent.
### Reference Information ### Reference Information
- **Syncro ticket:** #32320 (ID: 110958232) — "Afterhours - VMware ESXi - Evaluation License Expired / VMs Down" — Dataforth Corporation - **Syncro ticket:** #32320 (ID: 110958232) — "Afterhours - VMware ESXi - Evaluation License Expired / VMs Down" — Dataforth Corporation
- **Syncro test ticket deleted:** #32321 (ID: 110961873) — ACG internal customer - **Syncro test ticket deleted:** #32321 (ID: 110961873) — ACG internal customer
- **Reference invoice:** 67594 (VWP block customer emergency billing example, 2026-05-12) - **Reference invoice:** 67594 (VWP block customer emergency billing example, 2026-05-12)
- **Reference ticket:** #32269 (VWP, block emergency billing reference) - **Reference ticket:** #32269 (VWP, block emergency billing reference)
- **Syncro add_line_item endpoint:** `POST /api/v1/tickets/{id}/add_line_item` - **Syncro add_line_item endpoint:** `POST /api/v1/tickets/{id}/add_line_item`
- **Syncro product IDs:** 1190473 (Remote Business $150), 26118 (Onsite $175), 573881 (In-Shop $150), 26184 (Emergency Business $262.50) - **Syncro product IDs:** 1190473 (Remote Business $150), 26118 (Onsite $175), 573881 (In-Shop $150), 26184 (Emergency Business $262.50)
- **Python scripts (Temp):** - **Python scripts (Temp):**
- `C:\Users\guru\AppData\Local\Temp\esxi_schedule_monthly_reset_v2.py` — final cron setup script (SFTP method) - `C:\Users\guru\AppData\Local\Temp\esxi_schedule_monthly_reset_v2.py` — final cron setup script (SFTP method)
- `C:\Users\guru\AppData\Local\Temp\esxi_schedule_monthly_reset.py` — v1 (heredoc method, superseded) - `C:\Users\guru\AppData\Local\Temp\esxi_schedule_monthly_reset.py` — v1 (heredoc method, superseded)
- `C:\Users\guru\AppData\Local\Temp\esxi124_hostd_restart.py` — hostd restart + verification - `C:\Users\guru\AppData\Local\Temp\esxi124_hostd_restart.py` — hostd restart + verification
---
## Update: 13:48 MST — GuruRMM CRITICAL auth fix + run_analysis UX fix DEPLOYED; migration incident recovered (Mike Swanson / GURU-KALI)
### Session Summary
Continuation of the 09:34 GURU-KALI session (audit + submodule fixes). First corrected the CLAUDE.md guru-rmm submodule wording (it called the submodule a "stale reference copy"; it actually tracks the active azcomputerguru/gururmm repo, pinned commit just lags main) — committed `f2ece8e`.
Then implemented and DEPLOYED the two CRITICAL auth findings from the morning's audit. Root cause: the server has no router-level auth — every route is gated only by whether its handler includes the `AuthUser` extractor, and `metrics.rs` + `logs.rs` omitted it, leaving per-agent and fleet-wide metrics/logs anonymously readable (plus `/logs/analyze` firing an outbound LLM call and `/agents/:id/logs/request` commanding agents). Coding Agent (opus) added `AuthUser` to all 8 handlers, scoping per-agent endpoints to the caller's orgs (matching the `get_agent` pattern), fleet aggregates require-auth + `TODO(authz)`, and `run_analysis` admin-only. Code Review APPROVED. Merged to main (`1d5a08f`), deployed via build-server.sh as v0.3.14, verified anon -> 401 on all six endpoints (login still 422, so public routes intact).
Mike then asked to fix the run_analysis UX regression (admin-only `/logs/analyze` 403'd non-admin techs doing per-agent analysis). Coding Agent relaxed it: per-agent analysis (agent_id present) -> `authorize_agent_access` org check; fleet (no agent_id) stays admin-only; dashboard hides the fleet Analyze button for non-admins (`useAuth` role check matching backend `is_admin()`). Reviewed APPROVED, merged (`7be2f52`).
Deploying run_analysis surfaced that main did not compile — the unrelated crash-detection health-monitoring feature (`health.rs`, committed earlier today under the shared azcomputerguru account) had a type error. Per Mike's choice, coordinated with the owner (GURU-5070) via coord message rather than fixing it. This also exposed a hostname issue: I'd addressed the message to the stale `DESKTOP-0O8A1RL` session id (the retired hostname); re-sent to `GURU-5070/claude-main` + a fallback. GURU-5070 launched a fleet-wide identity audit in response; GURU-KALI verified clean (identity.json user=mike/machine=GURU-KALI, git user.name normalized to "Mike Swanson", in known_machines) and replied.
GURU-5070 committed a health.rs fix (`42790f5`) but it was incomplete — it assumed os_type AND architecture are non-null String; per migrations + .sqlx, os_type IS NOT NULL but architecture is nullable, so `&crashed.architecture` gave E0308. Fixed forward (`646eb0a`: as_deref() on version_to + architecture, &os_type direct) — the first version of this code with a verified-clean cargo check; reviewed, merged. Deploying via build-server.sh then hit a MIGRATION INCIDENT and brief outage: migration 046 (safe_rollout) had been applied to the DB out-of-band (3 tables existed) but never recorded in `_sqlx_migrations`, so the new binary crash-looped on boot ("relation update_rollouts already exists"). Since build-server.sh stops the old service before validating the new binary, the server went down. Database Agent recovered: confirmed all 3 tables empty (0 rows, no FK deps), dropped them, restart -> sqlx ran 046 fresh + recorded it. Server v0.3.22 live; dashboard redeployed; anon -> 401 confirmed; no data lost.
### Key Decisions
- **Coordinate vs. fix the health.rs blocker:** initially coordinated with GURU-5070 (Mike's choice, to avoid stepping on WIP). After their committed fix was still broken and they'd declared "done" (no active WIP), fixed it forward — aligned with Mike's "resume the deploy" intent.
- **Database recovery = drop empty tables, not checksum-insert:** Database Agent chose dropping the 3 empty tables (letting sqlx re-run 046 and self-record) over manually inserting a `_sqlx_migrations` row — avoids a fragile hand-computed SHA-384 and eliminates any out-of-band schema drift. Safe only because all 3 tables were empty.
- **Branch-not-main for the audit report; non-main pushes don't build:** verified the webhook builds on `refs/heads/main` only with no path filtering — so the audit branch and feature branches don't trigger builds; merging to main does.
- **Delegated all code/DB/git through agents (opus for auth/migration/security):** coordinator never hand-edited production code or ran DB writes; mandatory Code Review on every change caught that even my own prescribed health.rs fix was wrong.
### Problems Encountered
- **Self-inflicted git race (first run_analysis server build):** ran build-server.sh right after the merge push, which had triggered the webhook build on the same /home/guru/gururmm repo; concurrent `git reset --hard` left a stale tree and a false build failure. Fix: always check for in-flight builds before build-server.sh; resolved by waiting for idle.
- **health.rs compile saga (3 attempts):** original .as_ref() tuple (E0277 x3) -> GURU-5070's partial fix (E0308, architecture nullable) -> correct fix `646eb0a` (as_deref on the two Option fields). Root issue: nobody ran a clean `cargo check` before committing the prior attempts.
- **Migration 046 unrecorded -> crash-loop + outage:** see summary; recovered by Database Agent. Lesson sent to GURU-5070: don't apply migration SQL manually during dev; let the server apply via sqlx.
- **Coord message misaddressed to retired hostname:** DESKTOP-0O8A1RL is retired (now GURU-5070); re-sent + fallback. Triggered the fleet identity audit.
- **Public dashboard 403:** Cloudflare bot-mitigation on a server-side curl, not an nginx/deploy fault (origin serves the new bundle at local 200).
### Configuration Changes
- claudetools `f2ece8e` — `.claude/CLAUDE.md` guru-rmm submodule wording corrected.
- gururmm `1d5a08f` — `server/src/api/metrics.rs` + `logs.rs`: AuthUser on 8 handlers (CRITICAL auth fix).
- gururmm `7be2f52` — `server/src/api/logs.rs` (run_analysis per-agent authz) + `dashboard/src/pages/Logs.tsx` (hide fleet Analyze for non-admins).
- gururmm `646eb0a` — `server/src/updates/health.rs`: as_deref() fix for nullable Option fields (follow-up to GURU-5070's `42790f5`).
- DB: dropped + sqlx-recreated `update_rollouts`, `update_health_metrics`, `agent_update_events`; migration 046 now recorded in `_sqlx_migrations`.
- Deployed: gururmm-server v0.3.22 (`/opt/gururmm/gururmm-server`); dashboard rebuilt + copied to `/var/www/gururmm/dashboard/` (bundle `index-DUF78gxN.js`).
- `.claude/current-mode` -> infra during deploy.
### Credentials & Secrets
- No new credentials. Build server DB access via `DATABASE_URL` in `/home/guru/.cargo/env` (build server builds ONLINE, which is why health.rs query! macros validated against the live DB). GuruRMM API admin creds: vault `infrastructure/gururmm-server.sops.yaml`.
### Infrastructure & Servers
- gururmm-server: `172.16.3.30:3001`, systemd `gururmm-server`, binary `/opt/gururmm/gururmm-server` (the `/usr/local/bin` path in old CONTEXT.md is stale). Running **v0.3.22**.
- Server deploy = MANUAL `sudo /opt/gururmm/build-server.sh` (git reset --hard origin/main -> cargo build --release -> stop/cp/start). NOT triggered by the webhook (webhook = agents only). **Latent bug:** stops the service BEFORE validating the new binary's migrations -> a bad migration causes an outage; also doesn't check `git reset` exit code (race) and has no build lock.
- Dashboard: nginx serves `/var/www/gururmm/dashboard` (root-owned, server_name _); `/api/` proxied to `:3001`; second vhost `server_name rmm-api.azcomputerguru.com`. Dashboard `API_BASE_URL` defaults to `https://rmm-api.azcomputerguru.com` (no .env), so a plain `npm run build` is correct for prod. Public `rmm.azcomputerguru.com` is behind Cloudflare (IPv6 2606:4700; 403s bare curls via bot-mitigation).
- DB: PostgreSQL `localhost:5432/gururmm` on .30. `_sqlx_migrations` now at version 46.
### Commands & Outputs
```bash
# Server deploy (manual, intended path):
ssh guru@172.16.3.30 'sudo /opt/gururmm/build-server.sh' # ~4min build, then stop/cp/start
# Dashboard deploy:
ssh guru@172.16.3.30 'cd /home/guru/gururmm/dashboard && npm ci && npm run build && sudo cp -r dist/* /var/www/gururmm/dashboard/'
# Migration recovery (Database Agent, after confirming 3 tables empty):
# BEGIN; <guard: raise if any rows>; DROP TABLE IF EXISTS update_rollouts, update_health_metrics, agent_update_events CASCADE; COMMIT;
# then systemctl restart gururmm-server -> sqlx runs 046 fresh + records it
# Smoke test (auth enforcement live):
curl -s -o /dev/null -w '%{http_code}' http://localhost:3001/api/metrics/summary # -> 401
curl -s -o /dev/null -w '%{http_code}' http://localhost:3001/status # -> 200
```
### Pending / Incomplete Tasks
- **HIGH follow-ups from the audit (not started):** validate Entra SSO ID-token signature (`sso.rs:212`); auth+scope the agent-status SSE (`agents.rs:583`); add `client_id`/`update_channel` to the agent response structs (dead frontend links); org-scope the 3 fleet endpoints (`/metrics/summary`, `/logs`, `/logs/analysis` — TODO(authz), need client_ids-filtered queries); mac build gate stuck (mac builder offline since Pluto outage).
- **Structural:** add a router-level auth layer so "public" is opt-in (kills the missing-AuthUser bug class).
- **Hand to GURU-5070 (coord msg 2d518a70):** don't apply migration SQL manually; harden build-server.sh (validate migrations before service swap; check git reset exit; add build lock); `046_safe_rollout.sql` header comment mislabeled "Migration 045".
- Audit report still only on branch `audit/2026-05-25-rmm-audit` (merge to main when bundling code).
### Reference Information
- gururmm commits: `1d5a08f` (CRITICAL auth), `7be2f52` (run_analysis), `646eb0a` (health fix), `42790f5` (GURU-5070 partial health fix). Audit report: `reports/2026-05-25-rmm-audit.md` on branch `audit/2026-05-25-rmm-audit` (`da1d4ee`).
- claudetools commits: `413df93` (sync.sh submodule fix + solverbot removal), `f2ece8e` (CLAUDE.md wording).
- Coord: component `gururmm/server` = deployed 0.3.22. Messages: `16aa12fb`/`74a1a3e5` (build-blocked to GURU-5070 + DESKTOP fallback), `b99f718c` (identity check-in reply), `2d518a70` (deploy-done + lessons). DESKTOP-0O8A1RL retired; GURU-5070 is Mike's current session id.
- Audit tally: 61 findings (2 critical [both now FIXED+deployed], 10 high, 16 medium, 7 low, 26 info).