sync: auto-sync from GURU-KALI at 2026-05-25 13:49:31

Author: Mike Swanson
Machine: GURU-KALI
Timestamp: 2026-05-25 13:49:31
This commit is contained in:
2026-05-25 13:49:31 -07:00
parent c5f7c73381
commit 0d1085b145
3 changed files with 198 additions and 114 deletions

View File

@@ -1 +1 @@
{"sessionId":"2158f2e7-8168-4859-b2cf-e0b05d6517b2","pid":18624,"acquiredAt":1779727871169}
{"sessionId":"eda9a628-252f-4dd7-b4cf-1d987ea11512","pid":16195,"procStart":"259600","acquiredAt":1779740400025}

View File

@@ -1088,115 +1088,199 @@ if let (Some(version), Some(arch)) = (
- 12:25 PT - Final compilation successful on Saturn
- 12:40 PT - Session log written, ready to sync
---
## Update: 12:55 PT — Dataforth ESXi License Recovery + Syncro Emergency Billing Skill
### User
- **User:** Mike Swanson (mike)
- **Machine:** GURU-5070
- **Role:** admin
- **Session span:** ~2026-05-24 evening 2026-05-25 afternoon
### Session Summary
Session began as an emergency response: John Lehman texted after hours reporting VPN was down. Investigation via SSH (through D2TESTNAS at 192.168.0.9 as jump host) revealed AD1 and AD2 were offline because ESXi-122's 60-day evaluation license had expired, taking all VMs with it. ESXi-124 was also at risk. SSH was not running on ESXi-122, requiring DCUI physical console access to enable it first.
License recovery on ESXi-122 was accomplished by copying the hidden backup license file (`/etc/vmware/.#license.cfg`) over the active `license.cfg`, then restarting hostd. This resets the 60-day evaluation timer. ESXi-124 was treated preemptively with the same procedure. After license restoration, all four VMs on ESXi-122 (AD1, AD2, FILES-D1, PBX) were powered on. Both ESXi hosts were configured with a persistent monthly cron job (first Sunday of each month at 02:00) to auto-reset the license and reboot, written directly to `/var/spool/cron/crontabs/root` via paramiko SFTP and persisted through `/etc/rc.local.d/local.sh` since ESXi's filesystem is RAM-based.
A Syncro ticket was created (#32320) for the incident. The session then shifted to building out emergency/afterhours billing rules as a skill file (`syncro-emergency-billing.md`), researching Winter's historical tickets to establish the correct billing pattern. The key finding: block customers (Dataforth, VWP, Cascades) require two line items on the standard product (actual hours + 0.5x labeled "Afterhours rate") because block accounts track hours not dollars; non-block customers use a single dedicated emergency product (26184, $262.50/hr).
Adding labor to the Dataforth ticket required discovering the correct Syncro API endpoint through trial and error — `/tickets/{id}/add_line_item` (not `/line_item`, `/line_items`, or top-level endpoints). Experimented on ACG internal test ticket #32321 to confirm payload format before touching the real ticket. Once confirmed, added 2.0hr main labor + 1.0hr afterhours premium to ticket #32320, then deleted the test ticket. The skill was then audited: live product rate fetch revealed two rate errors in the original draft ($150/hr not $175 for Remote Business and In-Shop Business), residential rates were removed as legacy, and the confirmed API method was documented with all required fields.
### Key Decisions
- **ESXi crontab via SFTP, not shell**: ESXi has no `crontab` command. Wrote directly to `/var/spool/cron/crontabs/root` via paramiko SFTP; sent SIGHUP to crond after. Shell-based approaches (echo/heredoc) were tried first and failed.
- **local.sh persistence in Python, not shell**: `grep -c` through a shell command produced "0\n0" (grep output + fallback), causing false-positive match detection. Rewrote local.sh update logic using SFTP read/write in Python to avoid shell quoting/output ambiguity.
- **Test before touching real ticket**: Rather than guessing the Syncro line item payload format and hitting the real Dataforth ticket, opened a test ticket on ACG internal customer to confirm endpoint and required fields first.
- **Both `name` and `description` required**: Syncro's `add_line_item` endpoint returns 422 if either field is missing — not obvious from the API name. Documented explicitly.
- **Live rate fetch mandatory**: Memory note confirmed rates had been wrong before (2026-05-20 incident). Fetched all product rates live before finalizing the skill; found Remote Business ($150) and In-Shop Business ($150) were both documented as $175 in the original draft.
- **$262.50 emergency product covers all business work**: Confirmed with Mike — no distinction between remote and onsite emergency. One product for all business emergency billing regardless of service delivery method.
- **Residential rates are legacy**: Removed 42584 and 1190471 from all active sections of the skill; added to "Products NOT to Use."
### Problems Encountered
- **SSH not enabled on ESXi-122**: License expiration locks out management — had to enable SSH via DCUI physical console before remote work was possible. No automated fix; required hands-on at the host.
- **`crontab` command missing on ESXi**: ESXi busybox environment does not include the `crontab` CLI. Fix: write the crontab file directly via SFTP.
- **`grep -c` false positive in local.sh check**: Shell command `grep -c 'pattern' file 2>/dev/null || echo 0` emitted both the grep count and the fallback "0", causing the Python string comparison to see "0\n0" (truthy). Fixed by using SFTP to read and rewrite local.sh entirely in Python.
- **Syncro line item endpoint discovery**: No working documentation for the correct path. Tried `/line_item`, `/line_items`, PUT with `line_items_attributes` — all 404. Eventually fetched the Syncro Swagger spec from `api-docs.syncromsp.com/swagger.json` and found `add_line_item`.
- **422 on add_line_item with only `name` field**: Both `name` and `description` are required; omitting either returns 422.
### Configuration Changes
- **Created:** `D:\claudetools\.claude\commands\syncro-emergency-billing.md` — Emergency/afterhours billing skill for Syncro (rules, billing scenarios, confirmed API method)
- **Modified:** `syncro-emergency-billing.md` — Rate corrections (Remote Business $150, In-Shop $150), residential removed as legacy, API section added
- **ESXi-122** (`192.168.0.122`): license.cfg restored, cron job written, local.sh updated, all VMs powered on
- **ESXi-124** (`192.168.0.124`): license.cfg restored preemptively, cron job written, local.sh updated
### Credentials & Secrets
- **D2TESTNAS (jump host):** `192.168.0.9` — root / `Paper123!@#`
- **ESXi root password (both hosts):** `Gptf*77ttb!@#!@#`
- **Syncro API key:** `T259810e5c9917386b-52c2aeea7cdb5ff41c6685a73cebbeb3` — vault: `msp-tools/syncro.sops.yaml` → `credentials.credential`
### Infrastructure & Servers
| Host | IP | Role | Notes |
|---|---|---|---|
| D2TESTNAS | 192.168.0.9 | Jump host / NAS | SSH root access; used as paramiko jump for ESXi |
| ESXi-122 | 192.168.0.122 | Hypervisor | Datastore: `datastore1`; hosts AD1, AD2, FILES-D1, PBX |
| ESXi-124 | 192.168.0.124 | Hypervisor | Datastore: `Backup`; treated preemptively |
| AD1 | (on ESXi-122) | Domain Controller | Was offline due to license expiry; restored |
| AD2 | (on ESXi-122) | Domain Controller | Was offline; restored |
| FILES-D1 | (on ESXi-122) | File server | Was offline; restored |
| PBX | (on ESXi-122) | Phone system | Was offline; restored |
ESXi license reset script locations:
- ESXi-122: `/vmfs/volumes/datastore1/license_reset.sh`
- ESXi-124: `/vmfs/volumes/Backup/license_reset.sh`
Cron schedule (both hosts): `0 2 * * 0 [ $(date +%d) -le 7 ] && <script> >> /tmp/license_reset.log 2>&1`
Persistence: `/etc/rc.local.d/local.sh` — restores crontab entry on each boot.
### Commands & Outputs
```bash
# ESXi license reset (run on each host via SSH)
cp /etc/vmware/.#license.cfg /etc/vmware/license.cfg
/etc/init.d/hostd restart
# Verify license state
vim-cmd vimsvc/license --show | grep -E 'serial|diagnostic|expirationHours'
# Add line item to existing Syncro ticket (confirmed working 2026-05-25)
curl -s -X POST "https://computerguru.syncromsp.com/api/v1/tickets/{ticket_id}/add_line_item" \
-H "Authorization: <api_key>" \
-H "Content-Type: application/json" \
-d '{"product_id":1190473,"name":"Labor - Remote Business","description":"Work description","quantity":2.0,"price":0.0,"taxable":false}'
# Fetch live product rate before billing non-block
curl -s "https://computerguru.syncromsp.com/api/v1/products/{product_id}" \
-H "Authorization: <api_key>" | jq '.product.price_retail'
```
Dataforth ticket #32320 (ID: 110958232) — line items added:
- ID 42571127: Labor - Remote Business, 2.0 hr, "Afterhours remote — John Lehman reported VPN down..."
- ID 42571130: Labor - Remote Business, 1.0 hr, "Afterhours rate"
### Pending / Incomplete Tasks
None. Ticket is complete, skill is complete, ESXi cron is configured and persistent.
### Reference Information
- **Syncro ticket:** #32320 (ID: 110958232) — "Afterhours - VMware ESXi - Evaluation License Expired / VMs Down" — Dataforth Corporation
- **Syncro test ticket deleted:** #32321 (ID: 110961873) — ACG internal customer
- **Reference invoice:** 67594 (VWP block customer emergency billing example, 2026-05-12)
- **Reference ticket:** #32269 (VWP, block emergency billing reference)
- **Syncro add_line_item endpoint:** `POST /api/v1/tickets/{id}/add_line_item`
- **Syncro product IDs:** 1190473 (Remote Business $150), 26118 (Onsite $175), 573881 (In-Shop $150), 26184 (Emergency Business $262.50)
- **Python scripts (Temp):**
- `C:\Users\guru\AppData\Local\Temp\esxi_schedule_monthly_reset_v2.py` — final cron setup script (SFTP method)
- `C:\Users\guru\AppData\Local\Temp\esxi_schedule_monthly_reset.py` — v1 (heredoc method, superseded)
- `C:\Users\guru\AppData\Local\Temp\esxi124_hostd_restart.py` — hostd restart + verification
---
## Update: 12:55 PT — Dataforth ESXi License Recovery + Syncro Emergency Billing Skill
### User
- **User:** Mike Swanson (mike)
- **Machine:** GURU-5070
- **Role:** admin
- **Session span:** ~2026-05-24 evening 2026-05-25 afternoon
### Session Summary
Session began as an emergency response: John Lehman texted after hours reporting VPN was down. Investigation via SSH (through D2TESTNAS at 192.168.0.9 as jump host) revealed AD1 and AD2 were offline because ESXi-122's 60-day evaluation license had expired, taking all VMs with it. ESXi-124 was also at risk. SSH was not running on ESXi-122, requiring DCUI physical console access to enable it first.
License recovery on ESXi-122 was accomplished by copying the hidden backup license file (`/etc/vmware/.#license.cfg`) over the active `license.cfg`, then restarting hostd. This resets the 60-day evaluation timer. ESXi-124 was treated preemptively with the same procedure. After license restoration, all four VMs on ESXi-122 (AD1, AD2, FILES-D1, PBX) were powered on. Both ESXi hosts were configured with a persistent monthly cron job (first Sunday of each month at 02:00) to auto-reset the license and reboot, written directly to `/var/spool/cron/crontabs/root` via paramiko SFTP and persisted through `/etc/rc.local.d/local.sh` since ESXi's filesystem is RAM-based.
A Syncro ticket was created (#32320) for the incident. The session then shifted to building out emergency/afterhours billing rules as a skill file (`syncro-emergency-billing.md`), researching Winter's historical tickets to establish the correct billing pattern. The key finding: block customers (Dataforth, VWP, Cascades) require two line items on the standard product (actual hours + 0.5x labeled "Afterhours rate") because block accounts track hours not dollars; non-block customers use a single dedicated emergency product (26184, $262.50/hr).
Adding labor to the Dataforth ticket required discovering the correct Syncro API endpoint through trial and error — `/tickets/{id}/add_line_item` (not `/line_item`, `/line_items`, or top-level endpoints). Experimented on ACG internal test ticket #32321 to confirm payload format before touching the real ticket. Once confirmed, added 2.0hr main labor + 1.0hr afterhours premium to ticket #32320, then deleted the test ticket. The skill was then audited: live product rate fetch revealed two rate errors in the original draft ($150/hr not $175 for Remote Business and In-Shop Business), residential rates were removed as legacy, and the confirmed API method was documented with all required fields.
### Key Decisions
- **ESXi crontab via SFTP, not shell**: ESXi has no `crontab` command. Wrote directly to `/var/spool/cron/crontabs/root` via paramiko SFTP; sent SIGHUP to crond after. Shell-based approaches (echo/heredoc) were tried first and failed.
- **local.sh persistence in Python, not shell**: `grep -c` through a shell command produced "0\n0" (grep output + fallback), causing false-positive match detection. Rewrote local.sh update logic using SFTP read/write in Python to avoid shell quoting/output ambiguity.
- **Test before touching real ticket**: Rather than guessing the Syncro line item payload format and hitting the real Dataforth ticket, opened a test ticket on ACG internal customer to confirm endpoint and required fields first.
- **Both `name` and `description` required**: Syncro's `add_line_item` endpoint returns 422 if either field is missing — not obvious from the API name. Documented explicitly.
- **Live rate fetch mandatory**: Memory note confirmed rates had been wrong before (2026-05-20 incident). Fetched all product rates live before finalizing the skill; found Remote Business ($150) and In-Shop Business ($150) were both documented as $175 in the original draft.
- **$262.50 emergency product covers all business work**: Confirmed with Mike — no distinction between remote and onsite emergency. One product for all business emergency billing regardless of service delivery method.
- **Residential rates are legacy**: Removed 42584 and 1190471 from all active sections of the skill; added to "Products NOT to Use."
### Problems Encountered
- **SSH not enabled on ESXi-122**: License expiration locks out management — had to enable SSH via DCUI physical console before remote work was possible. No automated fix; required hands-on at the host.
- **`crontab` command missing on ESXi**: ESXi busybox environment does not include the `crontab` CLI. Fix: write the crontab file directly via SFTP.
- **`grep -c` false positive in local.sh check**: Shell command `grep -c 'pattern' file 2>/dev/null || echo 0` emitted both the grep count and the fallback "0", causing the Python string comparison to see "0\n0" (truthy). Fixed by using SFTP to read and rewrite local.sh entirely in Python.
- **Syncro line item endpoint discovery**: No working documentation for the correct path. Tried `/line_item`, `/line_items`, PUT with `line_items_attributes` — all 404. Eventually fetched the Syncro Swagger spec from `api-docs.syncromsp.com/swagger.json` and found `add_line_item`.
- **422 on add_line_item with only `name` field**: Both `name` and `description` are required; omitting either returns 422.
### Configuration Changes
- **Created:** `D:\claudetools\.claude\commands\syncro-emergency-billing.md` — Emergency/afterhours billing skill for Syncro (rules, billing scenarios, confirmed API method)
- **Modified:** `syncro-emergency-billing.md` — Rate corrections (Remote Business $150, In-Shop $150), residential removed as legacy, API section added
- **ESXi-122** (`192.168.0.122`): license.cfg restored, cron job written, local.sh updated, all VMs powered on
- **ESXi-124** (`192.168.0.124`): license.cfg restored preemptively, cron job written, local.sh updated
### Credentials & Secrets
- **D2TESTNAS (jump host):** `192.168.0.9` — root / `Paper123!@#`
- **ESXi root password (both hosts):** `Gptf*77ttb!@#!@#`
- **Syncro API key:** `T259810e5c9917386b-52c2aeea7cdb5ff41c6685a73cebbeb3` — vault: `msp-tools/syncro.sops.yaml` → `credentials.credential`
### Infrastructure & Servers
| Host | IP | Role | Notes |
|---|---|---|---|
| D2TESTNAS | 192.168.0.9 | Jump host / NAS | SSH root access; used as paramiko jump for ESXi |
| ESXi-122 | 192.168.0.122 | Hypervisor | Datastore: `datastore1`; hosts AD1, AD2, FILES-D1, PBX |
| ESXi-124 | 192.168.0.124 | Hypervisor | Datastore: `Backup`; treated preemptively |
| AD1 | (on ESXi-122) | Domain Controller | Was offline due to license expiry; restored |
| AD2 | (on ESXi-122) | Domain Controller | Was offline; restored |
| FILES-D1 | (on ESXi-122) | File server | Was offline; restored |
| PBX | (on ESXi-122) | Phone system | Was offline; restored |
ESXi license reset script locations:
- ESXi-122: `/vmfs/volumes/datastore1/license_reset.sh`
- ESXi-124: `/vmfs/volumes/Backup/license_reset.sh`
Cron schedule (both hosts): `0 2 * * 0 [ $(date +%d) -le 7 ] && <script> >> /tmp/license_reset.log 2>&1`
Persistence: `/etc/rc.local.d/local.sh` — restores crontab entry on each boot.
### Commands & Outputs
```bash
# ESXi license reset (run on each host via SSH)
cp /etc/vmware/.#license.cfg /etc/vmware/license.cfg
/etc/init.d/hostd restart
# Verify license state
vim-cmd vimsvc/license --show | grep -E 'serial|diagnostic|expirationHours'
# Add line item to existing Syncro ticket (confirmed working 2026-05-25)
curl -s -X POST "https://computerguru.syncromsp.com/api/v1/tickets/{ticket_id}/add_line_item" \
-H "Authorization: <api_key>" \
-H "Content-Type: application/json" \
-d '{"product_id":1190473,"name":"Labor - Remote Business","description":"Work description","quantity":2.0,"price":0.0,"taxable":false}'
# Fetch live product rate before billing non-block
curl -s "https://computerguru.syncromsp.com/api/v1/products/{product_id}" \
-H "Authorization: <api_key>" | jq '.product.price_retail'
```
Dataforth ticket #32320 (ID: 110958232) — line items added:
- ID 42571127: Labor - Remote Business, 2.0 hr, "Afterhours remote — John Lehman reported VPN down..."
- ID 42571130: Labor - Remote Business, 1.0 hr, "Afterhours rate"
### Pending / Incomplete Tasks
None. Ticket is complete, skill is complete, ESXi cron is configured and persistent.
### Reference Information
- **Syncro ticket:** #32320 (ID: 110958232) — "Afterhours - VMware ESXi - Evaluation License Expired / VMs Down" — Dataforth Corporation
- **Syncro test ticket deleted:** #32321 (ID: 110961873) — ACG internal customer
- **Reference invoice:** 67594 (VWP block customer emergency billing example, 2026-05-12)
- **Reference ticket:** #32269 (VWP, block emergency billing reference)
- **Syncro add_line_item endpoint:** `POST /api/v1/tickets/{id}/add_line_item`
- **Syncro product IDs:** 1190473 (Remote Business $150), 26118 (Onsite $175), 573881 (In-Shop $150), 26184 (Emergency Business $262.50)
- **Python scripts (Temp):**
- `C:\Users\guru\AppData\Local\Temp\esxi_schedule_monthly_reset_v2.py` — final cron setup script (SFTP method)
- `C:\Users\guru\AppData\Local\Temp\esxi_schedule_monthly_reset.py` — v1 (heredoc method, superseded)
- `C:\Users\guru\AppData\Local\Temp\esxi124_hostd_restart.py` — hostd restart + verification
---
## Update: 13:48 MST — GuruRMM CRITICAL auth fix + run_analysis UX fix DEPLOYED; migration incident recovered (Mike Swanson / GURU-KALI)
### Session Summary
Continuation of the 09:34 GURU-KALI session (audit + submodule fixes). First corrected the CLAUDE.md guru-rmm submodule wording (it called the submodule a "stale reference copy"; it actually tracks the active azcomputerguru/gururmm repo, pinned commit just lags main) — committed `f2ece8e`.
Then implemented and DEPLOYED the two CRITICAL auth findings from the morning's audit. Root cause: the server has no router-level auth — every route is gated only by whether its handler includes the `AuthUser` extractor, and `metrics.rs` + `logs.rs` omitted it, leaving per-agent and fleet-wide metrics/logs anonymously readable (plus `/logs/analyze` firing an outbound LLM call and `/agents/:id/logs/request` commanding agents). Coding Agent (opus) added `AuthUser` to all 8 handlers, scoping per-agent endpoints to the caller's orgs (matching the `get_agent` pattern), fleet aggregates require-auth + `TODO(authz)`, and `run_analysis` admin-only. Code Review APPROVED. Merged to main (`1d5a08f`), deployed via build-server.sh as v0.3.14, verified anon -> 401 on all six endpoints (login still 422, so public routes intact).
Mike then asked to fix the run_analysis UX regression (admin-only `/logs/analyze` 403'd non-admin techs doing per-agent analysis). Coding Agent relaxed it: per-agent analysis (agent_id present) -> `authorize_agent_access` org check; fleet (no agent_id) stays admin-only; dashboard hides the fleet Analyze button for non-admins (`useAuth` role check matching backend `is_admin()`). Reviewed APPROVED, merged (`7be2f52`).
Deploying run_analysis surfaced that main did not compile — the unrelated crash-detection health-monitoring feature (`health.rs`, committed earlier today under the shared azcomputerguru account) had a type error. Per Mike's choice, coordinated with the owner (GURU-5070) via coord message rather than fixing it. This also exposed a hostname issue: I'd addressed the message to the stale `DESKTOP-0O8A1RL` session id (the retired hostname); re-sent to `GURU-5070/claude-main` + a fallback. GURU-5070 launched a fleet-wide identity audit in response; GURU-KALI verified clean (identity.json user=mike/machine=GURU-KALI, git user.name normalized to "Mike Swanson", in known_machines) and replied.
GURU-5070 committed a health.rs fix (`42790f5`) but it was incomplete — it assumed os_type AND architecture are non-null String; per migrations + .sqlx, os_type IS NOT NULL but architecture is nullable, so `&crashed.architecture` gave E0308. Fixed forward (`646eb0a`: as_deref() on version_to + architecture, &os_type direct) — the first version of this code with a verified-clean cargo check; reviewed, merged. Deploying via build-server.sh then hit a MIGRATION INCIDENT and brief outage: migration 046 (safe_rollout) had been applied to the DB out-of-band (3 tables existed) but never recorded in `_sqlx_migrations`, so the new binary crash-looped on boot ("relation update_rollouts already exists"). Since build-server.sh stops the old service before validating the new binary, the server went down. Database Agent recovered: confirmed all 3 tables empty (0 rows, no FK deps), dropped them, restart -> sqlx ran 046 fresh + recorded it. Server v0.3.22 live; dashboard redeployed; anon -> 401 confirmed; no data lost.
### Key Decisions
- **Coordinate vs. fix the health.rs blocker:** initially coordinated with GURU-5070 (Mike's choice, to avoid stepping on WIP). After their committed fix was still broken and they'd declared "done" (no active WIP), fixed it forward — aligned with Mike's "resume the deploy" intent.
- **Database recovery = drop empty tables, not checksum-insert:** Database Agent chose dropping the 3 empty tables (letting sqlx re-run 046 and self-record) over manually inserting a `_sqlx_migrations` row — avoids a fragile hand-computed SHA-384 and eliminates any out-of-band schema drift. Safe only because all 3 tables were empty.
- **Branch-not-main for the audit report; non-main pushes don't build:** verified the webhook builds on `refs/heads/main` only with no path filtering — so the audit branch and feature branches don't trigger builds; merging to main does.
- **Delegated all code/DB/git through agents (opus for auth/migration/security):** coordinator never hand-edited production code or ran DB writes; mandatory Code Review on every change caught that even my own prescribed health.rs fix was wrong.
### Problems Encountered
- **Self-inflicted git race (first run_analysis server build):** ran build-server.sh right after the merge push, which had triggered the webhook build on the same /home/guru/gururmm repo; concurrent `git reset --hard` left a stale tree and a false build failure. Fix: always check for in-flight builds before build-server.sh; resolved by waiting for idle.
- **health.rs compile saga (3 attempts):** original .as_ref() tuple (E0277 x3) -> GURU-5070's partial fix (E0308, architecture nullable) -> correct fix `646eb0a` (as_deref on the two Option fields). Root issue: nobody ran a clean `cargo check` before committing the prior attempts.
- **Migration 046 unrecorded -> crash-loop + outage:** see summary; recovered by Database Agent. Lesson sent to GURU-5070: don't apply migration SQL manually during dev; let the server apply via sqlx.
- **Coord message misaddressed to retired hostname:** DESKTOP-0O8A1RL is retired (now GURU-5070); re-sent + fallback. Triggered the fleet identity audit.
- **Public dashboard 403:** Cloudflare bot-mitigation on a server-side curl, not an nginx/deploy fault (origin serves the new bundle at local 200).
### Configuration Changes
- claudetools `f2ece8e` — `.claude/CLAUDE.md` guru-rmm submodule wording corrected.
- gururmm `1d5a08f` — `server/src/api/metrics.rs` + `logs.rs`: AuthUser on 8 handlers (CRITICAL auth fix).
- gururmm `7be2f52` — `server/src/api/logs.rs` (run_analysis per-agent authz) + `dashboard/src/pages/Logs.tsx` (hide fleet Analyze for non-admins).
- gururmm `646eb0a` — `server/src/updates/health.rs`: as_deref() fix for nullable Option fields (follow-up to GURU-5070's `42790f5`).
- DB: dropped + sqlx-recreated `update_rollouts`, `update_health_metrics`, `agent_update_events`; migration 046 now recorded in `_sqlx_migrations`.
- Deployed: gururmm-server v0.3.22 (`/opt/gururmm/gururmm-server`); dashboard rebuilt + copied to `/var/www/gururmm/dashboard/` (bundle `index-DUF78gxN.js`).
- `.claude/current-mode` -> infra during deploy.
### Credentials & Secrets
- No new credentials. Build server DB access via `DATABASE_URL` in `/home/guru/.cargo/env` (build server builds ONLINE, which is why health.rs query! macros validated against the live DB). GuruRMM API admin creds: vault `infrastructure/gururmm-server.sops.yaml`.
### Infrastructure & Servers
- gururmm-server: `172.16.3.30:3001`, systemd `gururmm-server`, binary `/opt/gururmm/gururmm-server` (the `/usr/local/bin` path in old CONTEXT.md is stale). Running **v0.3.22**.
- Server deploy = MANUAL `sudo /opt/gururmm/build-server.sh` (git reset --hard origin/main -> cargo build --release -> stop/cp/start). NOT triggered by the webhook (webhook = agents only). **Latent bug:** stops the service BEFORE validating the new binary's migrations -> a bad migration causes an outage; also doesn't check `git reset` exit code (race) and has no build lock.
- Dashboard: nginx serves `/var/www/gururmm/dashboard` (root-owned, server_name _); `/api/` proxied to `:3001`; second vhost `server_name rmm-api.azcomputerguru.com`. Dashboard `API_BASE_URL` defaults to `https://rmm-api.azcomputerguru.com` (no .env), so a plain `npm run build` is correct for prod. Public `rmm.azcomputerguru.com` is behind Cloudflare (IPv6 2606:4700; 403s bare curls via bot-mitigation).
- DB: PostgreSQL `localhost:5432/gururmm` on .30. `_sqlx_migrations` now at version 46.
### Commands & Outputs
```bash
# Server deploy (manual, intended path):
ssh guru@172.16.3.30 'sudo /opt/gururmm/build-server.sh' # ~4min build, then stop/cp/start
# Dashboard deploy:
ssh guru@172.16.3.30 'cd /home/guru/gururmm/dashboard && npm ci && npm run build && sudo cp -r dist/* /var/www/gururmm/dashboard/'
# Migration recovery (Database Agent, after confirming 3 tables empty):
# BEGIN; <guard: raise if any rows>; DROP TABLE IF EXISTS update_rollouts, update_health_metrics, agent_update_events CASCADE; COMMIT;
# then systemctl restart gururmm-server -> sqlx runs 046 fresh + records it
# Smoke test (auth enforcement live):
curl -s -o /dev/null -w '%{http_code}' http://localhost:3001/api/metrics/summary # -> 401
curl -s -o /dev/null -w '%{http_code}' http://localhost:3001/status # -> 200
```
### Pending / Incomplete Tasks
- **HIGH follow-ups from the audit (not started):** validate Entra SSO ID-token signature (`sso.rs:212`); auth+scope the agent-status SSE (`agents.rs:583`); add `client_id`/`update_channel` to the agent response structs (dead frontend links); org-scope the 3 fleet endpoints (`/metrics/summary`, `/logs`, `/logs/analysis` — TODO(authz), need client_ids-filtered queries); mac build gate stuck (mac builder offline since Pluto outage).
- **Structural:** add a router-level auth layer so "public" is opt-in (kills the missing-AuthUser bug class).
- **Hand to GURU-5070 (coord msg 2d518a70):** don't apply migration SQL manually; harden build-server.sh (validate migrations before service swap; check git reset exit; add build lock); `046_safe_rollout.sql` header comment mislabeled "Migration 045".
- Audit report still only on branch `audit/2026-05-25-rmm-audit` (merge to main when bundling code).
### Reference Information
- gururmm commits: `1d5a08f` (CRITICAL auth), `7be2f52` (run_analysis), `646eb0a` (health fix), `42790f5` (GURU-5070 partial health fix). Audit report: `reports/2026-05-25-rmm-audit.md` on branch `audit/2026-05-25-rmm-audit` (`da1d4ee`).
- claudetools commits: `413df93` (sync.sh submodule fix + solverbot removal), `f2ece8e` (CLAUDE.md wording).
- Coord: component `gururmm/server` = deployed 0.3.22. Messages: `16aa12fb`/`74a1a3e5` (build-blocked to GURU-5070 + DESKTOP fallback), `b99f718c` (identity check-in reply), `2d518a70` (deploy-done + lessons). DESKTOP-0O8A1RL retired; GURU-5070 is Mike's current session id.
- Audit tally: 61 findings (2 critical [both now FIXED+deployed], 10 high, 16 medium, 7 low, 26 info).