scc: Session save and push from GURU-5070 at 2026-06-02 16:27:55
RMM per-site EXE signing fix (on hold) session log. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
79
session-logs/2026-06-02-mike-rmm-exe-signing-fix.md
Normal file
79
session-logs/2026-06-02-mike-rmm-exe-signing-fix.md
Normal file
@@ -0,0 +1,79 @@
|
||||
# GuruRMM Per-Site Agent EXE Signing Fix — Session Log 2026-06-02
|
||||
|
||||
## User
|
||||
- **User:** Mike Swanson (mike)
|
||||
- **Machine:** GURU-5070
|
||||
- **Role:** admin
|
||||
|
||||
## Session Summary
|
||||
|
||||
Diagnosed and fixed a production bug: the GuruRMM per-site Windows agent **EXE** was served **unsigned**, surfaced when Mike tried to install it for Tucson Coin (site `SWIFT-STORM-6516`). Root cause: the download endpoint `GET /install/{site_code}/download/windows` (`server/src/api/install.rs::build_site_binary`) appends a self-config trailer `[site_code][4-byte LE len][8-byte "GRMM_CFG"]` to the **already Authenticode-signed** base agent and serves it with **no re-sign** — appending bytes after the PE cert table invalidates the signature. The per-site **MSI** path re-signs after baking the SITEKEY (fresh WiX build), so MSIs were fine; only the EXE path was broken. Proven on the actual artifact `E:\gururmm-agent-main.exe` from the external drive: `NotSigned`, trailer magic `GRMM_CFG`, embedded site code `SWIFT-STORM-6516`, and its 5,178,664-byte prefix SHA-256-identical to the signed base (i.e. signed base + 28-byte trailer).
|
||||
|
||||
Fixed in four coordinated changes, each implemented by a Coding Agent (opus) and reviewed by the Code Review Agent (opus):
|
||||
- **Phase 1** — `agent/src/embedded.rs`: trailer reader is now **cert-table-aware** (locates the trailer relative to the PE security-directory / cert-table offset, not physical EOF), so it still reads after the EXE is re-signed. Plus `deploy/build-pipeline/build-windows.sh` `deploy_and_sign` made **fail-closed** (rm unsigned artifact + `exit 1` on sign failure; never publish unsigned). Commit `0d0d382`.
|
||||
- **Phase 2** — `server/src/api/install.rs`: Windows per-site EXE now **appends trailer THEN re-signs** via `/opt/gururmm/sign-windows.sh`, with a per-site/version cache mirroring the MSI path, fail-closed, unique-per-request UUID temp (review caught a concurrent-sign corruption + temp-leak, both fixed). MSI signing also made fail-closed. Commit `9fe7834`.
|
||||
- **Phase 2b** — `install.rs`: **strip the base's existing Authenticode signature before appending the trailer** (truncate to the security-dir VA + zero the entry → clean PE) so jsign can re-sign. Without this, jsign failed at `PEFile.writeCertificateTable` because the existing cert table was mid-file. Commit `84dfdae`.
|
||||
- **Phase 1c** — `embedded.rs`: **scan backward up to `PAD_MAX=16` bytes for the magic**, because jsign 8-byte-aligns the new cert table and inserts 0–7 padding bytes between the trailer and the cert table (verified: 4-byte gap). Commit `4c60874`.
|
||||
|
||||
At hold time the signing problem is **fixed and deployed**: the per-site EXE returns **Valid-signed (HTTP 200)**, signer `CN=Arizona Computer Guru LLC`. The only unverified item is the embedded-site-code read on the final agent build (`0.6.54`, padding-aware scan), which builds server-side on the webhook and will publish on its own. Mike put the task **on hold** mid-final-build; lock released, watcher stopped, resume captured in coord todo `5bfcb968`.
|
||||
|
||||
Also this session (logged/committed separately): the Deere Park Development (Glabman) UniFi WiFi 7 Syncro estimate #7190 (`clients/deere-park-development/`), and GuruConnect **SPEC-017 end-user (sub-user) remote access** (`projects/msp-tools/guru-connect/docs/specs/SPEC-017-*.md`, pushed to GC main).
|
||||
|
||||
## Key Decisions
|
||||
|
||||
- **Agent change first, then server** (two-phase deploy): the cert-aware base must be live before the server re-signs per-site EXEs, or field agents couldn't read the re-signed trailer.
|
||||
- **Strip-then-sign over server-side alignment hacks**: re-signing requires a clean PE; stripping the base signature is deterministic and avoids depending on jsign internals.
|
||||
- **Backward-scan for the magic** rather than trusting jsign to place the cert table at a fixed offset — robust to any alignment padding now or later.
|
||||
- **Fail-closed everywhere** (build pipeline + per-site EXE + per-site MSI): never serve/publish an unsigned client artifact. This is what *surfaced* the jsign re-sign failure (500 instead of silently unsigned) — the fail-closed design working as intended.
|
||||
- **MSI is the interim install path for Tucson Coin** (verified Valid-signed) while the EXE fix completes.
|
||||
- Pushed to `gururmm main` to drive builds via the Gitea webhook pipeline (never manual `build-agents.sh`).
|
||||
|
||||
## Problems Encountered
|
||||
|
||||
- **jsign re-sign failed (`writeCertificateTable`)** after Phase 2 deploy: the base was already signed, so the appended trailer landed after the existing cert table. Fixed by Phase 2b (strip existing signature first). Note: the Phase 2 fail-closed path correctly returned HTTP 500 rather than serving unsigned — that's how the bug surfaced.
|
||||
- **jsign alignment padding** (0–7 bytes) between the trailer and the new cert table broke the fixed-offset trailer read (site code intact but 4 bytes earlier than expected). Fixed by Phase 1c (backward scan).
|
||||
- **Content-Length watcher false timeout**: the cert-aware base rebuild came out byte-identical in size to the prior version, so a size-change poll missed it; confirmed completion via the build log instead (`Windows build complete: v0.6.53`).
|
||||
- **Code review caught** a concurrent-temp signing-corruption race + a temp leak on rename failure in Phase 2 — both fixed before commit.
|
||||
|
||||
## Configuration Changes
|
||||
|
||||
GuruRMM submodule (`projects/msp-tools/guru-rmm`, repo `azcomputerguru/gururmm`), all committed + pushed to `main`:
|
||||
- `agent/src/embedded.rs` — cert-table-aware + padding-scan trailer reader (commits `0d0d382`, `4c60874`)
|
||||
- `deploy/build-pipeline/build-windows.sh` — `deploy_and_sign` fail-closed (`0d0d382`)
|
||||
- `server/src/api/install.rs` — per-site EXE re-sign+cache+strip, fail-closed MSI (`9fe7834`, `84dfdae`)
|
||||
|
||||
Parent repo: submodule pointer bumps for the above; this session log.
|
||||
|
||||
## Credentials & Secrets
|
||||
|
||||
- None created. Syncro API via existing per-user key (mike, baked into `/syncro` skill).
|
||||
- Signing creds (reference only, not retrieved): `/etc/gururmm-signing.env` (root-only) + `/opt/gururmm/.env` on 172.16.3.30 — Azure Trusted Signing SP for jsign.
|
||||
|
||||
## Infrastructure & Servers
|
||||
|
||||
- **GuruRMM server / Linux build orchestrator:** `172.16.3.30`. `gururmm-server` systemd service runs as **User=root** (so the runtime per-site signing has the privileges `sign-windows.sh` needs). Deployed build scripts at `/opt/gururmm/`; `build-shared.sh` `git reset --hard origin/main` + `install`s `build-windows.sh`/`sign-windows.sh`/etc. from `deploy/build-pipeline/` on every build (repo is source of truth). Downloads served from `/var/www/gururmm/downloads/`. Windows build log: `/var/log/gururmm-build-windows.log`. Webhook orchestrator: `/opt/gururmm/webhook-handler.py` (persistent HTTP server; lock `/var/run/gururmm-build.lock`).
|
||||
- **Pluto:** `172.16.3.36` — Windows builder (Rust agent variants + WiX MSI), SSH from the orchestrator.
|
||||
- **Public download:** `https://rmm.azcomputerguru.com/` (NPM → server). Install landing: `/install/{site_code}`.
|
||||
- Agent versions: base `0.6.52` → `0.6.53` (Phase 1) → `0.6.54` (Phase 1c, building/published). New builds default to **beta** channel; `-latest` symlinks point to newest.
|
||||
|
||||
## Commands & Outputs
|
||||
|
||||
- Verify base signed: `Get-AuthenticodeSignature` on `https://rmm.azcomputerguru.com/downloads/gururmm-agent-windows-amd64-latest.exe` → `Valid`, `CN=Arizona Computer Guru LLC`.
|
||||
- Real unsigned artifact (Tucson Coin): `E:\gururmm-agent-main.exe` → `NotSigned`; trailer magic `GRMM_CFG`; site code `SWIFT-STORM-6516`; prefix SHA-256 == signed base.
|
||||
- Build host check: `ssh guru@172.16.3.30 'tail /var/log/gururmm-build-windows.log'` → `=== Windows build complete: v0.6.53 in 1534s ===`, all variants "Signed OK".
|
||||
- Server signing error (pre-Phase-2b): `journalctl -u gururmm-server` → `jsign … writeCertificateTable`; service `User=root`.
|
||||
- Post-Phase-2b: `GET /install/SWIFT-STORM-6516/download/windows` → HTTP 200, `Valid`-signed; cert table VA 5,166,112, magic ends 5,166,108 (**4-byte jsign alignment gap**), site code `SWIFT-STORM-6516`.
|
||||
- Per-site MSI (workaround): `GET /install/SWIFT-STORM-6516/download/msi` → `Valid`-signed.
|
||||
|
||||
## Pending / Incomplete Tasks
|
||||
|
||||
- **ON HOLD** (Mike, 2026-06-02). Remaining: when Phase-1c base `0.6.54` publishes, pull the per-site EXE and confirm Valid signature **and** site code readable via the padding-aware scan. Tracked: **coord todo `5bfcb968`** (gururmm, assigned mike). Coord lock released; background watcher stopped.
|
||||
- Fleet/coord: `/autotask` baseline decision (un-localize vs capability-gate) still pending Mike — flagged in the memory-consolidation broadcast (memory cleanup itself resolved upstream by `ae51988` sync-memory mirror mode).
|
||||
|
||||
## Reference Information
|
||||
|
||||
- GuruRMM commits: `0d0d382` (P1 agent+build), `9fe7834` (P2 server), `84dfdae` (P2b strip), `4c60874` (P1c scan). Repo: `azcomputerguru/gururmm`.
|
||||
- GuruConnect: SPEC-017 (`azcomputerguru/guru-connect` main).
|
||||
- Syncro estimate #7190 / ticket #32366 — Deere Park Development (`SWIFT-STORM-6516` is unrelated; that's Tucson Coin's RMM site code).
|
||||
- Coord todo: `5bfcb968-c1fd-4390-84af-6e3222a7cba2`.
|
||||
- Trailer format (source of truth `agent/src/embedded.rs` MAGIC `b"GRMM_CFG"`): `[site_code][4-byte LE len][8-byte MAGIC]`, located before the PE cert table (with 0–7 bytes jsign alignment padding).
|
||||
Reference in New Issue
Block a user