From e61b39b5c8f586658b99314d550ac1636e2915aa Mon Sep 17 00:00:00 2001 From: Mike Swanson Date: Thu, 25 Jun 2026 12:36:24 -0700 Subject: [PATCH] sync: auto-sync from GURU-5070 at 2026-06-25 12:35:22 Author: Mike Swanson Machine: GURU-5070 Timestamp: 2026-06-25 12:35:22 --- .claude/memory/MEMORY.md | 1 + .../reference_syncro_agent_handle_leak.md | 32 +++++++++++++++++++ .../reference_tailscale_subnet_key_expiry.md | 27 ++++++++++++++++ errorlog.md | 6 ++++ 4 files changed, 66 insertions(+) create mode 100644 .claude/memory/reference_syncro_agent_handle_leak.md create mode 100644 .claude/memory/reference_tailscale_subnet_key_expiry.md diff --git a/.claude/memory/MEMORY.md b/.claude/memory/MEMORY.md index 6211d353..f5bb251b 100644 --- a/.claude/memory/MEMORY.md +++ b/.claude/memory/MEMORY.md @@ -2,6 +2,7 @@ ## Reference - [ACG resource map](reference_resource_map.md) — **READ THIS FIRST** when a task references a server/service/tenant/API. What we have access to, how to connect from this machine, per-machine exceptions, gotchas. Points at the detail files below. +- [Tailscale subnet-route key expiry](reference_tailscale_subnet_key_expiry.md) — "internet OK but all of 172.16.3.x (Gitea .20, RMM/coord .30) dead" = Tailscale infra-node KEY EXPIRY (pfSense subnet router advertises 172.16.0.0/22), NOT a LAN outage; expiry now disabled on infra nodes (2026-06-25). Fallback: gururmm-server direct at tailnet 100.86.12.15:3001. - [GravityZone support center](reference_gravityzone_support.md) — Authoritative Bitdefender GravityZone product + Public API docs; use to confirm UNVERIFIED `bitdefender` skill methods/param shapes (push setPushEventSettings, assignPolicy, report/account writes, maintenancewindows/integrations names). - [GURU-5070 Rust toolchain](reference_guru5070_rust_toolchain.md) — GURU-5070 now has cargo + MSVC + protoc; build/clippy/test guru-connect LOCALLY (set PROTOC to the winget path) instead of the build host. CI only clippy-checks the Linux server, not the Windows agent. - [ACG Office Network Infrastructure](infra_office_network.md) — IPs/hosts/roles for pfSense/Jupiter/VMs/Docker. Check before assuming; .21 (Uranus) is storage. diff --git a/.claude/memory/reference_syncro_agent_handle_leak.md b/.claude/memory/reference_syncro_agent_handle_leak.md new file mode 100644 index 00000000..5deb29b1 --- /dev/null +++ b/.claude/memory/reference_syncro_agent_handle_leak.md @@ -0,0 +1,32 @@ +--- +name: reference_syncro_agent_handle_leak +description: RDS "no available computers in the pool" (0x3/0x408) can really be a SyncroLive.Agent.Runner handle leak starving the box. How to spot + fix. +metadata: + type: reference +--- + +**Symptom chain that hid the real cause (IMC1 / Instrumental Music Center, 2026-06-25):** +RemoteApp/RDP launch fails → *"There are no available computers in the pool"* (RDP **error 0x3 / extended 0x408**). +The RD Connection Broker **Admin** log (`Microsoft-Windows-TerminalServices-SessionBroker/Admin`, +**Event 802**) shows the truth: *"RD Connection Broker failed to process the connection request … +Error: **Insufficient system resources** exist to complete the requested service."* The +SessionBroker-Client log shows 1296 "Element not found" / 1306 redirect-failed. The collection + +session host are healthy (`NewConnectionAllowed: Yes`), so the broker isn't the bug — the **box is +out of a kernel resource**. + +**Root cause:** the **Syncro RMM agent `SyncroLive.Agent.Runner` was leaking HANDLES** — 1,135,414 +handles in one process (~80% of the box's 1.41M total). Handle/object exhaustion → broker can't +create the session. + +**Diagnose:** `Get-Process | Sort-Object Handles -Descending | Select -First 6 Name,Handles,Id` and +`(Get-Process | Measure-Object Handles -Sum).Sum`. Memory looks fine (it's handles, not RAM/commit). +Services on the box: `Syncro`, `SyncroLive`, `SyncroOvermind` (SyncroRecovery). + +**Fix (no reboot needed):** `Stop-Process -Name 'SyncroLive.Agent.Runner' -Force` — the Syncro +watchdog respawns it clean (dropped 1.41M → 280K handles, runner came back at ~900). Have the user +retry the RemoteApp immediately. + +**It recurs** (leak accumulates over uptime) → schedule a periodic SyncroLive restart and/or update the +agent; **likely fleet-wide** — sweep other client servers for high-handle `SyncroLive.Agent.Runner` +(deferred 2026-06-25). IMC1 also had a separate pending reboot (wedged KB5075999) + expired RDS certs. +See [[reference_resource_map]]. diff --git a/.claude/memory/reference_tailscale_subnet_key_expiry.md b/.claude/memory/reference_tailscale_subnet_key_expiry.md new file mode 100644 index 00000000..1c37198a --- /dev/null +++ b/.claude/memory/reference_tailscale_subnet_key_expiry.md @@ -0,0 +1,27 @@ +--- +name: reference_tailscale_subnet_key_expiry +description: "Internet OK but all of 172.16.3.x dead" = Tailscale infra-node key expiry, not a LAN outage. How to diagnose + the fallback path. +metadata: + type: reference +--- + +The ACG internal subnet **172.16.3.x is reached over Tailscale**, not a local LAN — `pfsense-2` +(the pfSense node) is the **subnet router** advertising **172.16.0.0/22**. Key hosts on it: +Gitea/Jupiter `172.16.3.20:3000`, GuruRMM + coord `172.16.3.30:3001`/`:8001`. + +**Symptom → cause:** if `sync.sh` fetch fails and the WHOLE `172.16.3.x` subnet is unreachable +(both .20 and .30) **while general internet is fine**, the cause is almost always a **Tailscale +node KEY EXPIRY** on an infra node (the subnet router or a server) — an expired key drops that node +off the tailnet, killing the route. It is NOT a "transient blip" and NOT a real LAN outage (logged +as a correction 2026-06-25 after I mis-called it). Mike **disabled key expiration** on the infra +node(s) 2026-06-25 so it shouldn't recur; if it does, re-auth the node + confirm expiry is off in the +Tailscale admin console. + +**Diagnose (Windows `tailscale.exe` at `C:\Program Files\Tailscale\`):** +- `tailscale status` — look for peers marked `offline`/key-expired, esp. `pfsense-2` and `gururmm-server`. +- `tailscale debug prefs | grep RouteAll` — must be `true` (this machine accepts subnet routes). +- `tailscale status --json` — confirm a peer advertises `172.16.0.0/22` (PrimaryRoutes) and is `Online`. +- `tailscale ping ` — tests tailnet path independent of the subnet route. + +**Fallback:** `gururmm-server` is directly reachable at its **tailnet IP `100.86.12.15:3001`** — usable +in place of `172.16.3.30:3001` if the subnet route is down but the node itself is up. See [[feedback_tmp_path_windows]]. diff --git a/errorlog.md b/errorlog.md index 146d6cde..9da82e91 100644 --- a/errorlog.md +++ b/errorlog.md @@ -17,6 +17,12 @@ Categories (the `[type]` tag): _(none)_ = skill/command execution failure · +2026-06-25 | GURU-5070 | remediation-tool/EOP | [friction] checking ACG own-tenant EOP quarantine: reached for investigator-exo (401 - Exchange Admin role only on Exchange OPERATOR SP, not Investigator), then RecipientAddress needs JSON array not string (400); skill has no EOP/quarantine section at all [ctx: ref=feedback_exchange_role_recurring_gap] + +2026-06-25 | GURU-5070 | sync/tailscale | [correction] diagnosed 172.16.3.x unreachable as transient blip; real cause was Tailscale node KEY EXPIRY on the subnet-router node (pfSense advertising 172.16.0.0/22) dropping it off the tailnet [ctx: fix=disabled key expiration on the node; symptom=internet OK but whole 172.16.3.x dead] + +2026-06-25 | GURU-5070 | sync/gitea | fetch failed: could not connect to 172.16.3.20:3000 (Gitea unreachable, exit 128) [ctx: host=172.16.3.20:3000 machine=GURU-5070] + 2026-06-25 | Howard-Home | remediation-tool/reset-password.sh | JIT cleanup cannot self-remove: after elevating the Tenant Admin SP to Privileged Authentication Administrator to reset a password, the DELETE of that role assignment is performed BY the same SP and Graph blocks it (HTTP 400 'Removing self from built-in role is not allowed'), leaving a STANDING PAA role on the SP - needs a Global Admin/portal removal; script should detect this and surface portal steps instead of a bare WARNING [ctx: tenant=cascadestucson SP=ComputerGuru-Tenant-Admin role=PrivilegedAuthAdmin] 2026-06-25 | Howard-Home | rmm/dispatch | [friction] embedded escaped quotes " , " in a PowerShell -join inside the jq/heredoc dispatch chain caused a parse error (script failed pre-exec, wasted one dispatch); fix: build strings with + concatenation or [char]44, never escaped quotes in RMM PowerShell payloads [ctx: ref=feedback_windows_quote_stripping]