From e61b39b5c8f586658b99314d550ac1636e2915aa Mon Sep 17 00:00:00 2001
From: Mike Swanson <mike@azcomputerguru.com>
Date: Thu, 25 Jun 2026 12:36:24 -0700
Subject: [PATCH] sync: auto-sync from GURU-5070 at 2026-06-25 12:35:22

Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-06-25 12:35:22
---
 .claude/memory/MEMORY.md                      |  1 +
 .../reference_syncro_agent_handle_leak.md     | 32 +++++++++++++++++++
 .../reference_tailscale_subnet_key_expiry.md  | 27 ++++++++++++++++
 errorlog.md                                   |  6 ++++
 4 files changed, 66 insertions(+)
 create mode 100644 .claude/memory/reference_syncro_agent_handle_leak.md
 create mode 100644 .claude/memory/reference_tailscale_subnet_key_expiry.md

diff --git a/.claude/memory/MEMORY.md b/.claude/memory/MEMORY.md
index 6211d353..f5bb251b 100644
--- a/.claude/memory/MEMORY.md
+++ b/.claude/memory/MEMORY.md
@@ -2,6 +2,7 @@
 
 ## Reference
 - [ACG resource map](reference_resource_map.md) — **READ THIS FIRST** when a task references a server/service/tenant/API. What we have access to, how to connect from this machine, per-machine exceptions, gotchas. Points at the detail files below.
+- [Tailscale subnet-route key expiry](reference_tailscale_subnet_key_expiry.md) — "internet OK but all of 172.16.3.x (Gitea .20, RMM/coord .30) dead" = Tailscale infra-node KEY EXPIRY (pfSense subnet router advertises 172.16.0.0/22), NOT a LAN outage; expiry now disabled on infra nodes (2026-06-25). Fallback: gururmm-server direct at tailnet 100.86.12.15:3001.
 - [GravityZone support center](reference_gravityzone_support.md) — Authoritative Bitdefender GravityZone product + Public API docs; use to confirm UNVERIFIED `bitdefender` skill methods/param shapes (push setPushEventSettings, assignPolicy, report/account writes, maintenancewindows/integrations names).
 - [GURU-5070 Rust toolchain](reference_guru5070_rust_toolchain.md) — GURU-5070 now has cargo + MSVC + protoc; build/clippy/test guru-connect LOCALLY (set PROTOC to the winget path) instead of the build host. CI only clippy-checks the Linux server, not the Windows agent.
 - [ACG Office Network Infrastructure](infra_office_network.md) — IPs/hosts/roles for pfSense/Jupiter/VMs/Docker. Check before assuming; .21 (Uranus) is storage.
diff --git a/.claude/memory/reference_syncro_agent_handle_leak.md b/.claude/memory/reference_syncro_agent_handle_leak.md
new file mode 100644
index 00000000..5deb29b1
--- /dev/null
+++ b/.claude/memory/reference_syncro_agent_handle_leak.md
@@ -0,0 +1,32 @@
+---
+name: reference_syncro_agent_handle_leak
+description: RDS "no available computers in the pool" (0x3/0x408) can really be a SyncroLive.Agent.Runner handle leak starving the box. How to spot + fix.
+metadata:
+  type: reference
+---
+
+**Symptom chain that hid the real cause (IMC1 / Instrumental Music Center, 2026-06-25):**
+RemoteApp/RDP launch fails → *"There are no available computers in the pool"* (RDP **error 0x3 / extended 0x408**).
+The RD Connection Broker **Admin** log (`Microsoft-Windows-TerminalServices-SessionBroker/Admin`,
+**Event 802**) shows the truth: *"RD Connection Broker failed to process the connection request …
+Error: **Insufficient system resources** exist to complete the requested service."* The
+SessionBroker-Client log shows 1296 "Element not found" / 1306 redirect-failed. The collection +
+session host are healthy (`NewConnectionAllowed: Yes`), so the broker isn't the bug — the **box is
+out of a kernel resource**.
+
+**Root cause:** the **Syncro RMM agent `SyncroLive.Agent.Runner` was leaking HANDLES** — 1,135,414
+handles in one process (~80% of the box's 1.41M total). Handle/object exhaustion → broker can't
+create the session.
+
+**Diagnose:** `Get-Process | Sort-Object Handles -Descending | Select -First 6 Name,Handles,Id` and
+`(Get-Process | Measure-Object Handles -Sum).Sum`. Memory looks fine (it's handles, not RAM/commit).
+Services on the box: `Syncro`, `SyncroLive`, `SyncroOvermind` (SyncroRecovery).
+
+**Fix (no reboot needed):** `Stop-Process -Name 'SyncroLive.Agent.Runner' -Force` — the Syncro
+watchdog respawns it clean (dropped 1.41M → 280K handles, runner came back at ~900). Have the user
+retry the RemoteApp immediately.
+
+**It recurs** (leak accumulates over uptime) → schedule a periodic SyncroLive restart and/or update the
+agent; **likely fleet-wide** — sweep other client servers for high-handle `SyncroLive.Agent.Runner`
+(deferred 2026-06-25). IMC1 also had a separate pending reboot (wedged KB5075999) + expired RDS certs.
+See [[reference_resource_map]].
diff --git a/.claude/memory/reference_tailscale_subnet_key_expiry.md b/.claude/memory/reference_tailscale_subnet_key_expiry.md
new file mode 100644
index 00000000..1c37198a
--- /dev/null
+++ b/.claude/memory/reference_tailscale_subnet_key_expiry.md
@@ -0,0 +1,27 @@
+---
+name: reference_tailscale_subnet_key_expiry
+description: "Internet OK but all of 172.16.3.x dead" = Tailscale infra-node key expiry, not a LAN outage. How to diagnose + the fallback path.
+metadata:
+  type: reference
+---
+
+The ACG internal subnet **172.16.3.x is reached over Tailscale**, not a local LAN — `pfsense-2`
+(the pfSense node) is the **subnet router** advertising **172.16.0.0/22**. Key hosts on it:
+Gitea/Jupiter `172.16.3.20:3000`, GuruRMM + coord `172.16.3.30:3001`/`:8001`.
+
+**Symptom → cause:** if `sync.sh` fetch fails and the WHOLE `172.16.3.x` subnet is unreachable
+(both .20 and .30) **while general internet is fine**, the cause is almost always a **Tailscale
+node KEY EXPIRY** on an infra node (the subnet router or a server) — an expired key drops that node
+off the tailnet, killing the route. It is NOT a "transient blip" and NOT a real LAN outage (logged
+as a correction 2026-06-25 after I mis-called it). Mike **disabled key expiration** on the infra
+node(s) 2026-06-25 so it shouldn't recur; if it does, re-auth the node + confirm expiry is off in the
+Tailscale admin console.
+
+**Diagnose (Windows `tailscale.exe` at `C:\Program Files\Tailscale\`):**
+- `tailscale status` — look for peers marked `offline`/key-expired, esp. `pfsense-2` and `gururmm-server`.
+- `tailscale debug prefs | grep RouteAll` — must be `true` (this machine accepts subnet routes).
+- `tailscale status --json` — confirm a peer advertises `172.16.0.0/22` (PrimaryRoutes) and is `Online`.
+- `tailscale ping <tailnet-100.x>` — tests tailnet path independent of the subnet route.
+
+**Fallback:** `gururmm-server` is directly reachable at its **tailnet IP `100.86.12.15:3001`** — usable
+in place of `172.16.3.30:3001` if the subnet route is down but the node itself is up. See [[feedback_tmp_path_windows]].
diff --git a/errorlog.md b/errorlog.md
index 146d6cde..9da82e91 100644
--- a/errorlog.md
+++ b/errorlog.md
@@ -17,6 +17,12 @@ Categories (the `[type]` tag): _(none)_ = skill/command execution failure ·
 
 <!-- Append entries below this line -->
 
+2026-06-25 | GURU-5070 | remediation-tool/EOP | [friction] checking ACG own-tenant EOP quarantine: reached for investigator-exo (401 - Exchange Admin role only on Exchange OPERATOR SP, not Investigator), then RecipientAddress needs JSON array not string (400); skill has no EOP/quarantine section at all [ctx: ref=feedback_exchange_role_recurring_gap]
+
+2026-06-25 | GURU-5070 | sync/tailscale | [correction] diagnosed 172.16.3.x unreachable as transient blip; real cause was Tailscale node KEY EXPIRY on the subnet-router node (pfSense advertising 172.16.0.0/22) dropping it off the tailnet [ctx: fix=disabled key expiration on the node; symptom=internet OK but whole 172.16.3.x dead]
+
+2026-06-25 | GURU-5070 | sync/gitea | fetch failed: could not connect to 172.16.3.20:3000 (Gitea unreachable, exit 128) [ctx: host=172.16.3.20:3000 machine=GURU-5070]
+
 2026-06-25 | Howard-Home | remediation-tool/reset-password.sh | JIT cleanup cannot self-remove: after elevating the Tenant Admin SP to Privileged Authentication Administrator to reset a password, the DELETE of that role assignment is performed BY the same SP and Graph blocks it (HTTP 400 'Removing self from built-in role is not allowed'), leaving a STANDING PAA role on the SP - needs a Global Admin/portal removal; script should detect this and surface portal steps instead of a bare WARNING [ctx: tenant=cascadestucson SP=ComputerGuru-Tenant-Admin role=PrivilegedAuthAdmin]
 
 2026-06-25 | Howard-Home | rmm/dispatch | [friction] embedded escaped quotes " , " in a PowerShell -join inside the jq/heredoc dispatch chain caused a parse error (script failed pre-exec, wasted one dispatch); fix: build strings with + concatenation or [char]44, never escaped quotes in RMM PowerShell payloads [ctx: ref=feedback_windows_quote_stripping]