sync: auto-sync from GURU-5070 at 2026-06-25 12:35:22

Author: Mike Swanson Machine: GURU-5070 Timestamp: 2026-06-25 12:35:22
2026-06-25 12:36:24 -07:00
parent 0f803c2d9c
commit e61b39b5c8
4 changed files with 66 additions and 0 deletions
--- a/.claude/memory/MEMORY.md
+++ b/.claude/memory/MEMORY.md
@@ -2,6 +2,7 @@

 ## Reference
 - [ACG resource map](reference_resource_map.md) — **READ THIS FIRST** when a task references a server/service/tenant/API. What we have access to, how to connect from this machine, per-machine exceptions, gotchas. Points at the detail files below.
+- [Tailscale subnet-route key expiry](reference_tailscale_subnet_key_expiry.md) — "internet OK but all of 172.16.3.x (Gitea .20, RMM/coord .30) dead" = Tailscale infra-node KEY EXPIRY (pfSense subnet router advertises 172.16.0.0/22), NOT a LAN outage; expiry now disabled on infra nodes (2026-06-25). Fallback: gururmm-server direct at tailnet 100.86.12.15:3001.
 - [GravityZone support center](reference_gravityzone_support.md) — Authoritative Bitdefender GravityZone product + Public API docs; use to confirm UNVERIFIED `bitdefender` skill methods/param shapes (push setPushEventSettings, assignPolicy, report/account writes, maintenancewindows/integrations names).
 - [GURU-5070 Rust toolchain](reference_guru5070_rust_toolchain.md) — GURU-5070 now has cargo + MSVC + protoc; build/clippy/test guru-connect LOCALLY (set PROTOC to the winget path) instead of the build host. CI only clippy-checks the Linux server, not the Windows agent.
 - [ACG Office Network Infrastructure](infra_office_network.md) — IPs/hosts/roles for pfSense/Jupiter/VMs/Docker. Check before assuming; .21 (Uranus) is storage.
--- a/.claude/memory/reference_syncro_agent_handle_leak.md
+++ b/.claude/memory/reference_syncro_agent_handle_leak.md
@@ -0,0 +1,32 @@
+---
+name: reference_syncro_agent_handle_leak
+description: RDS "no available computers in the pool" (0x3/0x408) can really be a SyncroLive.Agent.Runner handle leak starving the box. How to spot + fix.
+metadata:
+  type: reference
+---
+
+**Symptom chain that hid the real cause (IMC1 / Instrumental Music Center, 2026-06-25):**
+RemoteApp/RDP launch fails → *"There are no available computers in the pool"* (RDP **error 0x3 / extended 0x408**).
+The RD Connection Broker **Admin** log (`Microsoft-Windows-TerminalServices-SessionBroker/Admin`,
+**Event 802**) shows the truth: *"RD Connection Broker failed to process the connection request …
+Error: **Insufficient system resources** exist to complete the requested service."* The
+SessionBroker-Client log shows 1296 "Element not found" / 1306 redirect-failed. The collection +
+session host are healthy (`NewConnectionAllowed: Yes`), so the broker isn't the bug — the **box is
+out of a kernel resource**.
+
+**Root cause:** the **Syncro RMM agent `SyncroLive.Agent.Runner` was leaking HANDLES** — 1,135,414
+handles in one process (~80% of the box's 1.41M total). Handle/object exhaustion → broker can't
+create the session.
+
+**Diagnose:** `Get-Process | Sort-Object Handles -Descending | Select -First 6 Name,Handles,Id` and
+`(Get-Process | Measure-Object Handles -Sum).Sum`. Memory looks fine (it's handles, not RAM/commit).
+Services on the box: `Syncro`, `SyncroLive`, `SyncroOvermind` (SyncroRecovery).
+
+**Fix (no reboot needed):** `Stop-Process -Name 'SyncroLive.Agent.Runner' -Force` — the Syncro
+watchdog respawns it clean (dropped 1.41M → 280K handles, runner came back at ~900). Have the user
+retry the RemoteApp immediately.
+
+**It recurs** (leak accumulates over uptime) → schedule a periodic SyncroLive restart and/or update the
+agent; **likely fleet-wide** — sweep other client servers for high-handle `SyncroLive.Agent.Runner`
+(deferred 2026-06-25). IMC1 also had a separate pending reboot (wedged KB5075999) + expired RDS certs.
+See [[reference_resource_map]].
--- a/.claude/memory/reference_tailscale_subnet_key_expiry.md
+++ b/.claude/memory/reference_tailscale_subnet_key_expiry.md
@@ -0,0 +1,27 @@
+---
+name: reference_tailscale_subnet_key_expiry
+description: "Internet OK but all of 172.16.3.x dead" = Tailscale infra-node key expiry, not a LAN outage. How to diagnose + the fallback path.
+metadata:
+  type: reference
+---
+
+The ACG internal subnet **172.16.3.x is reached over Tailscale**, not a local LAN — `pfsense-2`
+(the pfSense node) is the **subnet router** advertising **172.16.0.0/22**. Key hosts on it:
+Gitea/Jupiter `172.16.3.20:3000`, GuruRMM + coord `172.16.3.30:3001`/`:8001`.
+
+**Symptom → cause:** if `sync.sh` fetch fails and the WHOLE `172.16.3.x` subnet is unreachable
+(both .20 and .30) **while general internet is fine**, the cause is almost always a **Tailscale
+node KEY EXPIRY** on an infra node (the subnet router or a server) — an expired key drops that node
+off the tailnet, killing the route. It is NOT a "transient blip" and NOT a real LAN outage (logged
+as a correction 2026-06-25 after I mis-called it). Mike **disabled key expiration** on the infra
+node(s) 2026-06-25 so it shouldn't recur; if it does, re-auth the node + confirm expiry is off in the
+Tailscale admin console.
+
+**Diagnose (Windows `tailscale.exe` at `C:\Program Files\Tailscale\`):**
+- `tailscale status` — look for peers marked `offline`/key-expired, esp. `pfsense-2` and `gururmm-server`.
+- `tailscale debug prefs | grep RouteAll` — must be `true` (this machine accepts subnet routes).
+- `tailscale status --json` — confirm a peer advertises `172.16.0.0/22` (PrimaryRoutes) and is `Online`.
+- `tailscale ping <tailnet-100.x>` — tests tailnet path independent of the subnet route.
+
+**Fallback:** `gururmm-server` is directly reachable at its **tailnet IP `100.86.12.15:3001`** — usable
+in place of `172.16.3.30:3001` if the subnet route is down but the node itself is up. See [[feedback_tmp_path_windows]].