sync: auto-sync from GURU-5070 at 2026-06-25 12:35:22

Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-06-25 12:35:22
This commit is contained in:
2026-06-25 12:36:24 -07:00
parent 0f803c2d9c
commit e61b39b5c8
4 changed files with 66 additions and 0 deletions

View File

@@ -2,6 +2,7 @@
## Reference
- [ACG resource map](reference_resource_map.md) — **READ THIS FIRST** when a task references a server/service/tenant/API. What we have access to, how to connect from this machine, per-machine exceptions, gotchas. Points at the detail files below.
- [Tailscale subnet-route key expiry](reference_tailscale_subnet_key_expiry.md) — "internet OK but all of 172.16.3.x (Gitea .20, RMM/coord .30) dead" = Tailscale infra-node KEY EXPIRY (pfSense subnet router advertises 172.16.0.0/22), NOT a LAN outage; expiry now disabled on infra nodes (2026-06-25). Fallback: gururmm-server direct at tailnet 100.86.12.15:3001.
- [GravityZone support center](reference_gravityzone_support.md) — Authoritative Bitdefender GravityZone product + Public API docs; use to confirm UNVERIFIED `bitdefender` skill methods/param shapes (push setPushEventSettings, assignPolicy, report/account writes, maintenancewindows/integrations names).
- [GURU-5070 Rust toolchain](reference_guru5070_rust_toolchain.md) — GURU-5070 now has cargo + MSVC + protoc; build/clippy/test guru-connect LOCALLY (set PROTOC to the winget path) instead of the build host. CI only clippy-checks the Linux server, not the Windows agent.
- [ACG Office Network Infrastructure](infra_office_network.md) — IPs/hosts/roles for pfSense/Jupiter/VMs/Docker. Check before assuming; .21 (Uranus) is storage.

View File

@@ -0,0 +1,32 @@
---
name: reference_syncro_agent_handle_leak
description: RDS "no available computers in the pool" (0x3/0x408) can really be a SyncroLive.Agent.Runner handle leak starving the box. How to spot + fix.
metadata:
type: reference
---
**Symptom chain that hid the real cause (IMC1 / Instrumental Music Center, 2026-06-25):**
RemoteApp/RDP launch fails → *"There are no available computers in the pool"* (RDP **error 0x3 / extended 0x408**).
The RD Connection Broker **Admin** log (`Microsoft-Windows-TerminalServices-SessionBroker/Admin`,
**Event 802**) shows the truth: *"RD Connection Broker failed to process the connection request …
Error: **Insufficient system resources** exist to complete the requested service."* The
SessionBroker-Client log shows 1296 "Element not found" / 1306 redirect-failed. The collection +
session host are healthy (`NewConnectionAllowed: Yes`), so the broker isn't the bug — the **box is
out of a kernel resource**.
**Root cause:** the **Syncro RMM agent `SyncroLive.Agent.Runner` was leaking HANDLES** — 1,135,414
handles in one process (~80% of the box's 1.41M total). Handle/object exhaustion → broker can't
create the session.
**Diagnose:** `Get-Process | Sort-Object Handles -Descending | Select -First 6 Name,Handles,Id` and
`(Get-Process | Measure-Object Handles -Sum).Sum`. Memory looks fine (it's handles, not RAM/commit).
Services on the box: `Syncro`, `SyncroLive`, `SyncroOvermind` (SyncroRecovery).
**Fix (no reboot needed):** `Stop-Process -Name 'SyncroLive.Agent.Runner' -Force` — the Syncro
watchdog respawns it clean (dropped 1.41M → 280K handles, runner came back at ~900). Have the user
retry the RemoteApp immediately.
**It recurs** (leak accumulates over uptime) → schedule a periodic SyncroLive restart and/or update the
agent; **likely fleet-wide** — sweep other client servers for high-handle `SyncroLive.Agent.Runner`
(deferred 2026-06-25). IMC1 also had a separate pending reboot (wedged KB5075999) + expired RDS certs.
See [[reference_resource_map]].

View File

@@ -0,0 +1,27 @@
---
name: reference_tailscale_subnet_key_expiry
description: "Internet OK but all of 172.16.3.x dead" = Tailscale infra-node key expiry, not a LAN outage. How to diagnose + the fallback path.
metadata:
type: reference
---
The ACG internal subnet **172.16.3.x is reached over Tailscale**, not a local LAN — `pfsense-2`
(the pfSense node) is the **subnet router** advertising **172.16.0.0/22**. Key hosts on it:
Gitea/Jupiter `172.16.3.20:3000`, GuruRMM + coord `172.16.3.30:3001`/`:8001`.
**Symptom → cause:** if `sync.sh` fetch fails and the WHOLE `172.16.3.x` subnet is unreachable
(both .20 and .30) **while general internet is fine**, the cause is almost always a **Tailscale
node KEY EXPIRY** on an infra node (the subnet router or a server) — an expired key drops that node
off the tailnet, killing the route. It is NOT a "transient blip" and NOT a real LAN outage (logged
as a correction 2026-06-25 after I mis-called it). Mike **disabled key expiration** on the infra
node(s) 2026-06-25 so it shouldn't recur; if it does, re-auth the node + confirm expiry is off in the
Tailscale admin console.
**Diagnose (Windows `tailscale.exe` at `C:\Program Files\Tailscale\`):**
- `tailscale status` — look for peers marked `offline`/key-expired, esp. `pfsense-2` and `gururmm-server`.
- `tailscale debug prefs | grep RouteAll` — must be `true` (this machine accepts subnet routes).
- `tailscale status --json` — confirm a peer advertises `172.16.0.0/22` (PrimaryRoutes) and is `Online`.
- `tailscale ping <tailnet-100.x>` — tests tailnet path independent of the subnet route.
**Fallback:** `gururmm-server` is directly reachable at its **tailnet IP `100.86.12.15:3001`** — usable
in place of `172.16.3.30:3001` if the subnet route is down but the node itself is up. See [[feedback_tmp_path_windows]].