sync: auto-sync from HOWARD-HOME at 2026-05-05 18:51:23

Author: Howard Enos
Machine: HOWARD-HOME
Timestamp: 2026-05-05 18:51:23
This commit is contained in:
2026-05-05 18:51:24 -07:00
parent 0f79fdedf4
commit 03d985fe33
2 changed files with 97 additions and 0 deletions

View File

@@ -0,0 +1,82 @@
# IMC — AIM "connection broken" diagnosis + GuruRMM agent enrollment
**Date:** 2026-05-05
## User
- **User:** Howard Enos (howard)
- **Machine:** Howard-Home
- **Role:** tech
## Summary
User reported `Telerik.OpenAccess.RT.sql.SQLException: Connection has been closed` / "The connection is broken and recovery is not possible" on AIM at IMC. Provisioned a GuruRMM client + site for IMC, enrolled IMC1 as the first agent, ran read-only diagnostics. Root cause is sustained memory pressure on IMC1, not a SQL hang or service outage. Made no changes per Howard's instruction. A scheduled `Restart-Service MSSQL$AIMSQL` is the immediate fix; a longer-term consolidation conversation is needed (note for Mike below).
## What was done
### 1. GuruRMM provisioning
| Resource | Value |
|---|---|
| Client | Instrumental Music Center (`213b62a8-30f4-41dd-9bb3-549341104416`, code `IMC`) |
| Site | IMCMain (`2c5b65ad-2d5e-47b3-b12b-632e35e08ff6`, site code `INNER-BRIDGE-8354`) |
| Site enrollment key | encrypted at `clients/imc/gururmm-site-main.sops.yaml` (vault) |
| First enrolled agent | IMC1 (`fa99e913-1027-4e33-a928-7695e31068e7`) — Mike installed via ScreenConnect |
### 2. Diagnostics on IMC1 (read-only)
`MSSQL$AIMSQL` is alive and responsive. `SELECT 1` returns immediately. SQL Server 2019 Express GDR build 15.0.2165.1 (March 2026 patch). Service started 2026-04-25 22:01:37 (11 days ago). OS uptime 547h (last boot 2026-04-12).
### 3. Root cause — memory pressure on IMC1
| Signal | Value |
|---|---|
| `MSSQL$MICROSOFT##WID` Event 17890 — "significant part of SQL process memory paged out" | 8 events in last 4h, durations 0919s, working sets 153175 MB while committed 326348 MB |
| AIMSQL Total Server Memory | **587 MB** (Target 7,224 MB; Express buffer-pool cap 1,410 MB) |
| AIMSQL `page_fault_count` since startup | **5,689,041** over 11 days |
| AIMSQL Page Life Expectancy | 842,990s (~9.7 days) — buffer barely churns because barely populated |
| Active RDP user sessions | 4 (`repaircoordinator`, `Ru`, `leslie`, `EdServices2`) + console (`guru`) |
| SQL instances co-resident on IMC1 | `MSSQL$AIMSQL` + `MSSQL$SQLEXPRESS` (separate, purpose unclear) + `MSSQL$MICROSOFT##WID` (WSUS / AD RMS) |
### 4. Why this matches the symptom
AIM uses Telerik OpenAccess connection pooling. Pool slots hold idle TCP connections to SQL. Under memory pressure, Windows trims SQL working sets (the 17890 events above), and idle connections from the pool can be reaped — marked "unrecoverable" on the server side. The OpenAccess pool doesn't discover the dead handle until the next reuse, then throws on the trivial query (`SELECT FROM scconfig`). Telerik has no transient-fault retry, so the user sees the raw stack trace.
### 5. Other findings (not root cause but logged)
- DistributedCOM 10016 fires every 5 minutes — RuntimeBroker permission noise. Cosmetic.
- Group Policy event 103 every 5 min — "removal of the assignment of application Syncro from policy Management SW failed". Stale GPO needs cleanup separately.
- `SvcRestartTask` ran 2026-05-05 at 11:00 AM — Windows service auto-recovery kicked in for *something*. Not visible in SCM events for SQL services in 24h, so it wasn't AIMSQL.
- ERRORLOG hasn't been written to since 2026-04-25 22:06 — initially flagged as suspicious, but ERRORLOG only logs startup chatter, errors at high severity, and login audits (if enabled). Quiet ERRORLOG is consistent with a healthy quiet system, not a hang. False alarm.
- Backup job `AIM Back Up` ran 2026-05-04 22:00 successfully (LastTaskResult 0).
- `AIMSQL` ran from PID 34536 the whole time, no service restarts.
- Logging in via sqlcmd as `NT AUTHORITY\SYSTEM` (the agent context) couldn't open the AIM database (login failed) — this is expected; SYSTEM isn't a granted login on AIM. We confirmed connectivity via `master` (server context).
## Plan
### Tonight — scheduled service restart (low risk)
Restart `MSSQL$AIMSQL` at **02:30 AM MST on 2026-05-06** via a one-shot scheduled task on IMC1. Releases the paged-out working set, clears stale pool state, ~30 seconds of unavailability. RDP users with AIM open will need to relaunch in the morning. Expected to clear the immediate symptom for at least a week or two.
### Note for Mike
See the dedicated section below — the long-term issues (3 SQL instances on a DC+RDS+file server with 4 chronic RDP users) need a strategy decision before they hit again.
## Note for Mike
Howard ran read-only diagnostics on IMC1 today after the user-facing AIM "connection broken" error. Service is up, but the server is sustaining memory pressure that's chewing idle connection-pool slots. **A scheduled SQL service restart at 02:30 AM 2026-05-06 will clear the immediate symptom**, but doesn't address the underlying squeeze. Couple of decisions to make when you're back at this:
1. **Why is `MSSQL$SQLEXPRESS` running alongside `MSSQL$AIMSQL`?** Two SQL instances doubles the memory overhead. If SQLEXPRESS is leftover from an old install (pre-AIMSQL migration?), shutting it down + uninstalling would give AIMSQL another ~1 GB of headroom. If something still uses it, we need to know what. Status: SQLAgent$SQLEXPRESS Stopped, MSSQL$SQLEXPRESS Running.
2. **Is the `MSSQL$MICROSOFT##WID` (Windows Internal Database) instance actually needed?** It's used by WSUS and AD RMS. IMC1 doesn't appear to be a WSUS server in production — it has the role installed but I didn't see active WSUS clients. Same question: kill it if unused, free ~300 MB. (The 17890 paging events fired specifically against this instance, which is the canary.)
3. **Memory budget for the DC+RDS+SQL stack.** IMC1 is hosting 4 concurrent RDP users + Domain Controller services + AIMSQL + the two other SQL instances + AIMsi backend, all on Server 2016 Standard with 32 GB. The current allocation is fine on paper but contention shows up under load. The existing "Server 2019 migration" plan in the README is the right answer; this incident is more evidence to push it forward. Worth scoping cost/timeline at next ACG strategy call.
4. **Server 2016 EOL** is approaching (extended support ends 2027-01-12). Migration window is finite.
5. **`SvcRestartTask` mystery** — Windows service auto-recovery ran today at 11:00 AM and we don't know which service it restarted. Worth checking System log around that timestamp on a later session.
The scheduled restart should hold things stable for now. If AIM users hit this again before we have a longer-term plan, escalate to me and we'll do an interactive deep-dive (longer Application log scan, perfmon trace, ETW capture, etc.) without making changes during business hours.
## References
- AIM error screenshot: provided by Mike at 2026-05-05 ~16:00 PT
- IMC1 vault: `clients/imc/imc1.sops.yaml` (existing, includes IMC\guru creds)
- GuruRMM site enrollment vault: `clients/imc/gururmm-site-main.sops.yaml` (created today)
- IMC client folder README: `clients/instrumental-music-center/README.md`
- GuruRMM agent ID for IMC1: `fa99e913-1027-4e33-a928-7695e31068e7`