8.2 KiB
IMC — AIM "connection broken" diagnosis + GuruRMM agent enrollment
Date: 2026-05-05
User
- User: Howard Enos (howard)
- Machine: Howard-Home
- Role: tech
Summary
User reported Telerik.OpenAccess.RT.sql.SQLException: Connection has been closed / "The connection is broken and recovery is not possible" on AIM at IMC. Provisioned a GuruRMM client + site for IMC, enrolled IMC1 as the first agent, ran read-only diagnostics. Root cause is sustained memory pressure on IMC1, not a SQL hang or service outage. Made no changes per Howard's instruction. A scheduled Restart-Service MSSQL$AIMSQL is the immediate fix; a longer-term consolidation conversation is needed (note for Mike below).
What was done
1. GuruRMM provisioning
| Resource | Value |
|---|---|
| Client | Instrumental Music Center (213b62a8-30f4-41dd-9bb3-549341104416, code IMC) |
| Site | IMCMain (2c5b65ad-2d5e-47b3-b12b-632e35e08ff6, site code INNER-BRIDGE-8354) |
| Site enrollment key | encrypted at clients/imc/gururmm-site-main.sops.yaml (vault) |
| First enrolled agent | IMC1 (fa99e913-1027-4e33-a928-7695e31068e7) — Mike installed via ScreenConnect |
2. Diagnostics on IMC1 (read-only)
MSSQL$AIMSQL is alive and responsive. SELECT 1 returns immediately. SQL Server 2019 Express GDR build 15.0.2165.1 (March 2026 patch). Service started 2026-04-25 22:01:37 (11 days ago). OS uptime 547h (last boot 2026-04-12).
3. Root cause — memory pressure on IMC1
| Signal | Value |
|---|---|
MSSQL$MICROSOFT##WID Event 17890 — "significant part of SQL process memory paged out" |
8 events in last 4h, durations 0–919s, working sets 153–175 MB while committed 326–348 MB |
| AIMSQL Total Server Memory | 587 MB (Target 7,224 MB; Express buffer-pool cap 1,410 MB) |
AIMSQL page_fault_count since startup |
5,689,041 over 11 days |
| AIMSQL Page Life Expectancy | 842,990s (~9.7 days) — buffer barely churns because barely populated |
| Active RDP user sessions | 4 (repaircoordinator, Ru, leslie, EdServices2) + console (guru) |
| SQL instances co-resident on IMC1 | MSSQL$AIMSQL + MSSQL$SQLEXPRESS (separate, purpose unclear) + MSSQL$MICROSOFT##WID (WSUS / AD RMS) |
4. Why this matches the symptom
AIM uses Telerik OpenAccess connection pooling. Pool slots hold idle TCP connections to SQL. Under memory pressure, Windows trims SQL working sets (the 17890 events above), and idle connections from the pool can be reaped — marked "unrecoverable" on the server side. The OpenAccess pool doesn't discover the dead handle until the next reuse, then throws on the trivial query (SELECT FROM scconfig). Telerik has no transient-fault retry, so the user sees the raw stack trace.
5. Other findings (not root cause but logged)
- DistributedCOM 10016 fires every 5 minutes — RuntimeBroker permission noise. Cosmetic.
- Group Policy event 103 every 5 min — "removal of the assignment of application Syncro from policy Management SW failed". Stale GPO needs cleanup separately.
SvcRestartTaskran 2026-05-05 at 11:00 AM — Windows service auto-recovery kicked in for something. Not visible in SCM events for SQL services in 24h, so it wasn't AIMSQL.- ERRORLOG hasn't been written to since 2026-04-25 22:06 — initially flagged as suspicious, but ERRORLOG only logs startup chatter, errors at high severity, and login audits (if enabled). Quiet ERRORLOG is consistent with a healthy quiet system, not a hang. False alarm.
- Backup job
AIM Back Upran 2026-05-04 22:00 successfully (LastTaskResult 0). AIMSQLran from PID 34536 the whole time, no service restarts.- Logging in via sqlcmd as
NT AUTHORITY\SYSTEM(the agent context) couldn't open the AIM database (login failed) — this is expected; SYSTEM isn't a granted login on AIM. We confirmed connectivity viamaster(server context).
Plan
Tonight — scheduled service restart (REGISTERED)
| Field | Value |
|---|---|
| Task name | AIMSQL_Restart_20260506_0230 |
| Trigger | 2026-05-06 02:30 AM MST (one-shot) |
| Runs as | SYSTEM |
| Action | powershell.exe -NoProfile -ExecutionPolicy Bypass -File "C:\Windows\Temp\aimsql-restart.ps1" |
| Restart script path | C:\Windows\Temp\aimsql-restart.ps1 (984 bytes) |
| Run log | C:\Windows\Temp\aimsql-restart.log (created on first run) |
| Auto-delete | none — manually remove after verifying clean run |
Script does: log pre-state → Restart-Service "MSSQL$AIMSQL" -Force → 15s sleep → log post-state → SELECT 1, GETDATE() via sqlcmd to confirm responsiveness → log result.
Quirks worth noting for future scheduled-task work on this server:
Register-ScheduledTaskrejected-DeleteExpiredTaskAfteron its own; it requires a triggerEndBoundary. SettingEndBoundaryafter the fact also returned "parameter is incorrect" on this Server 2016 box. Workaround: register withoutDeleteExpiredTaskAfterand clean up by hand. Worth saving as a recipe for future jobs.schtasks.exe /Createfailed with "parameter is incorrect" — possibly the colon in the original task name (... 02:30) hit a path-vs-name validation. Switched to underscore-only name (AIMSQL_Restart_20260506_0230), which Register-ScheduledTask accepted cleanly.
Morning verification (Howard, 2026-05-06 first thing)
- Check
C:\Windows\Temp\aimsql-restart.logvia GuruRMM - Confirm SELECT 1 succeeded and AIMSQL post-state is Running
- Sample-test AIM from a workstation if possible
- Delete the scheduled task:
Unregister-ScheduledTask -TaskName AIMSQL_Restart_20260506_0230 -Confirm:$false
Note for Mike
See the dedicated section below — the long-term issues (3 SQL instances on a DC+RDS+file server with 4 chronic RDP users) need a strategy decision before they hit again.
Note for Mike
Howard ran read-only diagnostics on IMC1 today after the user-facing AIM "connection broken" error. Service is up, but the server is sustaining memory pressure that's chewing idle connection-pool slots. A scheduled SQL service restart at 02:30 AM 2026-05-06 will clear the immediate symptom, but doesn't address the underlying squeeze. Couple of decisions to make when you're back at this:
- Why is
MSSQL$SQLEXPRESSrunning alongsideMSSQL$AIMSQL? Two SQL instances doubles the memory overhead. If SQLEXPRESS is leftover from an old install (pre-AIMSQL migration?), shutting it down + uninstalling would give AIMSQL another ~1 GB of headroom. If something still uses it, we need to know what. Status: SQLAgent$SQLEXPRESS Stopped, MSSQL$SQLEXPRESS Running. - Is the
MSSQL$MICROSOFT##WID(Windows Internal Database) instance actually needed? It's used by WSUS and AD RMS. IMC1 doesn't appear to be a WSUS server in production — it has the role installed but I didn't see active WSUS clients. Same question: kill it if unused, free ~300 MB. (The 17890 paging events fired specifically against this instance, which is the canary.) - Memory budget for the DC+RDS+SQL stack. IMC1 is hosting 4 concurrent RDP users + Domain Controller services + AIMSQL + the two other SQL instances + AIMsi backend, all on Server 2016 Standard with 32 GB. The current allocation is fine on paper but contention shows up under load. The existing "Server 2019 migration" plan in the README is the right answer; this incident is more evidence to push it forward. Worth scoping cost/timeline at next ACG strategy call.
- Server 2016 EOL is approaching (extended support ends 2027-01-12). Migration window is finite.
SvcRestartTaskmystery — Windows service auto-recovery ran today at 11:00 AM and we don't know which service it restarted. Worth checking System log around that timestamp on a later session.
The scheduled restart should hold things stable for now. If AIM users hit this again before we have a longer-term plan, escalate to me and we'll do an interactive deep-dive (longer Application log scan, perfmon trace, ETW capture, etc.) without making changes during business hours.
References
- AIM error screenshot: provided by Mike at 2026-05-05 ~16:00 PT
- IMC1 vault:
clients/imc/imc1.sops.yaml(existing, includes IMC\guru creds) - GuruRMM site enrollment vault:
clients/imc/gururmm-site-main.sops.yaml(created today) - IMC client folder README:
clients/instrumental-music-center/README.md - GuruRMM agent ID for IMC1:
fa99e913-1027-4e33-a928-7695e31068e7