diff --git a/clients/internal-infrastructure/session-logs/2026-04-23-neptune-inbound-mail-outage.md b/clients/internal-infrastructure/session-logs/2026-04-23-neptune-inbound-mail-outage.md index 9753a37..e1054ac 100644 --- a/clients/internal-infrastructure/session-logs/2026-04-23-neptune-inbound-mail-outage.md +++ b/clients/internal-infrastructure/session-logs/2026-04-23-neptune-inbound-mail-outage.md @@ -469,3 +469,267 @@ All prior backups from phase 1 remain in `C:\BackupBeforeFix\`. No new registry **For future Windows Update rollouts on this server:** verify Exchange mailflow end-to-end (including `Get-MessageTrackingLog -EventId DELIVER`) within 10 minutes of the post-update reboot. The initial FETS-accepts-SMTP test is not sufficient — it only validates external ingest, not mailbox delivery. +--- + +## Update: 13:55 – 14:45 — Post-Reboot Recovery & Revised Root Cause + +### Revised root cause (supersedes the KB-update theory above) + +**Neptune was in-place upgraded from Windows Server 2016 to Windows Server 2022 on 2026-04-22 at 15:04 local** — 23 hours before today's outage. Mike disclosed this mid-session. This changes the whole analysis: KB5082142 and KB5084071 were *triggers* that forced Exchange transport services to reload, but the *underlying* problem is that **Exchange 2016 on Windows Server 2022 is an unsupported combination** (Exch 2016 requires WS2016; only Exch 2019 supports WS2022, and 2019 is out of support as of 2025-10-14). Multiple latent WS2022-incompatibilities surfaced the moment services were forced to reinitialize: + +1. NETWORK SERVICE missing rights on `AssistantsQuarantine` — default ACL changed between WS2016 and WS2022. +2. Exchange transport's internal DNS resolver on WS2022 **does not honor the OS DNS suffix search list**. Queries for single-label names like `n-hosting1` get `DnsDomainDoesNotExist` even when `n-hosting1.acg.local` resolves fine via the OS resolver. This is the new big one — it's not a registry ACL fluke, it's a behavioral incompatibility baked into how `edgetransport.exe` talks to the DNS client on 2022. +3. Service-start ordering at boot: `MSExchangeADTopology` hit the SCM 45-second start-control timeout on the first post-reboot boot (WS2022 AD-locator is slower than Exchange 2016's SCM expectation), which cascaded to every other service failing with "An instance of the service is already running." +4. Built-in agents `RMS Encryption Agent` and `Index Routing Agent` (both non-disable-able via `Disable-TransportAgent`) timed out ~160 sec and ~95 sec per message respectively, effectively freezing the categorizer. + +### Rollback status — NOT AVAILABLE + +Checked every path. `C:\Windows.old` exists but is useless by itself: + +| Recovery path | State | +|---|---| +| `dism /Online /Initiate-OSUninstall` | Error 1168 (disarmed — uninstall pointers cleaned) | +| `$Windows.~BT` / `$Windows.~WS` staging | Deleted | +| VSS shadow copies on C: | None | +| VM host snapshot | N/A — bare-metal Dell PowerEdge R720 | +| Windows Server Backup | Not installed (`wbengine` service missing) | +| External backup (Veeam/Datto/etc.) | None for this machine — Mike confirmed | + +Windows cleanmgr or a post-upgrade Windows Update most likely nuked the rollback metadata inside the first 24 hours. No restore point, no snapshot, no backup. The WS2022 upgrade is effectively permanent unless we reinstall WS2016 from media. + +### Recovery sequence that actually worked (chronological) + +**13:45** — Full reboot completed. Server uptime reset. + +**13:46–13:48** — `MSExchangeADTopology` hit 45-sec SCM timeout on auto-start; all 27 Exchange services cascaded into failed/Stopped state. No manual start was attempted during this window — it's a cold-boot race condition. + +**13:57** — `Start-Service MSExchangeADTopology` (manual) completed in **3.6 seconds**. AD was reachable all along — the issue is purely that WS2022's boot timing doesn't satisfy Exchange 2016's pre-ADTopology dependency checks in the default 45-sec SCM window. + +**13:58 – 14:00** — Started remaining 25 Exchange services in dependency order. All reached Running. `MSExchangeMailboxAssistants` took 41 sec, `MSExchangeHM` took 33 sec, others <30 sec. No failures. + +**14:02** — Verified post-reboot state: ACL ACE survived, `msExchRoutingMasterDN = NEPTUNE`, DkimSigner + messageconcept SBR still disabled, DBs mounted (N-Hosting1 809 GB + N-LargeBoxes 313 GB), hosts + DC-side DNS entries intact. + +**14:03** — Found `n-hosting1` queue stuck with 405+ messages, LastError `DnsDomainDoesNotExist`. Force-retried — no progress. This is when the WS2022 suffix-search behavior became apparent: `.NET's GetHostEntry('n-hosting1')` returns 172.16.3.11 (via suffix search), but edgetransport queries the raw label and the DC correctly answers SERVFAIL. + +**14:05–14:09** — Enabled GlobalNames Zone (GNZ) on ACG-DC16 as a potential fix; added CNAMEs for `n-hosting1`, `n-largeboxes`, `mail` → `mail.acg.local.`. Did **not** help. Even after restarting DNS on the DC, `Resolve-DnsName n-hosting1` still returned SERVFAIL locally on the DC — something about GNZ lookup interacting with AD-integrated zones on this specific server refused to work. GNZ remains enabled (no harm) but it wasn't the fix. + +**14:10** — Restarted `MSExchangeTransport` hoping stale topology cache was the issue. The n-hosting1 / n-largeboxes queues dropped to 0 — but the messages all moved into Submission, which then sat at 640 with 0 DELIVER events. + +**14:11–14:13** — Diagnosed categorizer stall. Event Viewer showed: +- `RMS Encryption Agent` Event 1050 warnings every ~160 sec (7+ fires in one minute), `OnRoutedMessage`. +- `Index Routing Agent` Event 1050 warnings every ~90–97 sec (8+ fires in one minute), `OnResolvedMessage`. + +Both are built-in, both failed `Disable-TransportAgent "Transport agent 'X' isn't found"` (they're not in the public agent list and must be silenced indirectly). + +**14:14** — RMS Encryption Agent silenced via full IRM lockdown: +```powershell +Set-IRMConfiguration -InternalLicensingEnabled $false -ExternalLicensingEnabled $false ` + -TransportDecryptionSetting Disabled -JournalReportDecryptionEnabled $false ` + -SimplifiedClientAccessEnabled $false -EDiscoverySuperUserEnabled $false -Confirm:$false +``` +(The earlier pre-reboot `Set-IRMConfiguration` from phase 1 only addressed TransportDecryption and JournalReport. The full lockdown with `InternalLicensingEnabled $false` + `ExternalLicensingEnabled $false` is what actually stopped the agent from firing.) + +**14:15** — Index Routing Agent silenced by restarting the search subsystem: +```powershell +Stop-Service HostControllerService -Force +Stop-Service MSExchangeFastSearch -Force +Get-Process noderunner | Stop-Process -Force # 4 workers, 1.7 GB total +Start-Service MSExchangeFastSearch +Start-Service HostControllerService +Restart-Service MSExchangeTransport -Force +``` +After this, Event 1050 stopped. But Submission queue still sat at 640 with no DELIVER events. Categorizer moved some messages (TRANSFER, REDIRECT, DSN) but nothing to mailbox. + +**14:27** — **The actual fix.** On WS2022, edgetransport's DNS resolver behaves differently from WS2016 — it does not follow the NIC's `ConnectionSpecificSuffixSearchList` or the global `SuffixSearchList = {acg.local}`. Flipped the transport server off adapter-mode DNS and onto explicit DNS servers: +```powershell +Set-TransportServer NEPTUNE -InternalDNSAdapterEnabled $false ` + -InternalDNSServers @('172.16.3.50','172.16.3.52') +Set-TransportServer NEPTUNE -ExternalDNSAdapterEnabled $false ` + -ExternalDNSServers @('8.8.8.8','1.1.1.1','172.16.3.50') +Restart-Service MSExchangeTransport -Force +Retry-Queue -Identity 'NEPTUNE\11392' -Confirm:$false +``` +With explicit DNS servers, Exchange's own resolver applies `acg.local` suffix append when querying the configured servers, and the DC's normal `acg.local` zone A records resolve `n-hosting1.acg.local` → 172.16.3.11 correctly. **This is the config that must survive and that will fix this class of issue going forward** — stored in AD (`msExchInternalDNSServers`), survives reboots, not a runtime flag. + +**14:32** — First post-fix DELIVER event: Mike's earlier probe `NEPTUNE repair probe 142532` landed in `administrator@acghosting.com` at 14:32:01 (6.5 min after SMTP-accept — first message through the stack after the explicit-DNS change). + +**14:38** — Bertie started receiving her day's backlog. By 14:45 she had 114 DELIVER events in the preceding 2 hours covering the whole day's traffic. + +### Current state at 14:45 + +| Item | State | +|---|---| +| External inbound (FETS) | Accepting 250 2.6.0 cleanly | +| Submission queue | 0 (no backlog) | +| `n-hosting1` / `n-largeboxes` queues | 0 / 0 | +| Shadow queue | 9 (transient, normal) | +| Outbound delivery queue (residual) | 159 messages total, mostly Retry to dead external domains (mayo.edu rejecting + backscatter to bot domains) | +| DELIVER rate | ~10/min (32 in last 3 min) | +| DELIVER total since midnight | 532 | +| `MSExchangeDelivery` Event 10003 | 0 since the pre-reboot ACL fix | +| Event 1050 (agent timeout) | 0 since 14:13 | +| Event 1057 (async no-resume) | 0 since DkimSigner disabled in phase 1 | +| AssistantsQuarantine ACL | NETWORK SERVICE ACE persists | +| msExchRoutingMasterDN | NEPTUNE (persists) | +| `Get-MailboxStatistics bertie@amtransit.com` | 212,048 items, 31.78 GB, LastLogon 14:21 — healthy | + +### Sticky risks that will reappear on next reboot + +Everything that *survives* reboot through AD/Exchange config (explicit DNS servers, RoutingMasterDN, disabled agents, IRM lockdown) will stay put. Things that will bite again: + +1. **`MSExchangeADTopology` 45-sec SCM start timeout** — every cold boot on this WS2022 box. Manual remediation sequence: `Start-Service MSExchangeADTopology` → then every other `MSExchangeXxx` service in dependency order. Consider setting the service recovery options (`sc failure MSExchangeADTopology reset= 60 actions= restart/5000/restart/5000/restart/5000`) so Windows retries it automatically. Better: document the manual sequence and treat reboots as planned events. +2. **AssistantsQuarantine ACL** — the registry ACE is applied, but if Windows Update or an Exchange CU ever resets default ACLs on this key, delivery crashes return. Check on any future Exchange or WS CU. +3. **IRM config** — if any Exchange CU reinstaller runs `Set-IRMConfiguration` to defaults, RMS agent timeouts return immediately. Re-apply the full lockdown after any CU. +4. **FastSearch cold-start** — the 4 noderunner workers sometimes come back in a state where Index Routing Agent blocks. If queue stalls and Event 1050 Index Routing fires post-reboot, re-run the HostController + FastSearch restart + noderunner kill sequence. + +### Plan + +- **Mike parallel track:** building fresh WS2016 VM to host a replacement MAIL server. Once up and joined to the DAG, mailbox moves from NEPTUNE can proceed and this box becomes decommission-able. +- **Neptune:** leave as-is on WS2022 for now. Known-good config documented above. Do not apply any Exchange CU without testing end-to-end DELIVER first. Do not touch IRM, transport agents, or TransportServer DNS settings. +- **Long-term:** Exchange Online migration is the only clean forward path — both Exchange 2016 and 2019 are out of extended support as of 2025-10-14. Separate project. + +### What did NOT fix the problem (so future-us doesn't repeat) + +- **GlobalNames Zone (GNZ) on ACG-DC16** — enabled with CNAMEs for `n-hosting1`, `n-largeboxes`, `mail`. Zone created, records present, GNZ enabled server-wide. Didn't resolve on the DC even locally (SERVFAIL). Left enabled, harmless. Not the fix. +- **Hosts file entries on Neptune** — re-confirmed edgetransport bypasses the Windows hosts file for transport-internal DNS. The hosts file works for `.NET` / `[System.Net.Dns]::GetHostEntry()` and for service clients like OWA, but not for the transport pipeline's resolver. +- **`Retry-Queue` with `-Resubmit $true`** — re-queued messages but they hit the same DNS lookup path and failed again. +- **Full MSExchangeTransport restart alone** — temporarily drops queue messagecounts (they move to Submission) but doesn't clear the DNS lookup failure. + +### Diagnostic commands that actually helped + +```powershell +# Confirm the DNS behavior gap (suffix search vs raw label) +[System.Net.Dns]::GetHostEntry('n-hosting1') # Works (OS resolver w/ suffix search) +Resolve-DnsName 'n-hosting1' -Server 172.16.3.50 # SERVFAIL (raw label, no suffix) + +# Find the stuck agent — this was the smoking gun for categorizer stall +Get-WinEvent -FilterHashtable @{LogName='Application'; + ProviderName='MSExchange Extensibility'; Id=1050; + StartTime=(Get-Date).AddMinutes(-3)} | ` + Group-Object { $_.Message -replace "^The execution time of agent '([^']+)'.*$",'$1' } + +# Check if SUBMIT events are firing (categorizer health) +Get-MessageTrackingLog -Server NEPTUNE -Start (Get-Date).AddMinutes(-3) -ResultSize Unlimited | ` + Group-Object EventId | Sort-Object Count -Descending + +# Verify explicit-DNS config is in place (the fix) +Get-TransportServer NEPTUNE | fl InternalDNSAdapterEnabled, InternalDNSServers, + ExternalDNSAdapterEnabled, ExternalDNSServers +``` + +### Cross-machine / cross-user + +- All work done locally on NEPTUNE as `administrator.ACG` — no WinRM hops except `Invoke-Command ACG-DC16` for DNS. +- Will sync this log to Gitea via `/scc` from a workstation after Mike wraps the MAIL-server VM build. + +--- + +## Update: 14:45 – 15:00 — Migration Planning & Config Snapshot + +### Decision pivot — Exchange 2019 on WS2022 instead of Exchange 2016 rebuild + +Mike's first plan after rollback was ruled out was to rebuild the decommissioned `MAIL` server on fresh WS2016, move all mailboxes there, and decommission NEPTUNE. I flagged it as workable but suboptimal. Mike pushed back with the better plan: install Exchange 2019 on a new WS2022 VM into the existing Exchange org, move mailboxes, force-remove the old server carcasses. + +**Agreed direction:** Exchange 2019 CU14 (or CU15) on fresh WS2022 Datacenter VM. Coexists with the existing Exchange 2016 org cleanly, supported OS+Exchange combo (modulo Exchange 2019 itself being out of extended support since 2025-10-14), no new 180-day Eval clock, and sets up a trivial in-place upgrade to Exchange Subscription Edition later. + +**Path rejected:** Exchange SE directly is the forward-looking product but has subscription-licensing friction (per-seat monthly + M365 tenant activation via HCW). 2019 CU14 installs without license activation — install now, upgrade to SE later as a CU-style patch once licensing is sorted. + +### Organization inventory (used for migration planning) + +| Item | Count / Value | +|---|---| +| Mailboxes | 56 | +| Total mailbox data | 270.1 GB | +| Mailbox DBs | 2 — N-Hosting1 (809 GB allocated, hosts 54 boxes), N-LargeBoxes (313 GB, hosts 2 boxes) | +| Send connectors | 12 (per-client SBR rewrites) | +| Receive connectors | 6 (5 standard + 1 WesternTire Relay) | +| Transport rules | 3 (organization-wide) | +| Accepted domains | 19 (client-hosted) | +| Transport agents enabled | 10 | +| Transport agents disabled | 2 (Exchange DkimSigner, messageconcept SBR priority 11) | +| UM-enabled mailboxes | **0** — role dormant despite Dial Plan + Policy being defined. No blocker for Exchange 2019. | +| Mobile device policies | 2 | +| DLP / journal rules | 0 | +| Domain controllers | 1 — ACG-DC16 (WS2016 Std, 172.16.3.52) | +| Forest functional level | Windows2012R2Forest (meets 2019 CU14 minimum) | +| Schema version (rangeUpper) | 17002 (Exchange 2016 CU23) | +| Forest schema master / PDC | ACG-DC16.acg.local (all FSMO roles) | + +**Top-5 mailboxes by size** (plan overnight migration windows for these): + +| Mailbox | Size (MB) | Items | +|---|---|---| +| cansley@devconllc.com | 43,769 | 132,226 | +| Channa@farwestwell.com | 31,936 | 220,938 | +| bertie@amtransit.com | 32,553 | 212,074 | +| marylou@littleheartslittlehands.org | 27,229 | 232,790 | +| scott@justsimplysmart.com | 25,745 | 386,570 | + +### Config snapshot captured + +Full Exchange config dumped to **`C:\NeptuneConfigExport-20260423\`** (on NEPTUNE itself for now; will copy to vault / new server as needed). Every object captured as both `.txt` (human-readable `Format-List`) and `.xml` (`Export-Clixml` for re-import if needed). + +Exports: +- `send-connectors` — 12 items (full SourceTransportServers, AddressSpaces, SmartHosts, authentication) +- `receive-connectors` — 6 items +- `transport-rules` — 3 items +- `transport-config` — org-wide (1 item, including the `InternalSMTPServers = 172.21.3.50` typo still unfixed) +- `transport-server` — NEPTUNE-specific transport config incl. the DNS override we just set +- `transport-agents` — 12 agents with Enabled/Priority state +- `accepted-domains` — 19 domains +- `remote-domains` — 1 (Default) +- `email-address-policies` — 1 +- `irm-config` — fully disabled state +- `malware-filter` — BypassFiltering=True on both NEPTUNE and (stale) MAIL +- `owa-vdir`, `ecp-vdir`, `ews-vdir`, `eas-vdir`, `mapi-vdir`, `oab-vdir`, `autodiscover-vdir` — all vdir configs w/ InternalUrl/ExternalUrl +- `client-access` — ClientAccessService (SCP / AutoDiscoverServiceInternalUri) +- `outlook-anywhere` — Outlook Anywhere config +- `um-dialplans` — reference only (not migrating) +- `mobile-device-policies` — 2 policies +- `mailbox-databases` — N-Hosting1 + N-LargeBoxes full state +- `mailbox-inventory.csv` — 56-row spreadsheet for move planning (all mailboxes with size+count) +- `sbr-configs/` — copies of `Microsoft.Exchange.SBR.InternalDomains.config` (172 B), `Microsoft.Exchange.SBR.OverrideSettings.config` (491 B), `Microsoft.Exchange.SBR.IgnoreAuthAs.config` (empty) + +### Migration runbook written + +**`C:\NeptuneConfigExport-20260423\MIGRATION-RUNBOOK.md`** — full 6-phase runbook: + +- **Phase 1:** VM prereqs (WS2022, 4 vCPU / 16 GB / 1.2 TB data disk), Windows features, prerequisite software list (.NET 4.8.1, VC++ 2012/2013, URL Rewrite, UCMA 4.0), AD admin group requirements, **backup ACG-DC16 system state before schema prep** +- **Phase 2:** Schema prep (`Setup.exe /PrepareSchema` → `/PrepareAD` → `/PrepareDomain`). Expected rangeUpper bump: 17002 → 17003 (CU14) or 17005 (CU15). Forest-permanent. +- **Phase 3:** Install Exchange 2019 + port the 14 items of config from the snapshot (incl. the critical **WS2022 DNS adapter workaround** `Set-TransportService -InternalDNSAdapterEnabled $false`, IRM lockdown, SBR config file copy). New DBs should be named `DB01-Hosting` / `DB02-LargeBoxes` — avoid the `N-Hosting1` / `N-LargeBoxes` names that triggered today's DNS mess. +- **Phase 4:** Pilot (5 small mailboxes) then bulk `New-MoveRequest` in batches. Plan ~10 GB/hr LAN transfer rate → ~27 hr total transfer. +- **Phase 5:** Cutover — repoint MailProtector relay target, public DNS A records, vdir external URLs, AutoDiscover SCP. +- **Phase 6:** Clean-uninstall Exchange from NEPTUNE (`Setup /Mode:Uninstall`), then `Remove-ADObject -Recursive` on the MAIL carcass (that object never had a clean uninstall). Clean up the workaround hosts-file + DC DNS records + disable GlobalNames Zone. + +**Estimated total wall time: ~1 week** with appropriate change windows. + +### Known coexistence risks for the transition + +- Certs: new server needs cert matching `mail.acghosting.com` + autodiscover SAN. Can re-export from NEPTUNE (thumbprint `E58BFCBAEFEFDCAED0BF9E894127A3DE64CE9C69`, expires 2026-07-22) or issue new from Let's Encrypt. +- MAIL AD carcass `msExchCurrentServerRoles=16439`, build `15.1.2507.18` (newer than NEPTUNE!) — this is why FETS kept proxying to it during today's outage. Force-remove in Phase 6. +- Outlook clients cached credentials behave well; Mac Outlook sometimes needs restart after mailbox move. + +### No new credentials discovered this phase + +All work in this phase was local PowerShell on NEPTUNE; no new accounts, secrets, or endpoints discovered beyond what's in the phase-1 credentials section. + +### Files written in this phase + +| Path | Purpose | +|---|---| +| `C:\NeptuneConfigExport-20260423\*.{txt,xml,csv}` | 34 config exports (22 areas × 2 formats + CSV inventory) | +| `C:\NeptuneConfigExport-20260423\sbr-configs\*.config` | 3 SBR agent config files copied from live path | +| `C:\NeptuneConfigExport-20260423\MIGRATION-RUNBOOK.md` | Full 6-phase migration runbook | + +### Next-session entry points + +- Mike is building the WS2022 VM for the Exchange 2019 server (temp name `MAIL2` recommended; rename after carcass cleanup). +- Copy `C:\NeptuneConfigExport-20260423\` off NEPTUNE to a location that will outlive NEPTUNE (vault, or the new MAIL2 box directly). +- Before running `/PrepareSchema`: **take a system-state backup or VM snapshot of ACG-DC16**. Schema changes are forest-permanent and cannot be rolled back via uninstall. +- Mike will ping for real-time walk-through before the schema-prep step. + +## Note for Howard + +Today NEPTUNE (Exchange 2016 Eval) hit a severe outage triggered by the Windows Server 2016→2022 in-place upgrade Mike did on 2026-04-22. Exchange 2016 is unsupported on WS2022; multiple latent issues surfaced (DNS resolver suffix-search, RMS/Index Routing agent timeouts, ADTopology boot-timing, AssistantsQuarantine ACL). **Mail is flowing as of ~14:32** after disabling adapter-mode DNS on the transport server and using explicit DNS servers (`Set-TransportService NEPTUNE -InternalDNSAdapterEnabled $false -InternalDNSServers 172.16.3.50,172.16.3.52`). Rollback to WS2016 is **not available** — all paths (DISM uninstall, Windows.old staging, VSS, Windows Server Backup, VM snapshot) are dead. Mike is going to build Exchange 2019 on a fresh WS2022 VM into the same org, then migrate mailboxes and decommission NEPTUNE + force-remove the old MAIL AD carcass. If you end up running the migration, the full runbook is at `C:\NeptuneConfigExport-20260423\MIGRATION-RUNBOOK.md` on NEPTUNE — copy that whole folder off before NEPTUNE goes away. **Do NOT run `/PrepareSchema` without first backing up ACG-DC16** (single-DC forest, schema changes are forest-permanent). + +