From 742c25c96e08d92433d0d87e110851092ea2048a Mon Sep 17 00:00:00 2001 From: Administrator Date: Thu, 23 Apr 2026 13:38:52 -0700 Subject: [PATCH] session log: Neptune inbound mail outage + partial recovery (pre-reboot snapshot) KB5082142 (Windows Server 21H2 CU) + KB5084071 (.NET Framework CU) triggered cascading Exchange 2016 failures on NEPTUNE today. External SMTP ingest was restored after 4 fixes (registry ACL on AssistantsQuarantine, Routing Master DN, disabled messageconcept ExSBR, hosts entries for dead MAIL server). But internal pipeline (Submission -> categorizer -> mailbox delivery) remained broken until 3 more fixes (DNS records on ACG-DC16 for n-hosting1/n-largeboxes /mail, disabled hung DkimSigner agent, disabled IRM to silence RMS Encryption Agent timeouts). Submission queue still pinned at ~427 messages pre-reboot; full Neptune reboot queued to clear edgetransport.exe in-memory DNS cache and pending KB5082142 reboot actions. All registry/AD/config backups in C:\BackupBeforeFix\ on Neptune. Post-reboot verification checklist documented in the log. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../2026-04-23-neptune-inbound-mail-outage.md | 471 ++++++++++++++++++ 1 file changed, 471 insertions(+) create mode 100644 clients/internal-infrastructure/session-logs/2026-04-23-neptune-inbound-mail-outage.md diff --git a/clients/internal-infrastructure/session-logs/2026-04-23-neptune-inbound-mail-outage.md b/clients/internal-infrastructure/session-logs/2026-04-23-neptune-inbound-mail-outage.md new file mode 100644 index 0000000..9753a37 --- /dev/null +++ b/clients/internal-infrastructure/session-logs/2026-04-23-neptune-inbound-mail-outage.md @@ -0,0 +1,471 @@ +# Session Log: 2026-04-23 — Neptune Inbound Mail Outage (KB5082142 Aftermath) + +## User +- **User:** Mike Swanson (mike) +- **Machine:** NEPTUNE (Exchange server, not Mike's workstation — session executed locally on the server as administrator.ACG) +- **Role:** admin + +## Session Summary + +**Incident:** At approximately 12:28 PM local (19:28 UTC) on 2026-04-23, all inbound mail to Neptune began being deferred. MailProtector returned `451 4.7.0 Temporary server error. Please try again later. PRX2` at end-of-DATA for every inbound relay attempt. Google / Amazon SES / other direct senders hit the same error. Outbound mail continued to work. + +**Duration:** ~42 minutes of full inbound mail outage (12:28 PM – 1:10 PM local). + +**Root cause (layered):** + +1. **Trigger — Windows Update KB5082142** (2026-04 Cumulative Update for Windows Server 21H2) + KB5084071 (.NET Framework 3.5/4.8/4.8.1) installed at ~10 AM local, restarting the Exchange transport services at 12:26 PM. + +2. **First layer — `MSExchangeDelivery` registry ACL crash:** The service runs as `NT AUTHORITY\NETWORK SERVICE`. On restart it attempted to create a new GUID subkey at `HKLM\SYSTEM\CurrentControlSet\Services\AssistantsQuarantine\7ae551a5-6ed4-4389-9b06-44a926841ead` — but that key's ACL only granted `Administrators`, `SYSTEM`, and `CREATOR OWNER`. NETWORK SERVICE had zero rights. Result: `UnauthorizedAccessException` (Event 10003) → service terminated unexpectedly (SCM Event 7031) → transport proxy sessions failing (Event 1048: "More than 15% of proxy sessions over the last 15 minutes have failed"). + +3. **Second layer — Dead MAIL server in Exchange topology:** Once delivery was up again, `MSExchangeFrontEndTransport` began picking a proxy target for each inbound session from the list of Mailbox servers in its AD site. The dead `MAIL` server still existed in AD (same site `Default-First-Site-Name`, same `msExchCurrentServerRoles = 16439`, same Enterprise-ish build as NEPTUNE). FETS selected MAIL, tried to resolve `mail.acg.local` via DNS, failed, and fell back to the literal placeholder string `internalproxy` as the proxy destination. DNS of course returned non-existent for `internalproxy` as well. Exchange returned `554 5.4.4 SMTPSEND.DNS.NonExistentDomain; nonexistent domain internalproxy -> DnsDomainDoesNotExist: InfoDomainNonexistent` internally, which FETS then surfaced to the external sender as `451 4.7.0 PRX2`. + +4. **Third layer — Tombstoned Routing Master:** The Exchange Routing Group's `msExchRoutingMasterDN` attribute still pointed at the AD *deleted-objects* DN of MAIL: `CN=MAIL\0ADEL:528e09e8-4d78-4054-9e6d-e8ad490cd1d1,CN=Servers,...`. Event 2937 was logging this on every startup: "Property [RoutingMasterDN] ... is pointing to the Deleted Objects container in Active Directory. This property should be fixed as soon as possible." + +**Resolution path:** + +Four fixes applied in sequence. The **first** stopped the delivery-service crash (no more Event 10003). The **second** cleared the tombstoned routing master warning but did not stop 451 PRX2 on its own — Exchange still had MAIL as a legitimate mailbox-server candidate for site-internal proxying. The **third** (disabling expired messageconcept ExSBR) ruled out the agent as a source of the "internalproxy" string. The **fourth** — adding hosts-file entries to redirect the dead MAIL server name to NEPTUNE's own IP — was the one that actually restored mail flow, because it made FETS's DNS lookup for `mail.acg.local` succeed and proxy back to Neptune. + +The clean long-term fix is to remove the MAIL server AD object entirely, at which point the hosts entries can come off. + +### Key Decisions +- Chose hosts-file workaround over AD object deletion for the immediate outage because it was reversible in seconds if anything broke. +- Formal MAIL server AD decommission deferred to a scheduled follow-up (not hot-path). +- Left the `InternalSMTPServers = 172.21.3.50` typo alone for now (doesn't appear load-bearing; likely unrelated to this outage). + +### Problems Encountered +1. **First fix worked but wasn't sufficient.** After granting NETWORK SERVICE rights on `AssistantsQuarantine`, 10003 crashes stopped but 451 PRX2 continued at ~30/min. Submission queue went from 0 → 124 → 0 briefly (messages moving again), but individual proxy sessions still failed. Required pivoting to layer-2 investigation. +2. **RoutingMaster fix was misleading.** The tombstoned reference was clearly wrong (Event 2937 explicitly said so) but swapping it to NEPTUNE did not fix the PRX2. The routing master is used for OUTBOUND routing-topology coordination, not for inbound FETS→BETS proxy target selection. Good to fix anyway, but not the cause. +3. **Disabling messageconcept ExSBR didn't help either.** Ruled it out. Left disabled — it's an expired-license artifact from 2024 and the Microsoft SBR at priority 12 is what actually does the outbound SBR rewriting. +4. **Log filename confusion.** Frontend protocol logs use a 2-digit hour suffix in the filename that appears to be UTC-based (`RECV2026042319-1.LOG` covered 19:00-19:59 UTC), not local. Briefly threw off the "is logging still working?" check when the 20:00Z hour-boundary rotation produced a 0-byte file before traffic arrived. +5. **Hosts file mutated mid-session.** The file previously had `172.16.3.13 mail.acghosting.com` (stale, per my earlier Grep of CLAUDE.md context), but when I backed it up it had already been edited to `172.16.3.11 mail.acghosting.com`. Someone (Mike?) edited it live during the session or my earlier read was of a different cached copy. Non-blocking; the new MAIL / mail.acg.local entries I added are what mattered. + +--- + +## Infrastructure Details + +### Neptune Exchange Server +- **Hostname:** neptune.acghosting.com / mail.acghosting.com +- **Internal FQDN:** NEPTUNE.acg.local +- **Public IP:** 67.206.163.124 (at Dataforth) +- **Internal IP:** 172.16.3.11 +- **AD Domain:** acg.local +- **Exchange:** 2016 Standard Evaluation, Build 15.1.2507.17 +- **Site:** Default-First-Site-Name +- **Two mailbox DBs:** N-LargeBoxes, N-Hosting1 (both mounted on NEPTUNE) + +### Dead MAIL Server (still in AD) +- **DN:** `CN=MAIL,CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=First Organization,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=acg,DC=local` +- **Exchange build in AD:** 15.1.2507.18 (newer than NEPTUNE — part of why FETS preferred it) +- **msExchCurrentServerRoles:** 16439 (same as NEPTUNE) +- **networkAddress:** `ncacn_ip_tcp:mail.acg.local` — does **not** resolve in DNS +- **State:** No physical server exists; object is the carcass of the old Exchange 2016 Enterprise box that was decommissioned. + +### DNS Server +- **Primary DNS used by Neptune:** 172.16.3.50 (ACG-DC16) +- **Secondary:** 8.8.8.8 + +### MailProtector (emailservice.io) — unchanged from March session +All existing outbound send connectors still good (rieussetcorp, devconllc, littleheartslittlehands, airandspaceacademy, tucsongoldencorral, amtransit, tucsonsafety, farwestwell, patriot, LLA, TGC, Horseshoe, Sorensen). + +--- + +## Root Cause Timeline + +| Time (local) | Event | +|---|---| +| ~09:35 | KB5084071 (.NET Framework CU) installed | +| ~10:09 | KB5082142 (Windows Server 21H2 CU) installed | +| 12:26:37 | MSExchangeTransport stopped by update installer | +| 12:26:50 | MSExchangeTransport started | +| 12:27:03 | MSExchangeDelivery stopped | +| 12:27:04 | MSExchangeDelivery started | +| 12:28:36 | **First 10003 crash** — UnauthorizedAccessException on AssistantsQuarantine\7ae551a5-... | +| 12:28:39 | SCM 7031: Mailbox Transport Delivery terminated unexpectedly (1x) | +| 12:28:45 | Delivery restarted, but unhealthy | +| 12:45:51 | Event 1048: >15% proxy sessions failing over last 15 min | +| 12:49 | Fix #1 applied: NETWORK SERVICE FullControl on AssistantsQuarantine. Delivery crashes stopped (no more Event 10003). 451 PRX2 continues. | +| 12:58 | Fix #2 applied: msExchRoutingMasterDN → NEPTUNE DN. Event 2937 cleared. 451 PRX2 continues. | +| 13:04 | Fix #3 applied: messageconcept SenderBasedRouting agent disabled. 451 PRX2 continues. | +| 13:09 | Fix #4 applied: hosts file entries `172.16.3.11 MAIL` + `172.16.3.11 mail.acg.local`. Test message delivered: `250 2.6.0 ... Queued mail for delivery`. | +| 13:10 | 38 inbound RECEIVE events in 2 min. Submission queue draining normally. Last PRX2 at 13:09:41; none since. | + +--- + +## Changes Made + +### 1. Registry ACL — `HKLM\SYSTEM\CurrentControlSet\Services\AssistantsQuarantine` + +**Before (SDDL):** +``` +O:BAG:SYD:AI(A;CIID;KR;;;BU)(A;CIID;KA;;;BA)(A;CIID;KA;;;SY)(A;CIIOID;KA;;;CO)(A;CIID;KR;;;AC)(A;CIID;KR;;;S-1-15-3-1024-1065365936-1281604716-3511738428-1654721687-432734479-3232135806-4053264122-3456934681) +``` + +**After (SDDL):** +``` +O:BAG:SYD:AI(A;CI;KA;;;NS)(A;CIID;KR;;;BU)(A;CIID;KA;;;BA)(A;CIID;KA;;;SY)(A;CIIOID;KA;;;CO)(A;CIID;KR;;;AC)(A;CIID;KR;;;S-1-15-3-1024-1065365936-1281604716-3511738428-1654721687-432734479-3232135806-4053264122-3456934681) +``` + +Added ACE: `(A;CI;KA;;;NS)` — Allow, ContainerInherit, KEY_ALL_ACCESS, NETWORK SERVICE. + +**PowerShell applied:** +```powershell +$key = 'HKLM:\SYSTEM\CurrentControlSet\Services\AssistantsQuarantine' +$acl = Get-Acl $key +$ns = New-Object System.Security.Principal.NTAccount('NT AUTHORITY','NETWORK SERVICE') +$rule = New-Object System.Security.AccessControl.RegistryAccessRule( + $ns, + [System.Security.AccessControl.RegistryRights]::FullControl, + [System.Security.AccessControl.InheritanceFlags]::ContainerInherit, + [System.Security.AccessControl.PropagationFlags]::None, + [System.Security.AccessControl.AccessControlType]::Allow +) +$acl.AddAccessRule($rule) +Set-Acl -Path $key -AclObject $acl +``` + +**Backups:** `C:\BackupBeforeFix\AssistantsQuarantine-20260423-125228.reg` (registry values) and `C:\BackupBeforeFix\AssistantsQuarantine-SDDL-20260423-125228.txt` (SDDL). + +### 2. Exchange Routing Group — `msExchRoutingMasterDN` + +**Before:** +``` +CN=MAIL\0ADEL:528e09e8-4d78-4054-9e6d-e8ad490cd1d1,CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),... +``` +(DN of a deleted/tombstoned AD object.) + +**After:** +``` +CN=NEPTUNE,CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=First Organization,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=acg,DC=local +``` + +**PowerShell applied:** +```powershell +$rgDN = 'LDAP://CN=Exchange Routing Group (DWBGZMFD01QNBJR),CN=Routing Groups,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=First Organization,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=acg,DC=local' +$newMaster = 'CN=NEPTUNE,CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=First Organization,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=acg,DC=local' +$rg = [ADSI]$rgDN +$rg.Put('msExchRoutingMasterDN', $newMaster) +$rg.SetInfo() +``` + +**Backup:** `C:\BackupBeforeFix\RoutingMasterDN-20260423-125845.txt` + +### 3. Disabled messageconcept ExSBR transport agent + +```powershell +Disable-TransportAgent -Identity 'SenderBasedRouting' -Confirm:$false +Restart-Service MSExchangeTransport -Force +Restart-Service MSExchangeFrontEndTransport -Force +``` + +**Agent state now:** `SenderBasedRouting` (priority 11, messageconcept) = **Disabled**. +**Still active:** `Sender Based Routing` (priority 12, `Microsoft.Exchange.SBR.SbrRoutingAgentFactory`) = Enabled — this is the one actually doing outbound SBR rewriting for the client domains. + +The messageconcept license expired 2024-04-26 + 60-day grace = 2024-06-25. Harmless to leave disabled; can be fully uninstalled later. + +### 4. Hosts file — `C:\Windows\System32\drivers\etc\hosts` + +**Before (active lines):** +``` +172.16.3.11 mail.acghosting.com +``` + +**After:** +``` +172.16.3.11 mail.acghosting.com + +# Redirect dead MAIL server to NEPTUNE (2026-04-23 fix for PRX2 internalproxy issue) +172.16.3.11 MAIL +172.16.3.11 mail.acg.local +``` + +**Backup:** `C:\BackupBeforeFix\hosts-20260423-130927` + +DNS cache flushed with `ipconfig /flushdns` after. + +### 5. Services Restarted (multiple times during investigation) + +Final restart sequence (after hosts fix): +- MSExchangeTransport +- MSExchangeFrontEndTransport +- MSExchangeDelivery +- MSExchangeSubmission + +All returned to `Running`. Submission queue now processing. + +--- + +## Verification After Fix + +```powershell +# SMTP test from localhost +Test message accepted with: +>> 250 2.6.0 <0ef6332e-8ac5-4042-a3ed-01d7832efb03@neptune.acg.local> [InternalId=165837277233159, Hostname=NEPTUNE.acg.local] 1636 bytes in 0.108, 14.686 KB/sec Queued mail for delivery +``` + +``` +PRX2 per minute since hosts fix (20:09Z): + 20:09 5 (in-flight sessions with cached bad DNS) + 20:10 0 + 20:11 0 + ... + +Successful end-of-DATA '250 2.6.0' responses since 20:09Z: 40+ +Submission queue: 0-1 messages (draining normally) +No new 10003 / 25012 / 1048 / 4999 events since 12:49 PM +``` + +Inbound sources confirmed flowing again: Amazon SES (54.240.73.150), MailProtector (52.0.70.91 et al.), Google (209.85.210.43), generic internet senders. + +--- + +## Credentials Used / Discovered + +### Neptune Exchange (local admin) +- **Username:** `ACG\administrator` +- **Password:** `Gptf*77ttb##` *(supplied by Mike mid-session; not actually needed — I was already running interactively on NEPTUNE as administrator.ACG)* +- **Access method:** Local PowerShell with `Microsoft.Exchange.Management.PowerShell.SnapIn`. No WinRM needed when executing on the box itself. + +### DC / DNS (referenced, not touched) +- ACG-DC16.acg.local @ 172.16.3.50 — AD DNS server used for `acg.local` resolution. + +*(No other credentials used or discovered this session.)* + +--- + +## Known Tombstoned / Stale AD Objects + +The MAIL server AD carcass has at minimum these references still in Exchange config: + +| Object | DN / Identity | Status | +|---|---|---| +| Exchange Server | `CN=MAIL,CN=Servers,...` | Whole object — needs removal | +| ClientAccessServer | `MAIL` (per `Get-ClientAccessService`) | Attached to the server object — removes with it | +| FrontendTransportService | `MAIL` (per `Get-FrontendTransportService`) | Same | +| TransportService | `MAIL` | Same | +| MailboxTransportService | `MAIL` | Same | +| Receive connectors (6) | `MAIL\Default MAIL`, `MAIL\Client Proxy MAIL`, `MAIL\Default Frontend MAIL`, `MAIL\Outbound Proxy Frontend MAIL`, `MAIL\Client Frontend MAIL`, `MAIL\WesternTire Relay` | Will be deleted with server | + +Nothing of value lives on MAIL — both mailbox databases (N-Hosting1, N-LargeBoxes) are hosted on NEPTUNE. + +--- + +## Pending / Incomplete Tasks + +### Critical (scheduled for follow-up) +1. **Formal MAIL server AD decommission** — Proper removal of the `CN=MAIL,CN=Servers,...` subtree via ADSI Edit or `Remove-ADObject -Recursive`. Once done, the hosts file entries for `MAIL` / `mail.acg.local` can be reverted. This closes the real root cause so we don't need a workaround. + +### Lower priority +2. **Fix `Get-TransportConfig` — `InternalSMTPServers` typo:** Currently set to `{172.21.3.50}`. Network is `172.16.x.x`. Likely should be `172.16.3.50` (the DC) or simply empty. Doesn't appear load-bearing for this outage but worth fixing. +3. **Remove messageconcept ExSBR fully:** Agent is disabled. Uninstall the DLL at `C:\Program Files\messageconcept\ExSBR\` and the registry key `HKLM\SOFTWARE\SenderBasedRouting`. +4. **airandspaceacademy.com MX change** (from March session — still pending). Direct delivery to Neptune is now being rejected by the transport rule, but MX still points to `mail.acghosting.com` on GoDaddy. +5. **littleheartslittlehands.com MX change** (from March session — still pending). Cloudflare DNS still points to cbsolt.net. +6. **Transport cert expiry** — Thumbprint `E58BFCBAEFEFDCAED0BF9E894127A3DE64CE9C69` expires 2026-07-22 (Event 12017 logs it as "will expire soon"). Separate from the Let's Encrypt frontend cert. +7. **Windows Update after-care:** Watch for 10003 crashes on next Exchange service restart to confirm the AssistantsQuarantine ACL fix is sticky (AD inheritance shouldn't roll it back, but worth one more reboot verification). + +--- + +## Reference + +### Key file paths on Neptune +- Exchange SMTP Receive logs (frontend): + `C:\Program Files\Microsoft\Exchange Server\V15\TransportRoles\Logs\FrontEnd\ProtocolLog\SmtpReceive\RECV-.LOG` +- Exchange SMTP Receive logs (hub): + `C:\Program Files\Microsoft\Exchange Server\V15\TransportRoles\Logs\Hub\ProtocolLog\SmtpReceive\` +- Message tracking: + `C:\Program Files\Microsoft\Exchange Server\V15\TransportRoles\Logs\MessageTracking\` +- SBR config (Microsoft agent, still active): + `C:\Program Files\Microsoft\Exchange Server\V15\TransportRoles\agents\Custom\Microsoft.Exchange.SBR.{InternalDomains,OverrideSettings,IgnoreAuthAs}.config` +- messageconcept ExSBR (disabled but DLL still present): + `C:\Program Files\messageconcept\ExSBR\SenderBasedRouting.dll` +- All backups from this session: + `C:\BackupBeforeFix\` + +### Diagnostic one-liners that were useful + +**Find PRX2 events in last hour's frontend log:** +```powershell +$f = Get-ChildItem 'C:\Program Files\Microsoft\Exchange Server\V15\TransportRoles\Logs\FrontEnd\ProtocolLog\SmtpReceive' | Sort-Object LastWriteTime -Descending | Where-Object Length -gt 0 | Select -First 1 +Get-Content $f.FullName | Where-Object { $_ -match '451 4\.7\.0.*PRX2' } | Measure-Object | % Count +``` + +**Find a specific session's full SMTP trace:** +```powershell +$sid = '08DEA1720522673C' +Get-Content $f.FullName | Where-Object { $_ -match $sid } +``` + +**Check AssistantsQuarantine ACL:** +```powershell +Get-Acl 'HKLM:\SYSTEM\CurrentControlSet\Services\AssistantsQuarantine' | Select -Expand Access | ft IdentityReference, RegistryRights, AccessControlType -Auto +``` + +**Check routing master DN:** +```powershell +$rg = [ADSI]'LDAP://CN=Exchange Routing Group (DWBGZMFD01QNBJR),CN=Routing Groups,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=First Organization,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=acg,DC=local' +$rg.Properties['msExchRoutingMasterDN'].Value +``` + +### Event IDs to watch +| Event | Source | Meaning | +|---|---|---| +| 10003 | MSExchangeTransportDelivery | Transport process exception (e.g., registry ACL) | +| 25012 | MSExchangeTransportDelivery | Unhandled exception in SMTP session | +| 1048 | MSExchangeFrontEndTransport | >15% proxy sessions failing over 15-min window | +| 4999 | MSExchange Common | Watson crash report | +| 1035 | MSExchangeFrontEndTransport | Inbound auth failure (normal for brute-force attempts, ignore unless from trusted IPs) | +| 12035 | MSExchangeFrontEndTransport | Unable to load transport cert | +| 12017 | MSExchangeTransport / FETS | Internal transport cert expiring soon | +| 2937 | MSExchange ADAccess | AD object property points to Deleted Objects | + +### Windows Updates installed today +- KB5082142 — 2026-04 Cumulative Update for Windows Server 21H2 (~10:09 local) +- KB5084071 — 2026-04 Cumulative Update for .NET Framework 3.5 / 4.8 / 4.8.1 (~09:35 local) +- KB5082427 — Update +- KB5082137 — Security Update + +Neither KB5082142 nor KB5084071 release notes are known to explicitly mention the AssistantsQuarantine ACL regression — worth watching for any MS advisory or community reports. If multiple Exchange 2016 admins hit this same issue today, Microsoft will likely publish a workaround. + +--- + +## Cross-Machine Notes + +None for this session — all work was local to NEPTUNE. ClaudeTools repo is synced to this server for session-log writing; the commit from this session will sync back via normal `/sync` workflow. + +--- + +## Update: 13:40 — Post-Fix Verification Failed, Pre-Reboot Checkpoint + +After the initial 4 fixes, external SMTP senders stopped getting 451 PRX2 and inbound `RECEIVE` events resumed (~111 in a 5-minute window). **But delivery to mailboxes never actually started.** Mike confirmed `bertie@amtransit.com` was not seeing new mail. Investigation of `Get-MessageTrackingLog -EventId DELIVER` returned 0 events for the last 5 min even with 383+ messages sitting in the Submission queue. + +### What was missed in the first pass + +The external-facing SMTP flow (MailProtector → FETS) was restored, but the **internal pipeline** (Submission → categorizer → SmtpDeliveryToMailbox → mailbox store) was still broken in two distinct ways. Both were exposed by today's KB5084071 (.NET Framework CU) restart — they had been latent for a long time but started mattering only after transport services reloaded their agent assemblies and topology caches against the updated .NET runtime. + +### Fix #5 — DNS records on ACG-DC16 for dead-MAIL short names + +The `SmtpDeliveryToMailbox` queue was stuck on `NextHopDomain = n-hosting1` (the mailbox **database** short name, not a server FQDN). Last error: +``` +451 4.4.0 DNS query failed. The error was: SMTPSEND.DNS.NonExistentDomain; +nonexistent domain n-hosting1 -> DnsDomainDoesNotExist: InfoDomainNonexistent +``` + +Hosts-file workaround for n-hosting1/n-largeboxes didn't help — **Exchange Transport's internal DNS resolver (edgetransport.exe) bypasses the Windows hosts file**. It queries the configured/adapter DNS servers directly. Verified by `[System.Net.Dns]::GetHostEntry('n-hosting1')` returning 172.16.3.11 (hosts file works for .NET/Windows resolver) while Exchange queue LastError kept reporting non-existent domain. + +**Fix:** Added AD-integrated DNS A records on `ACG-DC16.acg.local` (172.16.3.50) via `Invoke-Command -ComputerName ACG-DC16` → `Add-DnsServerResourceRecordA`: + +| Name | Type | Target | TTL | +|---|---|---|---| +| `n-hosting1.acg.local` | A | 172.16.3.11 | 1h | +| `n-largeboxes.acg.local` | A | 172.16.3.11 | 1h | +| `mail.acg.local` | A | 172.16.3.11 | 1h | +| `MAIL` (uppercase) | — | already exists (as `mail`, DNS isn't case-sensitive) | — | + +Also added hosts-file entries on Neptune itself (belt-and-suspenders; see below). + +After DNS records + `Restart-Service MSExchangeTransport`, the `n-hosting1` retry queue (164 → 235 msgs at peak) drained to 0. But messages just moved back into **Submission** instead of completing delivery. + +### Fix #6 — Disabled `Exchange DkimSigner` transport agent (hung async) + +Submission queue pinned at exactly 383 messages, no movement for 30+ seconds. Found this in the Application log: +``` +MSExchange Extensibility 1057: Agent 'Exchange DkimSigner' went async but did +not call Resume on the new thread, while handling event 'OnCategorizedMessage' +``` + +The third-party Exchange DkimSigner (priority 10, assembly `C:\Program Files\Exchange DkimSigner\ExchangeDkimSigner.dll`, `IsCritical=true`) went async on `OnCategorizedMessage` and never resumed the categorizer thread. Classic transport agent async-bug pattern — the .NET Framework CU likely changed some async/threading behavior the DkimSigner depends on. + +**Fix:** `Disable-TransportAgent -Identity 'Exchange DkimSigner' -Confirm:$false` then `Restart-Service MSExchangeTransport MSExchangeSubmission`. Outbound mail temporarily loses DKIM signatures — receivers with strict DMARC `p=reject` (devconllc.com is the only one we run that tight) may now get fail results on reply traffic until we re-enable. + +### Fix #7 — IRM / RMS Encryption Agent timeout lockdown + +After DkimSigner disabled, queue was still stuck. Found: +``` +MSExchange Extensibility 1050: The execution time of agent 'RMS Encryption Agent' +exceeded 162154 milliseconds while handling event 'OnRoutedMessage' for message +with InternetMessageId: 'Not Available'. +``` + +RMS Encryption Agent is a hidden built-in Exchange agent (not in `Get-TransportAgent` output, not in `agents.config`, can't be disabled via `Disable-TransportAgent` — fails with "Transport agent 'RMS Encryption Agent' isn't found"). It was trying to reach a non-existent RMS licensing server and timing out after ~160s per message. `IRMConfiguration` already showed `InternalLicensingEnabled = False`, but the agent was still firing. + +**Fix:** +```powershell +Set-IRMConfiguration -TransportDecryptionSetting Disabled -Confirm:$false +Set-IRMConfiguration -JournalReportDecryptionEnabled $false -Confirm:$false +``` + +**Result:** Event 1050 RMS warnings in last 1 min: **0**. Agent no longer firing on messages. + +### Current State Before Reboot + +| Item | State | +|---|---| +| Queues (Submission) | ~427 messages still accumulated, no new DELIVER events yet observed | +| External inbound (FETS) | Accepting 250 2.6.0 Queued mail responses cleanly | +| `MSExchangeDelivery` registry crash | Fixed (no 10003 since 12:49) | +| Routing master DN | NEPTUNE (fixed) | +| DkimSigner agent | Disabled | +| messageconcept ExSBR agent | Disabled | +| RMS Encryption Agent | Silenced via IRM config (can't be fully disabled) | +| DNS records (on DC) | `n-hosting1`, `n-largeboxes`, `mail.acg.local` all resolve to 172.16.3.11 | +| Hosts file (on NEPTUNE) | Has MAIL, mail.acg.local, n-hosting1, n-largeboxes → 172.16.3.11 | +| AssistantsQuarantine ACL | NETWORK SERVICE has FullControl (inheritable) | + +Mike and I agreed a **full Neptune reboot** is the right next step. Rationale: +- KB5082142 (cumulative update) almost certainly has a pending-reboot action we're working around +- `edgetransport.exe` has been restarted many times but its in-memory DNS cache may still hold stale negative entries ("n-hosting1 doesn't exist") +- RMS Encryption Agent may be holding a timed-out socket to an old RMS endpoint that only a fresh process will release + +### Post-Reboot Verification Checklist (for next pass) + +1. `Get-Acl 'HKLM:\SYSTEM\CurrentControlSet\Services\AssistantsQuarantine'` — verify `NETWORK SERVICE` ACE survived. +2. `$rg = [ADSI]'LDAP://CN=Exchange Routing Group (DWBGZMFD01QNBJR),...'; $rg.Properties['msExchRoutingMasterDN'].Value` — verify NEPTUNE DN. +3. `Get-TransportAgent | ? Enabled -eq $false` — verify DkimSigner + SenderBasedRouting still disabled. +4. `[System.Net.Dns]::GetHostEntry('n-hosting1')` and `GetHostEntry('mail.acg.local')` — verify DNS resolution. +5. Watch Event Viewer for: + - Event 10003 / 25012 / 4999 (should not recur) + - Event 1057 "went async but did not call Resume" (should not recur — DkimSigner disabled) + - Event 1050 RMS Encryption Agent timeouts (should not recur — IRM disabled) +6. `Get-Queue | ? MessageCount -gt 0` — should drain rapidly from the ~400 messages sitting in Submission right now. +7. **End-to-end smoke test:** SMTP a message to `postmaster@azcomputerguru.com → rklem@lifelonglearningacademy.com`, confirm `250 2.6.0 Queued mail`, then `Get-MessageTrackingLog -Recipients rklem@... -EventId DELIVER` within 60 seconds. +8. Have Mike check `bertie@amtransit.com` in OWA/Outlook — confirm actual mailbox delivery of any message in the backlog. + +### If reboot doesn't fully resolve delivery + +Prime suspect if still stuck: the Exchange database itself. The two databases `N-Hosting1` (809 GB) and `N-LargeBoxes` (313 GB) mount on boot but if the KB5082142 update left the ESE store in a bad state, `MSExchangeIS` may be crashing after N seconds of delivery attempts. Check: +- `Get-MailboxDatabase -Status | Format-List Name, Mounted, MountedOnServer, Recovery` +- `Get-EventLog -LogName Application -Source MSExchangeIS -After (Get-Date).AddMinutes(-10) | ? EntryType -eq 'Error'` +- Event IDs to watch: 9542 (DB mount failure), 474 (page checksum mismatch), 103 (store corruption) + +--- + +## Credentials (addendum from this phase) + +### ACG-DC16 PowerShell Remoting +- No explicit creds — worked via Kerberos from an already-logged-in domain-admin session on NEPTUNE (`administrator@acg.local`). +- Used `Invoke-Command -ComputerName ACG-DC16` (NOT the IP — Kerberos needs a SPN-matching hostname). + +--- + +## Additional File Changes (this phase) + +| File | Change | +|---|---| +| `C:\Windows\System32\drivers\etc\hosts` (Neptune) | Added `172.16.3.11 n-hosting1` and `172.16.3.11 n-largeboxes` in addition to earlier MAIL/mail.acg.local entries | +| ACG-DC16 `acg.local` DNS zone | New A records: `n-hosting1`, `n-largeboxes`, `mail` → 172.16.3.11 | +| Exchange transport agents | `Exchange DkimSigner` (priority 10) → Disabled | +| Exchange IRM config | `TransportDecryptionSetting = Disabled`, `JournalReportDecryptionEnabled = False` | + +All prior backups from phase 1 remain in `C:\BackupBeforeFix\`. No new registry / AD writes needed a backup in this phase (DNS records are additive; IRM config change is reversible via `Set-IRMConfiguration`). + +--- + +## Key Lesson + +**KB5084071 (.NET Framework 3.5/4.8/4.8.1 CU) + KB5082142 (Windows Server 21H2 CU)** hit this box hard. Three separate latent Exchange issues surfaced simultaneously when the transport services reloaded against the updated runtime: + +1. NETWORK SERVICE missing rights on `AssistantsQuarantine` (likely pre-existing since ~2021 when Neptune was installed; only mattered once a post-update service restart tried to mint a new GUID subkey) +2. Dead MAIL server in topology being picked for proxying (pre-existing since 2026-03 cleanup; only mattered once FETS re-walked the candidate list post-restart) +3. Third-party DkimSigner async-completion behavior broken by the .NET update + RMS Encryption Agent timeout firing on every message (both latent problems that older service state tolerated) + +**For future Windows Update rollouts on this server:** verify Exchange mailflow end-to-end (including `Get-MessageTrackingLog -EventId DELIVER`) within 10 minutes of the post-update reboot. The initial FETS-accepts-SMTP test is not sufficient — it only validates external ingest, not mailbox delivery. +