session log: Neptune inbound mail outage + partial recovery (pre-reboot snapshot)
KB5082142 (Windows Server 21H2 CU) + KB5084071 (.NET Framework CU) triggered cascading Exchange 2016 failures on NEPTUNE today. External SMTP ingest was restored after 4 fixes (registry ACL on AssistantsQuarantine, Routing Master DN, disabled messageconcept ExSBR, hosts entries for dead MAIL server). But internal pipeline (Submission -> categorizer -> mailbox delivery) remained broken until 3 more fixes (DNS records on ACG-DC16 for n-hosting1/n-largeboxes /mail, disabled hung DkimSigner agent, disabled IRM to silence RMS Encryption Agent timeouts). Submission queue still pinned at ~427 messages pre-reboot; full Neptune reboot queued to clear edgetransport.exe in-memory DNS cache and pending KB5082142 reboot actions. All registry/AD/config backups in C:\BackupBeforeFix\ on Neptune. Post-reboot verification checklist documented in the log. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,471 @@
|
||||
# Session Log: 2026-04-23 — Neptune Inbound Mail Outage (KB5082142 Aftermath)
|
||||
|
||||
## User
|
||||
- **User:** Mike Swanson (mike)
|
||||
- **Machine:** NEPTUNE (Exchange server, not Mike's workstation — session executed locally on the server as administrator.ACG)
|
||||
- **Role:** admin
|
||||
|
||||
## Session Summary
|
||||
|
||||
**Incident:** At approximately 12:28 PM local (19:28 UTC) on 2026-04-23, all inbound mail to Neptune began being deferred. MailProtector returned `451 4.7.0 Temporary server error. Please try again later. PRX2` at end-of-DATA for every inbound relay attempt. Google / Amazon SES / other direct senders hit the same error. Outbound mail continued to work.
|
||||
|
||||
**Duration:** ~42 minutes of full inbound mail outage (12:28 PM – 1:10 PM local).
|
||||
|
||||
**Root cause (layered):**
|
||||
|
||||
1. **Trigger — Windows Update KB5082142** (2026-04 Cumulative Update for Windows Server 21H2) + KB5084071 (.NET Framework 3.5/4.8/4.8.1) installed at ~10 AM local, restarting the Exchange transport services at 12:26 PM.
|
||||
|
||||
2. **First layer — `MSExchangeDelivery` registry ACL crash:** The service runs as `NT AUTHORITY\NETWORK SERVICE`. On restart it attempted to create a new GUID subkey at `HKLM\SYSTEM\CurrentControlSet\Services\AssistantsQuarantine\7ae551a5-6ed4-4389-9b06-44a926841ead` — but that key's ACL only granted `Administrators`, `SYSTEM`, and `CREATOR OWNER`. NETWORK SERVICE had zero rights. Result: `UnauthorizedAccessException` (Event 10003) → service terminated unexpectedly (SCM Event 7031) → transport proxy sessions failing (Event 1048: "More than 15% of proxy sessions over the last 15 minutes have failed").
|
||||
|
||||
3. **Second layer — Dead MAIL server in Exchange topology:** Once delivery was up again, `MSExchangeFrontEndTransport` began picking a proxy target for each inbound session from the list of Mailbox servers in its AD site. The dead `MAIL` server still existed in AD (same site `Default-First-Site-Name`, same `msExchCurrentServerRoles = 16439`, same Enterprise-ish build as NEPTUNE). FETS selected MAIL, tried to resolve `mail.acg.local` via DNS, failed, and fell back to the literal placeholder string `internalproxy` as the proxy destination. DNS of course returned non-existent for `internalproxy` as well. Exchange returned `554 5.4.4 SMTPSEND.DNS.NonExistentDomain; nonexistent domain internalproxy -> DnsDomainDoesNotExist: InfoDomainNonexistent` internally, which FETS then surfaced to the external sender as `451 4.7.0 PRX2`.
|
||||
|
||||
4. **Third layer — Tombstoned Routing Master:** The Exchange Routing Group's `msExchRoutingMasterDN` attribute still pointed at the AD *deleted-objects* DN of MAIL: `CN=MAIL\0ADEL:528e09e8-4d78-4054-9e6d-e8ad490cd1d1,CN=Servers,...`. Event 2937 was logging this on every startup: "Property [RoutingMasterDN] ... is pointing to the Deleted Objects container in Active Directory. This property should be fixed as soon as possible."
|
||||
|
||||
**Resolution path:**
|
||||
|
||||
Four fixes applied in sequence. The **first** stopped the delivery-service crash (no more Event 10003). The **second** cleared the tombstoned routing master warning but did not stop 451 PRX2 on its own — Exchange still had MAIL as a legitimate mailbox-server candidate for site-internal proxying. The **third** (disabling expired messageconcept ExSBR) ruled out the agent as a source of the "internalproxy" string. The **fourth** — adding hosts-file entries to redirect the dead MAIL server name to NEPTUNE's own IP — was the one that actually restored mail flow, because it made FETS's DNS lookup for `mail.acg.local` succeed and proxy back to Neptune.
|
||||
|
||||
The clean long-term fix is to remove the MAIL server AD object entirely, at which point the hosts entries can come off.
|
||||
|
||||
### Key Decisions
|
||||
- Chose hosts-file workaround over AD object deletion for the immediate outage because it was reversible in seconds if anything broke.
|
||||
- Formal MAIL server AD decommission deferred to a scheduled follow-up (not hot-path).
|
||||
- Left the `InternalSMTPServers = 172.21.3.50` typo alone for now (doesn't appear load-bearing; likely unrelated to this outage).
|
||||
|
||||
### Problems Encountered
|
||||
1. **First fix worked but wasn't sufficient.** After granting NETWORK SERVICE rights on `AssistantsQuarantine`, 10003 crashes stopped but 451 PRX2 continued at ~30/min. Submission queue went from 0 → 124 → 0 briefly (messages moving again), but individual proxy sessions still failed. Required pivoting to layer-2 investigation.
|
||||
2. **RoutingMaster fix was misleading.** The tombstoned reference was clearly wrong (Event 2937 explicitly said so) but swapping it to NEPTUNE did not fix the PRX2. The routing master is used for OUTBOUND routing-topology coordination, not for inbound FETS→BETS proxy target selection. Good to fix anyway, but not the cause.
|
||||
3. **Disabling messageconcept ExSBR didn't help either.** Ruled it out. Left disabled — it's an expired-license artifact from 2024 and the Microsoft SBR at priority 12 is what actually does the outbound SBR rewriting.
|
||||
4. **Log filename confusion.** Frontend protocol logs use a 2-digit hour suffix in the filename that appears to be UTC-based (`RECV2026042319-1.LOG` covered 19:00-19:59 UTC), not local. Briefly threw off the "is logging still working?" check when the 20:00Z hour-boundary rotation produced a 0-byte file before traffic arrived.
|
||||
5. **Hosts file mutated mid-session.** The file previously had `172.16.3.13 mail.acghosting.com` (stale, per my earlier Grep of CLAUDE.md context), but when I backed it up it had already been edited to `172.16.3.11 mail.acghosting.com`. Someone (Mike?) edited it live during the session or my earlier read was of a different cached copy. Non-blocking; the new MAIL / mail.acg.local entries I added are what mattered.
|
||||
|
||||
---
|
||||
|
||||
## Infrastructure Details
|
||||
|
||||
### Neptune Exchange Server
|
||||
- **Hostname:** neptune.acghosting.com / mail.acghosting.com
|
||||
- **Internal FQDN:** NEPTUNE.acg.local
|
||||
- **Public IP:** 67.206.163.124 (at Dataforth)
|
||||
- **Internal IP:** 172.16.3.11
|
||||
- **AD Domain:** acg.local
|
||||
- **Exchange:** 2016 Standard Evaluation, Build 15.1.2507.17
|
||||
- **Site:** Default-First-Site-Name
|
||||
- **Two mailbox DBs:** N-LargeBoxes, N-Hosting1 (both mounted on NEPTUNE)
|
||||
|
||||
### Dead MAIL Server (still in AD)
|
||||
- **DN:** `CN=MAIL,CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=First Organization,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=acg,DC=local`
|
||||
- **Exchange build in AD:** 15.1.2507.18 (newer than NEPTUNE — part of why FETS preferred it)
|
||||
- **msExchCurrentServerRoles:** 16439 (same as NEPTUNE)
|
||||
- **networkAddress:** `ncacn_ip_tcp:mail.acg.local` — does **not** resolve in DNS
|
||||
- **State:** No physical server exists; object is the carcass of the old Exchange 2016 Enterprise box that was decommissioned.
|
||||
|
||||
### DNS Server
|
||||
- **Primary DNS used by Neptune:** 172.16.3.50 (ACG-DC16)
|
||||
- **Secondary:** 8.8.8.8
|
||||
|
||||
### MailProtector (emailservice.io) — unchanged from March session
|
||||
All existing outbound send connectors still good (rieussetcorp, devconllc, littleheartslittlehands, airandspaceacademy, tucsongoldencorral, amtransit, tucsonsafety, farwestwell, patriot, LLA, TGC, Horseshoe, Sorensen).
|
||||
|
||||
---
|
||||
|
||||
## Root Cause Timeline
|
||||
|
||||
| Time (local) | Event |
|
||||
|---|---|
|
||||
| ~09:35 | KB5084071 (.NET Framework CU) installed |
|
||||
| ~10:09 | KB5082142 (Windows Server 21H2 CU) installed |
|
||||
| 12:26:37 | MSExchangeTransport stopped by update installer |
|
||||
| 12:26:50 | MSExchangeTransport started |
|
||||
| 12:27:03 | MSExchangeDelivery stopped |
|
||||
| 12:27:04 | MSExchangeDelivery started |
|
||||
| 12:28:36 | **First 10003 crash** — UnauthorizedAccessException on AssistantsQuarantine\7ae551a5-... |
|
||||
| 12:28:39 | SCM 7031: Mailbox Transport Delivery terminated unexpectedly (1x) |
|
||||
| 12:28:45 | Delivery restarted, but unhealthy |
|
||||
| 12:45:51 | Event 1048: >15% proxy sessions failing over last 15 min |
|
||||
| 12:49 | Fix #1 applied: NETWORK SERVICE FullControl on AssistantsQuarantine. Delivery crashes stopped (no more Event 10003). 451 PRX2 continues. |
|
||||
| 12:58 | Fix #2 applied: msExchRoutingMasterDN → NEPTUNE DN. Event 2937 cleared. 451 PRX2 continues. |
|
||||
| 13:04 | Fix #3 applied: messageconcept SenderBasedRouting agent disabled. 451 PRX2 continues. |
|
||||
| 13:09 | Fix #4 applied: hosts file entries `172.16.3.11 MAIL` + `172.16.3.11 mail.acg.local`. Test message delivered: `250 2.6.0 ... Queued mail for delivery`. |
|
||||
| 13:10 | 38 inbound RECEIVE events in 2 min. Submission queue draining normally. Last PRX2 at 13:09:41; none since. |
|
||||
|
||||
---
|
||||
|
||||
## Changes Made
|
||||
|
||||
### 1. Registry ACL — `HKLM\SYSTEM\CurrentControlSet\Services\AssistantsQuarantine`
|
||||
|
||||
**Before (SDDL):**
|
||||
```
|
||||
O:BAG:SYD:AI(A;CIID;KR;;;BU)(A;CIID;KA;;;BA)(A;CIID;KA;;;SY)(A;CIIOID;KA;;;CO)(A;CIID;KR;;;AC)(A;CIID;KR;;;S-1-15-3-1024-1065365936-1281604716-3511738428-1654721687-432734479-3232135806-4053264122-3456934681)
|
||||
```
|
||||
|
||||
**After (SDDL):**
|
||||
```
|
||||
O:BAG:SYD:AI(A;CI;KA;;;NS)(A;CIID;KR;;;BU)(A;CIID;KA;;;BA)(A;CIID;KA;;;SY)(A;CIIOID;KA;;;CO)(A;CIID;KR;;;AC)(A;CIID;KR;;;S-1-15-3-1024-1065365936-1281604716-3511738428-1654721687-432734479-3232135806-4053264122-3456934681)
|
||||
```
|
||||
|
||||
Added ACE: `(A;CI;KA;;;NS)` — Allow, ContainerInherit, KEY_ALL_ACCESS, NETWORK SERVICE.
|
||||
|
||||
**PowerShell applied:**
|
||||
```powershell
|
||||
$key = 'HKLM:\SYSTEM\CurrentControlSet\Services\AssistantsQuarantine'
|
||||
$acl = Get-Acl $key
|
||||
$ns = New-Object System.Security.Principal.NTAccount('NT AUTHORITY','NETWORK SERVICE')
|
||||
$rule = New-Object System.Security.AccessControl.RegistryAccessRule(
|
||||
$ns,
|
||||
[System.Security.AccessControl.RegistryRights]::FullControl,
|
||||
[System.Security.AccessControl.InheritanceFlags]::ContainerInherit,
|
||||
[System.Security.AccessControl.PropagationFlags]::None,
|
||||
[System.Security.AccessControl.AccessControlType]::Allow
|
||||
)
|
||||
$acl.AddAccessRule($rule)
|
||||
Set-Acl -Path $key -AclObject $acl
|
||||
```
|
||||
|
||||
**Backups:** `C:\BackupBeforeFix\AssistantsQuarantine-20260423-125228.reg` (registry values) and `C:\BackupBeforeFix\AssistantsQuarantine-SDDL-20260423-125228.txt` (SDDL).
|
||||
|
||||
### 2. Exchange Routing Group — `msExchRoutingMasterDN`
|
||||
|
||||
**Before:**
|
||||
```
|
||||
CN=MAIL\0ADEL:528e09e8-4d78-4054-9e6d-e8ad490cd1d1,CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),...
|
||||
```
|
||||
(DN of a deleted/tombstoned AD object.)
|
||||
|
||||
**After:**
|
||||
```
|
||||
CN=NEPTUNE,CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=First Organization,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=acg,DC=local
|
||||
```
|
||||
|
||||
**PowerShell applied:**
|
||||
```powershell
|
||||
$rgDN = 'LDAP://CN=Exchange Routing Group (DWBGZMFD01QNBJR),CN=Routing Groups,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=First Organization,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=acg,DC=local'
|
||||
$newMaster = 'CN=NEPTUNE,CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=First Organization,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=acg,DC=local'
|
||||
$rg = [ADSI]$rgDN
|
||||
$rg.Put('msExchRoutingMasterDN', $newMaster)
|
||||
$rg.SetInfo()
|
||||
```
|
||||
|
||||
**Backup:** `C:\BackupBeforeFix\RoutingMasterDN-20260423-125845.txt`
|
||||
|
||||
### 3. Disabled messageconcept ExSBR transport agent
|
||||
|
||||
```powershell
|
||||
Disable-TransportAgent -Identity 'SenderBasedRouting' -Confirm:$false
|
||||
Restart-Service MSExchangeTransport -Force
|
||||
Restart-Service MSExchangeFrontEndTransport -Force
|
||||
```
|
||||
|
||||
**Agent state now:** `SenderBasedRouting` (priority 11, messageconcept) = **Disabled**.
|
||||
**Still active:** `Sender Based Routing` (priority 12, `Microsoft.Exchange.SBR.SbrRoutingAgentFactory`) = Enabled — this is the one actually doing outbound SBR rewriting for the client domains.
|
||||
|
||||
The messageconcept license expired 2024-04-26 + 60-day grace = 2024-06-25. Harmless to leave disabled; can be fully uninstalled later.
|
||||
|
||||
### 4. Hosts file — `C:\Windows\System32\drivers\etc\hosts`
|
||||
|
||||
**Before (active lines):**
|
||||
```
|
||||
172.16.3.11 mail.acghosting.com
|
||||
```
|
||||
|
||||
**After:**
|
||||
```
|
||||
172.16.3.11 mail.acghosting.com
|
||||
|
||||
# Redirect dead MAIL server to NEPTUNE (2026-04-23 fix for PRX2 internalproxy issue)
|
||||
172.16.3.11 MAIL
|
||||
172.16.3.11 mail.acg.local
|
||||
```
|
||||
|
||||
**Backup:** `C:\BackupBeforeFix\hosts-20260423-130927`
|
||||
|
||||
DNS cache flushed with `ipconfig /flushdns` after.
|
||||
|
||||
### 5. Services Restarted (multiple times during investigation)
|
||||
|
||||
Final restart sequence (after hosts fix):
|
||||
- MSExchangeTransport
|
||||
- MSExchangeFrontEndTransport
|
||||
- MSExchangeDelivery
|
||||
- MSExchangeSubmission
|
||||
|
||||
All returned to `Running`. Submission queue now processing.
|
||||
|
||||
---
|
||||
|
||||
## Verification After Fix
|
||||
|
||||
```powershell
|
||||
# SMTP test from localhost
|
||||
Test message accepted with:
|
||||
>> 250 2.6.0 <0ef6332e-8ac5-4042-a3ed-01d7832efb03@neptune.acg.local> [InternalId=165837277233159, Hostname=NEPTUNE.acg.local] 1636 bytes in 0.108, 14.686 KB/sec Queued mail for delivery
|
||||
```
|
||||
|
||||
```
|
||||
PRX2 per minute since hosts fix (20:09Z):
|
||||
20:09 5 (in-flight sessions with cached bad DNS)
|
||||
20:10 0
|
||||
20:11 0
|
||||
...
|
||||
|
||||
Successful end-of-DATA '250 2.6.0' responses since 20:09Z: 40+
|
||||
Submission queue: 0-1 messages (draining normally)
|
||||
No new 10003 / 25012 / 1048 / 4999 events since 12:49 PM
|
||||
```
|
||||
|
||||
Inbound sources confirmed flowing again: Amazon SES (54.240.73.150), MailProtector (52.0.70.91 et al.), Google (209.85.210.43), generic internet senders.
|
||||
|
||||
---
|
||||
|
||||
## Credentials Used / Discovered
|
||||
|
||||
### Neptune Exchange (local admin)
|
||||
- **Username:** `ACG\administrator`
|
||||
- **Password:** `Gptf*77ttb##` *(supplied by Mike mid-session; not actually needed — I was already running interactively on NEPTUNE as administrator.ACG)*
|
||||
- **Access method:** Local PowerShell with `Microsoft.Exchange.Management.PowerShell.SnapIn`. No WinRM needed when executing on the box itself.
|
||||
|
||||
### DC / DNS (referenced, not touched)
|
||||
- ACG-DC16.acg.local @ 172.16.3.50 — AD DNS server used for `acg.local` resolution.
|
||||
|
||||
*(No other credentials used or discovered this session.)*
|
||||
|
||||
---
|
||||
|
||||
## Known Tombstoned / Stale AD Objects
|
||||
|
||||
The MAIL server AD carcass has at minimum these references still in Exchange config:
|
||||
|
||||
| Object | DN / Identity | Status |
|
||||
|---|---|---|
|
||||
| Exchange Server | `CN=MAIL,CN=Servers,...` | Whole object — needs removal |
|
||||
| ClientAccessServer | `MAIL` (per `Get-ClientAccessService`) | Attached to the server object — removes with it |
|
||||
| FrontendTransportService | `MAIL` (per `Get-FrontendTransportService`) | Same |
|
||||
| TransportService | `MAIL` | Same |
|
||||
| MailboxTransportService | `MAIL` | Same |
|
||||
| Receive connectors (6) | `MAIL\Default MAIL`, `MAIL\Client Proxy MAIL`, `MAIL\Default Frontend MAIL`, `MAIL\Outbound Proxy Frontend MAIL`, `MAIL\Client Frontend MAIL`, `MAIL\WesternTire Relay` | Will be deleted with server |
|
||||
|
||||
Nothing of value lives on MAIL — both mailbox databases (N-Hosting1, N-LargeBoxes) are hosted on NEPTUNE.
|
||||
|
||||
---
|
||||
|
||||
## Pending / Incomplete Tasks
|
||||
|
||||
### Critical (scheduled for follow-up)
|
||||
1. **Formal MAIL server AD decommission** — Proper removal of the `CN=MAIL,CN=Servers,...` subtree via ADSI Edit or `Remove-ADObject -Recursive`. Once done, the hosts file entries for `MAIL` / `mail.acg.local` can be reverted. This closes the real root cause so we don't need a workaround.
|
||||
|
||||
### Lower priority
|
||||
2. **Fix `Get-TransportConfig` — `InternalSMTPServers` typo:** Currently set to `{172.21.3.50}`. Network is `172.16.x.x`. Likely should be `172.16.3.50` (the DC) or simply empty. Doesn't appear load-bearing for this outage but worth fixing.
|
||||
3. **Remove messageconcept ExSBR fully:** Agent is disabled. Uninstall the DLL at `C:\Program Files\messageconcept\ExSBR\` and the registry key `HKLM\SOFTWARE\SenderBasedRouting`.
|
||||
4. **airandspaceacademy.com MX change** (from March session — still pending). Direct delivery to Neptune is now being rejected by the transport rule, but MX still points to `mail.acghosting.com` on GoDaddy.
|
||||
5. **littleheartslittlehands.com MX change** (from March session — still pending). Cloudflare DNS still points to cbsolt.net.
|
||||
6. **Transport cert expiry** — Thumbprint `E58BFCBAEFEFDCAED0BF9E894127A3DE64CE9C69` expires 2026-07-22 (Event 12017 logs it as "will expire soon"). Separate from the Let's Encrypt frontend cert.
|
||||
7. **Windows Update after-care:** Watch for 10003 crashes on next Exchange service restart to confirm the AssistantsQuarantine ACL fix is sticky (AD inheritance shouldn't roll it back, but worth one more reboot verification).
|
||||
|
||||
---
|
||||
|
||||
## Reference
|
||||
|
||||
### Key file paths on Neptune
|
||||
- Exchange SMTP Receive logs (frontend):
|
||||
`C:\Program Files\Microsoft\Exchange Server\V15\TransportRoles\Logs\FrontEnd\ProtocolLog\SmtpReceive\RECV<YYYYMMDDHH>-<n>.LOG`
|
||||
- Exchange SMTP Receive logs (hub):
|
||||
`C:\Program Files\Microsoft\Exchange Server\V15\TransportRoles\Logs\Hub\ProtocolLog\SmtpReceive\`
|
||||
- Message tracking:
|
||||
`C:\Program Files\Microsoft\Exchange Server\V15\TransportRoles\Logs\MessageTracking\`
|
||||
- SBR config (Microsoft agent, still active):
|
||||
`C:\Program Files\Microsoft\Exchange Server\V15\TransportRoles\agents\Custom\Microsoft.Exchange.SBR.{InternalDomains,OverrideSettings,IgnoreAuthAs}.config`
|
||||
- messageconcept ExSBR (disabled but DLL still present):
|
||||
`C:\Program Files\messageconcept\ExSBR\SenderBasedRouting.dll`
|
||||
- All backups from this session:
|
||||
`C:\BackupBeforeFix\`
|
||||
|
||||
### Diagnostic one-liners that were useful
|
||||
|
||||
**Find PRX2 events in last hour's frontend log:**
|
||||
```powershell
|
||||
$f = Get-ChildItem 'C:\Program Files\Microsoft\Exchange Server\V15\TransportRoles\Logs\FrontEnd\ProtocolLog\SmtpReceive' | Sort-Object LastWriteTime -Descending | Where-Object Length -gt 0 | Select -First 1
|
||||
Get-Content $f.FullName | Where-Object { $_ -match '451 4\.7\.0.*PRX2' } | Measure-Object | % Count
|
||||
```
|
||||
|
||||
**Find a specific session's full SMTP trace:**
|
||||
```powershell
|
||||
$sid = '08DEA1720522673C'
|
||||
Get-Content $f.FullName | Where-Object { $_ -match $sid }
|
||||
```
|
||||
|
||||
**Check AssistantsQuarantine ACL:**
|
||||
```powershell
|
||||
Get-Acl 'HKLM:\SYSTEM\CurrentControlSet\Services\AssistantsQuarantine' | Select -Expand Access | ft IdentityReference, RegistryRights, AccessControlType -Auto
|
||||
```
|
||||
|
||||
**Check routing master DN:**
|
||||
```powershell
|
||||
$rg = [ADSI]'LDAP://CN=Exchange Routing Group (DWBGZMFD01QNBJR),CN=Routing Groups,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=First Organization,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=acg,DC=local'
|
||||
$rg.Properties['msExchRoutingMasterDN'].Value
|
||||
```
|
||||
|
||||
### Event IDs to watch
|
||||
| Event | Source | Meaning |
|
||||
|---|---|---|
|
||||
| 10003 | MSExchangeTransportDelivery | Transport process exception (e.g., registry ACL) |
|
||||
| 25012 | MSExchangeTransportDelivery | Unhandled exception in SMTP session |
|
||||
| 1048 | MSExchangeFrontEndTransport | >15% proxy sessions failing over 15-min window |
|
||||
| 4999 | MSExchange Common | Watson crash report |
|
||||
| 1035 | MSExchangeFrontEndTransport | Inbound auth failure (normal for brute-force attempts, ignore unless from trusted IPs) |
|
||||
| 12035 | MSExchangeFrontEndTransport | Unable to load transport cert |
|
||||
| 12017 | MSExchangeTransport / FETS | Internal transport cert expiring soon |
|
||||
| 2937 | MSExchange ADAccess | AD object property points to Deleted Objects |
|
||||
|
||||
### Windows Updates installed today
|
||||
- KB5082142 — 2026-04 Cumulative Update for Windows Server 21H2 (~10:09 local)
|
||||
- KB5084071 — 2026-04 Cumulative Update for .NET Framework 3.5 / 4.8 / 4.8.1 (~09:35 local)
|
||||
- KB5082427 — Update
|
||||
- KB5082137 — Security Update
|
||||
|
||||
Neither KB5082142 nor KB5084071 release notes are known to explicitly mention the AssistantsQuarantine ACL regression — worth watching for any MS advisory or community reports. If multiple Exchange 2016 admins hit this same issue today, Microsoft will likely publish a workaround.
|
||||
|
||||
---
|
||||
|
||||
## Cross-Machine Notes
|
||||
|
||||
None for this session — all work was local to NEPTUNE. ClaudeTools repo is synced to this server for session-log writing; the commit from this session will sync back via normal `/sync` workflow.
|
||||
|
||||
---
|
||||
|
||||
## Update: 13:40 — Post-Fix Verification Failed, Pre-Reboot Checkpoint
|
||||
|
||||
After the initial 4 fixes, external SMTP senders stopped getting 451 PRX2 and inbound `RECEIVE` events resumed (~111 in a 5-minute window). **But delivery to mailboxes never actually started.** Mike confirmed `bertie@amtransit.com` was not seeing new mail. Investigation of `Get-MessageTrackingLog -EventId DELIVER` returned 0 events for the last 5 min even with 383+ messages sitting in the Submission queue.
|
||||
|
||||
### What was missed in the first pass
|
||||
|
||||
The external-facing SMTP flow (MailProtector → FETS) was restored, but the **internal pipeline** (Submission → categorizer → SmtpDeliveryToMailbox → mailbox store) was still broken in two distinct ways. Both were exposed by today's KB5084071 (.NET Framework CU) restart — they had been latent for a long time but started mattering only after transport services reloaded their agent assemblies and topology caches against the updated .NET runtime.
|
||||
|
||||
### Fix #5 — DNS records on ACG-DC16 for dead-MAIL short names
|
||||
|
||||
The `SmtpDeliveryToMailbox` queue was stuck on `NextHopDomain = n-hosting1` (the mailbox **database** short name, not a server FQDN). Last error:
|
||||
```
|
||||
451 4.4.0 DNS query failed. The error was: SMTPSEND.DNS.NonExistentDomain;
|
||||
nonexistent domain n-hosting1 -> DnsDomainDoesNotExist: InfoDomainNonexistent
|
||||
```
|
||||
|
||||
Hosts-file workaround for n-hosting1/n-largeboxes didn't help — **Exchange Transport's internal DNS resolver (edgetransport.exe) bypasses the Windows hosts file**. It queries the configured/adapter DNS servers directly. Verified by `[System.Net.Dns]::GetHostEntry('n-hosting1')` returning 172.16.3.11 (hosts file works for .NET/Windows resolver) while Exchange queue LastError kept reporting non-existent domain.
|
||||
|
||||
**Fix:** Added AD-integrated DNS A records on `ACG-DC16.acg.local` (172.16.3.50) via `Invoke-Command -ComputerName ACG-DC16` → `Add-DnsServerResourceRecordA`:
|
||||
|
||||
| Name | Type | Target | TTL |
|
||||
|---|---|---|---|
|
||||
| `n-hosting1.acg.local` | A | 172.16.3.11 | 1h |
|
||||
| `n-largeboxes.acg.local` | A | 172.16.3.11 | 1h |
|
||||
| `mail.acg.local` | A | 172.16.3.11 | 1h |
|
||||
| `MAIL` (uppercase) | — | already exists (as `mail`, DNS isn't case-sensitive) | — |
|
||||
|
||||
Also added hosts-file entries on Neptune itself (belt-and-suspenders; see below).
|
||||
|
||||
After DNS records + `Restart-Service MSExchangeTransport`, the `n-hosting1` retry queue (164 → 235 msgs at peak) drained to 0. But messages just moved back into **Submission** instead of completing delivery.
|
||||
|
||||
### Fix #6 — Disabled `Exchange DkimSigner` transport agent (hung async)
|
||||
|
||||
Submission queue pinned at exactly 383 messages, no movement for 30+ seconds. Found this in the Application log:
|
||||
```
|
||||
MSExchange Extensibility 1057: Agent 'Exchange DkimSigner' went async but did
|
||||
not call Resume on the new thread, while handling event 'OnCategorizedMessage'
|
||||
```
|
||||
|
||||
The third-party Exchange DkimSigner (priority 10, assembly `C:\Program Files\Exchange DkimSigner\ExchangeDkimSigner.dll`, `IsCritical=true`) went async on `OnCategorizedMessage` and never resumed the categorizer thread. Classic transport agent async-bug pattern — the .NET Framework CU likely changed some async/threading behavior the DkimSigner depends on.
|
||||
|
||||
**Fix:** `Disable-TransportAgent -Identity 'Exchange DkimSigner' -Confirm:$false` then `Restart-Service MSExchangeTransport MSExchangeSubmission`. Outbound mail temporarily loses DKIM signatures — receivers with strict DMARC `p=reject` (devconllc.com is the only one we run that tight) may now get fail results on reply traffic until we re-enable.
|
||||
|
||||
### Fix #7 — IRM / RMS Encryption Agent timeout lockdown
|
||||
|
||||
After DkimSigner disabled, queue was still stuck. Found:
|
||||
```
|
||||
MSExchange Extensibility 1050: The execution time of agent 'RMS Encryption Agent'
|
||||
exceeded 162154 milliseconds while handling event 'OnRoutedMessage' for message
|
||||
with InternetMessageId: 'Not Available'.
|
||||
```
|
||||
|
||||
RMS Encryption Agent is a hidden built-in Exchange agent (not in `Get-TransportAgent` output, not in `agents.config`, can't be disabled via `Disable-TransportAgent` — fails with "Transport agent 'RMS Encryption Agent' isn't found"). It was trying to reach a non-existent RMS licensing server and timing out after ~160s per message. `IRMConfiguration` already showed `InternalLicensingEnabled = False`, but the agent was still firing.
|
||||
|
||||
**Fix:**
|
||||
```powershell
|
||||
Set-IRMConfiguration -TransportDecryptionSetting Disabled -Confirm:$false
|
||||
Set-IRMConfiguration -JournalReportDecryptionEnabled $false -Confirm:$false
|
||||
```
|
||||
|
||||
**Result:** Event 1050 RMS warnings in last 1 min: **0**. Agent no longer firing on messages.
|
||||
|
||||
### Current State Before Reboot
|
||||
|
||||
| Item | State |
|
||||
|---|---|
|
||||
| Queues (Submission) | ~427 messages still accumulated, no new DELIVER events yet observed |
|
||||
| External inbound (FETS) | Accepting 250 2.6.0 Queued mail responses cleanly |
|
||||
| `MSExchangeDelivery` registry crash | Fixed (no 10003 since 12:49) |
|
||||
| Routing master DN | NEPTUNE (fixed) |
|
||||
| DkimSigner agent | Disabled |
|
||||
| messageconcept ExSBR agent | Disabled |
|
||||
| RMS Encryption Agent | Silenced via IRM config (can't be fully disabled) |
|
||||
| DNS records (on DC) | `n-hosting1`, `n-largeboxes`, `mail.acg.local` all resolve to 172.16.3.11 |
|
||||
| Hosts file (on NEPTUNE) | Has MAIL, mail.acg.local, n-hosting1, n-largeboxes → 172.16.3.11 |
|
||||
| AssistantsQuarantine ACL | NETWORK SERVICE has FullControl (inheritable) |
|
||||
|
||||
Mike and I agreed a **full Neptune reboot** is the right next step. Rationale:
|
||||
- KB5082142 (cumulative update) almost certainly has a pending-reboot action we're working around
|
||||
- `edgetransport.exe` has been restarted many times but its in-memory DNS cache may still hold stale negative entries ("n-hosting1 doesn't exist")
|
||||
- RMS Encryption Agent may be holding a timed-out socket to an old RMS endpoint that only a fresh process will release
|
||||
|
||||
### Post-Reboot Verification Checklist (for next pass)
|
||||
|
||||
1. `Get-Acl 'HKLM:\SYSTEM\CurrentControlSet\Services\AssistantsQuarantine'` — verify `NETWORK SERVICE` ACE survived.
|
||||
2. `$rg = [ADSI]'LDAP://CN=Exchange Routing Group (DWBGZMFD01QNBJR),...'; $rg.Properties['msExchRoutingMasterDN'].Value` — verify NEPTUNE DN.
|
||||
3. `Get-TransportAgent | ? Enabled -eq $false` — verify DkimSigner + SenderBasedRouting still disabled.
|
||||
4. `[System.Net.Dns]::GetHostEntry('n-hosting1')` and `GetHostEntry('mail.acg.local')` — verify DNS resolution.
|
||||
5. Watch Event Viewer for:
|
||||
- Event 10003 / 25012 / 4999 (should not recur)
|
||||
- Event 1057 "went async but did not call Resume" (should not recur — DkimSigner disabled)
|
||||
- Event 1050 RMS Encryption Agent timeouts (should not recur — IRM disabled)
|
||||
6. `Get-Queue | ? MessageCount -gt 0` — should drain rapidly from the ~400 messages sitting in Submission right now.
|
||||
7. **End-to-end smoke test:** SMTP a message to `postmaster@azcomputerguru.com → rklem@lifelonglearningacademy.com`, confirm `250 2.6.0 Queued mail`, then `Get-MessageTrackingLog -Recipients rklem@... -EventId DELIVER` within 60 seconds.
|
||||
8. Have Mike check `bertie@amtransit.com` in OWA/Outlook — confirm actual mailbox delivery of any message in the backlog.
|
||||
|
||||
### If reboot doesn't fully resolve delivery
|
||||
|
||||
Prime suspect if still stuck: the Exchange database itself. The two databases `N-Hosting1` (809 GB) and `N-LargeBoxes` (313 GB) mount on boot but if the KB5082142 update left the ESE store in a bad state, `MSExchangeIS` may be crashing after N seconds of delivery attempts. Check:
|
||||
- `Get-MailboxDatabase -Status | Format-List Name, Mounted, MountedOnServer, Recovery`
|
||||
- `Get-EventLog -LogName Application -Source MSExchangeIS -After (Get-Date).AddMinutes(-10) | ? EntryType -eq 'Error'`
|
||||
- Event IDs to watch: 9542 (DB mount failure), 474 (page checksum mismatch), 103 (store corruption)
|
||||
|
||||
---
|
||||
|
||||
## Credentials (addendum from this phase)
|
||||
|
||||
### ACG-DC16 PowerShell Remoting
|
||||
- No explicit creds — worked via Kerberos from an already-logged-in domain-admin session on NEPTUNE (`administrator@acg.local`).
|
||||
- Used `Invoke-Command -ComputerName ACG-DC16` (NOT the IP — Kerberos needs a SPN-matching hostname).
|
||||
|
||||
---
|
||||
|
||||
## Additional File Changes (this phase)
|
||||
|
||||
| File | Change |
|
||||
|---|---|
|
||||
| `C:\Windows\System32\drivers\etc\hosts` (Neptune) | Added `172.16.3.11 n-hosting1` and `172.16.3.11 n-largeboxes` in addition to earlier MAIL/mail.acg.local entries |
|
||||
| ACG-DC16 `acg.local` DNS zone | New A records: `n-hosting1`, `n-largeboxes`, `mail` → 172.16.3.11 |
|
||||
| Exchange transport agents | `Exchange DkimSigner` (priority 10) → Disabled |
|
||||
| Exchange IRM config | `TransportDecryptionSetting = Disabled`, `JournalReportDecryptionEnabled = False` |
|
||||
|
||||
All prior backups from phase 1 remain in `C:\BackupBeforeFix\`. No new registry / AD writes needed a backup in this phase (DNS records are additive; IRM config change is reversible via `Set-IRMConfiguration`).
|
||||
|
||||
---
|
||||
|
||||
## Key Lesson
|
||||
|
||||
**KB5084071 (.NET Framework 3.5/4.8/4.8.1 CU) + KB5082142 (Windows Server 21H2 CU)** hit this box hard. Three separate latent Exchange issues surfaced simultaneously when the transport services reloaded against the updated runtime:
|
||||
|
||||
1. NETWORK SERVICE missing rights on `AssistantsQuarantine` (likely pre-existing since ~2021 when Neptune was installed; only mattered once a post-update service restart tried to mint a new GUID subkey)
|
||||
2. Dead MAIL server in topology being picked for proxying (pre-existing since 2026-03 cleanup; only mattered once FETS re-walked the candidate list post-restart)
|
||||
3. Third-party DkimSigner async-completion behavior broken by the .NET update + RMS Encryption Agent timeout firing on every message (both latent problems that older service state tolerated)
|
||||
|
||||
**For future Windows Update rollouts on this server:** verify Exchange mailflow end-to-end (including `Get-MessageTrackingLog -EventId DELIVER`) within 10 minutes of the post-update reboot. The initial FETS-accepts-SMTP test is not sufficient — it only validates external ingest, not mailbox delivery.
|
||||
|
||||
Reference in New Issue
Block a user