**1. Agent-side mechanisms that ACK TCP + handle Ping but starve Text frames (ranked by plausibility given the evidence)** The capture + ongoing pings + outbound heartbeats already prove the TCP bytes arrived, the TLS records were decrypted, and the WS frame reader advanced past the 193-byte Text payloads (TCP is a strict ordered byte stream; you cannot reach a later Ping in the stream without having consumed the preceding command bytes in userspace). Highest to lowest: - **Command dispatch / executor side is not consuming the parsed Text messages** (most likely). The low-level frame loop (or auto-control path) still runs, sees Ping, emits Pong, and keeps the socket moving. The Text is parsed but handed to a channel / task / match arm that has no active consumer, is parked, or was disabled after the first post-update command. Classic patterns: `tokio::spawn`ed handler that deadlocked on a shared resource/mutex after the successful 22:23 command; mpsc receiver with no task draining it; select! loop where the "commands" branch is no longer selected after a state transition. - **Per-session state machine / phase / "current command" slot left in a terminal state after the first success**. The 22:23 command (or the ConfigUpdate + inventory that preceded it) advanced a flag, consumed a oneshot, set a "busy" or "awaiting result" guard, or switched the agent into a mode that drops or ignores subsequent regular command Text frames. Later pings still flow because they are handled in the control path before or outside that state machine. - **High-level message stream vs. control frame path divergence in the WS library**. The app code only polls a `Stream` of user messages (Text/Binary) that is no longer being driven (or whose internal buffer is not drained), while the library (or a small helper) keeps reading the underlying TLS stream specifically to service Ping/Pong/Close automatically. This is common in split read-half + auto-pong designs. - **Silent drop after parse but before ack path** (e.g. command ID dedup, unknown opcode after reassembly edge, payload that no longer matches an expected envelope post-update, or write-half for acks is on a different handle that is stalled while heartbeat writes use a timer path that still works). - **Internal WS reassembly buffer desync that only affects data frames**. Less likely: control frames are short and non-fragmented and can be processed even when a data message is in "waiting for continuation" state. The fact that tiny commands worked once and pings keep working makes a persistent parser state corruption less probable than an app-level dispatch problem. - Pure "recv buffer filled, kernel still ACKing, app not reading at all" is ruled out by the later pings (see #2). **2. "Kernel ACKs while app stopped reading" plausibility and ping reconciliation** It is the normal TCP behavior: the kernel ACKs as soon as segments land in the socket receive buffer; userspace `read()` / TLS decrypt / WS parse is completely independent. So the three 214-byte frames being ACKed with no retransmit only tells us the bytes reached the agent's TCP stack. It cannot explain the observed pings, however. Because the pings (31-byte TLS frames) were interleaved or arrived after the commands in the TCP byte stream, any userspace code that saw and responded to a later Ping necessarily performed enough reads to drain past the command bytes. Therefore the WS frame parser in the agent did see the Text frames. The break is after parsing — in dispatch, state, or the consumer of user messages. If the pings that were answered were strictly the ones before the commands in the stream and no further pings arrived afterward, this would be weaker, but "continues to answer" and "interleaved" make the strict "app has stopped reading the socket" story untenable. **3. Non-agent explanations (middlebox, UDR inspection, offload) weighed** - **Agent-site middlebox / transparent TCP proxy / UDR doing L4 mangling**: Low. The capture on the NPM side sees the TCP 4-tuple terminating at the agents' own public IPs (98.190..., 64.139..., 68.230...) with clean ACKs coming back from those same IPs. A real proxy or interception device would normally terminate the connection and present its own source IP to NPM for the server leg (or would require a transparent L3 forwarder that does almost nothing to the TLS/WS bytes). - **TLS inspection / MITM at the agent's UDR or local gateway**: Very low for the same reason. Full MITM terminates TLS on the middlebox, re-encrypts toward NPM, and the source IP NPM sees becomes the middlebox egress IP, not the listed agent public IPs. The agents would also have to trust the middlebox CA; a Rust agent using system roots or rustls is unlikely to do so silently unless the machines are heavily managed. Content-aware mangling inside TLS is not possible without breaking TLS. - **GRO/LRO/TSO offload corruption on the server-grade NICs**: Possible in principle but low weight here. These offloads sit below TCP; corruption that still produces valid TCP checksums and clean ACKs with no retransmits would be a very specific driver/firmware bug. It would also tend to affect the entire stream (pings and commands are just TLS records over the same TCP bytes), not selectively Text frames. The "worked for exactly one command right after update, then never again" pattern is atypical for a hardware offload issue (those are usually consistent from the first packet). Different ISPs make a common NIC/driver across exactly the broken agents less likely unless they are all the same hardware model. Overall the capture + behavior already pushes probability very heavily onto the agent process after the WS Text is parsed. **4. Single most decisive next test (given constraints)** Force-evict the three agents' WebSocket sessions from the server side (allowed). As soon as each shows a fresh connection (DB last_seen / connection event), immediately dispatch one CLEAN test command to it while running a fresh, filtered capture on the NPM host (and ideally also on the origin side) for traffic to/from exactly those three public IPs. Also read the 22:21 last-uploaded log batch for these agents if it is still available in the DB. Why this single action is best: - It directly tests whether the problem is tied to the long-lived session state versus reproducible on every new connection. - If commands now produce CommandAck + execution after a fresh connect: the root cause is almost certainly per-connection accumulated state inside the agent (stuck executor, consumed channel, uncleared "busy" flag, reassembly state after the 22:23 command, etc.). This exonerates NPM/pf path further and tells us the bug is in the agent's WS read loop + command dispatch after the first post-reconnect message. - If the exact same symptom repeats on the new connection (TCP ACKed, pings answered, no CommandAck, no execution): the problem lives in the agent binary's core message handling for these hosts (or something local and persistent on those three machines). At that point we have strong evidence to deprioritize further NPM/pf changes and instead focus on (a) deep analysis of the 22:21 logs for init differences, (b) any per-agent DB state or config that these three share and others don't, and (c) eventually asking the sites for local captures or offload status (since we cannot push or visit). - It is cheap, targeted, does not disturb the other ~40 working agents, and produces a clean before/after capture we can compare to the previous decisive capture. - Restarting NPM would also force reconnects but is heavier and less precise. Secondary cheap adjunct while doing the eviction test: ensure an explicit outbound SNAT rule exists for 172.16.3.20 (even though the prior ACK behavior already argues against the SNAT gap being causal for these flows). **5. "Worked at 22:23 right after the 0.6.63 update, never since" implications** Yes, this is highly diagnostic. It almost certainly means: - The update caused a process restart / fresh WS connection. - The very first command(s) after that fresh connection (ConfigUpdate success + the one at 22:23) exercised a code path that succeeded. - Something in the handling of that first command (or the ConfigUpdate + inventory that happened at update time) left the agent's command-receive or command-execution machinery in a broken state for all subsequent commands on that connection. - Or the 22:23 command used a different code path (update-time special command, different envelope, different handler registration) than the later "hostname" / CLEAN commands. This pattern is classic for: one-shot initialization that runs only on first message or right after connect; a channel or oneshot that is consumed once; a state machine that advances to a "post-first-command" mode that has a bug; a resource (lock, slot, pending-command table) acquired by the successful command and never released; or a difference between the commands the updater itself sends versus the ones the server later dispatches. The fact that only three agents (different ISPs) show it while dozens of other 0.6.63 agents do not suggests these three either (a) hit a narrow race or ordering during their update/reconnect that others missed, (b) have distinct persisted local state or recent command history that interacts with the new version, or (c) share some other per-agent attribute the server or agent code now treats specially. The decisive reconnect test above will tell us whether a brand-new post-eviction connection reproduces the "only the first command works" behavior or restores normal function. That single bit of information collapses most of the remaining hypotheses.