**Round 2 revised analysis (Cloudflare removed from agent path; two nginx layers + endpoint gateway now primary surface)** ### Key factual corrections applied - Agent path is DNS-only (grey cloud) to `rmm-api.azcomputerguru.com`. Cloudflare is completely out of the agent data plane. All prior hypotheses that invoked CF WAF, CF WebSocket buffering, CF frame handling, or CDN normalization of MTU/keepalives are dropped. - Explicit two-nginx chain for agents: NPM (172.16.3.20, TLS termination, standard WS-enabled proxy host) → origin nginx (172.16.3.30:80, plaintext) → `proxy_pass http://127.0.0.1:3001`. - Origin `/ws` location supplies the basic upgrade headers and a long `proxy_read_timeout`, but lacks `proxy_buffering off;`, any `proxy_send_timeout`, `proxy_cache off`, or other streaming/WS hardening. NPM settings include `websockets=ON`, `http2_support=ON`, `block_exploits=ON`, caching=OFF, empty advanced config. - The endpoint gateway (e.g. PST UniFi UDR) sits directly on the WAN leg with no CDN in front. - Symptom core is unchanged and still asymmetric: agent→server (heartbeats, last_seen, re-offer triggering) works; server believes it dispatched (send_to true); tiny (~80 B) and larger commands produce no CommandAck and no execution; the agent does not trip its 90 s no-inbound reconnect; one early success occurred right after update/reconnect; forced eviction + fresh reconnect reproduced the failure for a subsequent tiny command. - ~40 other agents succeed through the identical NPM+origin path; the three affected (two sharing a UDR, one at a different site) are on 0.6.63. ### Revised ranked hypotheses Ranked by current fit to the corrected topology and the persistent asymmetry. I explicitly note changes from round 1. **1. Origin nginx (and/or NPM) proxy configuration is insufficient for reliable server→client WebSocket data-frame delivery after upgrade.** Missing `proxy_buffering off;` (nginx default is on) plus absent send timeout, combined with two proxy layers in series and NPM's block_exploits + http2_support, can cause server-originated Text frames to be buffered, not flushed promptly, or mishandled while small control frames (Pings) and the reverse direction (client heartbeats) continue. The upgrade handshake succeeds and the connection stays "live" from the origin's viewpoint, but application Text is not reliably forwarded to NPM's client-facing side. **For:** Exactly matches asymmetry + "Pings appear to arrive (no 90 s reconnect) but commands never do"; both tiny and large Text fail; fresh TCP session after eviction still uses the same proxy config so reproduces; early post-reconnect success is consistent with transient "hot" proxy state before buffering or idle behavior kicks in. The provided origin config is the minimal common template, not the hardened WS template. **Against:** 40 agents work on the same path, so any defect must be intermittent, connection-age-dependent, timing-dependent, or only triggered for certain agents' command dispatch patterns. **Verdict:** Now the single strongest hypothesis. Round-1 #1 (server socket stall) is reframed here because the two visible nginx layers are the new, obvious place where Text vs. control or directional asymmetry can be introduced without killing the whole connection. **2. Agent 0.6.63 command receive / Ack path is broken or left in a bad state while the heartbeat write path and control-frame handling remain functional.** The WS reader task processes Pings (or at least whatever keeps the 90 s timer happy) and can still emit heartbeats, but never delivers a parsed `Command` Text frame to the handler (or the Ack path is one-way dead). Could be triggered by ConfigUpdate content, a reconnect sequence, or per-agent state. **For:** Explains why only a subset of 0.6.63 agents are affected, why eviction + new server-side connection does not help (agent process state persists), why both sizes fail, and the "worked once then stopped" timeline on PST-SERVER. Heartbeats out are a separate task from inbound command dispatch. **Against:** Same build works for the large majority through the identical infrastructure; would require something agent-specific (config, enrollment data, or a race that only some hit). **Verdict:** Still very plausible and hard to rule out without agent visibility. Promoted slightly relative to round 1 because the network path is now better understood and the symptom remains after clean reconnects. **3. Endpoint gateway (UDR at PST site; whatever is at GTS-PEDRO-H) selectively interferes with server→client WebSocket Text frames on the direct WAN leg.** Stateful inspection, DPI/IPS, TCP flow tracking, or long-lived-connection handling in the gateway drops, delays, or mangles downstream application data frames (Text opcode) while passing control frames (Ping) and upstream traffic. No Cloudflare means the gateway's behavior is unmediated. **For:** PST-SERVER and PST-SERVER2 share the same UDR and site; the direct internet path puts the gateway squarely in the frame path. Early success followed by persistent failure is consistent with a flow entry being installed with a bad timeout or inspection state. **Against:** Tiny 80 B command also fails (argues against pure MTU/fragmentation); GTS-PEDRO-H is a different site (unless it has similar gateway hardware/policy); 40 other agents with their own gateways work. **Verdict:** Plausible primary cause for the PST pair; weaker as a unified explanation for all three unless the gateways are alike. Round-1 site/gateway hypothesis is retained but narrowed (no Cloudflare involvement). **4. Server-side outbound send task or WS library usage silently fails (or starves) Text frames for commands while continuing to emit Pings and accept inbound frames.** `send_to()` only enqueues to an in-memory channel; the actual tungstenite/etc. write for Text either never completes, is not flushed, targets a stale sender, or is deprioritized behind Pings in the select!/timer loop. **For:** Classic "enqueue reports success, bytes never leave." Would explain reaper behavior and "online but no delivery." **Against:** Eviction + fresh reconnect (T8) should have produced a brand-new sender and socket; same server binary successfully delivers hundreds of commands to other agents on the same path. Round-1 version of this hypothesis is demoted because the new-connect failure and removal of Cloudflare weaken the "old socket only" story. **Verdict:** Lower than in round 1 but not eliminated; a subtle difference between how Command Text vs. Ping frames are written/serialized/flushed remains possible. **5. NPM-generated config or runtime behavior for the rmm-api proxy host does not correctly forward server→client WS data frames post-upgrade.** Even with `websockets=ON`, the combination of http2_support, block_exploits, or the standard NPM template may only partially upgrade the backend connection or may apply buffering/inspection to the NPM↔origin leg that affects one direction or one opcode. **For:** NPM is the first hop after the internet and the TLS terminator; its config is a black box until dumped. block_exploits sounds like it could contain request/response or header inspection that survives the upgrade in unexpected ways. **Against:** If the WS proxying were fundamentally broken, far more (or all) agents would be affected. **Verdict:** Worth immediate checking via config dump; likely to combine with #1 rather than stand alone. **6. Stale or duplicate connection registration in the Rust in-memory agent map (commands routed to a sender no longer attached to the live WS sink).** **For:** Perfectly explains "send_to true, nothing arrives, receive side still updates last_seen." **Against:** Explicit eviction + re-handshake (T7/T8) should have replaced the entry; single-process server, single nginx path. **Verdict:** Strongly demoted from round 1. The fresh-connect failure makes registry split-brain much less likely unless there is a subtle map-key or cleanup bug not exercised by normal eviction. **7. Multiple independent root causes (GTS-PEDRO-H is a separate bug from the PST pair).** GTS showed `attempts=1` + 60 s timeout after a long idle period; PST shows climbing to 10 "undeliverable." Different sites. **Verdict:** Retain as a live possibility for GTS until the captures or more data merge or separate them. The shared "tiny hostname command, no ack, 0.6.63" observation argues against treating them as wholly unrelated. Dropped from round 1 (explicitly): - All Cloudflare hypotheses (WAF, CF buffering, CF WebSocket frame handling, CDN MTU/keepalive normalization). These no longer apply to the agent path. - Pure "long-lived socket half-open only on old connections" framing is demoted (T8 fresh reconnect reproduced the symptom). - Any assumption that "Cloudflare is in the path and could be normalizing or interfering" is removed. The core puzzle is now cleanly: bytes for server→agent Text either never leave the Rust process, are not forwarded by one or both nginx layers, are lost or mangled by the endpoint gateway, or are received by the agent but not turned into CommandAck + execution. ### Single most diagnostic next test **Dump NPM's exact generated nginx config for the rmm-api proxy host + coordinated packet capture on the .30 host (both legs) during a marked tiny command dispatch to a known-stuck agent (PST-SERVER).** **Execution:** - On the .20 (NPM) host, locate and export the full server block / location for rmm-api.azcomputerguru.com (typically under `/etc/nginx/proxy_host/...` or the live conf). Capture at minimum: listen directives, the location / (or /ws) block, all proxy_* settings, any upgrade/connection headers, buffering/cache/timeout settings, http2 directives, and any reference to block_exploits or advanced config. - On the .30 host, with PST-SERVER connected and in the failing state: - Start two background captures (pcap preferred for full frames; add `-vv -X` or similar text output if you need immediate readability): - Loopback: traffic involving port 3001 (Rust ↔ origin nginx). `tcpdump -i lo -s 0 -w /tmp/loopback-3001.pcap 'port 3001'` - LAN leg: traffic between .30 and .20 on port 80. Identify the interface with 172.16.3.30 (or the subnet), then `tcpdump -i -s 0 -w /tmp/lan-80.pcap 'host 172.16.3.20 and port 80'` - Optionally raise origin nginx log level (error/access) for the duration if it is not already informative. - From the server side (admin/test path or direct), dispatch one uniquely identifiable tiny command (hostname with a nonce/timestamp in command_id or payload if possible) targeted only at PST-SERVER. - Wait 60–90 s (covers reaper cycles). - Stop captures, collect the pcaps + origin nginx logs for the relevant minute(s) + server-side logs for that connection_id (every outbound Text/Ping with size/command_id, every inbound frame including heartbeats and any Pongs). - (If the UDR web UI or API exposes it) query active sessions, flow table, or recent traffic logs for the public server IP or the agent's known public IP around the dispatch timestamp. **Analysis focus (use tshark/wireshark or text dump):** - Does the loopback capture show the WS Text frame (opcode 0x81) containing the command leaving the Rust side toward 3001? - Does the LAN capture show the corresponding data shortly afterward on the established TCP stream from .30 → .20 (the same stream that carried the original WS upgrade)? - Are inbound heartbeats visible in both captures (client → .20 → .30 → Rust)? - Are server Ping control frames (opcode 0x89) visible leaving Rust and/or crossing to .20? - Any RST/FIN, zero window, or errors around the time? - Cross-reference exact timestamps with server "sent" logs. **Outcome table and interpretation** | Outcome (what the captures + config show) | Interpretation | |-------------------------------------------|----------------| | No Text frame for the dispatched command appears on loopback from Rust (server log may still claim send_to success); heartbeats arrive on loopback. | Failure is inside the Rust process (send task never actually wrote the frame to the socket for this connection, wrong sender, flush not happening, or enqueue-only success). Points to revised #4 (server outbound Text path). Rules out nginx/gateway/agent for this dispatch. | | Text frame visible leaving Rust on loopback to origin nginx, but no corresponding WS data seen in LAN capture from .30 to .20 on the proxy conn (inbound heartbeats visible both places). | Origin nginx accepted the frame on 3001 but did not forward it (or buffered it indefinitely) to its socket toward NPM. Strong evidence for inadequate proxy config at origin (#1) — especially the missing `proxy_buffering off;`, lack of send timeout, or interaction with the two-layer setup. | | Text frame visible on both loopback and LAN (left .30 toward NPM). Agent still never acks/executes. | Bytes traversed both nginx layers. Problem is downstream of .30: NPM did not relay over the TLS client socket (#5), or the frame was emitted by NPM but dropped/mangled by the endpoint gateway or on-path devices (#3), or the agent received it at the WS layer but did not parse/ack/execute (#2). Next actions become: capture at .20 if possible, UDR flow inspection, or agent-side instrumentation in a later build. | | Pings (control) from server visible on loopback and/or LAN; client heartbeats flow normally; the specific command Text is absent or truncated at one of the capture points. | Selective handling of Text (data) opcode vs. control frames or vs. client→server direction. Reinforces proxy-layer (#1) or gateway (#3) as the filter. | | Full bidirectional frame flow visible in captures (heartbeats, Pings if emitted, the command Text leaves Rust and crosses to .20), yet no CommandAck ever arrives at the server. | The infrastructure up to NPM forwarded the frame. Either NPM→WAN→gateway→agent path lost it, the agent ignored it, or (less likely) the capture timing missed a very delayed delivery. This would elevate #2 or #3 and de-emphasize the nginx config. | | Captures + server logs show normal inbound activity and that the server did emit Pings, but no Pongs are ever observed for those Pings, yet the agent remains "online" with fresh last_seen and does not 90 s reconnect. | The round-1 inference that "the agent must be receiving server Pings" is weaker than assumed. The agent's no-inbound timer may be driven only by app-level heartbeats, by any outbound activity, or may simply be longer/more forgiving. Does not prove the connection is fully healthy for server→agent data. | | No evidence the command was even attempted in server logs for the live connection_id (despite operator dispatch), while heartbeats continue to update last_seen on that same connection. | Registry / send_to targeting issue (#6). The eviction did not fully replace the live sender for commands. | This single coordinated step (config dump + dual-leg capture on .30 during one marked dispatch) directly answers the three critical questions without any endpoint access: (1) did bytes leave the Rust process?, (2) did the origin nginx forward them across the plaintext leg to NPM?, (3) was the proxy config obviously deficient? It will confirm or largely eliminate the two-nginx configuration surface (#1/#5) and sharply narrow the remaining possibilities to server, gateway, or agent. If the frames cleanly exit .30, the next practical steps become NPM-side visibility or a controlled bypass of one nginx layer for a test agent. The asymmetry after a clean reconnect is the fact that continues to make pure "old socket" or "stale registry" stories less attractive; the data now points most strongly at either the visible proxy layers or an agent-side receive/handler defect that survives reconnection. The captures will decide.