**1. Diagnostic bisection value** Success (affected agents start receiving commands after NPM is removed and traffic goes direct to .30's single nginx): - This would be strong evidence that the problem lives in the current dual-nginx path (H1), most likely in or between NPM and the origin nginx. The fact that heartbeats (agent→server) and apparently server WS Pings continue working while Text frames do not already points to a server→agent data path issue; removing one entire proxy layer and its configuration would cleanly implicate that layer or the interaction between the two. - It would make H2 (pure agent 0.6.63 defect) much less likely, because the same agent binaries would now be receiving frames they previously did not. A reconnect already failed to fix it, so a pure agent-side stuck state is less plausible if the path change alone restores delivery. - H3 (UDR/gateway interference) would be largely ruled out for the affected agents, because the path from the UDR outward (to the public IP) is unchanged; only the internal destination after the public IP changes. Failure (agents still do not receive commands): - This would rule out "NPM + two nginx layers" as the sole cause. It would shift weight to H2 or H3, or to something that survives the change: the Rust server itself (send path or per-connection state), the specific nginx config that gets deployed on .30, the TLS termination behavior on .30 vs NPM, or some other invariant (e.g., the public IP, the outer NAT device, or a property of the long-lived connections from those specific agents). - It would not cleanly distinguish H2 from H3. **Confounds and why it is not a perfectly clean bisection**: - You are not just "removing NPM"; you are also moving TLS termination from NPM's nginx to a new (or newly configured) nginx instance on .30. Any difference in nginx version, compiled modules, exact location block, `proxy_*` directives, or even OpenSSL behavior becomes a variable. - The internal hop (.20 ↔ .30) disappears. If there was any subtle interaction (extra buffering, header rewriting, connection pooling oddity, or MTU/fragmentation effect on the internal segment), it is removed at the same time. - If the public IP's 443 is DNATed wholesale to .30, both the grey-cloud API and the Cloudflare orange-cloud dashboard now land on .30's nginx. Any difference in how Cloudflare-injected headers, real IP handling, or HTTP/2 are processed on .30 vs the current NPM setup becomes entangled. - The affected population is a small subset of agents on the same path that already work for ~40 others. A fix that only appears after a full topology + TLS-terminator change does not tell you whether the root cause was a per-connection state bug in NPM that only some flows hit, or something narrower. - In short: success would be good evidence against the current dual-layer setup; failure would be informative but would leave the remaining hypotheses entangled with the new single-layer config and exposure. **2. Permanent fix / simplification pros and cons** Pros of collapsing to a single nginx layer on .30 (NAT public 443 straight to .30, terminate TLS there, proxy `/ws` directly to 127.0.0.1:3001): - Removes one entire place where server→agent Text frames can be buffered, delayed, or mishandled. The explicit `proxy_buffering off;`, `proxy_http_version 1.1;`, Upgrade/Connection headers, and appropriate read/write timeouts can be put in one location block and kept consistent. - Simpler mental model and fewer moving parts for the agent control path (grey-cloud direct path especially). - Slightly lower latency and one fewer internal TCP hop. - If the current NPM-generated config has been quietly suboptimal for long-lived bidirectional WS with server-initiated data frames, this eliminates that class of problem permanently. Cons and real risks: - **Exposure change**: .30 becomes directly internet-facing on 443 for rmm-api (and likely the dashboard if the same public IP is used). Today the public face is NPM on .20. You are moving the edge TLS terminator and the first point of public contact onto the origin box. - **Cert and renewal problem**: NPM currently holds and auto-renews the LE certs. Moving termination to .30 requires either (a) installing the certs + private keys on .30 and setting up independent renewal (certbot or equivalent) for both subdomains, or (b) some other sharing mechanism. Renewal failure or a botched cutover risks mass agent disconnects (all 200) plus dashboard outage. This is a classic "moved the certs, forgot the renewal" regression vector. - **Dashboard entanglement**: Both subdomains currently resolve to the same public IP. Changing the DNAT target for 443 moves *all* traffic (Cloudflare-proxied dashboard + direct grey-cloud API) to .30. You cannot easily move only the agent path without either a second public IP, port tricks (not viable for wss), or careful SNI/Host-based server blocks on .30 that correctly handle Cloudflare's X-Forwarded-For / CF-Connecting-IP plus real-ip module configuration. Keeping the dashboard on NPM while moving only API traffic requires more routing/NAT cleverness than the proposal states. - **Cloudflare-specific config**: The orange-cloud dashboard path expects proper trusted-proxy handling, real IP restoration, and possibly HTTP/2 or early hints behavior. Origin nginx on .30 was previously only doing internal plaintext proxy_pass; it will now need the full hardened public + Cloudflare config that NPM was providing. - **Grey-cloud direct surface**: rmm-api is already DNS-only grey-cloud, so the public IP was already reachable. But the actual listener and TLS stack were on NPM. Moving the listener changes the attack surface and the exact software that faces the internet for agent connections. - **Operational blast radius**: A misconfig on .30's nginx (wrong `server_name`, missing WebSocket upgrade handling, wrong `listen` directives, http2 vs http1.1 issues, etc.) takes down or degrades the entire agent fleet plus the dashboard in one shot. Current setup has some separation (NPM as a distinct TLS/proxy tier). - **Rollback complexity**: Re-pointing the DNAT back to .20 and restoring NPM as terminator is possible but not instantaneous, and any state (long-lived WS connections) will be torn down. - Minor: HTTP/2 handling for the dashboard, HSTS, OCSP stapling, cipher suite parity, etc., all have to be re-validated on the new terminator. The simplification is attractive for the agent WS path specifically, but the proposal as written does not isolate that path cleanly from the dashboard and does not address the cert-renewal and exposure shift. **3. Lower-risk / faster ways to get equivalent diagnostic signal** Yes, several that are strictly safer than re-NATing production: - **Highest value, lowest risk first**: Add the missing WebSocket hardening to the *existing* layers before touching routing. On the origin nginx (.30), for the `/ws` location (and any upstream for the agent path), explicitly set: ``` proxy_buffering off; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade"; proxy_read_timeout 3600s; # or similar long value proxy_send_timeout 3600s; ``` Do the same in NPM's generated config for the rmm-api vhost if it is not already perfect. Reload (not restart) and immediately test command delivery to the known-bad agents. This directly attacks H1 with almost zero blast radius. If commands start flowing, you have strong evidence without any topology change. - Run the packet captures that were just installed. Capture on .30 (both the external-facing interface and the loopback to 3001) while deliberately sending a command (small and large) to a bad agent and a good agent. See whether the Text frames ever arrive at .30's nginx listen socket, whether they are forwarded to the Rust process, and whether ACKs or responses come back. Also capture on NPM (.20) if feasible. This localizes where the frame dies without any config or routing change. - Server-side instrumentation: add targeted logging (or a temporary debug path) in the Rust server that logs when it attempts to send a command frame to specific agent IDs (or connection IDs) and whether the WS write succeeds or returns an error. Correlate with the agents that are "healthy" by last_seen/heartbeat but never ACK. - Controlled client test: from a machine behind one of the affected UDRs (or, if possible, temporarily on the same LAN as an affected agent), use `wscat`, `websocat`, or a tiny test binary to connect to `wss://rmm-api.azcomputerguru.com/ws` and have the server send Text frames. Observe whether this test client receives them. This bypasses the specific agent binary while using the real path. - Per-agent or per-site config override if the agent supports an alternate server URL (or can be pointed via a short-lived DNS change + test hostname). Point one affected agent (or a spare test agent placed behind the same UDR) at a bypass that avoids NPM entirely — for example, a temporary high-port listener on .30 with its own TLS (self-signed is acceptable for one test agent) or a minimal nginx instance on .30 listening on an alternate port that you DNAT only for a test public port. This gives a "same UDR, different proxy path" signal with tiny scope. - If the agents expose any local diagnostic or can have a one-off config pushed, collect WS receive logs from one bad agent and one good agent during a forced command. These give you localization (where the frame stops) and a direct test of the "remove the second nginx layer" hypothesis without moving the production NAT target or the TLS certs for 200 agents. **4. Net recommendation and concrete ordered action list** Do **not** do the NAT-straight-to-.30 change as the first step. It is high-risk, has significant confounds for the diagnostic, forces a cert/renewal/Cloudflare transition, and likely pulls the dashboard along whether intended or not. It is a reasonable *later* simplification once you have evidence that the dual-layer path is the culprit, but it is not a cheap or clean bisection today. Ordered action list (do in this sequence): 1. Immediately harden both existing nginx layers for the `/ws` location (add `proxy_buffering off;`, the Upgrade/Connection headers, long timeouts, and `proxy_http_version 1.1;`). Reload and re-test command delivery to the known-bad agents (PST-SERVER etc.). This is the cheapest, highest-signal step for H1. 2. Run the packet captures on .30 (and .20 if possible) during deliberate command sends to bad vs good agents. Determine whether Text frames reach .30's nginx and whether they are proxied to the Rust process. 3. Add minimal server-side send logging in the Rust binary for the affected agent IDs so you can see whether the server believes it is writing the frames successfully. 4. Perform a narrow-path test with one affected agent or a test client (alternate listener on .30, test hostname, or hosts-level override where feasible) that bypasses NPM while keeping everything else identical. This is the closest practical approximation to the operator's proposal without touching the production DNAT or certs for the fleet. 5. Only if steps 1–4 still point at the dual-layer path (and the captures/logs show frames dying in NPM or the origin nginx), *then* plan the single-layer cutover. At that point treat it as a deliberate simplification project: separate public IP or very careful server-block design if you want to keep dashboard on NPM; explicit cert migration + renewal automation on .30; staged rollout (small cohort first); documented rollback; and full validation of Cloudflare + grey-cloud paths. The operator's proposal does not cleanly isolate "NPM" because it also moves TLS termination and (almost certainly) forces the dashboard onto the new terminator. The subset nature of the affected agents plus the fact that reconnects already failed already suggest the root cause may be stateful/per-connection rather than a blanket "any traffic through NPM is broken." Get the direct evidence from config hardening + captures + narrow bypass before you change the public routing and cert story for the whole system.