Files

Mike Swanson fd99ee327c sync: auto-sync from GURU-5070 at 2026-06-12 05:57:38

Author: Mike Swanson
Machine: GURU-5070
Timestamp: 2026-06-12 05:57:38

2026-06-12 05:58:05 -07:00

3.9 KiB

Raw Blame History

GuruRMM command-delivery diagnosis — ROUND 3 (specific proposal to evaluate)

You are part of a two-model quorum diagnosing why a subset of RMM agents stop receiving server->agent WebSocket COMMAND (Text) frames while everything else about their connection looks healthy. Rounds 1-2 established (summarized):

Topology (agent path, NO Cloudflare — agents use grey-cloud rmm-api): agent -> endpoint gateway (e.g. UniFi UDR) -> public internet -> public IP 72.194.62.10 -> NPM (Nginx Proxy Manager, host .20, TERMINATES TLS) -> http://172.16.3.30:80 (origin nginx, 2nd layer) -> proxy_pass 127.0.0.1:3001 (Rust server). Two nginx layers in series.
Symptom: affected agents (PST-SERVER, PST-SERVER2 same site/UDR; GTS-PEDRO-H different site; all v0.6.63) keep heartbeating (agent->server fine, last_seen fresh), do NOT trip their 90s no-inbound reconnect (so they seem to receive at least the server's 30s WS Ping), but never ACK/execute commands (tiny ~80B AND large ~4KB). ~40 other agents on the SAME NPM+nginx path receive commands fine. A forced server-side eviction + fresh reconnect did NOT restore delivery.
Current top hypotheses (both models, round 2): (H1) one of the two nginx layers mishandles/buffers server->agent Text frames after the WS upgrade (origin /ws lacks an explicit proxy_buffering off;); (H2) agent-side 0.6.63 receive/command path defect that survives reconnect; (H3) endpoint gateway (UDR) selectively interferes with downstream Text frames on the un-CDN'd direct path. Cloudflare and MTU were dropped.
We have NOT yet localized where a Text frame dies (packet capture tooling was just installed; not run yet).

The proposal to evaluate (from the operator)

Remove NPM (Nginx Proxy Manager) from the equation entirely — NAT the public traffic straight to .30.

Relevant constraints / facts:

NPM (.20) is currently the PUBLIC TLS terminator for BOTH rmm.azcomputerguru.com (dashboard, Cloudflare orange-cloud -> NPM) AND rmm-api.azcomputerguru.com (agents, grey-cloud DNS-only -> straight to the public IP -> NPM). NPM holds the TLS certs.
The origin nginx on .30 currently listens on :80 PLAINTEXT only (no :443, no TLS cert installed there today).
Agents connect with wss://rmm-api.azcomputerguru.com/ws — TLS is mandatory for them.
So "NAT straight to .30" implies the edge firewall/router DNATs public :443 to .30, and .30's nginx must be reconfigured to terminate TLS on :443 (install/move the cert) and proxy /ws -> 127.0.0.1:3001. Net effect: ONE nginx layer instead of two; NPM out of the agent path.
~200 agents total. The dashboard path (rmm, Cloudflare) would also need a decision (keep it on NPM, or also move it).

Please answer (each model, concise but rigorous)

As a DIAGNOSTIC bisection: if we remove NPM and the affected agents (PST-SERVER etc.) START receiving commands, what does that prove? If they STILL don't, what does that prove? Is this a clean bisection of the current hypothesis set (H1 nginx-layer vs H2 agent vs H3 gateway), or are there confounds?
As a permanent FIX/SIMPLIFICATION: pros/cons of collapsing to a single nginx layer on .30. What could break or regress (TLS/cert handling, the Cloudflare-fronted dashboard, HTTP/2, the grey-cloud direct exposure of .30 to the internet, security surface, cert renewal, etc.)?
Is there a LOWER-RISK / FASTER way to get the SAME diagnostic signal WITHOUT re-NATing production — e.g. a parallel test listener, a single test agent pointed at a bypass hostname/port, an /etc/hosts override on one endpoint, or adding proxy_buffering off;/WS hardening to one layer first? If so, specify it.
Net recommendation: do the NAT-straight-to-.30 change now, or do a specific cheaper test first? Give a concrete ordered action list.

Be specific and call out anything in the operator's proposal that is risky or that wouldn't actually isolate the variable they think it does.

3.9 KiB Raw Blame History

GuruRMM command-delivery diagnosis — ROUND 3 (specific proposal to evaluate)

The proposal to evaluate (from the operator)

Please answer (each model, concise but rigorous)

3.9 KiB

Raw Blame History