55 lines
3.9 KiB
Markdown
55 lines
3.9 KiB
Markdown
# GuruRMM command-delivery diagnosis — ROUND 3 (specific proposal to evaluate)
|
|
|
|
You are part of a two-model quorum diagnosing why a subset of RMM agents stop receiving
|
|
server->agent WebSocket COMMAND (Text) frames while everything else about their connection
|
|
looks healthy. Rounds 1-2 established (summarized):
|
|
|
|
- Topology (agent path, NO Cloudflare — agents use grey-cloud `rmm-api`):
|
|
`agent -> endpoint gateway (e.g. UniFi UDR) -> public internet -> public IP 72.194.62.10
|
|
-> NPM (Nginx Proxy Manager, host .20, TERMINATES TLS) -> http://172.16.3.30:80 (origin nginx, 2nd layer)
|
|
-> proxy_pass 127.0.0.1:3001 (Rust server)`. Two nginx layers in series.
|
|
- Symptom: affected agents (PST-SERVER, PST-SERVER2 same site/UDR; GTS-PEDRO-H different site; all v0.6.63)
|
|
keep heartbeating (agent->server fine, last_seen fresh), do NOT trip their 90s no-inbound reconnect
|
|
(so they seem to receive at least the server's 30s WS Ping), but never ACK/execute commands (tiny ~80B AND
|
|
large ~4KB). ~40 other agents on the SAME NPM+nginx path receive commands fine. A forced server-side
|
|
eviction + fresh reconnect did NOT restore delivery.
|
|
- Current top hypotheses (both models, round 2): (H1) one of the two nginx layers mishandles/buffers
|
|
server->agent Text frames after the WS upgrade (origin `/ws` lacks an explicit `proxy_buffering off;`);
|
|
(H2) agent-side 0.6.63 receive/command path defect that survives reconnect; (H3) endpoint gateway
|
|
(UDR) selectively interferes with downstream Text frames on the un-CDN'd direct path. Cloudflare and MTU
|
|
were dropped.
|
|
- We have NOT yet localized where a Text frame dies (packet capture tooling was just installed; not run yet).
|
|
|
|
## The proposal to evaluate (from the operator)
|
|
|
|
**Remove NPM (Nginx Proxy Manager) from the equation entirely — NAT the public traffic straight to .30.**
|
|
|
|
Relevant constraints / facts:
|
|
- NPM (.20) is currently the PUBLIC TLS terminator for BOTH `rmm.azcomputerguru.com` (dashboard, Cloudflare
|
|
orange-cloud -> NPM) AND `rmm-api.azcomputerguru.com` (agents, grey-cloud DNS-only -> straight to the public
|
|
IP -> NPM). NPM holds the TLS certs.
|
|
- The origin nginx on .30 currently listens on :80 PLAINTEXT only (no :443, no TLS cert installed there today).
|
|
- Agents connect with `wss://rmm-api.azcomputerguru.com/ws` — TLS is mandatory for them.
|
|
- So "NAT straight to .30" implies the edge firewall/router DNATs public :443 to .30, and .30's nginx must be
|
|
reconfigured to terminate TLS on :443 (install/move the cert) and proxy `/ws` -> 127.0.0.1:3001. Net effect:
|
|
ONE nginx layer instead of two; NPM out of the agent path.
|
|
- ~200 agents total. The dashboard path (`rmm`, Cloudflare) would also need a decision (keep it on NPM, or
|
|
also move it).
|
|
|
|
## Please answer (each model, concise but rigorous)
|
|
|
|
1. As a DIAGNOSTIC bisection: if we remove NPM and the affected agents (PST-SERVER etc.) START receiving
|
|
commands, what does that prove? If they STILL don't, what does that prove? Is this a clean bisection of the
|
|
current hypothesis set (H1 nginx-layer vs H2 agent vs H3 gateway), or are there confounds?
|
|
2. As a permanent FIX/SIMPLIFICATION: pros/cons of collapsing to a single nginx layer on .30. What could break
|
|
or regress (TLS/cert handling, the Cloudflare-fronted dashboard, HTTP/2, the grey-cloud direct exposure of
|
|
.30 to the internet, security surface, cert renewal, etc.)?
|
|
3. Is there a LOWER-RISK / FASTER way to get the SAME diagnostic signal WITHOUT re-NATing production — e.g. a
|
|
parallel test listener, a single test agent pointed at a bypass hostname/port, an `/etc/hosts` override on
|
|
one endpoint, or adding `proxy_buffering off;`/WS hardening to one layer first? If so, specify it.
|
|
4. Net recommendation: do the NAT-straight-to-.30 change now, or do a specific cheaper test first? Give a
|
|
concrete ordered action list.
|
|
|
|
Be specific and call out anything in the operator's proposal that is risky or that wouldn't actually isolate
|
|
the variable they think it does.
|