85 lines
4.8 KiB
Markdown
85 lines
4.8 KiB
Markdown
# GuruRMM command-delivery diagnosis — ROUND 2 (important topology correction)
|
|
|
|
You analyzed this in round 1 (your prior answer is included at the bottom). A FACTUAL CORRECTION
|
|
to the architecture has come to light that invalidates part of the round-1 premise. Please
|
|
re-evaluate from scratch where needed. Be willing to discard your prior top hypothesis.
|
|
|
|
## CRITICAL CORRECTION to the ingress topology
|
|
|
|
In round 1 we told you: "agent -> Cloudflare -> nginx -> Rust server". **That was wrong for the
|
|
AGENT path.** Verified facts now:
|
|
|
|
- The agent's hard-coded WebSocket URL is **`wss://rmm-api.azcomputerguru.com/ws`** (from agent source:
|
|
`DEFAULT_SERVER_URL`, and the agent config default). The installer and enrollment also use
|
|
`rmm-api.azcomputerguru.com`.
|
|
- DNS / Cloudflare proxy status for the two hostnames (both A-record to the same public IP 72.194.62.10):
|
|
- `rmm.azcomputerguru.com` -> **proxied = true** (orange cloud; goes THROUGH Cloudflare) — this is the human DASHBOARD.
|
|
- `rmm-api.azcomputerguru.com` -> **proxied = false** (grey cloud; **DNS-only, BYPASSES Cloudflare**) — this is what the AGENTS use.
|
|
- Therefore **Cloudflare is NOT in the agent's path at all.** All round-1 hypotheses about Cloudflare
|
|
WAF / CF WebSocket buffering / CF frame handling are moot for agent command delivery. (Cloudflare only
|
|
fronts the dashboard.)
|
|
|
|
## The ACTUAL agent path (verified)
|
|
|
|
```
|
|
agent (endpoint, e.g. PST-SERVER behind a UniFi UDR/"Cloudflare-Ultra"-class gateway)
|
|
-> endpoint LAN/NAT
|
|
-> public internet
|
|
-> 72.194.62.10 (public IP; this is the NPM box)
|
|
-> NPM = "Nginx Proxy Manager" on host 172.16.3.20 (terminates TLS; one nginx layer)
|
|
NPM proxy-host settings for rmm-api: forward http://172.16.3.30:80, websockets=ON,
|
|
http2_support=ON, block_exploits=ON (NPM "Block Common Exploits"), caching=OFF, advanced_config=EMPTY
|
|
-> http://172.16.3.30:80 (the ORIGIN nginx; a SECOND nginx layer) -- PLAINTEXT HTTP over the LAN here
|
|
-> proxy_pass http://127.0.0.1:3001 (the Rust server)
|
|
```
|
|
|
|
So there are **TWO nginx proxy layers** in series (NPM on .20, then origin nginx on .30), no CDN.
|
|
|
|
Origin nginx `/ws` block (verbatim):
|
|
```
|
|
location /ws {
|
|
proxy_pass http://127.0.0.1:3001;
|
|
proxy_http_version 1.1;
|
|
proxy_set_header Upgrade $http_upgrade;
|
|
proxy_set_header Connection "upgrade";
|
|
proxy_set_header Host $host;
|
|
proxy_set_header X-Real-IP $remote_addr;
|
|
proxy_read_timeout 86400;
|
|
}
|
|
```
|
|
(No explicit `proxy_buffering off;`, no `proxy_send_timeout`. NPM's generated config for the proxy host is
|
|
the standard NPM template with websockets enabled + block_exploits; we have not yet dumped its exact directives.)
|
|
|
|
## What this changes / things to reconsider
|
|
|
|
- The "asymmetry" symptom is unchanged and still the core puzzle: for affected agents the server keeps
|
|
receiving inbound (heartbeats, last_seen fresh), the agent does NOT trip its 90s no-inbound reconnect
|
|
(so it appears to be receiving at least the server's 30s WS Ping frames), yet server->agent COMMAND (Text)
|
|
frames are never acked/executed — both ~4 KB commands AND ~80 B `hostname` commands. One command DID
|
|
succeed earlier (right after the agent updated/reconnected). A server-forced eviction + fresh reconnect did
|
|
NOT restore command delivery (a tiny command on the fresh connection still never acked) — but note we have
|
|
NO confirmation the agent received anything server->agent on that fresh connection either.
|
|
- Affected: PST-SERVER + PST-SERVER2 (same site/UDR), and GTS-PEDRO-H (different site). ~40 other agents
|
|
deliver commands fine through this SAME NPM+nginx path.
|
|
- Because rmm-api is grey-cloud, the agent's TLS connection goes straight from the endpoint's gateway to the
|
|
public IP — there is no CDN normalizing MTU/keepalive/buffering on the path; the endpoint's own gateway
|
|
(e.g. PST's UDR) is directly in the path.
|
|
|
|
## A note on one earlier "test"
|
|
|
|
We tried to `tcpdump` the origin loopback (nginx->Rust on 127.0.0.1:3001) during a marked command dispatch and
|
|
saw 0 packets — but that capture almost certainly FAILED (the backgrounded tcpdump was killed when the SSH
|
|
session closed), so treat it as NO DATA, not as evidence that the loopback is silent.
|
|
|
|
## Please now provide (round 2)
|
|
|
|
1. Your REVISED ranked hypotheses given Cloudflare is out of the agent path and there are two nginx layers +
|
|
the endpoint gateway directly in the path. Explicitly say which round-1 hypotheses you drop or demote.
|
|
2. The single most diagnostic next test (we can run packet captures on the .30 host on the plaintext .20->.30:80
|
|
leg and on the 127.0.0.1:3001 loopback; we can read NPM's generated nginx config on .20; we can query the
|
|
UDR; we currently CANNOT get shell/logs on the affected endpoints). For each plausible outcome, what it tells you.
|
|
|
|
---
|
|
## YOUR ROUND-1 ANSWER (for reference)
|
|
__ROUND1__
|