55 lines
4.1 KiB
Markdown
55 lines
4.1 KiB
Markdown
# GuruRMM command-delivery diagnosis — ROUND 4 (new packet-capture evidence)
|
|
|
|
Quorum of two models, continuing. New hard evidence has localized the failure further. Please
|
|
update your hypothesis ranking and pick the single best next action.
|
|
|
|
## Recap (established)
|
|
- Agent path (NO Cloudflare; agents use grey-cloud rmm-api): `agent -> endpoint gateway (UniFi UDR) ->
|
|
internet -> public IP 72.194.62.10 -> edge firewall DNAT -> NPM (Nginx Proxy Manager, host .20, Docker
|
|
on an Unraid box; TERMINATES TLS) -> http://172.16.3.30:80 (origin nginx) -> 127.0.0.1:3001 (Rust server)`.
|
|
- Symptom: affected agents (PST-SERVER, PST-SERVER2 same site/UDR; GTS-PEDRO-H different site; all v0.6.63)
|
|
heartbeat fine (agent->server OK), don't trip their 90s no-inbound reconnect, but never ACK/execute
|
|
server->agent commands (tiny 80B AND 4KB). ~40 other agents on the SAME path get commands fine.
|
|
|
|
## NEW EVIDENCE — packet capture on the origin host (.30), both legs simultaneously
|
|
We dispatched 5 uniquely-marked tiny `hostname-FINALMARKn-www` commands to the stuck PST-SERVER while
|
|
capturing on .30 (loopback + LAN). Findings (payloads are PLAINTEXT — WS permessage-deflate is NOT in use,
|
|
so frames are directly readable; agent->server frames are masked as expected):
|
|
|
|
- On the **loopback :3001** leg (Rust server <-> origin nginx): the command Text frame is present, e.g.:
|
|
`localhost.3001 > localhost.54036 {"type":"command","payload":{"id":"c109e270...","command":"hostname-FINALMARK1-www",...}}`
|
|
=> **The Rust server DID emit the WS Text frame.**
|
|
- On the **.30 -> .20:80** leg (origin nginx -> NPM): the SAME command frame is present, e.g.:
|
|
`gururmm.http > 172.16.3.20.45614 {"type":"command",...FINALMARK1...}`
|
|
=> **The origin nginx DID forward the frame out toward NPM (.20).**
|
|
- So the frame provably traverses the Rust server AND the origin nginx and is sent to NPM. It still never
|
|
reaches the agent (no ACK, no execution).
|
|
- (Minor: the marker appeared 13x on loopback vs 8x on the LAN leg over the window — possibly retransmit/
|
|
re-offer timing; not yet interpreted. We did NOT capture on NPM (.20) or on the agent.)
|
|
|
|
## What this rules in/out
|
|
- Refuted: "Rust send is lying" and "origin nginx (.30) swallows/buffers the frame" — the frame leaves both.
|
|
- The death is now localized to: **NPM (.20) itself, OR the NPM->agent leg (public internet / the endpoint's
|
|
UDR gateway), OR the agent.**
|
|
|
|
## Constraints on next tests
|
|
- We CAN now capture on the NPM box (.20, Unraid host; we have root). The NPM->agent leg is TLS (encrypted),
|
|
so on .20 we could see whether NPM emits TLS application-data toward the agent's WAN IP (98.190.129.150 for
|
|
PST) shortly after receiving each command, and whether those packets are ACK'd or retransmitted (a black-hole
|
|
signature) — but we cannot read the frame contents on that leg.
|
|
- We have a CATCH-22 for any "point the agent at a bypass" test: the only way to reconfigure an affected
|
|
agent's server URL is to send it a command/config — which is the very channel that's broken. There is no
|
|
known second, WORKING agent at the PST site. (On-site access would be required to re-point an agent there.)
|
|
- We can freely add config to NPM (via its API/advanced-config) and to the origin nginx, and reload.
|
|
- ~40 other agents work fine through this identical NPM. The affected set is a subset.
|
|
|
|
## Please answer (each model, concise)
|
|
1. Updated ranked hypotheses given the frame reaches NPM but not the agent. Where do you now put: NPM
|
|
buffering/relay defect; NPM->WAN/UDR black-hole (H3); agent-side defect (H2)? What does "subset of agents
|
|
on the same NPM" imply about NPM-internal vs per-connection-WAN causes?
|
|
2. The single most decisive next test we can actually run (given we can capture on .20 but canNOT easily
|
|
reconfigure the broken agents). For each outcome, what it tells us. Specifically: is a .20 capture of the
|
|
NPM->agent leg (encrypted; flow/retransmit analysis only) decisive enough, or is there a better move?
|
|
3. Does this new evidence change the verdict on the operator's "remove NPM / NAT straight to .30" idea — is it
|
|
now more justified, or is there still a cheaper test that isolates NPM specifically?
|