21 lines
2.5 KiB
Plaintext
21 lines
2.5 KiB
Plaintext
[OK] [VERDICT] The failure is localized to the **NPM (.20) -> Agent** segment. The origin-to-NPM hop is healthy.
|
|
|
|
### 1. Updated Hypotheses
|
|
* **H1 (Top): NPM Connection-Specific Failure / State Corruption.** (Confidence: High). Since ~40 agents work, the "relay logic" isn't globally broken. However, NPM/OpenResty may have stalled the specific TCP/WS streams for these agents (e.g., full send-buffers, socket-state mismatch, or worker-process hang). The fact that the frame reaches NPM but doesn't reach the agent suggests NPM accepted the frame from origin but failed to "push" it to the downstream socket.
|
|
* **H2: WAN / MTU / Black-Hole (H3).** (Confidence: Medium). A specific path MTU issue or firewall "stale entry" at the UniFi UDR (PST site) could be black-holing downstream packets that exceed a certain size or signature, while allowing the upstream heartbeats.
|
|
* **H3: Agent-side Socket Freeze.** (Confidence: Low). If the agent OS/NIC was dropping packets, we would likely see the agent trip its 90s timeout.
|
|
|
|
**Implication of "Subset of Agents":** This points away from global NPM config and toward **per-connection state**. NPM might be holding "ghost" connections or failing to drain buffers for specific client IPs.
|
|
|
|
### 2. The Decisive Next Action: Packet Capture on NPM (.20)
|
|
**Action:** `tcpdump` on NPM's LAN interface, filtering for the **Agent's Public IP** (98.190.129.150) while sending a command.
|
|
|
|
* **Outcome A (NPM emits nothing):** If NPM receives the plaintext frame from .30 but emits **zero** TLS application data toward the agent IP, **NPM is the culprit.** (Logic: The proxy is swallowing the frame internally).
|
|
* **Outcome B (NPM emits, no ACK):** If NPM emits TLS packets but sees **TCP Retransmissions**, the **Internet/UDR** is black-holing the traffic.
|
|
* **Outcome C (NPM emits, Agent ACKs, still no execution):** The **Agent** is receiving the data but the WS stack inside the agent is dead/frozen.
|
|
|
|
### 3. Verdict on "Remove NPM" Idea
|
|
The "Remove NPM" idea is now **highly justified** as a diagnostic isolation step. Since we have provably tracked the packet *to* NPM's doorstep, bypassing it eliminates the most complex variable (TLS termination and WS proxying).
|
|
|
|
**Cheaper alternative first:** Before a full NAT bypass, **force-restart the NPM Docker container.** If the "stuck" agents immediately ACK commands upon reconnecting, you have confirmed an NPM-internal state corruption (H1) without changing infrastructure. If a restart doesn't fix it, proceed immediately to the NAT bypass (Direct to .30).
|