claudetools/tmp/r3-gemini.txt

[INFO] Independent Reviewer Verdict: The "NAT-straight-to-.30" proposal is a high-effort diagnostic that provides an "unclean" bisection. It should be deferred in favor of a 5-minute configuration change and a packet capture.

### 1. Diagnostic Bisection Analysis
- **If it works (commands resume):** It proves the failure occurs at the NPM layer (.20) or in the NPM-to-Origin network leg. **Critique:** It does *not* distinguish between an Nginx configuration error (H1) and an NPM-specific resource exhaustion or OS-level socket leak.
- **If it fails (still no commands):** It exonerates NPM. The issue is localized to the origin Nginx, the Rust server, the Agent (H2), or the Gateway (H3).
- **Confounds:** This is NOT a clean bisection. Moving TLS termination from NPM to Origin changes the TLS stack (OpenResty vs. Nginx), cipher suites, and TCP window behavior. If the Gateway (H3) is dropping packets based on specific TCP/TLS fingerprinting or MTU issues triggered by NPM's headers, the "fix" might be a side effect, not a localization.

### 2. Permanent Fix: Pros and Cons
- **Pros:**
    - Reduces architectural complexity and "double-proxying" overhead.
    - Eliminates one potential point of failure and logging fragmentation.
- **Cons:**
    - **Security Surface:** .30 is now directly exposed. NPM acts as a sacrificial buffer; losing it moves the "front line" to the application host.
    - **Cert Management:** You must migrate and automate Certbot/ACME on .30. If this is manual, you risk expiry-related downtime.
    - **Fragmentation:** If the Cloudflare-fronted dashboard stays on NPM, you now have two different TLS entry points and two different Nginx configs to maintain.
    - **Cloudflare Integration:** You lose the ability to use NPM’s GUI to quickly toggle ACLs or WAF settings for the agent path.

### 3. Lower-Risk / Faster Diagnostic Signals
The current plan skips the "Evidence" phase and jumps to "Structural Change."
- **A. Config Hardening (The "Magic" Flag):** Add `proxy_buffering off;` to the `/ws` block on BOTH NPM and Origin. Nginx often attempts to buffer small upstream responses (Text frames) but lets through TCP Pings (Binary/Lower-level). This is the most likely culprit for H1.
- **B. Targeted PCAP (The Smoking Gun):** Since `tcpdump` is installed, run it on both .20 and .30 simultaneously:
  `tcpdump -i any port 3001 or port 80 -w capture.pcap`
  Send a command. If the frame is seen entering .20 but not leaving it, NPM is the killer. If it leaves .30 but isn't seen at the agent's WAN, H3 is confirmed.
- **C. Single-Agent Pivot:** Instead of re-NATing all 200 agents, NAT a non-standard port (e.g., 4443) straight to .30. Point *one* affected agent to `wss://rmm-api.azcomputerguru.com:4443/ws`. This isolates the variable for PST-SERVER without risking the fleet.

### 4. Net Recommendation
[WARN] Do NOT re-NAT the production fleet yet. The risk of TLS/cert misconfiguration and downtime is high for a diagnostic gain that can be achieved more cheaply.

**Ordered Action List:**
1. **Apply `proxy_buffering off;` and `proxy_read_timeout 86400;`** to the `/ws` location blocks on both Nginx layers. Restart Nginx. (Time: 5 mins).
2. **Execute a dual-sided `tcpdump` capture.** Verify if the Text frame (search for command payload strings in hex) survives the jump from .30 to .20. (Time: 15 mins).
3. **Internal Bypass Test:** From a machine on the same LAN as the agents (if possible), use `wscat` or a similar tool to connect directly to .30:80 to see if commands arrive when TLS and NPM are bypassed locally.
4. **Pivot only if Step 2 shows NPM swallowing frames.** If you must pivot, use the "Parallel Port" method (Step 3C) before moving the entire fleet.

**Confidence:** 90% that the issue is `proxy_buffering` or a TCP keep-alive mismatch between the two Nginx layers, rather than a fundamental flaw in NPM itself.