75 lines
12 KiB
Plaintext
75 lines
12 KiB
Plaintext
**1. Re-rank and shared attribute**
|
||
|
||
Per-site-gateway (H3) drops sharply. Three independent ISPs, three distinct public IPs, two explicitly UniFi UDR and one different hardware/ISP, all exhibiting the *exact* same signature (heartbeats + agent→server frames work; server→agent Text/command frames disappear after the origin nginx hands them to NPM) is extremely unlikely to be three independent gateway black-holes or asymmetric-routing failures. A single-gateway bug would have to be replicated across vendors or would require the gateways to all react identically to some property of packets sourced from the single public IP of the RMM host — possible in theory (e.g., all mishandle a particular TCP option, window scale, or large frame) but now a low-probability explanation.
|
||
|
||
Agent-side (H2) and NPM-relay/state rise. The common element is no longer "the network path" but "these three long-lived connections are currently terminated at this specific NPM instance and are being handled by the agents running on these three machines." Global NPM and global 0.6.63 are already falsified by the 40 working agents.
|
||
|
||
**Most likely shared attribute class (concrete + falsifiable):**
|
||
|
||
- **Windows Server OS vs workstation OS (or "server role" detection).**
|
||
The hostnames (PST-SERVER, PST-SERVER2, GTS-PEDRO-H) are suggestive. The 40 working agents are probably Win10/11 workstations/laptops that sleep, reboot, or lose network frequently. The three affected are almost certainly always-on server hardware running Windows Server (or treated as servers). This is easily falsified: query the agent-reported `os_version`, `platform`, `product_name`, or any "is_server" / role field for exactly these three rows versus a sample of 10–15 working agents. Also check reported hostname patterns and any installer "server" branch.
|
||
|
||
- **Connection age / long-lived without reconnect (second strongest).**
|
||
Servers rarely sleep or reboot, so their WS connections can be days or weeks old. Workstations cycle connections on wake/suspend. A resource leak, buffer state, or internal nginx fd event-loop entry that only manifests after N hours/days of a single upgraded connection would hit precisely the always-on machines. Falsifiable: pull the `connected_at`, `last_heartbeat_at`, or equivalent session start timestamp (or derive from metrics/logs) for the three vs. others. Also look at any "reconnect count" or "uptime" the agents report.
|
||
|
||
- **Enrollment/update cohort or config snapshot at connection time.**
|
||
Less likely but cheap to check: same narrow enrollment window, same build hash + config version at first connect, or a one-time policy push that only these three received before the population diverged. Query enrollment timestamp, last update timestamp, and any per-agent policy or "features" blob.
|
||
|
||
- **Server-grade NIC / TCP offload / RSS / checksum offload.**
|
||
Common on physical servers. These can produce subtly different TCP behavior (TSO, LRO, delayed ACKs, different window scaling) that interacts badly with the return path through Docker bridge + NPM + internet when the proxy is doing WS frame forwarding. Harder to observe remotely but can be inferred if the OS check points to servers and packet captures later show retransmit or zero-window behavior only on these three.
|
||
|
||
- Weaker: pure idle pattern (servers send almost no other traffic) or specific outbound NAT characteristics at the three sites. These are possible but secondary to the OS/role + age hypotheses.
|
||
|
||
The pattern "these three but not the other ~40 on identical infrastructure and version" is the signature of either (a) a property of the *machines themselves* (OS, hardware, role, uptime) or (b) a per-connection state that only these three conns accumulated inside NPM.
|
||
|
||
**2. http2 + proxy_buffering + WS upgrade downstream stall**
|
||
|
||
Yes, this combination is a known source of subtle, one-directional problems on long-lived upgraded connections.
|
||
|
||
- `http2 on` applies to the client→NPM leg. The backend leg is forced to 1.1 with the explicit `proxy_http_version 1.1` + Upgrade/Connection headers. This is the standard and usually correct way to do WS through a terminating proxy.
|
||
- However, `proxy_buffering on` (the nginx default, and not overridden) still influences response handling. For a 101 upgrade, nginx is *supposed* to stop using the response buffer and pass data bidirectionally on the raw sockets. In practice, across multiple nginx versions (especially 1.20–1.24 range that NPM often ships), there are edge cases:
|
||
- Large or bursty server→client Text/Binary frames can still trigger internal temporary buffering or be subject to the proxy's send buffer limits.
|
||
- The "an upstream response is buffered to a temporary file" warnings are only logged for non-upgraded responses; WS frames can stall silently.
|
||
- `proxy_read_timeout 60s` (default) governs the upstream read. If the origin is idle, this can close the upstream side even while the client→server direction (heartbeats) keeps the downstream leg alive from the client's perspective. The result is exactly the observed symptom: client can still send (heartbeats reach Rust), but server frames stop being delivered.
|
||
- Docker bridge + NAT on the client side of NPM adds another layer of buffering and potential delayed ACK or window interaction.
|
||
- There are documented nginx + WS + http2 issues (and NPM-specific reports) where one direction (commonly server→client) stops flowing while pings/heartbeats continue, especially on connections that have been upgraded for a long time or when the client read side is slow. Adding `proxy_buffering off;`, `proxy_read_timeout 300s` or higher, and sometimes `proxy_send_timeout` is the standard mitigation.
|
||
|
||
The absence of WS-specific errors in the NPM log is consistent with a silent stall rather than a hard failure. The fact that only three connections are affected is also consistent: most of the other 40 either reconnected more recently (fresh state) or are less idle (different buffering/timeout trigger points).
|
||
|
||
**3. Most discriminating test sequence (next 2–3 rounds), ordered by information gained per unit of risk**
|
||
|
||
Risk here is primarily "disruption to the ~200 agents" (NPM restart or reload) and secondarily "time to next data point." We have passive/low-risk moves that directly attack the "what is special about these three" and "is data even leaving NPM" questions.
|
||
|
||
**Round 5–6 (near-zero risk, highest immediate value):**
|
||
- Pull the agent DB rows (or equivalent telemetry) for the three affected + a representative sample of working agents on the same NPM/0.6.63. Extract: OS/version/platform strings, hostname, enrollment/first-seen, connection/session start time or age, last reconnect, any policy/site/customer tag, reported uptime or last reboot, any capability or config hash differences. This is the single highest-leverage action right now — it will either confirm or kill the Windows Server + long-lived-conn hypotheses within minutes.
|
||
- While that query runs, complete/analyze the in-progress NPM-box capture of the NPM→agent leg for the three IPs. Correlate exact timestamps of a deliberately sent test command (Text frame) with what appears on the wire.
|
||
- On the .20 host (outside the container), run `ss -tpi` / `ss -tan` (and `conntrack` if available) filtered to the three public IPs. Look at send-q/recv-q sizes, timer state, retransmit counts, window sizes right after a test command is injected. Also capture (tcpdump/wireshark) on the host's public interface for traffic to those three destinations. This tells us immediately whether the frames are leaving the host at all, whether they are retransmitting, or whether the sockets look "healthy but idle."
|
||
|
||
**Round 6–7 (low-to-medium risk, targeted config change):**
|
||
- Edit the NPM proxy host config for the rmm-api location: add `proxy_buffering off;`, raise `proxy_read_timeout 300s;` (or 600s), and `proxy_send_timeout 300s;` if available. Reload nginx inside the container (`nginx -s reload`). Do **not** restart the whole container yet.
|
||
- Immediately after reload, send test commands to the three affected agents and watch both the new host capture and the agent side (CommandAck / execution).
|
||
- This directly tests the known nginx WS + buffering + timeout failure mode without a full reconnect storm. If the three start working and no others regress, we have a strong causal signal and a fix. If nothing changes, we have ruled out the default buffering/timeout path as the sole cause for these conns.
|
||
|
||
**Round 7+ (higher risk, one-shot probe):**
|
||
- Only after the above: consider a deliberate NPM container restart. See question 4 for interpretation.
|
||
|
||
Secondary cheap actions that can run in parallel: inspect NPM access/error logs for bytes-written on the three specific connections, look for any conntrack table oddities or Docker network restarts around the time the problem appeared, and check whether the three agents show any difference in ping/pong timing or last successful upstream frame.
|
||
|
||
**4. Restarting the NPM container as a probe**
|
||
|
||
It is a worthwhile one-shot probe, but it should be the *last* move in the sequence above, not the first, because the cost is real (brief loss of command delivery for the entire fleet while ~200 agents reconnect).
|
||
|
||
Interpretation is clean and high-value:
|
||
|
||
- **Stuck agents recover (they begin receiving Text frames and acking/executing after the reconnect)**: Strong evidence that the defect was in *per-connection state inside the running NPM/nginx process*. Classic examples: a particular upgraded socket's event loop entry got into a bad state, an internal send buffer or WS frame queue for that fd was wedged, or a long-idle connection accumulated a condition that only a fresh accept + upgrade cleared. The fact that only three were affected fits a rare per-conn corruption or timeout path. This also exonerates (or greatly weakens) "the agent machines themselves are permanently broken" and most per-site gateway theories, because the new connections from the same three public IPs now work.
|
||
|
||
- **Stuck agents still do not receive commands after restart (while the rest of the fleet does)**: The problem is *not* transient NPM internal state. This shifts weight heavily to:
|
||
- Agent-side (H2): the three agents, once they re-establish, still cannot process inbound frames (possible causes: OS-level socket receive handling, WS client library state machine bug triggered by something in their environment, server-grade NIC offload interacting badly with the specific return traffic pattern, or the agent process having a stuck read goroutine / select that only affects certain machines).
|
||
- Or a persistent property of the return path from this specific public IP to those three destinations (e.g., the three gateways all apply some policy or have a path MTU / TCP option incompatibility that only manifests for certain frame sizes or after the initial upgrade handshake). Three different ISPs still makes this surprising, but a restart would have forced fresh TCP connections, so any "established conn half-open" theory would be weakened.
|
||
- Less likely: something in the origin nginx or Rust side that only these three agent identities trigger.
|
||
|
||
A restart also gives a natural before/after on connection age: if the problem reappears only after the new connections have been up for many hours/days, that further supports the long-lived-conn hypothesis.
|
||
|
||
**Summary recommendation for the next round**
|
||
Do the DB attribute query + host ss + targeted capture first (today). Apply the `proxy_buffering off` + longer timeouts + reload next (cheap and directly addresses a documented class of nginx WS failure). Only then decide on a container restart. The three-different-ISPs fact has already done most of the work of demoting H3; the population difference is now the best remaining signal, and the low-risk data pulls will likely tell us whether we are looking at "server OS + long conn age" or "rare NPM per-conn stall."
|