claudetools/.claude/memory/ad2-ssh-mtu-blackhole.md at c2e5f4faeb68588e939b7c73cfaa00a795aa6d7d

Files

Mike Swanson c5643ee419 dataforth/dsca33-45: recover lost specs from Hoffman API (56/58 models)

The DSCA33/DSCA45 main spec files lost in the cryptolocker wipe are recoverable:
the original software published correct certs to the Hoffman product API before
the wipe and our null-skipping renderer never overwrote them. Mine per-model
Final-Test templates (names + specs + verbatim accuracy headers) straight from
those originals instead of requesting spec files from Dataforth/John.

- dsca33-45-templates.json: 56 models (DSCA33 34/35, DSCA45 22/23); only
  DSCA33-1948 + DSCA45-1746 (24 units) lack an original.
- mine-hoffman-dsca.py: the re-runnable miner.
- DSCA33-45-HOFFMAN-RECOVERY handoff for the AD2 session (incl. the gate:
  validate each render vs its Hoffman original before enabling live rendering).
- memories: Hoffman recovery (supersedes the spec-gap "need John" note) and the
  AD2 SSH MTU-blackhole root cause/fix; errorlog entries (syncro jq, ssh correction).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-18 12:50:43 -07:00

2.9 KiB

Raw Blame History

name, description, metadata

name

description

metadata

ad2-ssh-mtu-blackhole

AD2 SSH "lockouts"/mid-session timeouts over the Dataforth OpenVPN were an MTU/PMTU blackhole, not a ban/account-lockout/flaky tunnel; fix = pin the tunnel adapter MTU to 1400

type
project

AD2 (Dataforth, 192.168.0.6) SSH from the fleet over OpenVPN (client subnet 192.168.6.x) intermittently looked "locked out": sessions authenticated fine, then died mid-session with Read error from remote host 192.168.6.2 ... Unknown error [postauth] and ssh_dispatch_run_fatal: Connection from authenticating user sysadmin ... Connection timed out [preauth]. Small/interactive commands often worked; bulk reads + scp stalled.

Root cause (diagnosed 2026-06-18 via RMM — SSH itself was the failing channel, so don't diagnose it over SSH):

NOT account lockout — Windows lockout threshold is 5/30min but zero 4740 events; sysadmin never locked.
NOT an IP ban — no IPBan/wail2ban/RdpGuard, 0 inbound firewall block rules.
NOT auth — every Accepted publickey for sysadmin succeeded.
NOT load — AD2 was CPU ~11%, 11.7 GB RAM free.
It was a PMTU blackhole. OpenVPN tunnel path MTU is ~1424 (DF ping: wire 1424 passes, 1428 drops). But GURU-5070's OpenVPN adapter (Local Area Connection, ifIndex 12, IP 192.168.6.2) was set to MTU 1500 → TCP negotiated MSS 1460 → full-size bulk/scp segments exceeded the tunnel and were silently dropped (DF set), while sub-MTU interactive packets passed. That is why it presented as random "lockouts" that got worse with bulk transfer.

Fix applied (2026-06-18): Set-NetIPInterface -InterfaceIndex 12 -AddressFamily IPv4 -NlMtuBytes 1400 run via GURU-5070's own RMM agent (819df0c8..., runs as nt authority\system = elevated; the elevated lever on the local box when you can't self-elevate from the Claude shell). Validated: a 1.41 MB single-session SSH transfer to AD2 completed in 9s, no read error (previously blackholed). ~/.ssh/config ad2 block annotated + tightened keepalives (ServerAliveInterval 15, ServerAliveCountMax 4, ConnectTimeout 20).

Durability / permanent fix: Set-NetIPInterface is registry-persistent, but OpenVPN Connect may reset the adapter MTU to 1500 on reconnect — re-apply if SSH bulk transfers start stalling again (check Get-NetIPInterface -InterfaceIndex 12). The real permanent fix is server-side on the Dataforth OpenVPN server: mssfix 1360 (or push "tun-mtu 1400") so every fleet client clamps automatically — 192.168.6.4 showed the identical symptom, so this is fleet-wide, not 5070-only.

Corrects the earlier wrong attribution ("flaky VPN tunnel" / "my rapid scp+ssh bursts triggering a ban") — the tunnel is up and stable for small packets; only over-MSS segments were dropped. See prefer-ssh-over-rmm (RMM-as-fallback guidance still holds; the reason was MTU, not a flaky VPN).

2.9 KiB Raw Blame History

2.9 KiB

Raw Blame History