diff --git a/clients/cascades-tucson/session-logs/2026-06/2026-06-18-howard-cascades-outage-followup-openvpn-printer.md b/clients/cascades-tucson/session-logs/2026-06/2026-06-18-howard-cascades-outage-followup-openvpn-printer.md new file mode 100644 index 00000000..22bc801b --- /dev/null +++ b/clients/cascades-tucson/session-logs/2026-06/2026-06-18-howard-cascades-outage-followup-openvpn-printer.md @@ -0,0 +1,81 @@ +# Cascades — power-outage follow-up: OpenVPN flapping root cause + kitchen printer post-outage casualty + +- **Date:** 2026-06-18 +- **Machine:** Howard-Home +- **Client:** Cascades of Tucson +- **Continuation of:** 2026-06-17 power-outage incident (`clients/cascades-tucson/reports/2026-06-17-power-outage-incident.md`) + +## User +- **User:** Howard Enos (howard) +- **Machine:** Howard-Home +- **Role:** tech + +## Session Summary + +Short follow-up session on the 2026-06-17 Cascades power outage. Two items. + +Diagnosed why the Howard-Home OpenVPN Connect tunnel to Cascades pfSense kept disconnecting/ +reconnecting. Read the pfSense OpenVPN server log (`/var/log/openvpn.log`): the disconnects are +caused by a configured **inactivity timeout** — `Howard/... Inactivity timeout (--inactive), +exiting` firing at ~5 min (connected 23:23:52 -> dropped 23:28:57 ~= 305s), after which OpenVPN +Connect auto-reconnects. Ruled OUT duplicate-CN (0 "will cause previous active session" events), +WAN instability (Cox gateway stable since the 20:47 recovery), and TLS/auth errors (clean auth each +time; the "IP packet with unknown IP version=0" line is cosmetic). It is a configured idle-disconnect, +not a fault. Fix = raise/disable the OpenVPN server `--inactive` timeout (keepalive pings do NOT +reset it — `--inactive` measures tunnel data). Proposed, not applied (standing no-change-without-go rule). + +Second: the kitchen thermal printer (iPad POS ticket printer) reported "disconnected from the network" +and would not print the morning after the outage; Howard power-cycled it and it resumed printing iPad +tickets. Root cause: it powered up DURING the DHCP-down window of the recovery (duplicate dhcpd + +2nd-floor switch not passing offers), never got an IP, cached a disconnected state, and did not retry +once the network was healthy. The power-cycle forced a fresh DHCP request against the now-healthy +network. Not a printer or network fault. Ran a read-only straggler sweep on pfSense (pulled recent +dhcpd.log, per-MAC DISCOVER vs ACK): 13/13 active DISCOVER senders are completing, 0 stuck — network +healthy. Noted that "gave-up" casualties like the printer are INVISIBLE to a DHCP scan (they stopped +requesting), so expect a few more "won't connect" reports today, each fixed by a power-cycle. +Updated the incident report with the printer casualty + a recovery-checklist lesson; synced. + +## Key Decisions + +- **OpenVPN flapping = configured `--inactive` idle timeout, not instability** — diagnosed from the + server log rather than guessing; ruled out duplicate-CN / WAN / TLS. Fix proposed (raise/disable inactive), not applied. +- **Printer = power-outage DHCP-down-window casualty** — correct fix was the power-cycle (re-DHCP); + no network change needed. Captured as a recovery-checklist item (power-cycle devices that booted during the DHCP-down window). +- **A DHCP-log scan cannot find gave-up casualties** (they stop requesting) — so the realistic plan is + reactive (power-cycle as reports come in), not a proactive scan. + +## Configuration Changes + +- No infrastructure changes. pfSense access was read-only (OpenVPN log, dhcpd log). +- Repo: updated `clients/cascades-tucson/reports/2026-06-17-power-outage-incident.md` (added "Post-Recovery + Casualties / Lessons" section — kitchen printer + the power-cycle-stragglers checklist item). + +## Infrastructure & Servers + +- **Cascades pfSense** `192.168.0.1`, Plus 25.07. OpenVPN server `ovpns1`, user `Howard` (client IP pool + 192.168.10.x; this session it got 192.168.10.2). Server has an **`--inactive` idle timeout ~300s** that + drops idle clients. WAN = Cox (igc0, 184.191.143.x / dpinger WAN_DHCP + WANCOAX_DHCP). pfSense logs are + PLAIN TEXT (read with tail/grep, not clog). +- OpenVPN client on Howard-Home: OpenVPN Connect (IV_GUI_VER=OCWindows_3.9.0-5008), public src 98.168.18.21. +- **Kitchen thermal printer:** iPad POS ticket printer (exact IP/MAC not captured); resolved by power-cycle. + +## Commands & Outputs + +- OpenVPN flap cause: `grep -i "inactivity timeout" /var/log/openvpn.log` -> `Howard/... Inactivity timeout (--inactive), exiting`; duplicate-CN count = 0. +- Straggler sweep: pulled `tail -5000 /var/log/dhcpd.log` locally -> python per-MAC DISCOVER vs ACK -> 13 senders, 13 completing, 0 stuck. + +## Pending / Incomplete Tasks + +- **OpenVPN flapping fix:** raise/disable the pfSense OpenVPN server `--inactive` timeout (proposed; needs go). +- **Watch for more post-outage stragglers** (printers/POS/IoT that gave up) — power-cycle each as reported. +- Carryover from the outage (unchanged): rotate the exposed Synology credential (vault history commit 1fbc0e1); + enable AutoConfigBackup; UPS coverage/runtime/clean-shutdown review; 5GHz Option B + 2.4 Low->Medium bump + (plus the auto-channel change still needs a proper data-driven re-plan). + +## Reference Information + +- Incident report: `clients/cascades-tucson/reports/2026-06-17-power-outage-incident.md` (updated). +- Prior session logs (same outage): `2026-06-17-howard-cascades-power-outage-recovery-and-5ghz.md`, + `2026-06-17-howard-cascades-poly-phone-drops-network-smoothing.md`. +- Memory: `reference_pfsense_25_07_ops.md`, `feedback_cascades.md` #4 (no prod change without discussing). +- pfSense access: `bash .claude/skills/unifi-wifi/scripts/pfsense-ssh.sh cascades-tucson run ""`.