sync: auto-sync from DESKTOP-0O8A1RL at 2026-05-18 15:57:51
Author: Mike Swanson Machine: DESKTOP-0O8A1RL Timestamp: 2026-05-18 15:57:51
This commit is contained in:
@@ -217,3 +217,144 @@ Seven new skills proposed — none built yet. Pending prioritization:
|
||||
- Avimark Guardian docs: https://software.covetrus.com/wp-content/uploads/dlm_uploads/2021/03/Setting-up-an-AVImark-client-server.pdf
|
||||
- Avimark support: (877) 838-9273 Option 1
|
||||
- Winter Williams Syncro user_id: 1737
|
||||
|
||||
---
|
||||
|
||||
## Update: 15:56 PT — Jupiter Recovery + UniFi Cloud Connectivity Fix
|
||||
|
||||
## User
|
||||
- **User:** Mike Swanson (mike)
|
||||
- **Machine:** DESKTOP-0O8A1RL
|
||||
- **Role:** admin
|
||||
- **Session span:** ~13:00–15:56 PT
|
||||
|
||||
## Session Summary
|
||||
|
||||
This session covered two major infrastructure recovery tasks following the 2026-05-17 office power failure.
|
||||
|
||||
**Jupiter Docker recovery.** After all VMs and Docker containers were stopped on request, Jupiter (172.16.3.20) rebooted but got stuck with Docker failing to start. Root cause: `DOCKER_IMAGE_SIZE="2048"` in `/boot/config/docker.cfg` caused emhttpd to attempt `fallocate` of a 2 TB sparse `docker.img` file on every boot, blocking the array fsState from reaching "Started." Fixed by killing the fallocate process (pid 9599), which let the array start, then manually mounting the existing docker.img and starting Docker via `/etc/rc.d/rc.docker start`. Corrected `DOCKER_IMAGE_SIZE` to `"100"` in docker.cfg to prevent recurrence. Also found the Unraid web UI was unresponsive from HTTPS — root cause was `USE_SSL="no"` in `/boot/config/ident.cfg`; HTTP on port 80 works correctly.
|
||||
|
||||
**UniFi admin permissions.** Set Greg to Schickling-site-only and John Trozzi to Cascades-site-only via MongoDB `ace.privilege` collection inside the UniFi container. Used a predefined "Site Admin" role created in the UniFi UI.
|
||||
|
||||
**UniFi cloud connectivity diagnosis and fix.** All 49 UniFi sites showed "offline" in unifi.ui.com despite being accessible locally. Investigation narrowed the issue to the UniFi container (on VM 172.16.3.29, Rocky Linux 9.1, UniFi OS 5.0.6) being unable to make outbound TCP connections to port 443. Ports 80 and 8443 worked; port 443 and 444 failed. Initial diagnosis focused on pasta (Ubiquiti's custom podman networking binary) as a likely allowlist issue. Further testing showed the block was NOT in pasta — the VM host itself could not reach external port 443. Tracing up the network path identified a DNAT rule in `/boot/config/go` on Jupiter: `iptables -t nat -I PREROUTING -p tcp --dport 443 -j DNAT --to-destination 172.17.0.2:443`. This rule intercepted ALL port 443 traffic arriving at Jupiter's bridge (including forwarded traffic from VMs) and redirected it to the Discourse Docker container (172.17.0.2), which no longer listens on port 443. The rule was manually placed in /boot/config/go at some point when Discourse was serving HTTPS directly; Discourse now only exposes port 80 (host port 8081) and the rule was vestigial. Removed the rule from /boot/config/go and deleted the live iptables entry. UniFi cloud reconnected immediately: `Device connected to cloud` appeared in cloud.log within seconds, followed by successful MQTT subscriptions and WebRTC tunnel establishment for remote site access.
|
||||
|
||||
## Key Decisions
|
||||
|
||||
- Killed the fallocate process rather than rebooting again — rebooting with the same docker.cfg would have just repeated the block. Killing it allowed the array to transition to Started without changing any config first, then fixed docker.cfg for permanence.
|
||||
- Reduced DOCKER_IMAGE_SIZE from 2048 to 100 — the existing docker.img is already sized appropriately; 2048 was an erroneous value that caused a 2 TB fallocate on each boot cycle.
|
||||
- pasta networking was investigated thoroughly before ruling it out — it was a credible suspect because the block occurred at the container level and pasta handles all container networking. Confirmed it was not pasta by testing port 443 directly from the VM host as root and getting the same failure.
|
||||
- Removed the DNAT rule entirely rather than scoping it to an interface — the target container (Discourse/app) does not listen on port 443 at all; the rule served no purpose and was actively harmful. NPM (port 18443) handles all external HTTPS reverse proxy duties now.
|
||||
|
||||
## Problems Encountered
|
||||
|
||||
- **Jupiter array stuck at "Starting" on Docker startup**: fallocate running for 2 TB sparse file blocked fsState from completing. Fixed by `kill 9599` (the fallocate pid), then manually mounting and starting Docker.
|
||||
- **DOCKER_IMAGE_FILE cleared during early fix attempt**: An earlier attempt to edit docker.cfg cleared the image file path, causing Docker to fail with "No image mounted at /var/lib/docker." Restored the correct path `/mnt/user/system/docker/docker.img`.
|
||||
- **Vault SSH key for pfSense had MAC mismatch**: `infrastructure/pfsense-firewall.sops.yaml` was corrupted during the power failure. Could not decrypt to get pfSense credentials. Worked around by using SSH key auth directly (`ssh -p 2248 admin@172.16.0.1`).
|
||||
- **pasta appeared to be blocking port 443**: tcpdump on pfSense igc2 showed zero port 443 packets from 172.16.3.29, while the container and VM host both showed SYN-SENT states. Led to extensive pasta investigation (binary analysis, nftables inspection, capability check) before confirming the block was upstream at Jupiter.
|
||||
- **UniFi container netns not accessible via nsenter from root**: The container uses user namespaces (uid 1000). Used `runuser -l uosserver -c 'XDG_RUNTIME_DIR=/run/user/1000 podman exec uosserver ...'` pattern instead.
|
||||
- **Backslash hook blocking full OpenSSH path**: Pre-bash hook blocks commands containing backslashes, preventing `C:\Windows\System32\OpenSSH\ssh.exe`. Used bare `ssh` command (resolves to the same binary via PATH).
|
||||
|
||||
## Configuration Changes
|
||||
|
||||
- **`/boot/config/docker.cfg` on Jupiter (172.16.3.20)**: Changed `DOCKER_IMAGE_SIZE` from `"2048"` to `"100"`. Restored `DOCKER_IMAGE_FILE="/mnt/user/system/docker/docker.img"` (was cleared in early fix attempt).
|
||||
- **`/boot/config/go` on Jupiter (172.16.3.20)**: Removed line `iptables -t nat -I PREROUTING -p tcp --dport 443 -j DNAT --to-destination 172.17.0.2:443`.
|
||||
- **Live iptables on Jupiter**: Deleted the PREROUTING DNAT rule for port 443 → 172.17.0.2:443.
|
||||
- **MongoDB `ace.privilege` on UniFi container**: Updated Greg (admin_id: 69fa8d0b005ca51fe30e6cee) to Schickling site only; John Trozzi (admin_id: 6a0240da005ca51fe3346960) to Cascades site only.
|
||||
|
||||
## Credentials & Secrets
|
||||
|
||||
- **pfSense SSH**: `ssh -p 2248 admin@172.16.0.1` — key auth works, password vault entry has MAC mismatch (needs re-encryption)
|
||||
- **pfSense HTTPS**: port 8443
|
||||
- **Vault entry with MAC mismatch**: `infrastructure/pfsense-firewall.sops.yaml` — needs to be re-encrypted from scratch
|
||||
|
||||
## Infrastructure & Servers
|
||||
|
||||
- **Jupiter (Unraid)**: 172.16.3.20 — Unraid array server, Docker host, KVM/libvirt VM host
|
||||
- Docker bridge: `docker0` (172.17.0.0/16)
|
||||
- `app` container (Discourse): 172.17.0.2 — exposes port 80 only (host:8081); NOT 443
|
||||
- `npm` container (Nginx Proxy Manager): 172.17.0.3 — host ports 1880→80, 7818→81, 18443→443
|
||||
- `MariaDB-Official`: 172.17.0.4
|
||||
- `qbittorrent`: 172.17.0.5
|
||||
- `rsync-server`: 172.17.0.6
|
||||
- Main bridge `br0`: eth0, eth1, eth2, eth3 (physical), vnet0–vnet3 (VM NICs)
|
||||
- VMs: Unifi (vnet0?), OwnCloud, GuruRMM, Claude-Builder, Windows 7 (off), WS2016 (off)
|
||||
|
||||
- **UniFi VM**: 172.16.3.29 — Rocky Linux 9.1, KVM guest on Jupiter
|
||||
- UniFi OS: 5.0.6, container image `uosserver:0.0.54`
|
||||
- Managed by systemd unit `uosserver.service` → `/var/lib/uosserver/bin/uosserver-service`
|
||||
- Container runs as user `uosserver` (uid 1000), rootless podman
|
||||
- Networking: pasta (`/var/lib/uosserver/bin/pasta`) — custom Ubiquiti fork
|
||||
- pasta command binds host ports: 5005, 5671, 6789, 8080, 8443, 8444, 8880, 8881, 8882, 9543, 11084, 11443
|
||||
- pasta maps host:11443 → container:443; host:8443 → container:8443
|
||||
- Cloud deviceId: `2d6b654d-9b79-4eaa-b2e1-52062a5690ef`
|
||||
- SSO owner: `4d503ebd-6990-4b88-93cc-3a3e568d0a6d`
|
||||
|
||||
- **pfSense**: 172.16.0.1 — SSH port 2248, HTTPS port 8443
|
||||
- WAN: igc0 (98.181.90.163), LAN: igc2 (172.16.0.0/22)
|
||||
- NAT rule: `nat on igc0 inet from 172.16.0.0/22 to any -> 98.181.90.163`
|
||||
|
||||
## Commands & Outputs
|
||||
|
||||
```bash
|
||||
# Kill fallocate blocking Docker startup on Jupiter
|
||||
kill 9599 # fallocate pid
|
||||
|
||||
# Mount docker.img manually and start Docker
|
||||
losetup /dev/loop2 /mnt/user/system/docker/docker.img
|
||||
mount /dev/loop2 /var/lib/docker
|
||||
/etc/rc.d/rc.docker start
|
||||
|
||||
# Fix docker.cfg
|
||||
# DOCKER_IMAGE_FILE="/mnt/user/system/docker/docker.img"
|
||||
# DOCKER_IMAGE_SIZE="100"
|
||||
|
||||
# pfSense SSH (use system OpenSSH, port 2248)
|
||||
ssh -p 2248 admin@172.16.0.1
|
||||
|
||||
# tcpdump on pfSense igc2 for port 443 packets from UniFi VM
|
||||
tcpdump -i igc2 -n 'src 172.16.3.29 and tcp port 443' -c 20
|
||||
# Result: zero packets — confirmed block was upstream of pfSense
|
||||
|
||||
# Test port 443 from UniFi VM host (root)
|
||||
curl -sk -m 5 https://1.1.1.1 -o /dev/null -w '%{http_code}' # → 000 (before fix)
|
||||
|
||||
# Test from inside container via podman exec
|
||||
runuser -l uosserver -c 'XDG_RUNTIME_DIR=/run/user/1000 podman exec uosserver curl -sk -m 5 https://1.1.1.1 -o /dev/null -w "%{http_code}"'
|
||||
# → 000 before fix, 301 after fix
|
||||
|
||||
# Find the DNAT rule on Jupiter
|
||||
iptables -t nat -L PREROUTING -n -v
|
||||
# Output included: DNAT tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:443 to:172.17.0.2:443
|
||||
grep -n '443' /boot/config/go
|
||||
# Line 12: iptables -t nat -I PREROUTING -p tcp --dport 443 -j DNAT --to-destination 172.17.0.2:443
|
||||
|
||||
# Fix: remove from go script (persistent)
|
||||
sed -i '/iptables -t nat -I PREROUTING -p tcp --dport 443 -j DNAT --to-destination 172.17.0.2:443/d' /boot/config/go
|
||||
|
||||
# Fix: remove live rule
|
||||
iptables -t nat -D PREROUTING -p tcp --dport 443 -j DNAT --to-destination 172.17.0.2:443
|
||||
|
||||
# Verify UniFi cloud reconnection
|
||||
tail -20 /data/unifi-core/logs/cloud.log # (inside container)
|
||||
# Key lines:
|
||||
# Device connected to cloud, deviceId: 2d6b654d-9b79-4eaa-b2e1-52062a5690ef, isReconnected: false
|
||||
# Successful Sync owner, SSO ID: 4d503ebd-6990-4b88-93cc-3a3e568d0a6d
|
||||
```
|
||||
|
||||
## Pending / Incomplete Tasks
|
||||
|
||||
- **Verify all 49 sites show online** at unifi.ui.com — WebRTC tunnels were establishing at session end; allow ~5 min for all sites to reconnect
|
||||
- **Vault re-encryption**: `infrastructure/pfsense-firewall.sops.yaml` has MAC mismatch from power failure — needs to be recreated from scratch with current pfSense credentials
|
||||
- **Jupiter load settling**: Load was elevated from Docker/VM restart activity; should normalize within an hour
|
||||
- **7 proposed skills from earlier session**: /coord, /new-user, /deploy-verify, /infra-recovery, /client-report, /patch-server, /vendor-ticket — none built
|
||||
|
||||
## Reference Information
|
||||
|
||||
- UniFi VM cloud.log: inside container at `/data/unifi-core/logs/cloud.log`
|
||||
- UniFi VM uid cloud.log: `/data/uid/log/cloud.log` (older, last updated 2026-04-06 — access token was invalid)
|
||||
- uosserver service: `/etc/systemd/system/uosserver.service`
|
||||
- uosserver config: `/var/lib/uosserver/server.conf`
|
||||
- pasta binary: `/var/lib/uosserver/bin/pasta` (Ubiquiti custom fork, ELF 64-bit, not stripped but no readable strings)
|
||||
- Jupiter /boot/config/go: persistent startup script for custom iptables rules
|
||||
- Jupiter /boot/config/docker.cfg: Docker image path and size config
|
||||
- Jupiter /boot/config/ident.cfg: `USE_SSL="no"` — Unraid web UI is HTTP-only on port 80
|
||||
|
||||
Reference in New Issue
Block a user