sync: auto-sync from HOWARD-HOME at 2026-06-24 12:49:35

Author: Howard Enos Machine: HOWARD-HOME Timestamp: 2026-06-24 12:49:35
2026-06-24 12:50:03 -07:00
parent 8d6ac5ef6e
commit be2ae8b07e
6 changed files with 57 additions and 7 deletions
--- a/.claude/memory/MEMORY.md
+++ b/.claude/memory/MEMORY.md
@@ -34,6 +34,7 @@
 - [Syncro prepay: full-GET only](feedback_syncro_prepay_full_get_only.md) — read prepay_hours ONLY from GET /customers/{id}; the customer search/list endpoint returns null/stale prepay. Never assert "no block" in a billing preview from search data.
 - [Syncro priority/type format](feedback_syncro_priority_type_format.md) — every ticket create needs a number-prefixed priority ("2 Normal", not bare "Normal" which renders blank) AND a valid problem_type. Winter flagged #32193/#32194. Use the syncro skill's create flow.
 - [RMM drive-map Explorer refresh](reference_rmm_drive_map_explorer_refresh.md) — drive mapped via RMM user_session works but the user's running Explorer won't show it until SHChangeNotify(DRIVEADD); also UNC \\ gets eaten in heredoc+jq, build it from [char]92.
+- [Verify live before acting](feedback_verify_live_before_acting.md) — pull LIVE data (OMSA/iDRAC/live API) before acting on a hardware/infra flag; wiki/logs go stale. Cascades CS-SERVER "degraded RAID" was 9-day-stale (mirror self-recovered, SSDs bought needlessly). Windows can't see RAID member health.
 - [AAD Connect msDS-KeyCredentialLink writeback](reference_aadconnect_keycredlink_writeback.md) — "completed-export-errors" + 8344 INSUFF_ACCESS_RIGHTS on a protected admin account = WHfB key writeback blocked by AdminSDHolder. Diagnose with csexport /f:x; fix with dsacls WP;msDS-KeyCredentialLink on AdminSDHolder + SDProp.
 - [UniFi Site Manager cloud API](reference_unifi_site_manager_api.md) — `api.ui.com` + `X-API-KEY` (vault `services/unifi-site-manager`) = remote access to the WHOLE ACG UniFi fleet (~36 consoles) outside UOS. Tier1 `/v1/hosts|sites|devices|isp-metrics` = inventory+health+WAN. Tier2 CONNECTOR `/v1/connector/consoles/{id}/proxy/network/api/s/default/stat/{device,sta}` = **full UOS parity** (per-radio cu_total airtime + per-client RSSI) for ANY console, remote. Backend `unifi-wifi/scripts/gw-sitemanager.sh` (`fleet|devices|sites|isp|net`). Standalone UDM WAN SSH usually firewalled; per-console SSH pw at `clients/<slug>/udm-ssh`.
 - [reference_sqlx_migrations_immutable](reference_sqlx_migrations_immutable.md) -- NEVER edit an already-applied sqlx migration file — even a comment. sqlx::migrate! checksums each file at compile time and validates against _sqlx_migrations at startup; a changed checksum crash-loops the server with "migration N was previously applied but has been modified". Code review MUST flag any edit to an applied migration.
--- a/.claude/memory/feedback_verify_live_before_acting.md
+++ b/.claude/memory/feedback_verify_live_before_acting.md
@@ -0,0 +1,30 @@
+---
+name: feedback_verify_live_before_acting
+description: Always pull LIVE current data before acting on (or alarming about) a hardware/infra finding — wiki/session-logs are point-in-time snapshots that go stale
+metadata:
+  type: feedback
+---
+
+Before acting on, or raising alarm about, any hardware/infrastructure state — **pull live current
+data first and lead with THAT, not the wiki or a recalled fact.** The wiki, session logs, and memory
+are point-in-time snapshots; a "broken/degraded/failing" flag may have changed by the time you read it.
+This matters most for **hard-to-reverse or money-spending actions** (drive swaps, hardware pulls,
+parts purchases, "it's down" escalations).
+
+**Why:** 2026-06-24 — the Cascades CS-SERVER wiki carried a `[CRITICAL] RAID degraded / failing
+drive` flag from 2026-06-15. Acting on it, **SSDs were purchased** and Howard went onsite ready to
+hot-swap the "failing" drive. A **live Dell OMSA `omreport` query (via the RMM agent)** then showed
+the OS mirror had **self-recovered** (the flaky drive dropped out and re-synced after a power cycle):
+all 5 disks Online/Ok, all LEDs green, and the "5th unused drive" was actually the **global hot
+spare**. Acting on the 9-day-stale flag nearly pulled a healthy drive and wasted a drive purchase.
+Howard's directive: "always go with live current data to make sure our findings are real."
+
+**How to apply:**
+- For Dell servers: `omreport storage controller|vdisk|pdisk controller=0` + `omreport system esmlog`
+  via the RMM agent (OMSA reads the controller directly — authoritative). iDRAC/Redfish is the
+  out-of-band equivalent (no iDRAC skill yet; creds not vaulted as of 2026-06-24).
+- Windows `Get-PhysicalDisk`/`Get-Disk` shows only the VIRTUAL disks as "Healthy" even when a member
+  is degraded — it CANNOT see the array; never conclude RAID health from the OS view alone.
+- For any infra claim sourced from the wiki/a recalled fact: re-verify the specific file/flag/host is
+  still true before recommending action. State the data's timestamp and source.
+- See [[reference_rmm_drive_map_explorer_refresh]] for the OMSA-via-RMM pattern context.