From f22d33f2aeaa17e9f539e38a2500b9c8d8fb09c3 Mon Sep 17 00:00:00 2001 From: Mike Swanson Date: Wed, 29 Apr 2026 07:51:02 -0700 Subject: [PATCH] =?UTF-8?q?pavon:=20session=20log=20=E2=80=94=20OwnCloud?= =?UTF-8?q?=20VM=20cron=20stacking=20diagnosed=20and=20stabilized?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Found 75-126 stale `occ system:cron` processes on 172.16.3.22 piling up since 2026-04-27 due to bad oc_filecache LIKE query against pavon's 257K camera files. Killed stale procs (load 80 -> 5), wrapped apache crontab with `flock -n /tmp/oc-cron.lock` to prevent restacking. Per-user versioning disable rejected by OwnCloud Community (`files_versions` can't be enabled for groups); workaround `occ versions:cleanup pavon` identified and deferred. Migration/retention cron deferred per user. NVR architecture clarified: GeoVision NVRs sync via OC Desktop client with virtual file placeholders; no direct SMB access to Jupiter. Co-Authored-By: Claude Opus 4.7 (1M context) --- clients/pavon/PROJECT_STATE.md | 32 +- .../pavon/session-logs/2026-04-29-session.md | 337 ++++++++++++++++++ 2 files changed, 361 insertions(+), 8 deletions(-) create mode 100644 clients/pavon/session-logs/2026-04-29-session.md diff --git a/clients/pavon/PROJECT_STATE.md b/clients/pavon/PROJECT_STATE.md index 0e83d4d..9ea97de 100644 --- a/clients/pavon/PROJECT_STATE.md +++ b/clients/pavon/PROJECT_STATE.md @@ -1,23 +1,39 @@ # Pavon — Project State -> Last updated: 2026-04-20 +> Last updated: 2026-04-29 -**Status:** COMPLETE -**Last Activity:** 2026-04-12 +**Status:** ACTIVE — deferred follow-ups +**Last Activity:** 2026-04-29 -Video archive management project. Cleaned 25 TB of old footage (184,124 files deleted), then integrated the remaining 35 TB with OwnCloud for organized archival access. +Video archive management with OwnCloud as source of truth (3-year retention). Original 25 TB cleanup completed 2026-04-12. New chapter opened 2026-04-28: OwnCloud VM cron stacking spiral diagnosed and stabilized, but root cause cleanup deferred per user. ## What Was Done +### 2026-04-12 (original project, COMPLETE) - Identified and deleted 184,124 redundant/old files totaling 25 TB - Infrastructure analysis of storage environment - Remaining 35 TB integrated with OwnCloud via external storage setup - Archive scan and cleanup completion documented - Final setup summary written +### 2026-04-29 (cron stacking incident, STABLE / FOLLOW-UPS DEFERRED) +- Diagnosed that the OwnCloud VM (172.16.3.22) was running 75-126 stale `occ system:cron` processes since 2026-04-27, all spinning on a bad `oc_filecache` LIKE query against pavon's storage 78 (~237K camera files) +- Killed stale crons, load avg dropped 80 -> 5 +- Wrapped the apache crontab line with `flock -n /tmp/oc-cron.lock` to prevent stacking — current production state +- Architecture clarified: GeoVision NVRs at Curves and Raiders sites use OwnCloud Desktop sync client with virtual file placeholders; NVRs have no direct SMB access to Jupiter/Saturn; pavon never touches OwnCloud directly +- Discovered `files_versions` cannot be group-restricted in OwnCloud Community; per-user disable not possible. Identified `occ versions:cleanup pavon` as the workaround. Deferred. + +## Current Operational State + +- OwnCloud VM stable. Cron protected by flock. No active firefighting needed. +- 30 GB of accumulated junk versions sitting in `/owncloud/pavon/files_versions/` waiting for `occ versions:cleanup pavon` +- A dangling `versioning_users` group exists on the OwnCloud VM (created during a failed group-restrict attempt; harmless) +- 1Password password for OwnCloud VM is stale (`Paper123!@#-unifi!`); SOPS has the working value (`r3tr0gradE99!!`) — needs reconciliation + ## If Resuming -- Check `clients/pavon/owncloud-archive-setup.md` and `final-setup-summary.md` for the OwnCloud configuration -- Storage layout documented in `infrastructure-analysis.md` -- `cleanup-completion-report.md` confirms what was deleted and what was kept -- Session logs in `clients/pavon/session-logs/` +- **Most recent context:** `clients/pavon/session-logs/2026-04-29-session.md` (full diagnostic trail, command outputs, deferred-task checklist) +- **Original archive integration:** `owncloud-archive-setup.md`, `final-setup-summary.md` +- **Storage layout:** `infrastructure-analysis.md` +- **Pending work checklist:** see "Concrete next session checklist" in 2026-04-29 session log +- **Session logs index:** `clients/pavon/session-logs/` diff --git a/clients/pavon/session-logs/2026-04-29-session.md b/clients/pavon/session-logs/2026-04-29-session.md new file mode 100644 index 0000000..4d53ecd --- /dev/null +++ b/clients/pavon/session-logs/2026-04-29-session.md @@ -0,0 +1,337 @@ +# Session Log — Pavon — 2026-04-29 + +## User +- **User:** Mike Swanson (mike) +- **Machine:** GURU-BEAST-ROG +- **Role:** admin + +--- + +## Mode +`infra` — OwnCloud VM diagnostics on Jupiter, multi-host SSH, scheduled cron edits. + +--- + +## Session Summary + +A performance issue was reported on a Windows Server 2016 VM running SharePoint Migration Tool for BirthBiologic, hosted on Jupiter (Unraid 172.16.3.20). Initial analysis incorrectly attributed high CPU usage to CPU pinning, but further investigation revealed that the high load was caused by the OwnCloud VM (172.16.3.22) instead. The OwnCloud VM, running Rocky Linux 9.6, was experiencing a severe performance degradation due to an excessive number of stalled `occ system:cron` processes. These processes were executing inefficient MariaDB queries that were scanning the `oc_filecache` table repeatedly, leading to sustained high CPU usage and load average. The inefficient query pattern used a wildcard in the LIKE clause, incorrect collation, and missing indexes, resulting in full table scans. + +After killing the stalled processes and implementing a flock lock to prevent future stacking, CPU usage dropped significantly. Further investigation into the storage configuration and data retention policy confirmed that OwnCloud is the source of truth for the data (3-year retention), and no immediate migration or cleanup was performed at the user's request. Credentials access was switched from 1Password to SOPS partway through the session per user preference, which uncovered drift between the two sources for the OwnCloud VM root password. + +--- + +## Key Decisions + +- **Kill stalled cron processes immediately** rather than waiting them out — system was in deep distress (load avg 80, ~70 tasks queued waiting on CPU) and there was no risk to user data +- **Wrap cron with `flock -n`** instead of disabling cron outright — preserves OwnCloud's normal background-job processing while making the stacking-spiral structurally impossible +- **Defer per-user versioning disable** — OwnCloud Community refused `--groups` flag for `files_versions`. Workaround identified (`occ versions:cleanup pavon` on a daily cron) but deferred per user request +- **Did not bypass OwnCloud for ingest** — user clarified GeoVision NVRs at Curves and Raiders sites use OwnCloud Desktop sync client with virtual file placeholders to save NVR-local disk; NVR units lack direct SMB access to Jupiter/Saturn, so the existing ingest path stays +- **Defer the migration/retention scripts** — current state (cron-flock + status quo on versioning) is stable enough; user wants to revisit the architectural cleanup with more time + +--- + +## Problems Encountered + +- **Incorrect initial diagnosis of CPU pinning** — first hypothesis was that the Windows VM's 16 vCPUs were oversubscribed onto 8 host threads. `virsh dumpxml` showed no `` block at all on the Windows VM. Re-pivoted by checking which qemu process actually owned the load, found OwnCloud +- **Stale `occ system:cron` processes** — 75-126 piled up since 2026-04-27. No flock wrapper meant every 15-min tick fired a new run while previous ones still ground on. Resolved with `pkill -f 'occ system:cron'` then `pkill -9` for stragglers +- **Inefficient MariaDB query pattern** in OwnCloud's filecache scanner — `name COLLATE utf8mb4_general_ci LIKE 'pattern%middle%suffix'` defeats the `utf8mb4_bin` index and the wildcard mid-string. The schema only has `(parent, name)` and `(storage, path_hash)` indexes, no `(storage, name)`. Each query becomes a 257K-row scan on storage 78 +- **Failed attempt to disable versioning for one user** — `app:enable files_versions --groups ` rejected with "files_versions can't be enabled for groups." Reverted by re-enabling globally +- **Credential drift between 1Password and SOPS** — SOPS had `r3tr0gradE99!!` for OwnCloud VM root, 1Password had `Paper123!@#-unifi!`. SOPS value worked, 1Password did not. Worth reconciling +- **PowerShell mangling shell metachars** — repeatedly tripped on `$()`, `*`, `(`, backticks when invoking SSH inline. Worked around by writing scripts to local files, scp-ing, then executing remotely + +--- + +## Credentials Used / Discovered + +### Jupiter (Unraid Primary) +- **Host:** 172.16.3.20 +- **User:** root +- **Password (1Password and SOPS agree):** `Th1nk3r^99##` +- **SOPS path:** `infrastructure/jupiter-unraid-primary.sops.yaml` +- **1Password item id:** `5ji4rsgvn6feare6fahxsqauui` + +### OwnCloud VM (Rocky Linux 9.6) +- **Host:** 172.16.3.22 +- **Hostname:** cloud.acghosting.com +- **User:** root +- **Password (SOPS — works):** `r3tr0gradE99!!` +- **Password (1Password — STALE, does NOT work):** `Paper123!@#-unifi!` ← **needs reconciliation** +- **SOPS path:** `infrastructure/owncloud-vm.sops.yaml` +- **1Password item id:** `h6usgzxxn26kvckxz5dhssxdai` +- **SSH host key (ed25519):** `SHA256:Yy4oFv5HudmKjNJ4IZgHcuSSmeBvUg+ZJta6iLasdqU` + +### MariaDB on OwnCloud VM +- Local socket auth as root from the VM (no password prompt for `mysql` CLI when run as root) +- OwnCloud's app DB: `owncloud` schema, user `owncloud@localhost` + +--- + +## Infrastructure Touched + +| Host | IP | Role | Action | +|---|---|---|---| +| Jupiter (Unraid Primary) | 172.16.3.20 | Hypervisor | Read-only diagnostics: `virsh list`, `virsh dumpxml`, `virsh vcpuinfo`, `mpstat`, `ps` | +| OwnCloud VM | 172.16.3.22 | Rocky Linux 9.6, runs OwnCloud + MariaDB 10.5.29 | `pkill` cron processes, edited apache crontab, group/app changes (reverted) | +| Uranus (Unraid Secondary) | 172.16.3.21 | SMB share host (`Storage` share) | None this session — referenced as future archive target | + +### VM cputune findings on Jupiter + +- **Windows Server 2016 VM:** No CPU pinning. Topology `sockets=1, cores=8, threads=2` (16 vCPUs). Memory 16 GB current / 32 GB max with ballooning. +- **OwnCloud VM:** vCPUs 0-7 pinned 1:1 to host CPU 0,28,1,29,2,30,3,31 — exactly the eight LPs that were saturated. Topology `cores=4, threads=2` (8 vCPUs). + +### OwnCloud users and groups + +- 10 users total: Martell, anaise, bst, jburger, mara, minrec, **pavon**, rohrbach, sysadmin, themarcgroup +- Existing groups: ACG, Clients, PST, QMS, Stamback, Stoltz, WCP, admin +- Only sysadmin is in any group (ACG, admin); all other users are unaffiliated + +### Storage layout (OwnCloud VM) + +- **OwnCloud data dir:** `/owncloud`, NFS-mounted from `172.16.3.20:/mnt/user/OwnCloud` +- **Filesystem state:** 932 GB total, 677 GB used, 248 GB free, **74% full** +- **Pavon's storage (numeric_id 78, `home::pavon`):** + - `/owncloud/pavon/files/Curves/` — 188,920 files (all sub-paths Curves/Data-F/CamNN/YYYYMMDD/Event*.Avi or .Wav) + - `/owncloud/pavon/files/Raiders/` — 48,978 files (Raiders/Cameras{,2}/CamNN/YYYYMMDD/Event*.Avi) + - Two NVR log files at root (`NVR-18019140.out`, `NVR-18082322.out`, ~16K each) + - **Total ~237K files** + - **30 GB of accumulated junk versions** in `/owncloud/pavon/files_versions/` (1,326 version files, 1,383 filecache rows) +- **External storage (numeric_id 6, mount_id 6):** + - Mount point in pavon's view: `/Archive` + - Backend: SMB Personal (unique file IDs) + - Host: 172.16.3.21 (Uranus) + - Share: `Storage` + - SMB user: `owncloud` + - **`filesystem_check_changes` already set to 0** — OwnCloud doesn't auto-rescan this on cron + +### File age distribution (pavon) + +``` +2024: 1 file (oldest from 2024-12-21) +2025: 162,898 files +2026: 74,719 files +Older than 365 days: 256 files (as of 2026-04-29) +``` + +--- + +## Architecture (NEW context from this session) + +- **GeoVision NVR** units at Curves and Raiders client sites +- Each NVR runs **OwnCloud Desktop sync client** with the sync folder pointed at the NVR's data drop directory +- After upload, OwnCloud client converts the local file to a **virtual file placeholder** to conserve NVR-local disk; if the NVR ever needs the file, the placeholder triggers a re-download +- **NVR units have NO direct SMB access** to Jupiter or Saturn — they reach OwnCloud only via the WebDAV interface used by the desktop client +- **OwnCloud is the source of truth** for this footage, not a backup of it +- **Retention policy: 3 years**. Older may be deleted +- **Pavon never uses OwnCloud directly** — only the NVR interface for footage retrieval +- **GeoVision has no built-in age-based file routing** — can't move old files to a different folder on the NVR side + +--- + +## Commands & Outputs (Critical) + +### Identifying the runaway VM (on Jupiter) + +```bash +ssh root@172.16.3.20 'ps -eo pid,pcpu,pmem,comm,args --sort=-pcpu | head -3' +# PID 15343 486% CPU qemu-system-x86_64 ... guest=OwnCloud +# PID 2349755 118% CPU qemu-system-x86_64 ... guest=Windows Server 2016 +# PID 13887 25% CPU qemu-system-x86_64 ... guest=Unifi + +mpstat -P 0,1,2,3,28,29,30,31 1 1 +# All eight LPs showed %guest near 100, %idle near 0 — load was guest VM, not host +``` + +### OwnCloud cputune confirming pin (on Jupiter) + +```bash +virsh dumpxml OwnCloud | grep -A 10 cputune +# +# +# ... +# +``` + +### Stale cron count and load (on OwnCloud VM) + +```bash +ps -ef | grep 'occ system:cron' | grep -v grep | wc -l +# 126 +uptime +# load average: 80.38, 77.60, 76.98 +``` + +### The bad query pattern (sample) + +```sql +SELECT `fileid`, `storage`, `path`, `parent`, `name`, `mimetype`, `mimepart`, + `size`, `mtime`, `encrypted`, `etag`, `permissions`, `checksum` +FROM `oc_filecache` +WHERE `storage` = '78' + AND `name` COLLATE utf8mb4_general_ci LIKE 'Event20260412190705025.Wav.v%.d1776373045' +``` + +### oc_filecache schema confirming the index situation + +``` +CREATE TABLE oc_filecache ( + ... + `name` varchar(250) DEFAULT NULL, + PRIMARY KEY (`fileid`), + UNIQUE KEY `fs_storage_path_hash` (`storage`, `path_hash`), + KEY `fs_parent_name_hash` (`parent`, `name`), + KEY `fs_storage_mimetype` (`storage`, `mimetype`), + KEY `fs_storage_mimepart` (`storage`, `mimepart`), + KEY `fs_storage_size` (`storage`, `size`, `fileid`), + KEY `fs_parent_storage_size` (`parent`, `storage`, `size`) +) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin +``` + +No `(storage, name)` index exists. Combined with the `COLLATE utf8mb4_general_ci` override and mid-string wildcard, every query becomes a full scan of the 257K rows in storage 78. + +### Killing the cron stack + +```bash +pkill -f 'occ system:cron' +sleep 3 +pkill -9 -f 'occ system:cron' +# Connection dropped briefly during the kill due to system shock — expected +# Reconnected: 0 stale processes; load avg dropped 80 -> 27 -> 5 within ~3 minutes +``` + +### Active queries after cleanup + +```sql +SELECT COUNT(*) FROM information_schema.PROCESSLIST WHERE COMMAND != 'Sleep'; +-- 1 (just my own query) +``` + +--- + +## Configuration Changes (PERSISTED) + +### OwnCloud VM — apache crontab (`/var/spool/cron/apache`) + +**Before (caused stacking):** +```cron +*/15 * * * * /usr/bin/php -f /var/www/owncloud/occ system:cron +``` + +**After (current production):** +```cron +*/15 * * * * /usr/bin/flock -n /tmp/oc-cron.lock /usr/bin/php -f /var/www/owncloud/occ system:cron +``` + +`flock -n` makes new ticks bail immediately if the previous run still holds the lock — at most one `occ system:cron` ever runs. + +**Backup:** `/root/apache-crontab.backup-20260428-pre-flock` (67 bytes, contains the original line) + +### Group changes (NOT REVERTED) + +A group named **`versioning_users`** was created on the OwnCloud VM during the failed per-user-versioning attempt, with all 9 non-pavon users added. The intent was to scope `files_versions` to that group, but `app:enable --groups` was rejected. The group **still exists** with those memberships. It's harmless (no app uses it) but worth knowing about for next session — could be deleted with: + +```bash +sudo -u apache php /var/www/owncloud/occ group:delete versioning_users +``` + +### `files_versions` app + +Was DISABLED briefly during the failed attempt; re-ENABLED globally to restore status quo. Currently enabled for all users as before. + +--- + +## Pending / Incomplete Tasks + +All deferred per user request. The system is stable now thanks to the flock fix; the items below are improvements, not emergencies. + +| # | Task | Status | Notes | +|---|---|---|---| +| 1 | Investigate pavon storage layout | DONE | Findings captured above | +| 2 | Disable versioning for pavon | DEFERRED | Approach: daily cron `occ versions:cleanup pavon` + `occ trashbin:cleanup pavon` (also clean existing 30 GB of pavon's accumulated versions) | +| 3 | Set up external storage mount for archive | DONE (already existed) | Storage 6, /Archive, SMB to Uranus, filesystem_check_changes already 0 | +| 4 | Disable trash on external archive storage | DEFERRED | Mooted by retention design — `find -mtime +1095 -delete` bypasses OwnCloud trash anyway | +| 5 | Build monthly migration cron (internal → /Archive) | DEFERRED | NVR architecture forces files to land internal first. Cutoff target: 90 days. Caveat: `/Archive` uses "SMB Personal (unique file IDs)" backend — host-level CIFS moves may break file-ID invariant | +| 6 | Build 3-year retention pruning cron | DEFERRED | Weekly `find /Archive -type f -mtime +1095 -delete` then `occ files:scan pavon/Archive` | + +### Concrete next session checklist + +1. Decide approach for pavon versioning: A (disable globally), C (aggressive 30-day migration), or D (`versions:cleanup pavon` daily cron) — D is what user proposed +2. If D: run one-time `occ versions:cleanup pavon` (reclaims 30 GB) + `occ trashbin:cleanup pavon`, then schedule `0 3 * * *` cron +3. Decide migration cutoff (90 days is current target; could go shorter for capacity reasons since `/owncloud` is 74% full) +4. Build migration script — open question whether to use OwnCloud API or host-level CIFS mount + `mv` +5. Build retention pruning cron on Uranus or via CIFS mount on OwnCloud VM +6. Reconcile the 1Password OwnCloud password (currently has stale value `Paper123!@#-unifi!`; should be `r3tr0gradE99!!` per SOPS) +7. Optionally clean up: `occ group:delete versioning_users` (created but unused this session) + +--- + +## Reference + +### File paths + +- OwnCloud install root: `/var/www/owncloud` +- OwnCloud occ command: `sudo -u apache php /var/www/owncloud/occ ...` +- OwnCloud data dir: `/owncloud` (NFS to Jupiter) +- Pavon's home files: `/owncloud/pavon/files/` +- Pavon's versions junk: `/owncloud/pavon/files_versions/` (30 GB, 1,326 files) +- Apache crontab: `/var/spool/cron/apache` +- Crontab backup: `/root/apache-crontab.backup-20260428-pre-flock` +- Cron lock file: `/tmp/oc-cron.lock` (used by flock wrapper) + +### Useful occ commands for next session + +```bash +# Per-user version cleanup (deletes BOTH disk files and filecache rows) +sudo -u apache php /var/www/owncloud/occ versions:cleanup pavon +sudo -u apache php /var/www/owncloud/occ trashbin:cleanup pavon + +# Trigger a scan after filesystem-level changes +sudo -u apache php /var/www/owncloud/occ files:scan pavon +sudo -u apache php /var/www/owncloud/occ files:scan pavon/Archive + +# Per-user expire (uses retention policy) +sudo -u apache php /var/www/owncloud/occ versions:expire pavon +sudo -u apache php /var/www/owncloud/occ trashbin:expire pavon + +# External storage management +sudo -u apache php /var/www/owncloud/occ files_external:list +sudo -u apache php /var/www/owncloud/occ files_external:option filesystem_check_changes 0 +``` + +### Local files created this session + +All under `c:/Users/guru/ClaudeTools/temp/` (small bash scripts uploaded via pscp to /tmp on OwnCloud VM): + +- `owncloud-investigate.sh` — initial pavon storage walk +- `owncloud-versioning-check.sh` — schema/state check +- `owncloud-groups-check.sh` — group enumeration +- `owncloud-pavon-groups.sh` — per-user group mapping +- `owncloud-versioning-restrict.sh` — failed group-restrict attempt +- `occ-versions-help.sh` — discover available occ subcommands + +These are scratch scripts; no need to preserve. + +### MariaDB on OwnCloud VM (cheat sheet) + +- Version: MariaDB 10.5.29 +- Local CLI: `mysql owncloud --skip-column-names <<<'SQL...'` works as root via socket auth +- `mysql -e 'SQL'` does NOT work via plink heredoc; PowerShell mangles quoting. Use heredoc (`<<<` or `<