Files
claudetools/clients/pavon/session-logs/2026-04-29-session.md
Mike Swanson f22d33f2ae pavon: session log — OwnCloud VM cron stacking diagnosed and stabilized
Found 75-126 stale `occ system:cron` processes on 172.16.3.22 piling up
since 2026-04-27 due to bad oc_filecache LIKE query against pavon's 257K
camera files. Killed stale procs (load 80 -> 5), wrapped apache crontab
with `flock -n /tmp/oc-cron.lock` to prevent restacking. Per-user
versioning disable rejected by OwnCloud Community (`files_versions`
can't be enabled for groups); workaround `occ versions:cleanup pavon`
identified and deferred. Migration/retention cron deferred per user.
NVR architecture clarified: GeoVision NVRs sync via OC Desktop client
with virtual file placeholders; no direct SMB access to Jupiter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 07:51:21 -07:00

17 KiB

Session Log — Pavon — 2026-04-29

User

  • User: Mike Swanson (mike)
  • Machine: GURU-BEAST-ROG
  • Role: admin

Mode

infra — OwnCloud VM diagnostics on Jupiter, multi-host SSH, scheduled cron edits.


Session Summary

A performance issue was reported on a Windows Server 2016 VM running SharePoint Migration Tool for BirthBiologic, hosted on Jupiter (Unraid 172.16.3.20). Initial analysis incorrectly attributed high CPU usage to CPU pinning, but further investigation revealed that the high load was caused by the OwnCloud VM (172.16.3.22) instead. The OwnCloud VM, running Rocky Linux 9.6, was experiencing a severe performance degradation due to an excessive number of stalled occ system:cron processes. These processes were executing inefficient MariaDB queries that were scanning the oc_filecache table repeatedly, leading to sustained high CPU usage and load average. The inefficient query pattern used a wildcard in the LIKE clause, incorrect collation, and missing indexes, resulting in full table scans.

After killing the stalled processes and implementing a flock lock to prevent future stacking, CPU usage dropped significantly. Further investigation into the storage configuration and data retention policy confirmed that OwnCloud is the source of truth for the data (3-year retention), and no immediate migration or cleanup was performed at the user's request. Credentials access was switched from 1Password to SOPS partway through the session per user preference, which uncovered drift between the two sources for the OwnCloud VM root password.


Key Decisions

  • Kill stalled cron processes immediately rather than waiting them out — system was in deep distress (load avg 80, ~70 tasks queued waiting on CPU) and there was no risk to user data
  • Wrap cron with flock -n instead of disabling cron outright — preserves OwnCloud's normal background-job processing while making the stacking-spiral structurally impossible
  • Defer per-user versioning disable — OwnCloud Community refused --groups flag for files_versions. Workaround identified (occ versions:cleanup pavon on a daily cron) but deferred per user request
  • Did not bypass OwnCloud for ingest — user clarified GeoVision NVRs at Curves and Raiders sites use OwnCloud Desktop sync client with virtual file placeholders to save NVR-local disk; NVR units lack direct SMB access to Jupiter/Saturn, so the existing ingest path stays
  • Defer the migration/retention scripts — current state (cron-flock + status quo on versioning) is stable enough; user wants to revisit the architectural cleanup with more time

Problems Encountered

  • Incorrect initial diagnosis of CPU pinning — first hypothesis was that the Windows VM's 16 vCPUs were oversubscribed onto 8 host threads. virsh dumpxml showed no <cputune> block at all on the Windows VM. Re-pivoted by checking which qemu process actually owned the load, found OwnCloud
  • Stale occ system:cron processes — 75-126 piled up since 2026-04-27. No flock wrapper meant every 15-min tick fired a new run while previous ones still ground on. Resolved with pkill -f 'occ system:cron' then pkill -9 for stragglers
  • Inefficient MariaDB query pattern in OwnCloud's filecache scanner — name COLLATE utf8mb4_general_ci LIKE 'pattern%middle%suffix' defeats the utf8mb4_bin index and the wildcard mid-string. The schema only has (parent, name) and (storage, path_hash) indexes, no (storage, name). Each query becomes a 257K-row scan on storage 78
  • Failed attempt to disable versioning for one userapp:enable files_versions --groups <group> rejected with "files_versions can't be enabled for groups." Reverted by re-enabling globally
  • Credential drift between 1Password and SOPS — SOPS had r3tr0gradE99!! for OwnCloud VM root, 1Password had Paper123!@#-unifi!. SOPS value worked, 1Password did not. Worth reconciling
  • PowerShell mangling shell metachars — repeatedly tripped on $(), *, (, backticks when invoking SSH inline. Worked around by writing scripts to local files, scp-ing, then executing remotely

Credentials Used / Discovered

Jupiter (Unraid Primary)

  • Host: 172.16.3.20
  • User: root
  • Password (1Password and SOPS agree): Th1nk3r^99##
  • SOPS path: infrastructure/jupiter-unraid-primary.sops.yaml
  • 1Password item id: 5ji4rsgvn6feare6fahxsqauui

OwnCloud VM (Rocky Linux 9.6)

  • Host: 172.16.3.22
  • Hostname: cloud.acghosting.com
  • User: root
  • Password (SOPS — works): r3tr0gradE99!!
  • Password (1Password — STALE, does NOT work): Paper123!@#-unifi!needs reconciliation
  • SOPS path: infrastructure/owncloud-vm.sops.yaml
  • 1Password item id: h6usgzxxn26kvckxz5dhssxdai
  • SSH host key (ed25519): SHA256:Yy4oFv5HudmKjNJ4IZgHcuSSmeBvUg+ZJta6iLasdqU

MariaDB on OwnCloud VM

  • Local socket auth as root from the VM (no password prompt for mysql CLI when run as root)
  • OwnCloud's app DB: owncloud schema, user owncloud@localhost

Infrastructure Touched

Host IP Role Action
Jupiter (Unraid Primary) 172.16.3.20 Hypervisor Read-only diagnostics: virsh list, virsh dumpxml, virsh vcpuinfo, mpstat, ps
OwnCloud VM 172.16.3.22 Rocky Linux 9.6, runs OwnCloud + MariaDB 10.5.29 pkill cron processes, edited apache crontab, group/app changes (reverted)
Uranus (Unraid Secondary) 172.16.3.21 SMB share host (Storage share) None this session — referenced as future archive target

VM cputune findings on Jupiter

  • Windows Server 2016 VM: No CPU pinning. Topology sockets=1, cores=8, threads=2 (16 vCPUs). Memory 16 GB current / 32 GB max with ballooning.
  • OwnCloud VM: vCPUs 0-7 pinned 1:1 to host CPU 0,28,1,29,2,30,3,31 — exactly the eight LPs that were saturated. Topology cores=4, threads=2 (8 vCPUs).

OwnCloud users and groups

  • 10 users total: Martell, anaise, bst, jburger, mara, minrec, pavon, rohrbach, sysadmin, themarcgroup
  • Existing groups: ACG, Clients, PST, QMS, Stamback, Stoltz, WCP, admin
  • Only sysadmin is in any group (ACG, admin); all other users are unaffiliated

Storage layout (OwnCloud VM)

  • OwnCloud data dir: /owncloud, NFS-mounted from 172.16.3.20:/mnt/user/OwnCloud
  • Filesystem state: 932 GB total, 677 GB used, 248 GB free, 74% full
  • Pavon's storage (numeric_id 78, home::pavon):
    • /owncloud/pavon/files/Curves/ — 188,920 files (all sub-paths Curves/Data-F/CamNN/YYYYMMDD/Event*.Avi or .Wav)
    • /owncloud/pavon/files/Raiders/ — 48,978 files (Raiders/Cameras{,2}/CamNN/YYYYMMDD/Event*.Avi)
    • Two NVR log files at root (NVR-18019140.out, NVR-18082322.out, ~16K each)
    • Total ~237K files
    • 30 GB of accumulated junk versions in /owncloud/pavon/files_versions/ (1,326 version files, 1,383 filecache rows)
  • External storage (numeric_id 6, mount_id 6):
    • Mount point in pavon's view: /Archive
    • Backend: SMB Personal (unique file IDs)
    • Host: 172.16.3.21 (Uranus)
    • Share: Storage
    • SMB user: owncloud
    • filesystem_check_changes already set to 0 — OwnCloud doesn't auto-rescan this on cron

File age distribution (pavon)

2024:      1 file (oldest from 2024-12-21)
2025: 162,898 files
2026:  74,719 files
Older than 365 days: 256 files (as of 2026-04-29)

Architecture (NEW context from this session)

  • GeoVision NVR units at Curves and Raiders client sites
  • Each NVR runs OwnCloud Desktop sync client with the sync folder pointed at the NVR's data drop directory
  • After upload, OwnCloud client converts the local file to a virtual file placeholder to conserve NVR-local disk; if the NVR ever needs the file, the placeholder triggers a re-download
  • NVR units have NO direct SMB access to Jupiter or Saturn — they reach OwnCloud only via the WebDAV interface used by the desktop client
  • OwnCloud is the source of truth for this footage, not a backup of it
  • Retention policy: 3 years. Older may be deleted
  • Pavon never uses OwnCloud directly — only the NVR interface for footage retrieval
  • GeoVision has no built-in age-based file routing — can't move old files to a different folder on the NVR side

Commands & Outputs (Critical)

Identifying the runaway VM (on Jupiter)

ssh root@172.16.3.20 'ps -eo pid,pcpu,pmem,comm,args --sort=-pcpu | head -3'
# PID 15343  486% CPU  qemu-system-x86_64 ... guest=OwnCloud
# PID 2349755 118% CPU  qemu-system-x86_64 ... guest=Windows Server 2016
# PID 13887   25% CPU  qemu-system-x86_64 ... guest=Unifi

mpstat -P 0,1,2,3,28,29,30,31 1 1
# All eight LPs showed %guest near 100, %idle near 0 — load was guest VM, not host

OwnCloud cputune confirming pin (on Jupiter)

virsh dumpxml OwnCloud | grep -A 10 cputune
# <vcpupin vcpu='0' cpuset='0'/>
# <vcpupin vcpu='1' cpuset='28'/>
# ...
# <vcpupin vcpu='7' cpuset='31'/>

Stale cron count and load (on OwnCloud VM)

ps -ef | grep 'occ system:cron' | grep -v grep | wc -l
# 126
uptime
# load average: 80.38, 77.60, 76.98

The bad query pattern (sample)

SELECT `fileid`, `storage`, `path`, `parent`, `name`, `mimetype`, `mimepart`,
       `size`, `mtime`, `encrypted`, `etag`, `permissions`, `checksum`
FROM `oc_filecache`
WHERE `storage` = '78'
  AND `name` COLLATE utf8mb4_general_ci LIKE 'Event20260412190705025.Wav.v%.d1776373045'

oc_filecache schema confirming the index situation

CREATE TABLE oc_filecache (
  ...
  `name` varchar(250) DEFAULT NULL,
  PRIMARY KEY (`fileid`),
  UNIQUE KEY `fs_storage_path_hash`   (`storage`, `path_hash`),
  KEY `fs_parent_name_hash`           (`parent`,  `name`),
  KEY `fs_storage_mimetype`           (`storage`, `mimetype`),
  KEY `fs_storage_mimepart`           (`storage`, `mimepart`),
  KEY `fs_storage_size`               (`storage`, `size`, `fileid`),
  KEY `fs_parent_storage_size`        (`parent`,  `storage`, `size`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin

No (storage, name) index exists. Combined with the COLLATE utf8mb4_general_ci override and mid-string wildcard, every query becomes a full scan of the 257K rows in storage 78.

Killing the cron stack

pkill -f 'occ system:cron'
sleep 3
pkill -9 -f 'occ system:cron'
# Connection dropped briefly during the kill due to system shock — expected
# Reconnected: 0 stale processes; load avg dropped 80 -> 27 -> 5 within ~3 minutes

Active queries after cleanup

SELECT COUNT(*) FROM information_schema.PROCESSLIST WHERE COMMAND != 'Sleep';
-- 1  (just my own query)

Configuration Changes (PERSISTED)

OwnCloud VM — apache crontab (/var/spool/cron/apache)

Before (caused stacking):

*/15  *  *  *  * /usr/bin/php -f /var/www/owncloud/occ system:cron

After (current production):

*/15 * * * * /usr/bin/flock -n /tmp/oc-cron.lock /usr/bin/php -f /var/www/owncloud/occ system:cron

flock -n makes new ticks bail immediately if the previous run still holds the lock — at most one occ system:cron ever runs.

Backup: /root/apache-crontab.backup-20260428-pre-flock (67 bytes, contains the original line)

Group changes (NOT REVERTED)

A group named versioning_users was created on the OwnCloud VM during the failed per-user-versioning attempt, with all 9 non-pavon users added. The intent was to scope files_versions to that group, but app:enable --groups was rejected. The group still exists with those memberships. It's harmless (no app uses it) but worth knowing about for next session — could be deleted with:

sudo -u apache php /var/www/owncloud/occ group:delete versioning_users

files_versions app

Was DISABLED briefly during the failed attempt; re-ENABLED globally to restore status quo. Currently enabled for all users as before.


Pending / Incomplete Tasks

All deferred per user request. The system is stable now thanks to the flock fix; the items below are improvements, not emergencies.

# Task Status Notes
1 Investigate pavon storage layout DONE Findings captured above
2 Disable versioning for pavon DEFERRED Approach: daily cron occ versions:cleanup pavon + occ trashbin:cleanup pavon (also clean existing 30 GB of pavon's accumulated versions)
3 Set up external storage mount for archive DONE (already existed) Storage 6, /Archive, SMB to Uranus, filesystem_check_changes already 0
4 Disable trash on external archive storage DEFERRED Mooted by retention design — find -mtime +1095 -delete bypasses OwnCloud trash anyway
5 Build monthly migration cron (internal → /Archive) DEFERRED NVR architecture forces files to land internal first. Cutoff target: 90 days. Caveat: /Archive uses "SMB Personal (unique file IDs)" backend — host-level CIFS moves may break file-ID invariant
6 Build 3-year retention pruning cron DEFERRED Weekly find /Archive -type f -mtime +1095 -delete then occ files:scan pavon/Archive

Concrete next session checklist

  1. Decide approach for pavon versioning: A (disable globally), C (aggressive 30-day migration), or D (versions:cleanup pavon daily cron) — D is what user proposed
  2. If D: run one-time occ versions:cleanup pavon (reclaims 30 GB) + occ trashbin:cleanup pavon, then schedule 0 3 * * * cron
  3. Decide migration cutoff (90 days is current target; could go shorter for capacity reasons since /owncloud is 74% full)
  4. Build migration script — open question whether to use OwnCloud API or host-level CIFS mount + mv
  5. Build retention pruning cron on Uranus or via CIFS mount on OwnCloud VM
  6. Reconcile the 1Password OwnCloud password (currently has stale value Paper123!@#-unifi!; should be r3tr0gradE99!! per SOPS)
  7. Optionally clean up: occ group:delete versioning_users (created but unused this session)

Reference

File paths

  • OwnCloud install root: /var/www/owncloud
  • OwnCloud occ command: sudo -u apache php /var/www/owncloud/occ ...
  • OwnCloud data dir: /owncloud (NFS to Jupiter)
  • Pavon's home files: /owncloud/pavon/files/
  • Pavon's versions junk: /owncloud/pavon/files_versions/ (30 GB, 1,326 files)
  • Apache crontab: /var/spool/cron/apache
  • Crontab backup: /root/apache-crontab.backup-20260428-pre-flock
  • Cron lock file: /tmp/oc-cron.lock (used by flock wrapper)

Useful occ commands for next session

# Per-user version cleanup (deletes BOTH disk files and filecache rows)
sudo -u apache php /var/www/owncloud/occ versions:cleanup pavon
sudo -u apache php /var/www/owncloud/occ trashbin:cleanup pavon

# Trigger a scan after filesystem-level changes
sudo -u apache php /var/www/owncloud/occ files:scan pavon
sudo -u apache php /var/www/owncloud/occ files:scan pavon/Archive

# Per-user expire (uses retention policy)
sudo -u apache php /var/www/owncloud/occ versions:expire pavon
sudo -u apache php /var/www/owncloud/occ trashbin:expire pavon

# External storage management
sudo -u apache php /var/www/owncloud/occ files_external:list
sudo -u apache php /var/www/owncloud/occ files_external:option <id> filesystem_check_changes 0

Local files created this session

All under c:/Users/guru/ClaudeTools/temp/ (small bash scripts uploaded via pscp to /tmp on OwnCloud VM):

  • owncloud-investigate.sh — initial pavon storage walk
  • owncloud-versioning-check.sh — schema/state check
  • owncloud-groups-check.sh — group enumeration
  • owncloud-pavon-groups.sh — per-user group mapping
  • owncloud-versioning-restrict.sh — failed group-restrict attempt
  • occ-versions-help.sh — discover available occ subcommands

These are scratch scripts; no need to preserve.

MariaDB on OwnCloud VM (cheat sheet)

  • Version: MariaDB 10.5.29
  • Local CLI: mysql owncloud --skip-column-names <<<'SQL...' works as root via socket auth
  • mysql -e 'SQL' does NOT work via plink heredoc; PowerShell mangles quoting. Use heredoc (<<< or <<EOF) instead

Notes for Howard / Future Mike

  • Don't kill the OwnCloud VM while this OwnCloud + GeoVision NVR architecture is in place — pavon's NVRs are dependent on it being reachable to either upload new footage OR rehydrate placeholders if footage is requested
  • The flock wrapper is the only thing keeping the cron stable right now. Any future change to /var/spool/cron/apache should preserve flock -n /tmp/oc-cron.lock
  • The bad query is a property of OwnCloud's files_versions background job + camera filename pattern + filecache schema — not specific to anything we did. If we ever upgrade OwnCloud, retest under load before declaring victory
  • Storage 6 (/Archive) is already configured correctly — don't recreate it. Just use it
  • 30 GB cleanup waitingocc versions:cleanup pavon will reclaim it instantly when you're ready

Status at session end

  • OwnCloud VM load: dropping past 5.95 (1-min) from a peak of 80, recovering toward normal
  • MariaDB CPU: oscillating 0-140% (normal background)
  • Stale cron processes: 0
  • New cron via flock: armed; first protected tick will fire within 15 minutes of the change
  • BirthBiologic SPMT (the original ticket): unchanged on the Windows VM side; the user stopped SPMT manually during diagnosis. Decision pending whether to resume SPMT or switch to OneDrive client (separate question discussed mid-session)