Files

Mike Swanson 2c5f10faaa Session log: qwen3.6 benchmark, route strict-format to 3.6

Benchmarked qwen3.6 (36B MoE) vs qwen3:14b and qwen3:32b on 16
representative prompts. qwen3.6 scored 15/16 vs 14b 11/16 and 32b
12/16, winning every strict-format/adherence test (multi-step rules,
weekend-aware scheduling, prompt-injection resistance, word-limit
summary). Single reasoning regression noted for re-check at qwen3.7.

Updated .claude/OLLAMA.md (Models, Documentation Engine, and
When-to-Use tables) and .claude/CLAUDE.md one-line model summary to
route strict-format work to qwen3.6 and keep bulk prose on qwen3:14b
(2x faster). Also removed openclaw npm package + ~/.openclaw data dir
earlier in the session.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-16 16:03:07 -07:00

22 KiB

Raw Blame History

Session Log — 2026-05-16

User

User: Mike Swanson (mike)
Machine: DESKTOP-0O8A1RL
Role: admin
Session span: ~2026-05-15 evening through 2026-05-16 morning PT (cross-midnight, continued from compacted context)

Session Summary

The session opened as a continuation of prior work that had been context-compacted. The main pending item was inserting an "Asset Location Tracking" section into the GuruRMM feature roadmap — a Python patch script had been written to D:\claudetools\tmp_roadmap_patch.py but not yet executed. The script was copied to the GuruRMM server (172.16.3.30) via scp and run with Python 3; it successfully located the marker, inserted the new section before the --- divider separating Core Agent Features from Server/API Features, and printed "inserted". The updated roadmap was committed and pushed to the gururmm repo as 883d8ff.

Earlier in the session (pre-compaction) a significant amount of GuruRMM infrastructure work was completed. Jupiter's Docker container was rebuilt to add /dev/kvm and /var/run/libvirt/libvirt-sock mounts and the libvirt-clients apt package, enabling hypervisor detection (is_hypervisor: true) and VM enumeration (7 hosted VMs via virsh). Pluto's agent was force-updated to v0.6.22 by bumping Cargo.toml, since the auto-updater skips same-version builds. Three watchdog bugs were identified and fixed: sc.exe fallback when SCM service.stop() returns access denied; suppress_until being set to a future deadline on restart failure instead of cleared to Instant::now(); and a misplaced warn! for PerformUpdate stubs that caused unnecessary log noise (downgraded to debug!). A changelog generation script was wired into build-agents.sh so v0.6.22.md and LATEST_AGENT.md are auto-generated on each build.

The dashboard Terminal tab had a NativeSelect flex layout bug where the className prop was being applied to the inner select element rather than the outer wrapper div, causing the input box to appear invisible. The fix moved the className to the outer div and removed inline-block, letting twMerge correctly resolve caller width classes against the inner select's w-full. The output panel was also enlarged from h-80 to h-[28rem].

The session closed with a workflow feature: a /feature-request skill was created so Howard can submit GuruRMM feature requests from his Claude Code session. When invoked, Claude reads the current roadmap, calls Ollama to classify where the feature belongs (section, subsection, priority), and posts a coord message to Mike's machines so he can review and decide if/how/when to build it. The skill was committed to claudetools, synced to Gitea, and coord messages were sent to Howard on both his known machines (ACG-TECH03L and Howard-Home) explaining how to use it.

Key Decisions

Asset location tracking added to roadmap only — not built. Mike's explicit instruction was to add it for future consideration, not to start implementation. Section covers WiFi geolocation (SSID/BSSID → Google/Microsoft API, ~50-100m), IP fallback, on-demand requests, mobile app as a future phase, and geofence alerts as P3.
No Apple MDM vendor certificate required for mobile app. Standard APNs/FCM push is sufficient for the "find stolen device" use case. Full MDM profile management is explicitly out of scope for v1.
Watchdog suppress_until cleared on restart failure. The original code left suppress_until at a future deadline when RestartMainService failed, which meant the watchdog would silently skip restart attempts for however long was left on the suppression window. Fix: set to Instant::now() so the next poll tick tries again immediately.
sc.exe fallback for service stop. The windows-service crate's service.stop() can return ERROR_ACCESS_DENIED even when the process has SeServiceLogonRight, apparently due to session isolation in some upgrade paths. sc.exe bypasses this. Added as fallback rather than replacement so the SCM path remains preferred.
/feature-request routes to Ollama for classification. Using qwen3:14b at Tier 0 keeps the cost and latency low for what is essentially a document classification task. Howard's machines use the Tailscale Ollama endpoint (100.92.127.64:11434).
Coord messages sent to both of Mike's primary machines. DESKTOP-0O8A1RL and Mikes-MacBook-Air both receive the notification so it surfaces regardless of which machine Mike opens Claude Code on.

Problems Encountered

hosted_vm_uuids empty after /dev/kvm mount. virsh was not installed in the Docker image. Fix: added libvirt-clients to the Dockerfile apt-get block and rebuilt the image.
virsh still failing after new image. Wrong socket path — /var/run/libvirt/libvirt-sock was not mounted. Fix: verified the socket path on the Jupiter host, added the mount to the Unraid container config and unraid-ca-template.xml.
Pluto stuck on v0.6.21 after KVM fix built. Auto-updater compares version strings and skips same-version. The KVM fix was built as v0.6.21 but Pluto was already on v0.6.21. Fix: bumped Cargo.toml to 0.6.22, rebuilt.
Pluto service offline 25+ minutes after watchdog-triggered update. Three separate bugs in watchdog (see Key Decisions). Service was manually started via Python paramiko (sshpass not available; paramiko handles password auth programmatically without interactive stdin).
Bash pre-backslash hook blocked SSH with full Windows path. Hook at D:/claudetools/.claude/hooks/pre-bash-backslash.sh blocked C:\Windows\System32\OpenSSH\ssh.exe. Used bare ssh command instead.
Python unicode error reading Windows log. Log contained non-UTF8 bytes. Fix: sys.stdout.buffer.write(output.encode('utf-8', errors='replace')).
NativeSelect flex layout bug in Terminal tab. className prop was applied to the inner select element, not the outer wrapper div, so the caller's width class (w-32) was overridden by the inner select's w-full. Fix: moved className to outer div, inner select always w-full.
Pre-bash-backslash hook blocking curl multiline commands. Coord messages with backslash line continuations were blocked. Fix: rewrote as Python urllib one-liners which the hook doesn't touch.

Configuration Changes

gururmm repo (172.16.3.30:/home/guru/gururmm):

agent/Dockerfile — added libvirt-clients to apt-get block
agent/Cargo.toml — version bumped 0.6.21 → 0.6.22
agent/src/watchdog/monitor.rs — three bug fixes (sc.exe fallback, suppress_until, PerformUpdate log level)
scripts/build-agents.sh — added generate-changelog.sh call before "Build complete" log
changelogs/agent/v0.6.22.md — created (auto-generated release notes)
changelogs/LATEST_AGENT.md — updated to point to v0.6.22
dashboard/src/components/Select.tsx — NativeSelect className moved to outer div, inline-block removed
dashboard/src/components/CommandTerminal.tsx — NativeSelect w-32 shrink-0, output panel h-[28rem]
docs/FEATURE_ROADMAP.md — Asset Location Tracking section inserted (commit 883d8ff)

claudetools repo (D:\claudetools):

.claude/commands/feature-request.md — new skill file
.claude/CLAUDE.md — /feature-request added to Commands table
projects/msp-tools/guru-rmm/docs/unraid-ca-template.xml — /dev/kvm and libvirt-sock mount entries added
projects/msp-tools/guru-rmm/agent/Dockerfile — updated to match live repo

Credentials & Secrets

Pluto (172.16.3.36) administrator password: Paper123!@# — NOT YET IN VAULT. Pending vault entry at infrastructure/pluto-build-server.sops.yaml.

Infrastructure & Servers

Host	IP	Role	Notes
Jupiter	172.16.3.20	Unraid primary / KVM hypervisor	Docker container rebuilt with /dev/kvm + libvirt-sock mounts
Saturn	172.16.3.30	GuruRMM server + ClaudeTools API	Gitea at :3000, RMM server at :3001, ClaudeTools API at :8001
Pluto	172.16.3.36	Windows build server	Agent running v0.6.22 post-watchdog fix; is_virtual_machine: true, hypervisor: Jupiter

GuruRMM agent v0.6.22 — deployed to all enrolled agents via auto-update after build pipeline ran.

Commands & Outputs

# Copy and run roadmap patch on server
scp D:/claudetools/tmp_roadmap_patch.py guru@172.16.3.30:/tmp/roadmap_patch.py
ssh guru@172.16.3.30 "python3 /tmp/roadmap_patch.py"
# Output: inserted

# Commit roadmap update
ssh guru@172.16.3.30 "cd /home/guru/gururmm && git add docs/FEATURE_ROADMAP.md && git commit -m 'docs: add Asset Location Tracking section to feature roadmap' && git push origin main"
# Output: [main 883d8ff] ... To 172.16.3.20:azcomputerguru/gururmm.git  d9ec476..883d8ff  main -> main

# Send coord messages to Howard (Python urllib, avoids backslash hook)
py -c "import urllib.request, json; ..."
# Output: ACG-TECH03L/claude-main 201 / Howard-Home/claude-main 201

Pending / Incomplete Tasks

Pluto vault entry — create infrastructure/pluto-build-server.sops.yaml with password Paper123!@# and standard fields (hostname, IP, role, admin credentials)
Pluto SSH key — add DESKTOP-0O8A1RL public key to Pluto authorized_keys so paramiko password auth is no longer needed
GURU-BEAST-ROG enrollment — not in GuruRMM; Mike may want to enroll it (site assignment TBD)
macOS agent — build-agents.sh has TODO-MACOS; no Docker/install path implemented
Live terminal (xterm.js + PTY bridge) — deferred; CommandTerminal is input-only for now
Policy wiring — plan at ticklish-questing-stallman.md exists; deferred
BB-SERVER enrollment loop — pre-existing duplicate key constraint, not addressed
PowerShell command_type bug on Windows PS 5.1 — agent prepends flags incorrectly; not addressed
Dashboard VM badges — verify Jupiter (hypervisor) and Pluto (VM guest) display correctly after data fix

Reference Information

GuruRMM repo: http://172.16.3.20:3000/azcomputerguru/gururmm
Asset Location Tracking roadmap commit: 883d8ff (gururmm repo)
/feature-request skill commit: 83e8e44 (claudetools repo)
Coord API: http://172.16.3.30:8001/api/coord
Ollama Tailscale endpoint (Howard's machines): http://100.92.127.64:11434
Unraid CA template: projects/msp-tools/guru-rmm/docs/unraid-ca-template.xml
Policy wiring plan: C:\Users\guru\.claude\plans\ticklish-questing-stallman.md

Update: 11:50 PT — agent-os standards system + feature planning tools

Session Summary

After reviewing the agent-os GitHub repo (buildermethods/agent-os), four high-value improvements were identified and implemented in a single parallel agent run. The core idea borrowed from agent-os: split a monolithic guidelines doc into individual indexed standards files so agents load only what's relevant to a given task, rather than the entire guidelines file every time.

The standards system agent split CODING_GUIDELINES.md into 19 individual files under .claude/standards/, organized by topic: conventions (no-emojis, naming, output-markers), powershell (execution-pattern, tmp-path-windows), context-lookup (grepai-first), security (credential-handling), api (response-format), git (commit-style), gitea (internal-api), gururmm (platform-parity, build-pipeline, sqlx-migrations), syncro (comment-dedup, time-entry-protocol, html-formatting), ssh (windows-openssh), python (windows-runtime), client (communication-tone). Two standards beyond the requested list were added: sqlx-migrations (the proc macro outage was substantial enough to codify) and powershell/tmp-path-windows (the /tmp vs Windows temp path mismatch affects any Write-tool-to-Bash handoff). An index.yml was created with one-line matchable descriptions for each file, enabling the /inject-standards command to select 2-5 relevant standards per task without loading the full set.

The /shape-spec command was created as a pre-implementation planning tool for GuruRMM features. It gates on both a feature description (Phase 1) and explicit out-of-scope items (Phase 2) before writing anything. Output is four files in specs/<slug>/: plan.md (ordered task list, Task 0 always "commit the spec"), shape.md (decisions/constraints/non-goals), references.md (existing code found via Grep), standards.md (applicable standards from index.yml). This solves the "re-explaining context across sessions" problem for multi-session GuruRMM feature work.

Ground-truth docs were written for both active projects. GuruRMM got docs/tech-stack.md (server/agent/dashboard/pipeline architecture) and docs/mission.md (purpose, current scope, roadmap direction, design principles) committed to the gururmm submodule at 79604a2. ClaudeTools API got equivalent docs at docs/tech-stack.md and docs/mission.md in the repo root. These are fast-load context docs — future sessions read them instead of reconstructing architecture from code.

Key Decisions

19 standards files, not one per CODING_GUIDELINES section. Some sections were split further (output-markers separated from no-emojis; powershell tmp-path separated from execution-pattern) because they apply to different task types.
/shape-spec is a new command, not a modification to /create-spec. /create-spec is for the AutoCoder autonomous coding workflow (feature counts, XML spec files). /shape-spec is for GuruRMM feature planning within our existing dev workflow — different purpose, different output format.
sqlx-migrations.md added beyond the requested list. The sqlx proc macro outage caused multiple sessions of recovery and a brief production impact. Codifying it as a standard is warranted.
GuruRMM mission/tech-stack committed to the submodule. These docs live in the gururmm repo (authoritative), not just the claudetools reference copy, so they travel with the codebase.
Agent-os install script not adopted. agent-os is designed for multi-repo installation. Claudetools is a mono-repo on Gitea sync — standards live in .claude/standards/ and are immediately available to all machines after /sync. No separate install mechanism needed.

Configuration Changes

claudetools repo:

.claude/commands/inject-standards.md — new /inject-standards command
.claude/commands/shape-spec.md — new /shape-spec command
.claude/standards/ — 19 standards files + index.yml (all new)
.claude/CLAUDE.md — /inject-standards and /shape-spec added to Commands table
docs/tech-stack.md — new ClaudeTools API tech-stack doc
docs/mission.md — new ClaudeTools API mission doc

gururmm submodule (commit 79604a2):

docs/tech-stack.md — new GuruRMM tech-stack ground-truth doc
docs/mission.md — new GuruRMM mission doc

Commit: dd0ef45 — feat: implement agent-os standards system and feature planning tools

Pending / Incomplete Tasks

Same as prior update — no new pending items from this work
Pluto vault entry still pending
Pluto SSH key still pending
Policy wiring plan (ticklish-questing-stallman.md) still deferred

Reference Information

agent-os repo reviewed: https://github.com/buildermethods/agent-os
Standards index: .claude/standards/index.yml
Commit (claudetools): dd0ef45
Commit (gururmm submodule tech-stack/mission): 79604a2

Update: 16:02 MST — qwen3.6 benchmark + Ollama routing update + openclaw removal

Machine: GURU-BEAST-ROG (Mike Swanson)

Session Summary

Removed openclaw from the workstation by uninstalling the global npm package (npm uninstall -g openclaw, 458 deps removed) and deleting the ~/.openclaw data directory (identity, agents, memory, devices, .env). User chose complete deletion with no backup. Confirmed no running processes, scheduled tasks, or services existed for openclaw.

Benchmarked qwen3.6:latest (new 36B MoE) against qwen3:14b (current production default) and qwen3:32b on the local Ollama instance to evaluate whether 3.6 is a meaningful upgrade for the documentation-engine workload. Built a Python harness measuring cold-start load time, throughput (from Ollama's eval_duration), and capability scores against deterministic graders. Initial six-prompt round exposed a grader bug (multi_step test had the wrong expected set) — after fixing, all three models scored 5/6 with qwen3.6 the only one to apply per-file rules correctly. Per user request, expanded the suite to 16 prompts weighted toward strict-format and adherence work (CSV filter, FizzBuzz, PII redaction, exact-count bullets, nested JSON, scheduling with weekend trap, prompt-injection resistance, exact delimiter, multi-field classification, strict word-limit summary).

Re-ran with the expanded suite. Final scores: qwen3:14b 11/16, qwen3:32b 12/16, qwen3.6 15/16. qwen3.6 won every strict-format/adherence test (multi-step rules, weekend-aware scheduling, injection resistance, 25-word limit). One regression: qwen3.6 failed the 15-min schedule reasoning prompt (answered 3, expected 4) that 14b and 32b both got right. Throughput: 14b ~66 tok/s, 32b ~21 tok/s, 3.6 ~32 tok/s. qwen3.6 cold-load (4.9s) was actually faster than 14b's (8.6s) despite the larger file.

Updated .claude/OLLAMA.md (Models, Documentation Engine, When-to-Use tables) and the one-line model summary in .claude/CLAUDE.md to route prose drafting to qwen3:14b (2x faster) and strict-format work (JSON, classification, redaction, word limits, multi-step rules, untrusted input) to qwen3.6. Added an explicit "untrusted input that may contain prompt injection → qwen3.6" routing rule since 14b and 32b both output "HACKED" to the injection prompt and only 3.6 ignored it.

Key Decisions

Promoted qwen3.6 to dual-routing default (strict-format only) rather than full replacement — 14b's 2x throughput still wins for bulk prose where format is forgiving.
Expanded benchmark from 6 to 16 prompts before changing documentation defaults. The first 6 produced an ambiguous 5/6-across-the-board signal; the expanded suite produced a decisive 4-point capability gap.
Added explicit injection-resistance routing rule. Both older models output "HACKED" to the injection test; only 3.6 resisted. Worth calling out separately in OLLAMA.md so future routing decisions account for it.
Documented the 3.6 reasoning regression in OLLAMA.md as a re-check-at-qwen3.7 note rather than disqualifying 3.6. Single-prompt miss vs four strict-format wins is a clear net positive.
Kept qwen3:32b installed despite being dominated on every axis (per user choice — frees ~20 GB if removed later).
Removed openclaw with no backup per explicit user direction (".env, identity, device pairings — all gone").

Problems Encountered

Grader bug in the multi_step test. Initial expected set uppercased .py filenames but the prompt said to leave them unchanged. Discovered by inspecting raw model outputs; fixed check_multi_step() and re-scored from saved snippets.
Shell escaping of \n literals when rescoring inline via py -c "..." from bash double-quoted heredocs — the backslash got eaten and the replace silently no-op'd. Worked around by writing rescore_qwen.py as a real file.
Rebase conflict on this very session log — DESKTOP-0O8A1RL had already pushed a 2026-05-16 log earlier (GuruRMM work). Resolved by keeping both, appending this work as an Update section.

Configuration Changes

.claude/OLLAMA.md — rewrote three tables (Models, Documentation Engine, When-to-Use). Added benchmark-basis paragraph under Models and a one-line rule-of-thumb under When-to-Use. +23/-10 lines.
.claude/CLAUDE.md — single line updated (model summary now names qwen3.6 + qwen3:14b instead of qwen3:14b only). +1/-1.

Files Created (uncommitted, in CWD on GURU-BEAST-ROG)

benchmark_qwen_3_6.py — re-runnable harness, 16 prompts, deterministic graders
rescore_qwen.py — one-off rescorer that reads snippets from JSON and regenerates the MD report
qwen-benchmark-2026-05-16.json — full raw benchmark output (per-prompt timings, token counts, snippets, pass/fail)
qwen-benchmark-2026-05-16.md — readable comparison report

Infrastructure & Servers

Ollama (local on DESKTOP-0O8A1RL, accessed from GURU-BEAST-ROG via OLLAMA env) — three models exercised: qwen3:14b (9.3 GB), qwen3:32b (20 GB), qwen3.6:latest (24 GB MoE, Q4_K_M, family qwen35moe).
No production servers, databases, or client systems touched.

Credentials

None used or rotated. The deleted ~/.openclaw/.env likely contained openclaw-specific API keys / device pairing tokens — destroyed per user direction, not captured.

Commands & Outputs

# Remove openclaw
npm uninstall -g openclaw       # removed 458 packages in 3s
rm -rf "C:/Users/guru/.openclaw"
where.exe openclaw              # INFO: Could not find files for the given pattern(s).

# Run benchmark
py benchmark_qwen_3_6.py        # 16 prompts x 3 models, ~12 min total

# Final scoreboard
#   qwen3:14b            11/16   66 tok/s
#   qwen3:32b            12/16   21 tok/s
#   qwen3.6:latest       15/16   32 tok/s

Pending / Incomplete

Re-validate the reasoning regression when qwen3.7 (or any qwen3.6 update) lands. The 15-min schedule prompt (reasoning test in the harness) is the canary — currently 3, expected 4.
Decide on qwen3:32b retention — dominated on every axis, frees ~20 GB if removed. Deferred.
Decide whether to commit benchmark artifacts to repo (e.g. benchmarks/ folder) so future model evaluations have a baseline. Deferred.

Reference Information

Benchmark harness: c:\Users\guru\ClaudeTools\benchmark_qwen_3_6.py (rerun: py benchmark_qwen_3_6.py)
Benchmark report: c:\Users\guru\ClaudeTools\qwen-benchmark-2026-05-16.md
Benchmark raw data: c:\Users\guru\ClaudeTools\qwen-benchmark-2026-05-16.json
Ollama endpoint (local on this machine): http://localhost:11434/api/chat with think:false for qwen3 family, options.num_ctx:4096 for benchmark
Updated docs: .claude/OLLAMA.md, .claude/CLAUDE.md

22 KiB Raw Blame History

Session Log — 2026-05-16

User

Session Summary

Key Decisions

Problems Encountered

Configuration Changes

Credentials & Secrets

Infrastructure & Servers

Commands & Outputs

Pending / Incomplete Tasks

Reference Information

Update: 11:50 PT — agent-os standards system + feature planning tools

Session Summary

Key Decisions

Configuration Changes

Pending / Incomplete Tasks

Reference Information

Update: 16:02 MST — qwen3.6 benchmark + Ollama routing update + openclaw removal

Session Summary

Key Decisions

Problems Encountered

Configuration Changes

Files Created (uncommitted, in CWD on GURU-BEAST-ROG)

Infrastructure & Servers

Credentials

Commands & Outputs

Pending / Incomplete

Reference Information

22 KiB

Raw Blame History