Skill-first rule now has two halves: route the request to a doing-skill,
then gate the result with the matching check-skill before 'done' --
inferred from the request, not user-named. Adds .claude/SKILL_ROUTING.md
(on-demand request->doing-skill->check-skill map). Enforcement tier A+B
(CORE rule + map; Stop-hook backstop deferred). Calibrate to stakes,
Ollama Tier-0 for cheap passes.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- deploy-cmd: require explicit --regkey or --group; never auto-pick an
arbitrary cross-client registration key (would enroll into wrong org).
- raw: block POST to any */scan endpoint with no non-empty `where`
(same tenant-wide footgun the scan command guards against).
- main(): catch-all for unexpected exceptions -> [ERROR] + errorlog,
plus clean KeyboardInterrupt (130).
- isolate: forgiving extension-name match (exact, then substring),
excludes the paired "Restore" ext; errors on ambiguous match.
- detections: --site -> --target-group; Alert.targetGroupId is a
scan-target id, not a Location id (distinct from `agents --site`).
- status: relabel "Target groups (sites)" -> "Scan target groups".
- SKILL.md + docstrings updated to match.
Verified: py_compile clean, selftest green (216 agents), guards fire
on no-key/empty-where/no-agent, deploy-cmd --group picks the group's key.
Convergence-pass LOW/NIT cleanup:
- cmd_companies uses list_all_companies() so a >100-company tenant isn't truncated
in the listing (was page-1 only); matches sweep/inventory.
- removed unused 'field' import from dataclasses.
Deliberately NOT changed: id validation on delete-package/report-delete/blocklist-
remove/quarantine-remove/restore - those ids are not pinned 24-hex format, so
validating could reject valid input; they are --confirm-gated and bad ids match
the expected-error markers (no mislog). 81/81 selftest.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
From a third review pass (converging - all MEDIUM/LOW):
- urllib fallback: a post-send reset (RemoteDisconnected/ConnectionReset, which
urllib wraps in URLError) was misclassified as always-safe 'connect' and could
retry a non-idempotent write after a server commit. Now only ConnectionRefused/
DNS (socket.gaierror) -> 'connect'; everything else -> 'timeout' (write-gated).
- _retry_delay clamps a negative numeric Retry-After to 0 (was -> time.sleep(-1) ValueError).
- cmd_sweep + cmd_install_links now validate --company; cmd_company_create validates
--parent (finished _require_oid consistency - these mislogged as errorlog noise).
- cmd_push_test parses --extra-json before gating (validate->gate order, matches siblings).
- selftest: +sweep/install-links bad-company assertions. 81/81. Units: clamp + reset classification.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Remaining LOW/NIT items from the second review pass:
- list_all_companies() paginates the company list; sweep-all + refresh_inventory
no longer truncate a >100-company tenant.
- Pre-send connection failures (httpx ConnectError/ConnectTimeout; urllib URLError
not wrapping a timeout) are now retried as 'connect' - always safe (no side
effect) even for non-idempotent writes; ambiguous read-timeouts stay idempotent-gated.
- Explicit Retry-After honored up to RETRY_AFTER_MAX_SECONDS (120s) instead of the
30s exponential cap, so a server-mandated cooldown isn't cut short.
- GravityZoneClient is now a context manager (__enter__/__exit__ -> close()).
- incident-status/note reject an empty --set-json (rc2), matching account-update/notif.
- selftest: +connect/Retry-After/ctx-mgr unit coverage, incident empty-json assertion. 79/79.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Batched the audit doc/LOW findings plus the two pagination LOWs:
- Pagination (gz_client): security_sweep and refresh_inventory stopped on a
'total' field some responses omit, truncating after page 1. Now page until a
short page (< per_page) - the reliable last-page signal.
- isolate/restore docstrings (gz_client): removed the stale 'v1.2 takes an ARRAY
endpointIds' lines that contradicted the verified single-endpointId code.
- Cache 'no PII' wording corrected (gz_client header + SKILL.md): cache holds infra
identifiers (hostnames/FQDNs); no secrets. Dead _require_company_for_sweep removed.
- Doc drift fixed: delete-package is '--id <packageId>' not '--package <name>'
(SKILL.md + api-reference.md, verified live); module docstring + sweep --company
help corrected (sweep with no --company fans out to ALL companies).
- selftest aligned to the improved behavior: malformed ids now exit rc2 client-side
(H3) instead of rc1; gate-refusal 'Would' messages now assert on stderr (they
moved off stdout so --json isn't polluted). 75/75 pass; live sweep verified.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Found during a full command-surface recheck: every privileged SSH recipe
(shares/users/groups/acl) was broken — sudo secure_path drops /usr/syno/{bin,sbin}
so synoshare/synouser/synogroup/synoacltool were "command not found" (non-sudo
plain recipes worked because the admin login PATH has them).
- Inject SYNO_PATH into priv()/plain(); run priv via `sh -c` so operators work.
- synouser/synogroup use `--enum local` (not the invalid `--list`).
- acl quotes the share path (handles spaces, e.g. "Sandra Fish").
- services repointed to Web API (no synoservice on DSM 7.2; synosystemctl has no list-all).
Verified live: all Web API reads, all SSH reads (acl returns real Windows ACEs),
write path (share create/delete), and every destructive command correctly gated.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Audit fix M3: _write_cache did a non-atomic CACHE_FILE.write_text and the
write-through helpers did unlocked read-modify-write, so a crash mid-write could
truncate inventory.json and two concurrent gz.py runs could lose an update.
- _write_cache now writes a temp file (fsync) then os.replace() - atomic on the
same filesystem; a reader/crash can never see a partial file, and a failed
write leaves the prior cache intact and no .tmp residue.
- Added a best-effort cross-platform advisory lock (_cache_lock) around the
read-modify-write in _cache_add_group/_cache_add_package; steals a stale lock,
proceeds unlocked on timeout (a lost update is tolerable, a hang is not).
- Dropped the dead cache.setdefault('companies', ...) line in _cache_add_group.
- Verified: compile clean; unit tests for round-trip, lock acquire/release/steal,
write-through, temp cleanup on failure, and prior-cache survival.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Audit fix H2 (+ M2): the live GravityZone tenant is rate-limited and sweeps fan
out one getManagedEndpointDetails per endpoint across every company, which hit a
real HTTP 429 (errorlog 2026-06-21). _post had zero retry and opened a fresh
httpx.Client (new TLS handshake) per request.
- _post now retries 429/500/502/503/504/timeout up to RETRY_MAX_ATTEMPTS with
bounded exponential backoff + jitter, honoring Retry-After (numeric or HTTP-date).
Retry notices go to stderr (don't pollute --json). Terminal errors still raise.
- M2: a single httpx.Client is created lazily and reused (connection pooling),
closed via client.close() in main()'s finally. Makes the docstring's pooling
claim true and cuts handshake overhead + 429 pressure during sweeps.
- Verified: compile clean; offline unit tests (persistent 429 -> 4 attempts then
raise, flaky 503 -> recovers, Retry-After honored); live status read OK.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Audit cluster C1/C2/H1/H3/M1 on the live GravityZone tenant:
- C1/H1/M1: move, scan, create-package, make-group called the live API with
no --confirm; added _gated() + a --confirm flag to each (move can change an
endpoint's inherited policy posture).
- C2: extend raw's destructive-method denylist with moveEndpoints/moveCustomGroup/
createScanTask/createPackage/createCustomGroup so 'raw' can't bypass the gates.
- H3: add _require_oid() 24-char-hex validation to endpoint/policy/endpoints +
the gated handlers, so malformed ids no longer hit the tenant or get mislogged
as functional errors (source of the 2026-06-21 errorlog noise).
- Gate refusals now print to stderr (don't pollute --json). SKILL.md gating list
updated. Verified: compile clean; gates exit 3, bad ids exit 2, raw denylist hits.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>