sync: auto-sync from HOWARD-HOME at 2026-06-25 14:58:19

Author: Howard Enos
Machine: HOWARD-HOME
Timestamp: 2026-06-25 14:58:19
This commit is contained in:
2026-06-25 14:58:50 -07:00
parent df75a86518
commit d87aa9dfd8

View File

@@ -0,0 +1,159 @@
# Session Log — 2026-06-25 — Bitdefender skill audit + iterative fix-to-convergence
## User
- **User:** Howard Enos (howard)
- **Machine:** Howard-Home
- **Role:** tech
## Session Summary
Audited the `bitdefender` skill (GravityZone Cloud JSON-RPC CLI over the live ACG
partner tenant; `.claude/skills/bitdefender/`, ~3,200 lines across `gz.py` CLI +
`gz_client.py` API client + `selftest.py`) for correctness, CLAUDE.md compliance, and
bugs, then fixed everything found and looped review->fix->re-review until two
consecutive review passes returned no MEDIUM+ defects.
The first audit (two parallel review sub-agents, one per big file) found two CRITICAL
gating gaps on the live tenant: `move`, `scan`, `create-package`, `make-group` called
the API with no `--confirm` gate, and the `raw` destructive-method denylist omitted the
matching method names so `raw` bypassed the gates too. It also found no HTTP 429
backoff (a real 429 was in errorlog), no client-side object-id validation, non-atomic
cache writes, and assorted doc drift. Fixed the gating cluster (C1/C2/H1/H3/M1) first,
then H2 (429/5xx retry + connection pooling), then M3 (atomic cache write + advisory
lock), then the doc LOWs + sweep pagination, each committed and verified independently.
A second review pass caught issues partly introduced by the first round of fixes — most
importantly that the new retry fired on timeout/5xx for non-idempotent writes, which
could double-execute a `createScanTask`/`createPackage` if a timeout landed after a
server-side commit. Made retry idempotency-aware (429 and pre-send `connect` always
retried; ambiguous timeout/5xx retried only for `get*`/`list*` reads), finished the
`_require_oid` rollout across all destructive + required-id handlers, made the
write-through cache best-effort, added a pagination ceiling, extended the `raw` denylist
(`createInstallTask`), and wrapped non-JSON 200s.
A third pass found the urllib fallback misclassifying a post-send reset as the
always-safe `connect` code, a negative `Retry-After` crash, and three more unvalidated
company/parent ids (`sweep`/`install-links`/`company-create`); all fixed. A fourth
(convergence) audit returned CONVERGED: yes with only a read-only `companies`
single-page listing LOW and an unused-import NIT, both then fixed. A fifth/final
confirmation pass returned CONVERGED: yes — no MEDIUM+ defects. The skill's own live
selftest grew from 75 to 81 cases and ends green; new behavior is unit-tested offline
(retry idempotency, Retry-After clamp/ceiling, urllib reset classification, atomic cache
write + lock, context manager).
Eight descriptive commits landed on `origin/main` over the session
(`d8f0974` -> `befd265`); a few were absorbed by the background auto-sync into generic
commits (recurring friction), but no change was lost.
## Key Decisions
- **Validation-before-gate ordering** in handlers (bad id -> rc2 before the `--confirm`
gate -> rc3). Fail fast on malformed input; never describe/perform an action for an
invalid target. Required updating selftest gate-refusal cases to use valid-FORMAT ids.
- **Retry only safe failure classes for writes.** 429 (pre-processing rate-limit reject)
and `connect` (pre-send) have no side effect -> always retried. Timeout/5xx are
ambiguous (may have committed server-side) -> retried only for idempotent `get*`/`list*`
methods. This preserved the 429-sweep fix while removing the double-execute hazard.
- **Did NOT validate ids of unpinned format** (package id, report id, `hashItemId`,
quarantine item id, incident type). Their formats aren't confirmed 24-hex, so
validating would risk rejecting valid input; they are `--confirm`-gated and bad ids
match the expected-error markers (no mislog). Deliberate, not an oversight.
- **`raw` denylist kept as best-effort** (substring denylist on a power-user passthrough)
rather than flipped to a read-method allowlist — minimal change, plus reworded
docstring/help to say "verify any raw method yourself; not exhaustive."
- **Surgical edits + the existing selftest as the regression gate**, rather than
regenerating files. The live selftest (read paths + gating + validation against the
real tenant) is the definitive "everything works" check.
- **Looped to convergence** per the user's instruction: kept running fresh review passes
until two consecutive passes found no MEDIUM+ defect, fixing only real/defensible items
and consciously leaving documented LOW/NITs.
## Problems Encountered
- **Retry double-execution risk introduced by my own H2 fix** — the broad timeout/5xx
retry would re-run non-idempotent writes. Caught on the second review pass; fixed with
idempotency-aware retry (Fix A) + unit tests proving writes don't retry on timeout/5xx
but do on 429/connect.
- **Selftest expectation drift after behavior changes** — moving gate messages to stderr
and adding client-side id validation changed several exit codes/streams (rc1->rc2 for
bad ids; "Would" stdout->stderr; empty-set-json now rc2). Updated the affected
assertions and added new bad-id/connect/Retry-After cases each round; ended 81/81.
- **Dummy-key test polluted errorlog.md** — a `--confirm` smoke test with a dummy key hit
a real 401 that was logged. Removed the artifact line; the API-key-error is otherwise
correctly handled.
- **Background auto-sync interleaving** — repeatedly swept in-progress edits into generic
"auto-sync" commits and once blocked a rebase via another session's unstaged files.
Worked around by stashing the other session's changes, rebasing, pushing, and popping;
no work lost, but several descriptive commit messages were absorbed.
## Configuration Changes
All under `.claude/skills/bitdefender/`:
- `scripts/gz.py` — gated move/scan/create-package/make-group; `_require_oid` rollout
across endpoint/policy/endpoints/sweep/install-links/company(+suspend/activate/delete/
create)/account(+update/delete)/assign-policy/set-label/reconfigure/delete-endpoint/
delete-group/isolate/unisolate/blocklist-add/quarantine/incidents; extended
`DESTRUCTIVE_RAW_PATTERNS` (move*/createscan/createpackage/createcustomgroup/
createinstall); gate messages -> stderr; incident-status/note empty-set-json check;
push-test parse-before-gate; companies lists full fleet; docstring/help fixes; removed
dead `_require_company_for_sweep`.
- `scripts/gz_client.py` — 429/5xx/timeout/connect retry with idempotency gating +
backoff honoring Retry-After (clamped [0,120]); single reused httpx client + close()
+ context manager; atomic `_write_cache` (temp+fsync+os.replace) + `_cache_lock`
advisory lock; best-effort write-through; pagination by short-page + MAX_PAGINATION_PAGES
ceiling; `list_all_companies()`; non-JSON-200 -> GravityZoneError; urllib URLError
conservative connect/timeout classification; removed unused `field` import.
- `scripts/selftest.py` — VID valid-format placeholder; gate-refusal cases use VID;
bad-id/sweep/install-links rc2 assertions; stderr "Would" assertions; incident
empty-json assertion. 81 cases.
- `SKILL.md` — gating list adds move/scan/create-package/make-group; delete-package
`--id <packageId>` (was `--package <name>`); cache "no PII" wording corrected.
- `references/api-reference.md``deletePackage` param corrected to `packageId`.
- `errorlog.md` — removed a dummy-key 401 test artifact.
## Credentials & Secrets
None created, rotated, or newly discovered. The GravityZone API key is loaded from the
SOPS vault (`msp-tools/gravityzone.sops.yaml` field `credentials.api_key`) via
`vault.sh`; never hardcoded, logged, or cached. Verified intact across all changes.
## Infrastructure & Servers
GravityZone Cloud Public API: `https://cloud.gravityzone.bitdefender.com/api/v1.0/jsonrpc`
(HTTP Basic, key as username/empty password). Live ACG partner tenant. ACG internal
company id `5c428b246c031893678b4569`; companies container `5c4280716c0318f3478b456e`;
root company `5c4280716c0318f3478b456a`. No infrastructure changed.
## Commands & Outputs
- `python -m py_compile gz.py gz_client.py selftest.py` — clean each round.
- `python selftest.py` — read-only + refusal-path harness against the live tenant;
ended `81/81 passed, 0 failed` (sets `GZ_SUPPRESS_ERRORLOG=1`, no errorlog pollution).
- `python gz.py sweep --company 5c428b246c031893678b4569` — exit 0, swept 9 endpoints
(verified refactored pagination end-to-end).
- Offline unit checks: retry idempotency (write+timeout NOT retried, write+429/connect
retried, read+timeout retried), `_retry_delay` clamp+ceiling, urllib reset
classification, atomic-write/lock round-trip + stale-steal + temp-cleanup.
## Pending / Incomplete Tasks
- **None for bitdefender** — converged; two consecutive review passes clean.
- Consciously left (documented, not defects): id validation deliberately omitted on
unpinned-format ids (package/report/hashItemId/quarantine-item); `socket.timeout is
TimeoutError` assumption (non-issue on the fleet's Python 3.12); `raw` denylist is
best-effort by design.
- Recurring friction worth addressing fleet-wide: background auto-sync interleaving manual
commits (absorbs descriptive messages). Option: pause auto-sync during focused dev.
## Reference Information
- Skill: `.claude/skills/bitdefender/` (SKILL.md, scripts/{gz.py,gz_client.py,selftest.py},
references/{api-reference.md,BUILDOUT.md}).
- Commits (origin/main, claudetools): `d8f0974` (gating+id-validation cluster),
`51751e6` (H2 retry+pooling), `4d75083` (M3 atomic cache), `1852f75` (doc LOWs+
pagination), `a96a15a` (auto-sync swept second-pass hardening), `778e12d` (companies
pagination+connect-retry+Retry-After ceiling+ctx mgr), `3d6cb46` (urllib reset+
Retry-After clamp+sweep/install-links validation), `befd265` (companies full fleet+
drop unused import).
- Exit-code contract: 0 success, 1 API/runtime failure, 2 bad input/validation, 3 gated
refusal, 130 KeyboardInterrupt.