From d87aa9dfd8117102fccfd06b1842f884349d0bdb Mon Sep 17 00:00:00 2001 From: Howard Enos Date: Thu, 25 Jun 2026 14:58:50 -0700 Subject: [PATCH] sync: auto-sync from HOWARD-HOME at 2026-06-25 14:58:19 Author: Howard Enos Machine: HOWARD-HOME Timestamp: 2026-06-25 14:58:19 --- ...25-howard-bitdefender-skill-audit-fixes.md | 159 ++++++++++++++++++ 1 file changed, 159 insertions(+) create mode 100644 session-logs/2026-06/2026-06-25-howard-bitdefender-skill-audit-fixes.md diff --git a/session-logs/2026-06/2026-06-25-howard-bitdefender-skill-audit-fixes.md b/session-logs/2026-06/2026-06-25-howard-bitdefender-skill-audit-fixes.md new file mode 100644 index 00000000..c4f84797 --- /dev/null +++ b/session-logs/2026-06/2026-06-25-howard-bitdefender-skill-audit-fixes.md @@ -0,0 +1,159 @@ +# Session Log — 2026-06-25 — Bitdefender skill audit + iterative fix-to-convergence + +## User +- **User:** Howard Enos (howard) +- **Machine:** Howard-Home +- **Role:** tech + +## Session Summary + +Audited the `bitdefender` skill (GravityZone Cloud JSON-RPC CLI over the live ACG +partner tenant; `.claude/skills/bitdefender/`, ~3,200 lines across `gz.py` CLI + +`gz_client.py` API client + `selftest.py`) for correctness, CLAUDE.md compliance, and +bugs, then fixed everything found and looped review->fix->re-review until two +consecutive review passes returned no MEDIUM+ defects. + +The first audit (two parallel review sub-agents, one per big file) found two CRITICAL +gating gaps on the live tenant: `move`, `scan`, `create-package`, `make-group` called +the API with no `--confirm` gate, and the `raw` destructive-method denylist omitted the +matching method names so `raw` bypassed the gates too. It also found no HTTP 429 +backoff (a real 429 was in errorlog), no client-side object-id validation, non-atomic +cache writes, and assorted doc drift. Fixed the gating cluster (C1/C2/H1/H3/M1) first, +then H2 (429/5xx retry + connection pooling), then M3 (atomic cache write + advisory +lock), then the doc LOWs + sweep pagination, each committed and verified independently. + +A second review pass caught issues partly introduced by the first round of fixes — most +importantly that the new retry fired on timeout/5xx for non-idempotent writes, which +could double-execute a `createScanTask`/`createPackage` if a timeout landed after a +server-side commit. Made retry idempotency-aware (429 and pre-send `connect` always +retried; ambiguous timeout/5xx retried only for `get*`/`list*` reads), finished the +`_require_oid` rollout across all destructive + required-id handlers, made the +write-through cache best-effort, added a pagination ceiling, extended the `raw` denylist +(`createInstallTask`), and wrapped non-JSON 200s. + +A third pass found the urllib fallback misclassifying a post-send reset as the +always-safe `connect` code, a negative `Retry-After` crash, and three more unvalidated +company/parent ids (`sweep`/`install-links`/`company-create`); all fixed. A fourth +(convergence) audit returned CONVERGED: yes with only a read-only `companies` +single-page listing LOW and an unused-import NIT, both then fixed. A fifth/final +confirmation pass returned CONVERGED: yes — no MEDIUM+ defects. The skill's own live +selftest grew from 75 to 81 cases and ends green; new behavior is unit-tested offline +(retry idempotency, Retry-After clamp/ceiling, urllib reset classification, atomic cache +write + lock, context manager). + +Eight descriptive commits landed on `origin/main` over the session +(`d8f0974` -> `befd265`); a few were absorbed by the background auto-sync into generic +commits (recurring friction), but no change was lost. + +## Key Decisions + +- **Validation-before-gate ordering** in handlers (bad id -> rc2 before the `--confirm` + gate -> rc3). Fail fast on malformed input; never describe/perform an action for an + invalid target. Required updating selftest gate-refusal cases to use valid-FORMAT ids. +- **Retry only safe failure classes for writes.** 429 (pre-processing rate-limit reject) + and `connect` (pre-send) have no side effect -> always retried. Timeout/5xx are + ambiguous (may have committed server-side) -> retried only for idempotent `get*`/`list*` + methods. This preserved the 429-sweep fix while removing the double-execute hazard. +- **Did NOT validate ids of unpinned format** (package id, report id, `hashItemId`, + quarantine item id, incident type). Their formats aren't confirmed 24-hex, so + validating would risk rejecting valid input; they are `--confirm`-gated and bad ids + match the expected-error markers (no mislog). Deliberate, not an oversight. +- **`raw` denylist kept as best-effort** (substring denylist on a power-user passthrough) + rather than flipped to a read-method allowlist — minimal change, plus reworded + docstring/help to say "verify any raw method yourself; not exhaustive." +- **Surgical edits + the existing selftest as the regression gate**, rather than + regenerating files. The live selftest (read paths + gating + validation against the + real tenant) is the definitive "everything works" check. +- **Looped to convergence** per the user's instruction: kept running fresh review passes + until two consecutive passes found no MEDIUM+ defect, fixing only real/defensible items + and consciously leaving documented LOW/NITs. + +## Problems Encountered + +- **Retry double-execution risk introduced by my own H2 fix** — the broad timeout/5xx + retry would re-run non-idempotent writes. Caught on the second review pass; fixed with + idempotency-aware retry (Fix A) + unit tests proving writes don't retry on timeout/5xx + but do on 429/connect. +- **Selftest expectation drift after behavior changes** — moving gate messages to stderr + and adding client-side id validation changed several exit codes/streams (rc1->rc2 for + bad ids; "Would" stdout->stderr; empty-set-json now rc2). Updated the affected + assertions and added new bad-id/connect/Retry-After cases each round; ended 81/81. +- **Dummy-key test polluted errorlog.md** — a `--confirm` smoke test with a dummy key hit + a real 401 that was logged. Removed the artifact line; the API-key-error is otherwise + correctly handled. +- **Background auto-sync interleaving** — repeatedly swept in-progress edits into generic + "auto-sync" commits and once blocked a rebase via another session's unstaged files. + Worked around by stashing the other session's changes, rebasing, pushing, and popping; + no work lost, but several descriptive commit messages were absorbed. + +## Configuration Changes + +All under `.claude/skills/bitdefender/`: +- `scripts/gz.py` — gated move/scan/create-package/make-group; `_require_oid` rollout + across endpoint/policy/endpoints/sweep/install-links/company(+suspend/activate/delete/ + create)/account(+update/delete)/assign-policy/set-label/reconfigure/delete-endpoint/ + delete-group/isolate/unisolate/blocklist-add/quarantine/incidents; extended + `DESTRUCTIVE_RAW_PATTERNS` (move*/createscan/createpackage/createcustomgroup/ + createinstall); gate messages -> stderr; incident-status/note empty-set-json check; + push-test parse-before-gate; companies lists full fleet; docstring/help fixes; removed + dead `_require_company_for_sweep`. +- `scripts/gz_client.py` — 429/5xx/timeout/connect retry with idempotency gating + + backoff honoring Retry-After (clamped [0,120]); single reused httpx client + close() + + context manager; atomic `_write_cache` (temp+fsync+os.replace) + `_cache_lock` + advisory lock; best-effort write-through; pagination by short-page + MAX_PAGINATION_PAGES + ceiling; `list_all_companies()`; non-JSON-200 -> GravityZoneError; urllib URLError + conservative connect/timeout classification; removed unused `field` import. +- `scripts/selftest.py` — VID valid-format placeholder; gate-refusal cases use VID; + bad-id/sweep/install-links rc2 assertions; stderr "Would" assertions; incident + empty-json assertion. 81 cases. +- `SKILL.md` — gating list adds move/scan/create-package/make-group; delete-package + `--id ` (was `--package `); cache "no PII" wording corrected. +- `references/api-reference.md` — `deletePackage` param corrected to `packageId`. +- `errorlog.md` — removed a dummy-key 401 test artifact. + +## Credentials & Secrets + +None created, rotated, or newly discovered. The GravityZone API key is loaded from the +SOPS vault (`msp-tools/gravityzone.sops.yaml` field `credentials.api_key`) via +`vault.sh`; never hardcoded, logged, or cached. Verified intact across all changes. + +## Infrastructure & Servers + +GravityZone Cloud Public API: `https://cloud.gravityzone.bitdefender.com/api/v1.0/jsonrpc` +(HTTP Basic, key as username/empty password). Live ACG partner tenant. ACG internal +company id `5c428b246c031893678b4569`; companies container `5c4280716c0318f3478b456e`; +root company `5c4280716c0318f3478b456a`. No infrastructure changed. + +## Commands & Outputs + +- `python -m py_compile gz.py gz_client.py selftest.py` — clean each round. +- `python selftest.py` — read-only + refusal-path harness against the live tenant; + ended `81/81 passed, 0 failed` (sets `GZ_SUPPRESS_ERRORLOG=1`, no errorlog pollution). +- `python gz.py sweep --company 5c428b246c031893678b4569` — exit 0, swept 9 endpoints + (verified refactored pagination end-to-end). +- Offline unit checks: retry idempotency (write+timeout NOT retried, write+429/connect + retried, read+timeout retried), `_retry_delay` clamp+ceiling, urllib reset + classification, atomic-write/lock round-trip + stale-steal + temp-cleanup. + +## Pending / Incomplete Tasks + +- **None for bitdefender** — converged; two consecutive review passes clean. +- Consciously left (documented, not defects): id validation deliberately omitted on + unpinned-format ids (package/report/hashItemId/quarantine-item); `socket.timeout is + TimeoutError` assumption (non-issue on the fleet's Python 3.12); `raw` denylist is + best-effort by design. +- Recurring friction worth addressing fleet-wide: background auto-sync interleaving manual + commits (absorbs descriptive messages). Option: pause auto-sync during focused dev. + +## Reference Information + +- Skill: `.claude/skills/bitdefender/` (SKILL.md, scripts/{gz.py,gz_client.py,selftest.py}, + references/{api-reference.md,BUILDOUT.md}). +- Commits (origin/main, claudetools): `d8f0974` (gating+id-validation cluster), + `51751e6` (H2 retry+pooling), `4d75083` (M3 atomic cache), `1852f75` (doc LOWs+ + pagination), `a96a15a` (auto-sync swept second-pass hardening), `778e12d` (companies + pagination+connect-retry+Retry-After ceiling+ctx mgr), `3d6cb46` (urllib reset+ + Retry-After clamp+sweep/install-links validation), `befd265` (companies full fleet+ + drop unused import). +- Exit-code contract: 0 success, 1 API/runtime failure, 2 bad input/validation, 3 gated + refusal, 130 KeyboardInterrupt.