sync: auto-sync from HOWARD-HOME at 2026-06-25 14:58:19
Author: Howard Enos Machine: HOWARD-HOME Timestamp: 2026-06-25 14:58:19
This commit is contained in:
@@ -0,0 +1,159 @@
|
||||
# Session Log — 2026-06-25 — Bitdefender skill audit + iterative fix-to-convergence
|
||||
|
||||
## User
|
||||
- **User:** Howard Enos (howard)
|
||||
- **Machine:** Howard-Home
|
||||
- **Role:** tech
|
||||
|
||||
## Session Summary
|
||||
|
||||
Audited the `bitdefender` skill (GravityZone Cloud JSON-RPC CLI over the live ACG
|
||||
partner tenant; `.claude/skills/bitdefender/`, ~3,200 lines across `gz.py` CLI +
|
||||
`gz_client.py` API client + `selftest.py`) for correctness, CLAUDE.md compliance, and
|
||||
bugs, then fixed everything found and looped review->fix->re-review until two
|
||||
consecutive review passes returned no MEDIUM+ defects.
|
||||
|
||||
The first audit (two parallel review sub-agents, one per big file) found two CRITICAL
|
||||
gating gaps on the live tenant: `move`, `scan`, `create-package`, `make-group` called
|
||||
the API with no `--confirm` gate, and the `raw` destructive-method denylist omitted the
|
||||
matching method names so `raw` bypassed the gates too. It also found no HTTP 429
|
||||
backoff (a real 429 was in errorlog), no client-side object-id validation, non-atomic
|
||||
cache writes, and assorted doc drift. Fixed the gating cluster (C1/C2/H1/H3/M1) first,
|
||||
then H2 (429/5xx retry + connection pooling), then M3 (atomic cache write + advisory
|
||||
lock), then the doc LOWs + sweep pagination, each committed and verified independently.
|
||||
|
||||
A second review pass caught issues partly introduced by the first round of fixes — most
|
||||
importantly that the new retry fired on timeout/5xx for non-idempotent writes, which
|
||||
could double-execute a `createScanTask`/`createPackage` if a timeout landed after a
|
||||
server-side commit. Made retry idempotency-aware (429 and pre-send `connect` always
|
||||
retried; ambiguous timeout/5xx retried only for `get*`/`list*` reads), finished the
|
||||
`_require_oid` rollout across all destructive + required-id handlers, made the
|
||||
write-through cache best-effort, added a pagination ceiling, extended the `raw` denylist
|
||||
(`createInstallTask`), and wrapped non-JSON 200s.
|
||||
|
||||
A third pass found the urllib fallback misclassifying a post-send reset as the
|
||||
always-safe `connect` code, a negative `Retry-After` crash, and three more unvalidated
|
||||
company/parent ids (`sweep`/`install-links`/`company-create`); all fixed. A fourth
|
||||
(convergence) audit returned CONVERGED: yes with only a read-only `companies`
|
||||
single-page listing LOW and an unused-import NIT, both then fixed. A fifth/final
|
||||
confirmation pass returned CONVERGED: yes — no MEDIUM+ defects. The skill's own live
|
||||
selftest grew from 75 to 81 cases and ends green; new behavior is unit-tested offline
|
||||
(retry idempotency, Retry-After clamp/ceiling, urllib reset classification, atomic cache
|
||||
write + lock, context manager).
|
||||
|
||||
Eight descriptive commits landed on `origin/main` over the session
|
||||
(`d8f0974` -> `befd265`); a few were absorbed by the background auto-sync into generic
|
||||
commits (recurring friction), but no change was lost.
|
||||
|
||||
## Key Decisions
|
||||
|
||||
- **Validation-before-gate ordering** in handlers (bad id -> rc2 before the `--confirm`
|
||||
gate -> rc3). Fail fast on malformed input; never describe/perform an action for an
|
||||
invalid target. Required updating selftest gate-refusal cases to use valid-FORMAT ids.
|
||||
- **Retry only safe failure classes for writes.** 429 (pre-processing rate-limit reject)
|
||||
and `connect` (pre-send) have no side effect -> always retried. Timeout/5xx are
|
||||
ambiguous (may have committed server-side) -> retried only for idempotent `get*`/`list*`
|
||||
methods. This preserved the 429-sweep fix while removing the double-execute hazard.
|
||||
- **Did NOT validate ids of unpinned format** (package id, report id, `hashItemId`,
|
||||
quarantine item id, incident type). Their formats aren't confirmed 24-hex, so
|
||||
validating would risk rejecting valid input; they are `--confirm`-gated and bad ids
|
||||
match the expected-error markers (no mislog). Deliberate, not an oversight.
|
||||
- **`raw` denylist kept as best-effort** (substring denylist on a power-user passthrough)
|
||||
rather than flipped to a read-method allowlist — minimal change, plus reworded
|
||||
docstring/help to say "verify any raw method yourself; not exhaustive."
|
||||
- **Surgical edits + the existing selftest as the regression gate**, rather than
|
||||
regenerating files. The live selftest (read paths + gating + validation against the
|
||||
real tenant) is the definitive "everything works" check.
|
||||
- **Looped to convergence** per the user's instruction: kept running fresh review passes
|
||||
until two consecutive passes found no MEDIUM+ defect, fixing only real/defensible items
|
||||
and consciously leaving documented LOW/NITs.
|
||||
|
||||
## Problems Encountered
|
||||
|
||||
- **Retry double-execution risk introduced by my own H2 fix** — the broad timeout/5xx
|
||||
retry would re-run non-idempotent writes. Caught on the second review pass; fixed with
|
||||
idempotency-aware retry (Fix A) + unit tests proving writes don't retry on timeout/5xx
|
||||
but do on 429/connect.
|
||||
- **Selftest expectation drift after behavior changes** — moving gate messages to stderr
|
||||
and adding client-side id validation changed several exit codes/streams (rc1->rc2 for
|
||||
bad ids; "Would" stdout->stderr; empty-set-json now rc2). Updated the affected
|
||||
assertions and added new bad-id/connect/Retry-After cases each round; ended 81/81.
|
||||
- **Dummy-key test polluted errorlog.md** — a `--confirm` smoke test with a dummy key hit
|
||||
a real 401 that was logged. Removed the artifact line; the API-key-error is otherwise
|
||||
correctly handled.
|
||||
- **Background auto-sync interleaving** — repeatedly swept in-progress edits into generic
|
||||
"auto-sync" commits and once blocked a rebase via another session's unstaged files.
|
||||
Worked around by stashing the other session's changes, rebasing, pushing, and popping;
|
||||
no work lost, but several descriptive commit messages were absorbed.
|
||||
|
||||
## Configuration Changes
|
||||
|
||||
All under `.claude/skills/bitdefender/`:
|
||||
- `scripts/gz.py` — gated move/scan/create-package/make-group; `_require_oid` rollout
|
||||
across endpoint/policy/endpoints/sweep/install-links/company(+suspend/activate/delete/
|
||||
create)/account(+update/delete)/assign-policy/set-label/reconfigure/delete-endpoint/
|
||||
delete-group/isolate/unisolate/blocklist-add/quarantine/incidents; extended
|
||||
`DESTRUCTIVE_RAW_PATTERNS` (move*/createscan/createpackage/createcustomgroup/
|
||||
createinstall); gate messages -> stderr; incident-status/note empty-set-json check;
|
||||
push-test parse-before-gate; companies lists full fleet; docstring/help fixes; removed
|
||||
dead `_require_company_for_sweep`.
|
||||
- `scripts/gz_client.py` — 429/5xx/timeout/connect retry with idempotency gating +
|
||||
backoff honoring Retry-After (clamped [0,120]); single reused httpx client + close()
|
||||
+ context manager; atomic `_write_cache` (temp+fsync+os.replace) + `_cache_lock`
|
||||
advisory lock; best-effort write-through; pagination by short-page + MAX_PAGINATION_PAGES
|
||||
ceiling; `list_all_companies()`; non-JSON-200 -> GravityZoneError; urllib URLError
|
||||
conservative connect/timeout classification; removed unused `field` import.
|
||||
- `scripts/selftest.py` — VID valid-format placeholder; gate-refusal cases use VID;
|
||||
bad-id/sweep/install-links rc2 assertions; stderr "Would" assertions; incident
|
||||
empty-json assertion. 81 cases.
|
||||
- `SKILL.md` — gating list adds move/scan/create-package/make-group; delete-package
|
||||
`--id <packageId>` (was `--package <name>`); cache "no PII" wording corrected.
|
||||
- `references/api-reference.md` — `deletePackage` param corrected to `packageId`.
|
||||
- `errorlog.md` — removed a dummy-key 401 test artifact.
|
||||
|
||||
## Credentials & Secrets
|
||||
|
||||
None created, rotated, or newly discovered. The GravityZone API key is loaded from the
|
||||
SOPS vault (`msp-tools/gravityzone.sops.yaml` field `credentials.api_key`) via
|
||||
`vault.sh`; never hardcoded, logged, or cached. Verified intact across all changes.
|
||||
|
||||
## Infrastructure & Servers
|
||||
|
||||
GravityZone Cloud Public API: `https://cloud.gravityzone.bitdefender.com/api/v1.0/jsonrpc`
|
||||
(HTTP Basic, key as username/empty password). Live ACG partner tenant. ACG internal
|
||||
company id `5c428b246c031893678b4569`; companies container `5c4280716c0318f3478b456e`;
|
||||
root company `5c4280716c0318f3478b456a`. No infrastructure changed.
|
||||
|
||||
## Commands & Outputs
|
||||
|
||||
- `python -m py_compile gz.py gz_client.py selftest.py` — clean each round.
|
||||
- `python selftest.py` — read-only + refusal-path harness against the live tenant;
|
||||
ended `81/81 passed, 0 failed` (sets `GZ_SUPPRESS_ERRORLOG=1`, no errorlog pollution).
|
||||
- `python gz.py sweep --company 5c428b246c031893678b4569` — exit 0, swept 9 endpoints
|
||||
(verified refactored pagination end-to-end).
|
||||
- Offline unit checks: retry idempotency (write+timeout NOT retried, write+429/connect
|
||||
retried, read+timeout retried), `_retry_delay` clamp+ceiling, urllib reset
|
||||
classification, atomic-write/lock round-trip + stale-steal + temp-cleanup.
|
||||
|
||||
## Pending / Incomplete Tasks
|
||||
|
||||
- **None for bitdefender** — converged; two consecutive review passes clean.
|
||||
- Consciously left (documented, not defects): id validation deliberately omitted on
|
||||
unpinned-format ids (package/report/hashItemId/quarantine-item); `socket.timeout is
|
||||
TimeoutError` assumption (non-issue on the fleet's Python 3.12); `raw` denylist is
|
||||
best-effort by design.
|
||||
- Recurring friction worth addressing fleet-wide: background auto-sync interleaving manual
|
||||
commits (absorbs descriptive messages). Option: pause auto-sync during focused dev.
|
||||
|
||||
## Reference Information
|
||||
|
||||
- Skill: `.claude/skills/bitdefender/` (SKILL.md, scripts/{gz.py,gz_client.py,selftest.py},
|
||||
references/{api-reference.md,BUILDOUT.md}).
|
||||
- Commits (origin/main, claudetools): `d8f0974` (gating+id-validation cluster),
|
||||
`51751e6` (H2 retry+pooling), `4d75083` (M3 atomic cache), `1852f75` (doc LOWs+
|
||||
pagination), `a96a15a` (auto-sync swept second-pass hardening), `778e12d` (companies
|
||||
pagination+connect-retry+Retry-After ceiling+ctx mgr), `3d6cb46` (urllib reset+
|
||||
Retry-After clamp+sweep/install-links validation), `befd265` (companies full fleet+
|
||||
drop unused import).
|
||||
- Exit-code contract: 0 success, 1 API/runtime failure, 2 bad input/validation, 3 gated
|
||||
refusal, 130 KeyboardInterrupt.
|
||||
Reference in New Issue
Block a user