sync: auto-sync from HOWARD-HOME at 2026-06-25 14:58:19
Author: Howard Enos Machine: HOWARD-HOME Timestamp: 2026-06-25 14:58:19
This commit is contained in:
@@ -0,0 +1,159 @@
|
|||||||
|
# Session Log — 2026-06-25 — Bitdefender skill audit + iterative fix-to-convergence
|
||||||
|
|
||||||
|
## User
|
||||||
|
- **User:** Howard Enos (howard)
|
||||||
|
- **Machine:** Howard-Home
|
||||||
|
- **Role:** tech
|
||||||
|
|
||||||
|
## Session Summary
|
||||||
|
|
||||||
|
Audited the `bitdefender` skill (GravityZone Cloud JSON-RPC CLI over the live ACG
|
||||||
|
partner tenant; `.claude/skills/bitdefender/`, ~3,200 lines across `gz.py` CLI +
|
||||||
|
`gz_client.py` API client + `selftest.py`) for correctness, CLAUDE.md compliance, and
|
||||||
|
bugs, then fixed everything found and looped review->fix->re-review until two
|
||||||
|
consecutive review passes returned no MEDIUM+ defects.
|
||||||
|
|
||||||
|
The first audit (two parallel review sub-agents, one per big file) found two CRITICAL
|
||||||
|
gating gaps on the live tenant: `move`, `scan`, `create-package`, `make-group` called
|
||||||
|
the API with no `--confirm` gate, and the `raw` destructive-method denylist omitted the
|
||||||
|
matching method names so `raw` bypassed the gates too. It also found no HTTP 429
|
||||||
|
backoff (a real 429 was in errorlog), no client-side object-id validation, non-atomic
|
||||||
|
cache writes, and assorted doc drift. Fixed the gating cluster (C1/C2/H1/H3/M1) first,
|
||||||
|
then H2 (429/5xx retry + connection pooling), then M3 (atomic cache write + advisory
|
||||||
|
lock), then the doc LOWs + sweep pagination, each committed and verified independently.
|
||||||
|
|
||||||
|
A second review pass caught issues partly introduced by the first round of fixes — most
|
||||||
|
importantly that the new retry fired on timeout/5xx for non-idempotent writes, which
|
||||||
|
could double-execute a `createScanTask`/`createPackage` if a timeout landed after a
|
||||||
|
server-side commit. Made retry idempotency-aware (429 and pre-send `connect` always
|
||||||
|
retried; ambiguous timeout/5xx retried only for `get*`/`list*` reads), finished the
|
||||||
|
`_require_oid` rollout across all destructive + required-id handlers, made the
|
||||||
|
write-through cache best-effort, added a pagination ceiling, extended the `raw` denylist
|
||||||
|
(`createInstallTask`), and wrapped non-JSON 200s.
|
||||||
|
|
||||||
|
A third pass found the urllib fallback misclassifying a post-send reset as the
|
||||||
|
always-safe `connect` code, a negative `Retry-After` crash, and three more unvalidated
|
||||||
|
company/parent ids (`sweep`/`install-links`/`company-create`); all fixed. A fourth
|
||||||
|
(convergence) audit returned CONVERGED: yes with only a read-only `companies`
|
||||||
|
single-page listing LOW and an unused-import NIT, both then fixed. A fifth/final
|
||||||
|
confirmation pass returned CONVERGED: yes — no MEDIUM+ defects. The skill's own live
|
||||||
|
selftest grew from 75 to 81 cases and ends green; new behavior is unit-tested offline
|
||||||
|
(retry idempotency, Retry-After clamp/ceiling, urllib reset classification, atomic cache
|
||||||
|
write + lock, context manager).
|
||||||
|
|
||||||
|
Eight descriptive commits landed on `origin/main` over the session
|
||||||
|
(`d8f0974` -> `befd265`); a few were absorbed by the background auto-sync into generic
|
||||||
|
commits (recurring friction), but no change was lost.
|
||||||
|
|
||||||
|
## Key Decisions
|
||||||
|
|
||||||
|
- **Validation-before-gate ordering** in handlers (bad id -> rc2 before the `--confirm`
|
||||||
|
gate -> rc3). Fail fast on malformed input; never describe/perform an action for an
|
||||||
|
invalid target. Required updating selftest gate-refusal cases to use valid-FORMAT ids.
|
||||||
|
- **Retry only safe failure classes for writes.** 429 (pre-processing rate-limit reject)
|
||||||
|
and `connect` (pre-send) have no side effect -> always retried. Timeout/5xx are
|
||||||
|
ambiguous (may have committed server-side) -> retried only for idempotent `get*`/`list*`
|
||||||
|
methods. This preserved the 429-sweep fix while removing the double-execute hazard.
|
||||||
|
- **Did NOT validate ids of unpinned format** (package id, report id, `hashItemId`,
|
||||||
|
quarantine item id, incident type). Their formats aren't confirmed 24-hex, so
|
||||||
|
validating would risk rejecting valid input; they are `--confirm`-gated and bad ids
|
||||||
|
match the expected-error markers (no mislog). Deliberate, not an oversight.
|
||||||
|
- **`raw` denylist kept as best-effort** (substring denylist on a power-user passthrough)
|
||||||
|
rather than flipped to a read-method allowlist — minimal change, plus reworded
|
||||||
|
docstring/help to say "verify any raw method yourself; not exhaustive."
|
||||||
|
- **Surgical edits + the existing selftest as the regression gate**, rather than
|
||||||
|
regenerating files. The live selftest (read paths + gating + validation against the
|
||||||
|
real tenant) is the definitive "everything works" check.
|
||||||
|
- **Looped to convergence** per the user's instruction: kept running fresh review passes
|
||||||
|
until two consecutive passes found no MEDIUM+ defect, fixing only real/defensible items
|
||||||
|
and consciously leaving documented LOW/NITs.
|
||||||
|
|
||||||
|
## Problems Encountered
|
||||||
|
|
||||||
|
- **Retry double-execution risk introduced by my own H2 fix** — the broad timeout/5xx
|
||||||
|
retry would re-run non-idempotent writes. Caught on the second review pass; fixed with
|
||||||
|
idempotency-aware retry (Fix A) + unit tests proving writes don't retry on timeout/5xx
|
||||||
|
but do on 429/connect.
|
||||||
|
- **Selftest expectation drift after behavior changes** — moving gate messages to stderr
|
||||||
|
and adding client-side id validation changed several exit codes/streams (rc1->rc2 for
|
||||||
|
bad ids; "Would" stdout->stderr; empty-set-json now rc2). Updated the affected
|
||||||
|
assertions and added new bad-id/connect/Retry-After cases each round; ended 81/81.
|
||||||
|
- **Dummy-key test polluted errorlog.md** — a `--confirm` smoke test with a dummy key hit
|
||||||
|
a real 401 that was logged. Removed the artifact line; the API-key-error is otherwise
|
||||||
|
correctly handled.
|
||||||
|
- **Background auto-sync interleaving** — repeatedly swept in-progress edits into generic
|
||||||
|
"auto-sync" commits and once blocked a rebase via another session's unstaged files.
|
||||||
|
Worked around by stashing the other session's changes, rebasing, pushing, and popping;
|
||||||
|
no work lost, but several descriptive commit messages were absorbed.
|
||||||
|
|
||||||
|
## Configuration Changes
|
||||||
|
|
||||||
|
All under `.claude/skills/bitdefender/`:
|
||||||
|
- `scripts/gz.py` — gated move/scan/create-package/make-group; `_require_oid` rollout
|
||||||
|
across endpoint/policy/endpoints/sweep/install-links/company(+suspend/activate/delete/
|
||||||
|
create)/account(+update/delete)/assign-policy/set-label/reconfigure/delete-endpoint/
|
||||||
|
delete-group/isolate/unisolate/blocklist-add/quarantine/incidents; extended
|
||||||
|
`DESTRUCTIVE_RAW_PATTERNS` (move*/createscan/createpackage/createcustomgroup/
|
||||||
|
createinstall); gate messages -> stderr; incident-status/note empty-set-json check;
|
||||||
|
push-test parse-before-gate; companies lists full fleet; docstring/help fixes; removed
|
||||||
|
dead `_require_company_for_sweep`.
|
||||||
|
- `scripts/gz_client.py` — 429/5xx/timeout/connect retry with idempotency gating +
|
||||||
|
backoff honoring Retry-After (clamped [0,120]); single reused httpx client + close()
|
||||||
|
+ context manager; atomic `_write_cache` (temp+fsync+os.replace) + `_cache_lock`
|
||||||
|
advisory lock; best-effort write-through; pagination by short-page + MAX_PAGINATION_PAGES
|
||||||
|
ceiling; `list_all_companies()`; non-JSON-200 -> GravityZoneError; urllib URLError
|
||||||
|
conservative connect/timeout classification; removed unused `field` import.
|
||||||
|
- `scripts/selftest.py` — VID valid-format placeholder; gate-refusal cases use VID;
|
||||||
|
bad-id/sweep/install-links rc2 assertions; stderr "Would" assertions; incident
|
||||||
|
empty-json assertion. 81 cases.
|
||||||
|
- `SKILL.md` — gating list adds move/scan/create-package/make-group; delete-package
|
||||||
|
`--id <packageId>` (was `--package <name>`); cache "no PII" wording corrected.
|
||||||
|
- `references/api-reference.md` — `deletePackage` param corrected to `packageId`.
|
||||||
|
- `errorlog.md` — removed a dummy-key 401 test artifact.
|
||||||
|
|
||||||
|
## Credentials & Secrets
|
||||||
|
|
||||||
|
None created, rotated, or newly discovered. The GravityZone API key is loaded from the
|
||||||
|
SOPS vault (`msp-tools/gravityzone.sops.yaml` field `credentials.api_key`) via
|
||||||
|
`vault.sh`; never hardcoded, logged, or cached. Verified intact across all changes.
|
||||||
|
|
||||||
|
## Infrastructure & Servers
|
||||||
|
|
||||||
|
GravityZone Cloud Public API: `https://cloud.gravityzone.bitdefender.com/api/v1.0/jsonrpc`
|
||||||
|
(HTTP Basic, key as username/empty password). Live ACG partner tenant. ACG internal
|
||||||
|
company id `5c428b246c031893678b4569`; companies container `5c4280716c0318f3478b456e`;
|
||||||
|
root company `5c4280716c0318f3478b456a`. No infrastructure changed.
|
||||||
|
|
||||||
|
## Commands & Outputs
|
||||||
|
|
||||||
|
- `python -m py_compile gz.py gz_client.py selftest.py` — clean each round.
|
||||||
|
- `python selftest.py` — read-only + refusal-path harness against the live tenant;
|
||||||
|
ended `81/81 passed, 0 failed` (sets `GZ_SUPPRESS_ERRORLOG=1`, no errorlog pollution).
|
||||||
|
- `python gz.py sweep --company 5c428b246c031893678b4569` — exit 0, swept 9 endpoints
|
||||||
|
(verified refactored pagination end-to-end).
|
||||||
|
- Offline unit checks: retry idempotency (write+timeout NOT retried, write+429/connect
|
||||||
|
retried, read+timeout retried), `_retry_delay` clamp+ceiling, urllib reset
|
||||||
|
classification, atomic-write/lock round-trip + stale-steal + temp-cleanup.
|
||||||
|
|
||||||
|
## Pending / Incomplete Tasks
|
||||||
|
|
||||||
|
- **None for bitdefender** — converged; two consecutive review passes clean.
|
||||||
|
- Consciously left (documented, not defects): id validation deliberately omitted on
|
||||||
|
unpinned-format ids (package/report/hashItemId/quarantine-item); `socket.timeout is
|
||||||
|
TimeoutError` assumption (non-issue on the fleet's Python 3.12); `raw` denylist is
|
||||||
|
best-effort by design.
|
||||||
|
- Recurring friction worth addressing fleet-wide: background auto-sync interleaving manual
|
||||||
|
commits (absorbs descriptive messages). Option: pause auto-sync during focused dev.
|
||||||
|
|
||||||
|
## Reference Information
|
||||||
|
|
||||||
|
- Skill: `.claude/skills/bitdefender/` (SKILL.md, scripts/{gz.py,gz_client.py,selftest.py},
|
||||||
|
references/{api-reference.md,BUILDOUT.md}).
|
||||||
|
- Commits (origin/main, claudetools): `d8f0974` (gating+id-validation cluster),
|
||||||
|
`51751e6` (H2 retry+pooling), `4d75083` (M3 atomic cache), `1852f75` (doc LOWs+
|
||||||
|
pagination), `a96a15a` (auto-sync swept second-pass hardening), `778e12d` (companies
|
||||||
|
pagination+connect-retry+Retry-After ceiling+ctx mgr), `3d6cb46` (urllib reset+
|
||||||
|
Retry-After clamp+sweep/install-links validation), `befd265` (companies full fleet+
|
||||||
|
drop unused import).
|
||||||
|
- Exit-code contract: 0 success, 1 API/runtime failure, 2 bad input/validation, 3 gated
|
||||||
|
refusal, 130 KeyboardInterrupt.
|
||||||
Reference in New Issue
Block a user