When the halacha drain hit a 429, the supervisor recorded the reset time the
error reported (e.g. "resets 6:50pm UTC") and then HELD until that timestamp,
re-reading it from its own state every tick without ever checking whether quota
had actually returned. claude.ai usually frees up quota earlier than the message
claims, so the drain sat idle for hours after it could have resumed — and only a
manual kick (clear cooldown + trigger) got it going again.
Now, on any tick where we'd otherwise hold on a cooldown, run a cheap live probe
(`quota_available()` → a tiny `claude -p` call, cost ~0) and resume the instant
it succeeds — at most one probe per 15-min tick, only while we believe we're
limited. Conservative on failure (non-zero exit / timeout / limit message →
stay held), so a flaky probe never resumes the drain into a real 429. Adds a
claude_bin() resolver so the probe works under pm2 cron where PATH is bare.
Invariants: G1 (resume decision driven by actual quota state, not a guessed
timestamp); no new control path.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The /operations "disabled" toggle only wrote drain_controls.disabled, which the
drain checks at STARTUP — so a drain already mid-run kept going until the queue
emptied or the night window closed. Disabling did not stop a running drain.
Three layers, immediate + backstops:
- web/app.py operations_drain_toggle: on disable, also stop the running process
immediately via the host pm2 bridge (_ops_pm2_control). Best-effort — a bridge
failure doesn't fail the toggle.
- halacha_drain_supervisor.py: each tick now reads the disabled flag (added to
db_snapshot) and, when set, stops the drain and never re-triggers it —
regardless of burst/window. Backstop if the UI path failed (≤ one tick).
- drain_halacha_queue.py: re-check is_drain_disabled at the top of every round,
so a drain disabled mid-run halts at the next round boundary. Per-chunk
checkpoints mean the in-flight case loses nothing.
SCRIPTS.md updated for both drain and supervisor.
Invariants: G1 (fix at source — the disable control honoured along every path,
not just at startup); G2 (no parallel control path — same drain_controls flag).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The host pm2 supervisor imports legal_mcp.services.db from the host repo checkout,
which can lag main by many commits. Depending on the just-added db.set_drain_burst/
get_drain_burst would require the host checkout to be current. Use raw SQL via the
stable db.get_pool() instead — the supervisor now depends only on get_pool + the
drain_controls.burst_until column (the shared contract with the /operations API).
The container-side API keeps using the typed helpers (it ships the code in-image).
Invariants: G1/G2 unchanged (same single DB column, no parallel path).
The halacha-extraction backlog needs to be worked off the chair's leftover weekly
Claude quota on demand. This adds a MANUAL, time-boxed "burst" — run the drain
continuously now until a chosen deadline (default the upcoming Saturday 18:00 IL),
managed interactively from /operations — plus the permanent health-supervisor that
enforces it.
Backend (this PR; deploys via Coolify + host pm2):
- db: drain_controls.burst_until (SCHEMA_V37) + set_drain_burst/get_drain_burst/
get_drain_bursts. Single source of truth shared by the container-side /operations
API and the host-side supervisor.
- web: POST /api/operations/drains/{name}/burst (on→until|next-Sat-18:00, off→NULL),
and burst_until surfaced per-service in the /operations snapshot.
- scripts/halacha_drain_supervisor.py + legal-halacha-supervisor.config.cjs: pm2 cron
(*/15, zero Claude quota) — re-triggers idle drain, restarts a HUNG run (liveness =
per-chunk checkpoints, NOT log mtime), backs off on 429 until the parsed reset
(fresh-gated), verifies crash-safe staging. Reads burst_until from the DB; burst
auto-expires at the deadline (never bleeds into a fresh week).
UI (separate follow-up PR, after Claude Design approval): the /operations toggle +
date-picker that calls the burst endpoint.
Invariants: G1 (normalize at source — burst lives once in the DB, read by both
surfaces), G2 (no parallel control path — CAPTURE field on the existing
drain_controls + orchestrates the existing drain, not a new one), G12 (no Paperclip
touch), §6 (no silent error-swallow — burst-clear failure is surfaced as a note).