Commit Graph

11 Commits

Author SHA1 Message Date
07ecb6a366 feat(halacha): עצירה-רכה של הדריינר בסף-ניצול (75/65) + מקור-אמת יחיד למכסה (#265)
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 1m29s
G12 Leak-Guard / leak-guard (push) Successful in 5s
Lint — undefined names / undefined-names (push) Successful in 11s
Co-authored-by: Chaim <chaim@marcus-law.co.il>
Co-committed-by: Chaim <chaim@marcus-law.co.il>
2026-06-15 04:11:43 +00:00
c348903e4b fix(extraction): סינון cited_only מתור/מוני החילוץ (#140)
All checks were successful
G12 Leak-Guard / leak-guard (pull_request) Successful in 7s
Lint — undefined names / undefined-names (pull_request) Successful in 14s
31 שורות case_law עם source_kind='cited_only' (ציטוט-בלבד, ללא full_text/chunks)
נושאות halacha_extraction_status='pending' רק כברירת-מחדל ומזהמות את מונה ה-pending
ובמתזמר/בדף-התפעול — אין להן מה לחלץ.

תיקון (G1 — תיקון-במקור, G2 — מסנן יחיד משותף):
- db.EXTRACTION_ELIGIBLE_PREDICATE — מקור-אמת יחיד ל"שורה ברת-חילוץ" (source_kind
  <> 'cited_only' AND יש precedent_chunks). מוחל ב-list_pending_extraction_requests;
  #139 יעשה בו שימוש-חוזר ל-reconcile (אותו כלל, לא כפול).
- מוני-snapshot מסננים cited_only: halacha_drain_supervisor.db_snapshot,
  web/app.py meta+hal_ext (GROUP BY status).
- reconcile_metadata_status.py מורחב לכסות גם את תור-ההלכות: cited_only→'skipped'
  (אותו terminal-state כמו צד-המטא, תור-תאום, G2). בוצע על ה-DB החי: 31 הועברו
  ל-'skipped' (metadata כבר היה מיושב — אידמפוטנטי). התפלגות-אחרי: halacha
  pending=9 (עבודה אמיתית), skipped=31, completed=309.

בדיקות: test_extraction_queue_eligibility (predicate + list_pending מחיל אותו,
שני ה-kinds). כל 345 בדיקות mcp עוברות. guards נקיים.

Invariants: G1 (terminal-state אמיתי במקור), G2 (predicate יחיד, ללא תור מקביל),
INV-DM1 (stub לא-searchable אינו מועמד-חילוץ), G12 (leak-guard נקי).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 04:03:21 +00:00
1094ac9967 feat(halacha): ספי-עצירה-רכים לדריינר — 5-שעות 75% / שבועי 65% (עצירה לפני 429) (#259)
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 9s
G12 Leak-Guard / leak-guard (push) Successful in 4s
Lint — undefined names / undefined-names (push) Successful in 10s
Co-authored-by: Chaim <chaim@marcus-law.co.il>
Co-committed-by: Chaim <chaim@marcus-law.co.il>
2026-06-15 03:18:56 +00:00
1340bff6f1 fix(halacha): a fresh CLI 429 is ground truth over the usage endpoint (rate-limit)
All checks were successful
G12 Leak-Guard / leak-guard (pull_request) Successful in 4s
Lint — undefined names / undefined-names (pull_request) Successful in 10s
PR #251 made the OAuth usage endpoint the PRIMARY rate-limit signal and the log
429 only a fallback for when the endpoint is unreachable. Observed 2026-06-15: the
endpoint reported the window <100% (available) while the claude CLI kept 429-ing
("session limit"). The supervisor then read 'rate_limited=false', classified the
drain 'hung', and restart-churned it — RE-EXTRACTING already-completed precedents
under the rate limit and DEGRADING them (e.g. 4624/21 lost halachot 3→1, only
4/18 chunks). delta_done went negative (completed cases reverting).

Fix: a FRESH CLI 429 is ground truth — the call is literally failing.
  • ENTER cooldown on EITHER signal (endpoint-exhausted OR fresh 429), so a 429
    overrides an endpoint that wrongly reports the window available.
  • VETO the early resume while a fresh 429 remains (the endpoint can lie
    "available" mid-storm → without the veto we'd bounce straight back to churn).
  • DEFAULT_COOLDOWN_MIN=30 when a fresh 429 has no parseable reset time.
While limited the drain STOPS (no 429-hammering, no degrading completed cases) and
re-ignites only once quota is back AND no fresh 429 remains.

Tested: 8 unit-tests over the decision matrix (endpoint×429×stored-cooldown),
incl. the exact tonight case and the veto. py_compile clean.

Immediate mitigation already applied out-of-band: drain stopped + disabled
(drain_controls.disabled) to halt the degradation until this deploys.

Invariants: G1 (fix at source — trust the failing call, not a lagging endpoint),
G2 (same cooldown path, no parallel control). Builds on PR #251.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 02:28:16 +00:00
49efa94d60 fix(halacha): authoritative rate-limit detection + early-morning catch-up window (supervisor)
All checks were successful
G12 Leak-Guard / leak-guard (pull_request) Successful in 4s
הדריינר רץ בלילה 13→14.6 אך חילץ 0 הלכות: מכסת-המנוי של claude.ai אזלה
(`api_error_status:429 "You've hit your session limit · resets 2:30am UTC"`,
`total_cost_usd:0`), וה-reset (5:30 IDT) נחת מעט אחרי סגירת החלון (05:00 IDT).
המתזמר סיווג זאת שגוי כ-"hung" ועשה restart-storm כל 15 דק' — כי `scan_rate_limit`
קורא רק 120 שורות-זנב, וה-429 (שורה 8273/9170) נקבר תחת ~900 שורות teardown שה-churn
שלו עצמו ייצר. בנוסף "hold" לא עצר את הדריינר → המשך הלמת-429 ובזבוז המכסה.

Fix A — זיהוי rate-limit עמיד:
  • `quota_exhausted()` חדש: מקור-האמת הוא endpoint-המכסה (`subscription_usage`,
    אותו util שה-UI מציג) — durable, לא תלוי בעומק-זנב-הלוג. log-scrape רק כ-fallback.
  • בזמן מוגבל עוצר דריינר online (`hold-stopped`) כדי לא להלום 429; מצית-מחדש
    כשהמכסה חוזרת (exit מיידי כש-endpoint <100%, או probe `claude -p` אם endpoint למטה).

Fix B — חלון catch-up בוקר [05:00–07:00 IDT):
  • נפתח רק לניקוי backlog שנותר כשהמכסה חזרה (מגודר: לא-מוגבל + תור≠ריק) כדי
    שהמכסה המשוחררת לא תתבזבז עד הלילה הבא. הקצה המורחב מועבר לדריינר (window self-guard).

נתונים בטוחים — תיקים נשארו 'processing' for retry, שום הלכה לא אבדה.
13 unit-tests עוברים (parse endpoint, gating של catch-band, win extension); `status` חי OK.

Invariants: מקיים G1 (תיקון-במקור: זיהוי ממקור-מכסה סמכותי, לא מתסמין-לוג),
G2 (אותו endpoint+מנגנון-חלון קיימים — בלי מסלול מקביל), INV-G3/X16 (לא נוגע
ב-checkpointing הדטרמיניסטי). G12 לא רלוונטי (host-side pm2, בלי Paperclip).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:02:35 +00:00
eac4dd3ac9 fix(supervisor): gate + display weekly-Sonnet, not weekly-Opus
All checks were successful
G12 Leak-Guard / leak-guard (pull_request) Successful in 8s
On this claude.ai account the populated per-model weekly cap is Sonnet;
seven_day_opus is null (no separate Opus cap). So quota_available() now gates on
five_hour + seven_day + seven_day_sonnet (was seven_day_opus, which never bound),
and `status` prints weekly-Sonnet. The all-models seven_day cap remains the
backstop for Opus usage regardless. Matches the /operations display (#245).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 10:58:05 +00:00
9e46db3c48 feat(supervisor): read real claude.ai usage % from OAuth endpoint for quota gating
All checks were successful
G12 Leak-Guard / leak-guard (pull_request) Successful in 6s
The supervisor's quota check used a tiny `claude -p` probe to decide whether the
claude.ai subscription had room. That works but is indirect (an Opus-adjacent
round trip) and only answers yes/no. Anthropic exposes the actual utilization —
the same 5-hour / weekly / weekly-Opus percentages the Claude Code status bar
shows — via the (undocumented) GET /api/oauth/usage endpoint.

- subscription_usage(): reads the OAuth token from ~/.claude/.credentials.json
  and GETs /api/oauth/usage with the required `claude-code/*` User-Agent (without
  it the request hits an aggressively rate-limited bucket and 429s). Returns the
  parsed {five_hour, seven_day, seven_day_opus, ...} or None on any failure.
- quota_available(): now prefers the endpoint — a drain run resumes only when the
  5-hour, weekly, AND weekly-Opus windows are all <100% (the extractor runs Opus).
  More precise than the probe and sees every limit the way the UI does. Falls
  back to the `claude -p` probe when the endpoint is unreachable (it's
  undocumented and may change).
- `status` subcommand now prints the live percentages + reset times.

Note: this is the data/logic layer only. Surfacing the % on the /operations
page is a visual UI change and must go through the Claude Design gate first
(web-ui/AGENTS.md) — deferred.

Invariants: G1 (resume decision driven by the authoritative usage state).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 10:19:17 +00:00
013fe39ea7 fix(supervisor): re-probe claude.ai quota instead of waiting blindly for the reported reset
All checks were successful
G12 Leak-Guard / leak-guard (pull_request) Successful in 8s
When the halacha drain hit a 429, the supervisor recorded the reset time the
error reported (e.g. "resets 6:50pm UTC") and then HELD until that timestamp,
re-reading it from its own state every tick without ever checking whether quota
had actually returned. claude.ai usually frees up quota earlier than the message
claims, so the drain sat idle for hours after it could have resumed — and only a
manual kick (clear cooldown + trigger) got it going again.

Now, on any tick where we'd otherwise hold on a cooldown, run a cheap live probe
(`quota_available()` → a tiny `claude -p` call, cost ~0) and resume the instant
it succeeds — at most one probe per 15-min tick, only while we believe we're
limited. Conservative on failure (non-zero exit / timeout / limit message →
stay held), so a flaky probe never resumes the drain into a real 429. Adds a
claude_bin() resolver so the probe works under pm2 cron where PATH is bare.

Invariants: G1 (resume decision driven by actual quota state, not a guessed
timestamp); no new control path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 09:35:45 +00:00
a44827c3dd fix(operations): disabling the halacha drain now stops a running process immediately
All checks were successful
G12 Leak-Guard / leak-guard (pull_request) Successful in 6s
The /operations "disabled" toggle only wrote drain_controls.disabled, which the
drain checks at STARTUP — so a drain already mid-run kept going until the queue
emptied or the night window closed. Disabling did not stop a running drain.

Three layers, immediate + backstops:
- web/app.py operations_drain_toggle: on disable, also stop the running process
  immediately via the host pm2 bridge (_ops_pm2_control). Best-effort — a bridge
  failure doesn't fail the toggle.
- halacha_drain_supervisor.py: each tick now reads the disabled flag (added to
  db_snapshot) and, when set, stops the drain and never re-triggers it —
  regardless of burst/window. Backstop if the UI path failed (≤ one tick).
- drain_halacha_queue.py: re-check is_drain_disabled at the top of every round,
  so a drain disabled mid-run halts at the next round boundary. Per-chunk
  checkpoints mean the in-flight case loses nothing.

SCRIPTS.md updated for both drain and supervisor.

Invariants: G1 (fix at source — the disable control honoured along every path,
not just at startup); G2 (no parallel control path — same drain_controls flag).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 09:03:07 +00:00
75a1b23972 fix(supervisor): burst set/get via raw SQL, not new db helpers (host-lag-proof)
All checks were successful
G12 Leak-Guard / leak-guard (pull_request) Successful in 9s
The host pm2 supervisor imports legal_mcp.services.db from the host repo checkout,
which can lag main by many commits. Depending on the just-added db.set_drain_burst/
get_drain_burst would require the host checkout to be current. Use raw SQL via the
stable db.get_pool() instead — the supervisor now depends only on get_pool + the
drain_controls.burst_until column (the shared contract with the /operations API).
The container-side API keeps using the typed helpers (it ships the code in-image).

Invariants: G1/G2 unchanged (same single DB column, no parallel path).
2026-06-12 11:16:38 +00:00
c7c402e7ef feat(operations): manual burst control for the halacha drain + permanent supervisor
All checks were successful
G12 Leak-Guard / leak-guard (pull_request) Successful in 6s
The halacha-extraction backlog needs to be worked off the chair's leftover weekly
Claude quota on demand. This adds a MANUAL, time-boxed "burst" — run the drain
continuously now until a chosen deadline (default the upcoming Saturday 18:00 IL),
managed interactively from /operations — plus the permanent health-supervisor that
enforces it.

Backend (this PR; deploys via Coolify + host pm2):
- db: drain_controls.burst_until (SCHEMA_V37) + set_drain_burst/get_drain_burst/
  get_drain_bursts. Single source of truth shared by the container-side /operations
  API and the host-side supervisor.
- web: POST /api/operations/drains/{name}/burst (on→until|next-Sat-18:00, off→NULL),
  and burst_until surfaced per-service in the /operations snapshot.
- scripts/halacha_drain_supervisor.py + legal-halacha-supervisor.config.cjs: pm2 cron
  (*/15, zero Claude quota) — re-triggers idle drain, restarts a HUNG run (liveness =
  per-chunk checkpoints, NOT log mtime), backs off on 429 until the parsed reset
  (fresh-gated), verifies crash-safe staging. Reads burst_until from the DB; burst
  auto-expires at the deadline (never bleeds into a fresh week).

UI (separate follow-up PR, after Claude Design approval): the /operations toggle +
date-picker that calls the burst endpoint.

Invariants: G1 (normalize at source — burst lives once in the DB, read by both
surfaces), G2 (no parallel control path — CAPTURE field on the existing
drain_controls + orchestrates the existing drain, not a new one), G12 (no Paperclip
touch), §6 (no silent error-swallow — burst-clear failure is surfaced as a note).
2026-06-12 11:11:13 +00:00