feat(operations): manual burst control for the halacha drain + permanent supervisor
All checks were successful
G12 Leak-Guard / leak-guard (pull_request) Successful in 6s

The halacha-extraction backlog needs to be worked off the chair's leftover weekly
Claude quota on demand. This adds a MANUAL, time-boxed "burst" — run the drain
continuously now until a chosen deadline (default the upcoming Saturday 18:00 IL),
managed interactively from /operations — plus the permanent health-supervisor that
enforces it.

Backend (this PR; deploys via Coolify + host pm2):
- db: drain_controls.burst_until (SCHEMA_V37) + set_drain_burst/get_drain_burst/
  get_drain_bursts. Single source of truth shared by the container-side /operations
  API and the host-side supervisor.
- web: POST /api/operations/drains/{name}/burst (on→until|next-Sat-18:00, off→NULL),
  and burst_until surfaced per-service in the /operations snapshot.
- scripts/halacha_drain_supervisor.py + legal-halacha-supervisor.config.cjs: pm2 cron
  (*/15, zero Claude quota) — re-triggers idle drain, restarts a HUNG run (liveness =
  per-chunk checkpoints, NOT log mtime), backs off on 429 until the parsed reset
  (fresh-gated), verifies crash-safe staging. Reads burst_until from the DB; burst
  auto-expires at the deadline (never bleeds into a fresh week).

UI (separate follow-up PR, after Claude Design approval): the /operations toggle +
date-picker that calls the burst endpoint.

Invariants: G1 (normalize at source — burst lives once in the DB, read by both
surfaces), G2 (no parallel control path — CAPTURE field on the existing
drain_controls + orchestrates the existing drain, not a new one), G12 (no Paperclip
touch), §6 (no silent error-swallow — burst-clear failure is surfaced as a note).
This commit is contained in:
2026-06-12 11:11:13 +00:00
parent 551d38dd7c
commit c7c402e7ef
5 changed files with 563 additions and 1 deletions

View File

@@ -6637,8 +6637,10 @@ async def operations_snapshot():
pm2 = await _ops_pm2_services()
controls = await db.get_drain_controls()
bursts = await db.get_drain_bursts()
for svc in pm2["services"]:
svc["disabled"] = controls.get(svc.get("name", ""), False)
svc["burst_until"] = bursts.get(svc.get("name", ""))
def _iso(rows: list[dict]) -> list[dict]:
for d in rows:
@@ -6717,6 +6719,53 @@ async def operations_drain_toggle(name: str, body: dict = Body(...)):
return {"ok": True, "name": name, "disabled": disabled}
def _next_saturday_18_il() -> datetime:
"""Upcoming Saturday 18:00 Israel time (DST-safe)."""
from datetime import timedelta
from zoneinfo import ZoneInfo
il = ZoneInfo("Asia/Jerusalem")
now = datetime.now(il)
days = (5 - now.weekday()) % 7 # Mon=0 .. Sat=5 .. Sun=6
cand = now.replace(hour=18, minute=0, second=0, microsecond=0) + timedelta(days=days)
if cand <= now:
cand += timedelta(days=7)
return cand
@app.post("/api/operations/drains/{name}/burst")
async def operations_drain_burst(name: str, body: dict = Body(...)):
"""Start/stop a drain's MANUAL burst window (chair-controlled, from /operations).
``action='on'`` → ``burst_until`` = body ``until`` (ISO) or the upcoming
Saturday 18:00 Israel time. ``action='off'`` → NULL. The host supervisor
(legal-halacha-supervisor) reads this from the DB and lifts/restores the
drain's window accordingly (takes effect within one supervisor tick, ≤15 min).
Never set automatically — manual only."""
if not name.startswith("legal-"):
raise HTTPException(403, "ניתן לשלוט רק בשירותי legal-*")
action = (body.get("action") or "").lower()
if action == "off":
await db.set_drain_burst(name, None)
return {"ok": True, "name": name, "burst_until": None}
if action == "on":
until = body.get("until")
if until:
try:
until_dt = datetime.fromisoformat(until)
except (ValueError, TypeError):
raise HTTPException(400, "until חייב להיות ISO-8601")
if until_dt.tzinfo is None:
from zoneinfo import ZoneInfo
until_dt = until_dt.replace(tzinfo=ZoneInfo("Asia/Jerusalem"))
else:
until_dt = _next_saturday_18_il()
if until_dt <= datetime.now(timezone.utc):
raise HTTPException(400, "until חייב להיות בעתיד")
await db.set_drain_burst(name, until_dt)
return {"ok": True, "name": name, "burst_until": until_dt.isoformat()}
raise HTTPException(400, "action חייב להיות on|off")
# ── Live agents (/operations "סוכנים פעילים") ──────────────────────────────
# What the pm2/queue panels can't show: WHICH agent is doing the work right now
# and its live output. An agent-driven drain (e.g. the CEO heartbeat draining