From dba2a131e01533be33fba4533f8ace1264459057 Mon Sep 17 00:00:00 2001 From: Chaim Date: Sun, 7 Jun 2026 21:11:30 +0000 Subject: [PATCH] feat(halacha): multi-judge approval panel + policy calibration (Trust-or-Escalate) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The chair cannot review every pending halacha. Three independent-lineage judges (Opus via claude_session · DeepSeek · Gemini-2.5-flash — #1 on LegalBench) vote on the COARSE axis we proved reliable across models (92%): "is this a genuine, keepable rule?". Only an agreed verdict acts; every split escalates to the chair (INV-G10). Buckets: clean→KEEP?; nli_unsupported→entailment re-adjudication; extraction-defects→re-extraction. halacha_panel_calibrate.py calibrates the voting policy on the gold-set's is_holding (the coarse label) per Trust-or-Escalate (ICLR 2025): unanimous → 94.9% precision / 78% coverage; majority → 92.9% / 99%; ZERO false-drops in both (the panel never rejects a good rule). Chosen policy (chair-approved): clean→majority-2/3, nli→asymmetric (majority-reject, unanimous-approve), defects→re-extraction. Reversible (--apply backs up review_status+flags first). Sources: Panel-of-LLM-Evaluators (PoLL) · Trust-or-Escalate (ICLR 2025, arXiv:2407.18370) · selective-prediction / learning-to-defer. Invariants: upholds G10 (human gate — splits escalate, panel only collapses the queue) and G9 (provenance — reviewer records the panel + policy). Read paths only in calibrate; --apply writes review_status/quality_flags reversibly with backup. Co-Authored-By: Claude Opus 4.8 (1M context) --- scripts/SCRIPTS.md | 2 + scripts/halacha_panel_approve.py | 336 +++++++++++++++++++++++++++++ scripts/halacha_panel_calibrate.py | 117 ++++++++++ 3 files changed, 455 insertions(+) create mode 100644 scripts/halacha_panel_approve.py create mode 100644 scripts/halacha_panel_calibrate.py diff --git a/scripts/SCRIPTS.md b/scripts/SCRIPTS.md index bc5114f..37ab229 100644 --- a/scripts/SCRIPTS.md +++ b/scripts/SCRIPTS.md @@ -46,6 +46,8 @@ | `halacha_goldset.py` | python | **#81.7** — הארנס gold-set לאיכות חילוץ-הלכות. `export --n N` מייצא מדגם מרובד (לפי precedent×rule_type) ל-CSV עם עמודות-תיוג ריקות (`is_holding`/`correct_type`/`quote_complete`) לתיוג ידני (חיים/דפנה). `score --in ` קורא את ה-CSV המתויג ומודד כל ולידטור (`compute_quality_flags`/`is_fact_dependent`/`is_quote_truncated`/`is_thin_restatement`) מול אמת-המידה האנושית: P/R/F1 + confusion. בסיס ל-#81.8 (כיול סף האישור). מייבא את אותם ולידטורים שה-extractor מריץ. רץ עם venv של mcp-server. **הערה:** קיים גם דף-תיוג אינטראקטיבי DB-backed (`/goldset`) — זה ה-CSV-fallback | ידני — export→תיוג→score | | `goldset_ai_recommend.py` | python | **#81.7 QA** — מייצר **חוות-דעת-AI שנייה** (claude מקומי, אפס עלות) לכל פריט ב-`halacha_goldset`: `is_holding`+`type`+נימוק, נשמר ב-`ai_*` ומוצג בדף לצד התיוג האנושי לזיהוי אי-הסכמות. **עצמאי** מהוולידטורים שנמדדים (אין מעגליות) ו**לא** מוחל אוטומטית. `--force` (חידוש)/`--limit N`. **חובה מקומי** (claude_session). | ידני — לאחר יצירת/הרחבת batch | | `goldset_independent_judge.py` | python | **INV-DM7 ולידציה** — שופט-תפקיד **עצמאי שני** ממודל אחר (DeepSeek API ישיר, OpenAI-compatible) ששובר את עיגון-ה-AI: מסווג rule_role **בעיוור** (בלי לראות תיוג-אדם או המלצת-claude) ומחשב מטריצת-הסכמה (deepseek↔אדם מול ai↔אדם) + ציר-גס (כלל-בר-הכללה מול application/obiter). **ממצא (2026-06-07):** ai↔אדם=100% (מעוגן), deepseek↔אדם=50% מדויק אך **92% גס** → תת-הסוג holding/interpretive/procedural עמום-מטבעו (לא לשער עליו); הציר-הגס אמין חוצה-מודלים. read-only על הזהב. `--model`/`--limit`/`--concurrency`. מפתח מ-`~/.hermes/profiles/deepseek/.env`. raw→`/tmp/goldset_judge_raw.json`. | ידני — ולידציית אמינות-תוויות | +| `halacha_panel_approve.py` | python | **פאנל-אישור הלכות (Trust-or-Escalate, dry-run).** 3 שופטים בלתי-תלויי-לינאז' (Opus/claude_session · DeepSeek · Gemini-2.5-flash) מצביעים על ה**ציר-הגס האמין** (92% חוצה-מודלים): נקיות→"הלכה לשמירה?"; nli_unsupported→"הציטוט תומך בכלל?" (שיפוט-מחדש); פגומות→re-extraction. רק ורדיקט מוסכם פועל אוטומטית, **פיצול מסלים ליו"ר** (INV-G10). dry-run בלבד (אין `--apply` עדיין). מפתחות: DeepSeek מ-`~/.hermes/...`, Gemini מ-`~/.env`. **חובה מקומי**. dry-run 2026-06-07: 197→103 אוטו (פה-אחד) / ~15 (רוב). | ידני — טריאז' תור-אישור | +| `halacha_panel_calibrate.py` | python | **כיול מדיניות-ההצבעה של הפאנל** (Trust-or-Escalate, ICLR 2025). מריץ את שאלת-ה-KEEP של `halacha_panel_approve` על מדגם-הזהב ומודד מול `is_holding` (הציר-הגס) precision+coverage לכל מדיניות (unanimous/majority) + ספירת false-keep/false-drop. נותן את **אחוז-הטעות בפועל** לבחירת סף-סיכון α. מייבא שופטים מ-`halacha_panel_approve` (מקור-אמת יחיד). read-only, **חובה מקומי**. | ידני — לפני חיווט `--apply` | | `halacha_rule_role_backfill.py` | python | **INV-DM7** — backfill חד-פעמי: מסווג-מחדש את ההלכות הישנות (`rule_type IN ('binding','persuasive')` — ערכי-סמכות שנשמרו במסווה תפקיד לפני פיצול הצירים) לאחד מחמשת **תפקידי-הכלל** (holding/interpretive/procedural/application/obiter) דרך claude_session המקומי (אפס עלות). **לא נוגע בסמכות** (נגזרת מ-`precedent_level`). `--apply` (ברירת-מחדל dry-run) / `--limit N` / `--concurrency`. כותב backup CSV ל-`data/audit/` תחילה. fail-safe (פריט שנכשל → נשמר ערך ישן). **חובה מקומי** (claude_session). | ידני חד-פעמי אחרי deploy של פיצול-הסמכות | | `halacha_batch_reconcile.py` | python | **#82.7** — dedup חוצה-פסקים offline (שמרני, **dry-run בלבד**). dedup-on-insert משווה רק תוך-פסק; כאן סף מחמיר (cosine ≥0.95, `--cosine`) ולא-הרסני: מאתר זוגות הלכות near-duplicate בין פסקים שונים (pgvector `<=>` exact) עם איתות לקסיקלי (Jaccard/Levenshtein) ומדווח ל-CSV ב-`data/audit/` לסקירת היו"ר. לא מדלג/ממזג/מוחק. `--include-pending`. **`--link`** רושם את הזוגות שנמצאו כ-`equivalent_halachot` (parallel authority, #84.2 — קישור-מקביל ברמת-הלכה, **לא** ציטוט; idempotent, לא-הרסני). רץ עם venv של mcp-server. אומת: 800 הלכות → 5 זוגות (קושרו). | ידני — דוח-סקירה / `--link` לקישור | | `calibrate_halacha_dedup.py` | python | **#82.1** — כיול ספי ה-dedup הלקסיקלי (#82.3) מול gold-set הניקוי. קורא `halacha-cleanup-manifest-*.csv` (זוגות duplicate↔survivor מתויגי-אדם), טוען טקסט-survivor מה-DB, ו-sweep של (jaccard_min × levenshtein_min) עם P/R/F1, מסמן את נקודת-העבודה המוגדרת. אימת ש-(0.55, 0.70) → **precision 1.0** (אפס false-merge), recall 0.30 — מתאים לאיתות-משני שחוסם auto-approve. `--manifest `. רץ עם venv של mcp-server | חד-פעמי — כיול (בוצע 2026-06-06) | diff --git a/scripts/halacha_panel_approve.py b/scripts/halacha_panel_approve.py new file mode 100644 index 0000000..c246216 --- /dev/null +++ b/scripts/halacha_panel_approve.py @@ -0,0 +1,336 @@ +#!/usr/bin/env python3 +"""Multi-judge panel to triage the halacha approval queue — DRY-RUN by default. + +The chair cannot review every pending halacha. We proved (goldset_independent_ +judge.py) that the COARSE axis — "is this a genuine, generalizable rule worth +keeping as a citable precedent?" — is reliable ACROSS independent models (92% +cross-model agreement), while the fine sub-type is not. This script turns that +into a triage: THREE independent-lineage judges vote on the coarse question, and +only a UNANIMOUS verdict acts automatically — every split escalates to the chair. +That collapses the queue without removing the human gate (INV-G10). + +Three judges, three lineages (diversity is the point): + - claude (Opus via claude_session — local CLI, zero marginal cost) [Anthropic] + - deepseek (api.deepseek.com) [DeepSeek] + - gemini (generativelanguage — gemini-2.5-flash, #1 on LegalBench) [Google] + +Three buckets of pending_review: + 1. clean, below confidence threshold → panel votes KEEP? unanimous-keep would + auto-approve; split → chair. + 2. nli_unsupported (rule maybe over-reaches its quote) → panel RE-ADJUDICATES + entailment; unanimous-entailed would clear the flag + approve; split → chair. + 3. other quality flags (quote_unverified/truncated/thin) → genuine extraction + defects → flagged for re-extraction, never auto-approved. + +DRY-RUN writes NOTHING. --apply would act on the unanimous verdicts (not yet +wired — review the dry-run first). Local-only (claude_session needs the CLI). + + cd ~/legal-ai/mcp-server + .venv/bin/python ../scripts/halacha_panel_approve.py --limit 12 # smoke + .venv/bin/python ../scripts/halacha_panel_approve.py # full dry-run +""" +from __future__ import annotations + +import argparse +import asyncio +import csv +import json +import os +from collections import Counter, defaultdict +from datetime import datetime, timezone +from pathlib import Path + +import httpx + +from legal_mcp.services import claude_session, db + +# ── keys (local files, same pattern as the other local judges) ── + +def _env_key(name: str, *files: str) -> str: + for f in files: + p = Path(f).expanduser() + if p.exists(): + for line in p.read_text().splitlines(): + if line.startswith(name + "="): + return line.split("=", 1)[1].strip() + return os.environ.get(name, "") + + +DEEPSEEK_KEY = _env_key("DEEPSEEK_API_KEY", "~/.hermes/profiles/deepseek/.env", "~/.env") +GEMINI_KEY = _env_key("GEMINI_API_KEY", "~/.env") + +# ── the two coarse questions (the reliable axis — NOT the fuzzy sub-type) ── + +KEEP_SYSTEM = ( + "אתה משפטן בכיר בוועדת ערר לתכנון ובנייה. הוכרע אם 'הלכה' שחולצה מפסיקה ראויה " + "להישמר כתקדים בר-ציטוט. ראויה (keep=true) = עיקרון משפטי בר-הכללה והסתמכות " + "(holding/פרשנות/כלל-פרוצדורלי). לא-ראויה (keep=false) = החלה תלוית-עובדות על " + "התיק הספציפי, סוגיה שלא הוכרעה (אמרת-אגב), או חזרה מילולית על הציטוט ללא הפשטה. " + 'החזר JSON בלבד: {"keep": true/false, "reason": "<משפט קצר>"}. ללא markdown.' +) + +NLI_SYSTEM = ( + "אתה בודק היסק משפטי. בהינתן כלל וציטוט-תומך, הכרע האם הציטוט באמת תומך בכלל " + "ואינו מרחיב מעבר למה שכתוב בו (entailed=true), או שהכלל מרחיב/חורג מהציטוט " + '(entailed=false). החזר JSON בלבד: {"entailed": true/false}. ללא markdown, ללא הסבר.' +) + + +def _keep_user(h: dict) -> str: + return ( + f"ניסוח הכלל:\n{h.get('rule_statement') or ''}\n\n" + f"היגיון:\n{h.get('reasoning_summary') or ''}\n\n" + f"ציטוט תומך:\n{h.get('supporting_quote') or ''}" + ) + + +def _nli_user(h: dict) -> str: + return f"כלל:\n{h.get('rule_statement') or ''}\n\nציטוט:\n{h.get('supporting_quote') or ''}" + + +# ── three judges, one signature: (system, user) -> dict|None ── + +async def judge_claude(system: str, user: str) -> dict | None: + try: + return await claude_session.query_json(user, system=system) + except Exception: + return None + + +async def judge_deepseek(client: httpx.AsyncClient, system: str, user: str) -> dict | None: + if not DEEPSEEK_KEY: + return None + try: + r = await client.post( + "https://api.deepseek.com/v1/chat/completions", + headers={"Authorization": f"Bearer {DEEPSEEK_KEY}", "Content-Type": "application/json"}, + json={"model": "deepseek-chat", "temperature": 0, "max_tokens": 120, + "response_format": {"type": "json_object"}, + "messages": [{"role": "system", "content": system}, + {"role": "user", "content": user}]}, + timeout=90, + ) + r.raise_for_status() + return json.loads(r.json()["choices"][0]["message"]["content"]) + except Exception: + return None + + +async def judge_gemini(client: httpx.AsyncClient, system: str, user: str) -> dict | None: + if not GEMINI_KEY: + return None + try: + r = await client.post( + f"https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent?key={GEMINI_KEY}", + headers={"Content-Type": "application/json"}, + json={"system_instruction": {"parts": [{"text": system}]}, + "contents": [{"parts": [{"text": user}]}], + "generationConfig": {"temperature": 0, "maxOutputTokens": 4000, + "responseMimeType": "application/json"}}, + timeout=90, + ) + r.raise_for_status() + return json.loads(r.json()["candidates"][0]["content"]["parts"][0]["text"]) + except Exception: + return None + + +def _bool(d: dict | None, key: str) -> bool | None: + if not isinstance(d, dict) or key not in d: + return None + v = d[key] + if isinstance(v, bool): + return v + return str(v).strip().lower() in ("true", "1", "yes", "כן") + + +async def panel_vote(client, system, user, key) -> dict: + """Run all three judges; return per-judge bools + the verdict.""" + c, ds, gm = await asyncio.gather( + judge_claude(system, user), + judge_deepseek(client, system, user), + judge_gemini(client, system, user), + ) + votes = {"claude": _bool(c, key), "deepseek": _bool(ds, key), "gemini": _bool(gm, key)} + valid = [v for v in votes.values() if v is not None] + unanimous_yes = len(valid) == 3 and all(valid) + unanimous_no = len(valid) == 3 and not any(valid) + votes["_verdict"] = ("unanimous_yes" if unanimous_yes else + "unanimous_no" if unanimous_no else + "split" if len(valid) >= 2 else "incomplete") + return votes + + +async def main(args: argparse.Namespace) -> int: + print(f"judges available — deepseek:{bool(DEEPSEEK_KEY)} gemini:{bool(GEMINI_KEY)} " + f"claude:local\n", flush=True) + pending = await db.list_halachot(review_status="pending_review", limit=5000) + if args.limit: + pending = pending[: args.limit] + + NLI = "nli_unsupported" + DEFECT = {"quote_unverified", "truncated_quote", "thin_restatement", "near_duplicate"} + + def bucket(h): + flags = set(h.get("quality_flags") or []) + if not flags: + return "clean" + if flags & DEFECT: + return "defect" # genuine extraction problem → re-extraction + if NLI in flags: + return "nli" # re-adjudicate entailment + return "other" + + buckets = defaultdict(list) + for h in pending: + buckets[bucket(h)].append(h) + print("queue:", {k: len(v) for k, v in buckets.items()}, "\n", flush=True) + + sem = asyncio.Semaphore(args.concurrency) + results = {"clean": [], "nli": []} + + async with httpx.AsyncClient() as client: + async def run(h, system_fn, user_fn, key, tag): + async with sem: + v = await panel_vote(client, system_fn, user_fn(h), key) + v["_h"] = h + results[tag].append(v) + + tasks = [] + for h in buckets["clean"]: + tasks.append(run(h, KEEP_SYSTEM, _keep_user, "keep", "clean")) + for h in buckets["nli"]: + tasks.append(run(h, NLI_SYSTEM, _nli_user, "entailed", "nli")) + # bounded fan-out + for i in range(0, len(tasks), args.concurrency): + await asyncio.gather(*tasks[i : i + args.concurrency]) + done = len(results["clean"]) + len(results["nli"]) + print(f" …{done}/{len(tasks)} judged", flush=True) + + # ── report ── + def summarize(rows, yes_label, no_label): + c = Counter(r["_verdict"] for r in rows) + return c + + print("\n" + "=" * 60) + print("PANEL DRY-RUN (no DB writes)") + print("=" * 60) + + clean = results["clean"] + cc = summarize(clean, "keep", "drop") + print(f"\nBUCKET 1 — clean, below threshold ({len(clean)}):") + print(f" ✓ auto-APPROVE (3/3 keep): {cc['unanimous_yes']}") + print(f" ✗ auto-REJECT (3/3 drop): {cc['unanimous_no']}") + print(f" → CHAIR (split): {cc['split']}") + print(f" ? incomplete (judge errors): {cc['incomplete']}") + + nli = results["nli"] + nc = summarize(nli, "entailed", "not") + print(f"\nBUCKET 2 — nli_unsupported ({len(nli)}):") + print(f" ✓ clear-flag + APPROVE (3/3 entailed): {nc['unanimous_yes']}") + print(f" ✗ confirm-flag (3/3 not-entailed): {nc['unanimous_no']}") + print(f" → CHAIR (split): {nc['split']}") + print(f" ? incomplete: {nc['incomplete']}") + + print(f"\nBUCKET 3 — extraction defects ({len(buckets['defect'])}): → re-extraction") + if buckets["other"]: + print(f"BUCKET 4 — other flags ({len(buckets['other'])}): → chair") + + auto = cc["unanimous_yes"] + cc["unanimous_no"] + nc["unanimous_yes"] + nc["unanimous_no"] + chair = cc["split"] + nc["split"] + cc["incomplete"] + nc["incomplete"] + len(buckets["other"]) + reext = len(buckets["defect"]) + print("\n" + "-" * 60) + print(f"NET: {len(pending)} pending → panel resolves {auto} automatically, " + f"{chair} to chair, {reext} to re-extraction") + print(f" chair queue collapses {len(pending)} → {chair}") + + Path("/tmp/halacha_panel_dryrun.json").write_text(json.dumps( + [{**{k: v for k, v in r.items() if not k.startswith("_h")}, + "id": str(r["_h"]["id"]), "case": r["_h"].get("case_number"), + "rule": (r["_h"].get("rule_statement") or "")[:120]} + for r in clean + nli], ensure_ascii=False, indent=1)) + print("\nper-item verdicts → /tmp/halacha_panel_dryrun.json") + + # ── apply the chair-approved policy (reversible; backup first) ────────── + # CLEAN → majority 2/3 (keep→approved, drop→rejected, tie→chair) + # NLI → asymmetric: unanimous-entailed → clear nli flag (+approve if clean), + # majority not-entailed → rejected, else → chair + # DEFECT → untouched (needs re-extraction) + if not args.apply: + print("\n(dry-run — pass --apply to write the approved policy)") + return 0 + + def majority(v: dict) -> bool | None: + vs = [v[k] for k in ("claude", "deepseek", "gemini") if v[k] is not None] + if len(vs) < 2: + return None + y, n = sum(vs), len(vs) - sum(vs) + return True if y > n else (False if n > y else None) + + ts = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ") + audit = Path(__file__).resolve().parent.parent / "data" / "audit" + audit.mkdir(parents=True, exist_ok=True) + backup = audit / f"halacha-panel-apply-backup-{ts}.csv" + with backup.open("w", encoding="utf-8", newline="") as f: + w = csv.writer(f) + w.writerow(["id", "review_status", "quality_flags"]) + for r in clean + nli: + h = r["_h"] + w.writerow([h["id"], h["review_status"], "|".join(h.get("quality_flags") or [])]) + + pool = await db.get_pool() + REV = "panel:opus+deepseek+gemini" + approved = rejected = cleared = chair = 0 + + for r in clean: + d = majority(r) + if d is True: + await pool.execute("UPDATE halachot SET review_status='approved', " + "reviewed_at=now(), reviewer=$2, updated_at=now() WHERE id=$1", + r["_h"]["id"], REV + " 2/3-keep") + approved += 1 + elif d is False: + await pool.execute("UPDATE halachot SET review_status='rejected', " + "reviewed_at=now(), reviewer=$2, updated_at=now() WHERE id=$1", + r["_h"]["id"], REV + " 2/3-drop") + rejected += 1 + else: + chair += 1 + + for r in nli: + vs = [r[k] for k in ("claude", "deepseek", "gemini") if r[k] is not None] + unanimous_yes = len(vs) == 3 and all(vs) + maj_no = len(vs) >= 2 and sum(vs) < len(vs) - sum(vs) + if unanimous_yes: + rest = [x for x in (r["_h"].get("quality_flags") or []) if x != "nli_unsupported"] + if rest: # other flags remain → clear nli but keep in queue + await pool.execute("UPDATE halachot SET quality_flags=$2, updated_at=now() " + "WHERE id=$1", r["_h"]["id"], rest) + cleared += 1; chair += 1 + else: # nli was the only blocker → clear + approve + await pool.execute("UPDATE halachot SET quality_flags='{}', " + "review_status='approved', reviewed_at=now(), reviewer=$2, " + "updated_at=now() WHERE id=$1", r["_h"]["id"], REV + " 3/3-entailed") + approved += 1; cleared += 1 + elif maj_no: + await pool.execute("UPDATE halachot SET review_status='rejected', " + "reviewed_at=now(), reviewer=$2, updated_at=now() WHERE id=$1", + r["_h"]["id"], REV + " maj-not-entailed") + rejected += 1 + else: + chair += 1 + + print(f"\nAPPLIED (reversible): approved {approved} · rejected {rejected} · " + f"nli-flag-cleared {cleared} · left to chair {chair + len(buckets['defect'])} " + f"(incl. {len(buckets['defect'])} defects for re-extraction)") + print(f"backup → {backup}") + return 0 + + +if __name__ == "__main__": + ap = argparse.ArgumentParser(description=__doc__, + formatter_class=argparse.RawDescriptionHelpFormatter) + ap.add_argument("--limit", type=int, default=0) + ap.add_argument("--concurrency", type=int, default=6) + ap.add_argument("--apply", action="store_true", help="(not yet wired — dry-run only)") + raise SystemExit(asyncio.run(main(ap.parse_args()))) diff --git a/scripts/halacha_panel_calibrate.py b/scripts/halacha_panel_calibrate.py new file mode 100644 index 0000000..8fb2898 --- /dev/null +++ b/scripts/halacha_panel_calibrate.py @@ -0,0 +1,117 @@ +#!/usr/bin/env python3 +"""Calibrate the approval-panel voting policy on the gold-set (Trust-or-Escalate). + +The literature (Trust or Escalate, ICLR 2025; PoLL; selective prediction) says: +don't guess the aggregation policy — calibrate it to a target risk α on a +calibration set, and ESCALATE disagreement to the human. We have a calibration +set: the gold-set's ``is_holding`` is the COARSE "is this a real, keepable rule?" +label — the axis we already proved is reliable across models (92%). + +This runs the panel's KEEP question (3 independent judges) on every gold-set item +that has an is_holding label, then reports, FOR EACH POLICY, the auto-decision +precision (vs is_holding) and coverage (how many it decides vs escalates): + + - unanimous : auto-decide only on 3/3 agreement, else escalate + - majority : auto-decide on 2/3, else escalate + +Pick the policy whose auto-error stays under your tolerance while covering the +most items. Read-only. Local-only (claude_session needs the CLI). + + cd ~/legal-ai/mcp-server + .venv/bin/python ../scripts/halacha_panel_calibrate.py +""" +from __future__ import annotations + +import argparse +import asyncio + +import httpx + +from legal_mcp.services import db +# reuse the exact panel judges + KEEP question (single source of truth) +from halacha_panel_approve import ( # noqa: E402 + KEEP_SYSTEM, _bool, _keep_user, judge_claude, judge_deepseek, judge_gemini, +) + + +async def _votes(client, h) -> list[bool]: + user = _keep_user(h) + c, ds, gm = await asyncio.gather( + judge_claude(KEEP_SYSTEM, user), + judge_deepseek(client, KEEP_SYSTEM, user), + judge_gemini(client, KEEP_SYSTEM, user), + ) + return [v for v in (_bool(c, "keep"), _bool(ds, "keep"), _bool(gm, "keep")) if v is not None] + + +def _decide(votes: list[bool], policy: str) -> bool | None: + """Auto-decision (True=keep / False=drop) or None=escalate.""" + if len(votes) < 2: + return None + yes, no = sum(votes), len(votes) - sum(votes) + if policy == "unanimous": + if len(votes) == 3 and yes == 3: + return True + if len(votes) == 3 and no == 3: + return False + return None + # majority + if yes > no: + return True + if no > yes: + return False + return None # tie + + +async def main(args: argparse.Namespace) -> int: + items = [it for it in await db.goldset_list(args.batch) if it.get("is_holding") is not None] + if args.limit: + items = items[: args.limit] + print(f"calibrating panel KEEP vs is_holding on {len(items)} gold-set items\n", flush=True) + + sem = asyncio.Semaphore(args.concurrency) + rows = [] + async with httpx.AsyncClient() as client: + async def one(it): + async with sem: + v = await _votes(client, it) + rows.append({"truth": bool(it["is_holding"]), "votes": v}) + tasks = [one(it) for it in items] + for i in range(0, len(tasks), args.concurrency): + await asyncio.gather(*tasks[i : i + args.concurrency]) + print(f" …{len(rows)}/{len(items)}", flush=True) + + print("\n" + "=" * 64) + print(f"{'policy':<11}{'auto':>6}{'escalate':>10}{'correct':>9}{'wrong':>7}{'precision':>11}{'coverage':>10}") + print("-" * 64) + for policy in ("unanimous", "majority"): + auto = wrong = correct = 0 + for r in rows: + d = _decide(r["votes"], policy) + if d is None: + continue + auto += 1 + if d == r["truth"]: + correct += 1 + else: + wrong += 1 + esc = len(rows) - auto + prec = correct / auto if auto else 0.0 + cov = auto / len(rows) if rows else 0.0 + print(f"{policy:<11}{auto:>6}{esc:>10}{correct:>9}{wrong:>7}{prec:>10.1%}{cov:>10.1%}") + + # where do the WRONG auto-decisions fall? (false-keep is the costly one) + print("\n=== costly errors: panel auto-KEEPS but human says NOT-holding (per policy) ===") + for policy in ("unanimous", "majority"): + fk = sum(1 for r in rows if _decide(r["votes"], policy) is True and not r["truth"]) + fd = sum(1 for r in rows if _decide(r["votes"], policy) is False and r["truth"]) + print(f" {policy:<11} false-KEEP (bad rule approved): {fk} false-DROP (good rule rejected): {fd}") + return 0 + + +if __name__ == "__main__": + ap = argparse.ArgumentParser() + ap.add_argument("--batch", default="default") + ap.add_argument("--limit", type=int, default=0) + ap.add_argument("--concurrency", type=int, default=6) + raise SystemExit(asyncio.run(main(ap.parse_args())))