feat(halacha): multi-judge approval panel + policy calibration (Trust-or-Escalate) #130

Merged
chaim merged 1 commits from worktree-halacha-panel into main 2026-06-07 21:12:04 +00:00
3 changed files with 455 additions and 0 deletions

View File

@@ -46,6 +46,8 @@
| `halacha_goldset.py` | python | **#81.7** — הארנס gold-set לאיכות חילוץ-הלכות. `export --n N` מייצא מדגם מרובד (לפי precedent×rule_type) ל-CSV עם עמודות-תיוג ריקות (`is_holding`/`correct_type`/`quote_complete`) לתיוג ידני (חיים/דפנה). `score --in <csv>` קורא את ה-CSV המתויג ומודד כל ולידטור (`compute_quality_flags`/`is_fact_dependent`/`is_quote_truncated`/`is_thin_restatement`) מול אמת-המידה האנושית: P/R/F1 + confusion. בסיס ל-#81.8 (כיול סף האישור). מייבא את אותם ולידטורים שה-extractor מריץ. רץ עם venv של mcp-server. **הערה:** קיים גם דף-תיוג אינטראקטיבי DB-backed (`/goldset`) — זה ה-CSV-fallback | ידני — export→תיוג→score |
| `goldset_ai_recommend.py` | python | **#81.7 QA** — מייצר **חוות-דעת-AI שנייה** (claude מקומי, אפס עלות) לכל פריט ב-`halacha_goldset`: `is_holding`+`type`+נימוק, נשמר ב-`ai_*` ומוצג בדף לצד התיוג האנושי לזיהוי אי-הסכמות. **עצמאי** מהוולידטורים שנמדדים (אין מעגליות) ו**לא** מוחל אוטומטית. `--force` (חידוש)/`--limit N`. **חובה מקומי** (claude_session). | ידני — לאחר יצירת/הרחבת batch |
| `goldset_independent_judge.py` | python | **INV-DM7 ולידציה** — שופט-תפקיד **עצמאי שני** ממודל אחר (DeepSeek API ישיר, OpenAI-compatible) ששובר את עיגון-ה-AI: מסווג rule_role **בעיוור** (בלי לראות תיוג-אדם או המלצת-claude) ומחשב מטריצת-הסכמה (deepseek↔אדם מול ai↔אדם) + ציר-גס (כלל-בר-הכללה מול application/obiter). **ממצא (2026-06-07):** ai↔אדם=100% (מעוגן), deepseek↔אדם=50% מדויק אך **92% גס** → תת-הסוג holding/interpretive/procedural עמום-מטבעו (לא לשער עליו); הציר-הגס אמין חוצה-מודלים. read-only על הזהב. `--model`/`--limit`/`--concurrency`. מפתח מ-`~/.hermes/profiles/deepseek/.env`. raw→`/tmp/goldset_judge_raw.json`. | ידני — ולידציית אמינות-תוויות |
| `halacha_panel_approve.py` | python | **פאנל-אישור הלכות (Trust-or-Escalate, dry-run).** 3 שופטים בלתי-תלויי-לינאז' (Opus/claude_session · DeepSeek · Gemini-2.5-flash) מצביעים על ה**ציר-הגס האמין** (92% חוצה-מודלים): נקיות→"הלכה לשמירה?"; nli_unsupported→"הציטוט תומך בכלל?" (שיפוט-מחדש); פגומות→re-extraction. רק ורדיקט מוסכם פועל אוטומטית, **פיצול מסלים ליו"ר** (INV-G10). dry-run בלבד (אין `--apply` עדיין). מפתחות: DeepSeek מ-`~/.hermes/...`, Gemini מ-`~/.env`. **חובה מקומי**. dry-run 2026-06-07: 197→103 אוטו (פה-אחד) / ~15 (רוב). | ידני — טריאז' תור-אישור |
| `halacha_panel_calibrate.py` | python | **כיול מדיניות-ההצבעה של הפאנל** (Trust-or-Escalate, ICLR 2025). מריץ את שאלת-ה-KEEP של `halacha_panel_approve` על מדגם-הזהב ומודד מול `is_holding` (הציר-הגס) precision+coverage לכל מדיניות (unanimous/majority) + ספירת false-keep/false-drop. נותן את **אחוז-הטעות בפועל** לבחירת סף-סיכון α. מייבא שופטים מ-`halacha_panel_approve` (מקור-אמת יחיד). read-only, **חובה מקומי**. | ידני — לפני חיווט `--apply` |
| `halacha_rule_role_backfill.py` | python | **INV-DM7** — backfill חד-פעמי: מסווג-מחדש את ההלכות הישנות (`rule_type IN ('binding','persuasive')` — ערכי-סמכות שנשמרו במסווה תפקיד לפני פיצול הצירים) לאחד מחמשת **תפקידי-הכלל** (holding/interpretive/procedural/application/obiter) דרך claude_session המקומי (אפס עלות). **לא נוגע בסמכות** (נגזרת מ-`precedent_level`). `--apply` (ברירת-מחדל dry-run) / `--limit N` / `--concurrency`. כותב backup CSV ל-`data/audit/` תחילה. fail-safe (פריט שנכשל → נשמר ערך ישן). **חובה מקומי** (claude_session). | ידני חד-פעמי אחרי deploy של פיצול-הסמכות |
| `halacha_batch_reconcile.py` | python | **#82.7** — dedup חוצה-פסקים offline (שמרני, **dry-run בלבד**). dedup-on-insert משווה רק תוך-פסק; כאן סף מחמיר (cosine ≥0.95, `--cosine`) ולא-הרסני: מאתר זוגות הלכות near-duplicate בין פסקים שונים (pgvector `<=>` exact) עם איתות לקסיקלי (Jaccard/Levenshtein) ומדווח ל-CSV ב-`data/audit/` לסקירת היו"ר. לא מדלג/ממזג/מוחק. `--include-pending`. **`--link`** רושם את הזוגות שנמצאו כ-`equivalent_halachot` (parallel authority, #84.2 — קישור-מקביל ברמת-הלכה, **לא** ציטוט; idempotent, לא-הרסני). רץ עם venv של mcp-server. אומת: 800 הלכות → 5 זוגות (קושרו). | ידני — דוח-סקירה / `--link` לקישור |
| `calibrate_halacha_dedup.py` | python | **#82.1** — כיול ספי ה-dedup הלקסיקלי (#82.3) מול gold-set הניקוי. קורא `halacha-cleanup-manifest-*.csv` (זוגות duplicate↔survivor מתויגי-אדם), טוען טקסט-survivor מה-DB, ו-sweep של (jaccard_min × levenshtein_min) עם P/R/F1, מסמן את נקודת-העבודה המוגדרת. אימת ש-(0.55, 0.70) → **precision 1.0** (אפס false-merge), recall 0.30 — מתאים לאיתות-משני שחוסם auto-approve. `--manifest <path>`. רץ עם venv של mcp-server | חד-פעמי — כיול (בוצע 2026-06-06) |

View File

@@ -0,0 +1,336 @@
#!/usr/bin/env python3
"""Multi-judge panel to triage the halacha approval queue — DRY-RUN by default.
The chair cannot review every pending halacha. We proved (goldset_independent_
judge.py) that the COARSE axis — "is this a genuine, generalizable rule worth
keeping as a citable precedent?" — is reliable ACROSS independent models (92%
cross-model agreement), while the fine sub-type is not. This script turns that
into a triage: THREE independent-lineage judges vote on the coarse question, and
only a UNANIMOUS verdict acts automatically — every split escalates to the chair.
That collapses the queue without removing the human gate (INV-G10).
Three judges, three lineages (diversity is the point):
- claude (Opus via claude_session — local CLI, zero marginal cost) [Anthropic]
- deepseek (api.deepseek.com) [DeepSeek]
- gemini (generativelanguage — gemini-2.5-flash, #1 on LegalBench) [Google]
Three buckets of pending_review:
1. clean, below confidence threshold → panel votes KEEP? unanimous-keep would
auto-approve; split → chair.
2. nli_unsupported (rule maybe over-reaches its quote) → panel RE-ADJUDICATES
entailment; unanimous-entailed would clear the flag + approve; split → chair.
3. other quality flags (quote_unverified/truncated/thin) → genuine extraction
defects → flagged for re-extraction, never auto-approved.
DRY-RUN writes NOTHING. --apply would act on the unanimous verdicts (not yet
wired — review the dry-run first). Local-only (claude_session needs the CLI).
cd ~/legal-ai/mcp-server
.venv/bin/python ../scripts/halacha_panel_approve.py --limit 12 # smoke
.venv/bin/python ../scripts/halacha_panel_approve.py # full dry-run
"""
from __future__ import annotations
import argparse
import asyncio
import csv
import json
import os
from collections import Counter, defaultdict
from datetime import datetime, timezone
from pathlib import Path
import httpx
from legal_mcp.services import claude_session, db
# ── keys (local files, same pattern as the other local judges) ──
def _env_key(name: str, *files: str) -> str:
for f in files:
p = Path(f).expanduser()
if p.exists():
for line in p.read_text().splitlines():
if line.startswith(name + "="):
return line.split("=", 1)[1].strip()
return os.environ.get(name, "")
DEEPSEEK_KEY = _env_key("DEEPSEEK_API_KEY", "~/.hermes/profiles/deepseek/.env", "~/.env")
GEMINI_KEY = _env_key("GEMINI_API_KEY", "~/.env")
# ── the two coarse questions (the reliable axis — NOT the fuzzy sub-type) ──
KEEP_SYSTEM = (
"אתה משפטן בכיר בוועדת ערר לתכנון ובנייה. הוכרע אם 'הלכה' שחולצה מפסיקה ראויה "
"להישמר כתקדים בר-ציטוט. ראויה (keep=true) = עיקרון משפטי בר-הכללה והסתמכות "
"(holding/פרשנות/כלל-פרוצדורלי). לא-ראויה (keep=false) = החלה תלוית-עובדות על "
"התיק הספציפי, סוגיה שלא הוכרעה (אמרת-אגב), או חזרה מילולית על הציטוט ללא הפשטה. "
'החזר JSON בלבד: {"keep": true/false, "reason": "<משפט קצר>"}. ללא markdown.'
)
NLI_SYSTEM = (
"אתה בודק היסק משפטי. בהינתן כלל וציטוט-תומך, הכרע האם הציטוט באמת תומך בכלל "
"ואינו מרחיב מעבר למה שכתוב בו (entailed=true), או שהכלל מרחיב/חורג מהציטוט "
'(entailed=false). החזר JSON בלבד: {"entailed": true/false}. ללא markdown, ללא הסבר.'
)
def _keep_user(h: dict) -> str:
return (
f"ניסוח הכלל:\n{h.get('rule_statement') or ''}\n\n"
f"היגיון:\n{h.get('reasoning_summary') or ''}\n\n"
f"ציטוט תומך:\n{h.get('supporting_quote') or ''}"
)
def _nli_user(h: dict) -> str:
return f"כלל:\n{h.get('rule_statement') or ''}\n\nציטוט:\n{h.get('supporting_quote') or ''}"
# ── three judges, one signature: (system, user) -> dict|None ──
async def judge_claude(system: str, user: str) -> dict | None:
try:
return await claude_session.query_json(user, system=system)
except Exception:
return None
async def judge_deepseek(client: httpx.AsyncClient, system: str, user: str) -> dict | None:
if not DEEPSEEK_KEY:
return None
try:
r = await client.post(
"https://api.deepseek.com/v1/chat/completions",
headers={"Authorization": f"Bearer {DEEPSEEK_KEY}", "Content-Type": "application/json"},
json={"model": "deepseek-chat", "temperature": 0, "max_tokens": 120,
"response_format": {"type": "json_object"},
"messages": [{"role": "system", "content": system},
{"role": "user", "content": user}]},
timeout=90,
)
r.raise_for_status()
return json.loads(r.json()["choices"][0]["message"]["content"])
except Exception:
return None
async def judge_gemini(client: httpx.AsyncClient, system: str, user: str) -> dict | None:
if not GEMINI_KEY:
return None
try:
r = await client.post(
f"https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent?key={GEMINI_KEY}",
headers={"Content-Type": "application/json"},
json={"system_instruction": {"parts": [{"text": system}]},
"contents": [{"parts": [{"text": user}]}],
"generationConfig": {"temperature": 0, "maxOutputTokens": 4000,
"responseMimeType": "application/json"}},
timeout=90,
)
r.raise_for_status()
return json.loads(r.json()["candidates"][0]["content"]["parts"][0]["text"])
except Exception:
return None
def _bool(d: dict | None, key: str) -> bool | None:
if not isinstance(d, dict) or key not in d:
return None
v = d[key]
if isinstance(v, bool):
return v
return str(v).strip().lower() in ("true", "1", "yes", "כן")
async def panel_vote(client, system, user, key) -> dict:
"""Run all three judges; return per-judge bools + the verdict."""
c, ds, gm = await asyncio.gather(
judge_claude(system, user),
judge_deepseek(client, system, user),
judge_gemini(client, system, user),
)
votes = {"claude": _bool(c, key), "deepseek": _bool(ds, key), "gemini": _bool(gm, key)}
valid = [v for v in votes.values() if v is not None]
unanimous_yes = len(valid) == 3 and all(valid)
unanimous_no = len(valid) == 3 and not any(valid)
votes["_verdict"] = ("unanimous_yes" if unanimous_yes else
"unanimous_no" if unanimous_no else
"split" if len(valid) >= 2 else "incomplete")
return votes
async def main(args: argparse.Namespace) -> int:
print(f"judges available — deepseek:{bool(DEEPSEEK_KEY)} gemini:{bool(GEMINI_KEY)} "
f"claude:local\n", flush=True)
pending = await db.list_halachot(review_status="pending_review", limit=5000)
if args.limit:
pending = pending[: args.limit]
NLI = "nli_unsupported"
DEFECT = {"quote_unverified", "truncated_quote", "thin_restatement", "near_duplicate"}
def bucket(h):
flags = set(h.get("quality_flags") or [])
if not flags:
return "clean"
if flags & DEFECT:
return "defect" # genuine extraction problem → re-extraction
if NLI in flags:
return "nli" # re-adjudicate entailment
return "other"
buckets = defaultdict(list)
for h in pending:
buckets[bucket(h)].append(h)
print("queue:", {k: len(v) for k, v in buckets.items()}, "\n", flush=True)
sem = asyncio.Semaphore(args.concurrency)
results = {"clean": [], "nli": []}
async with httpx.AsyncClient() as client:
async def run(h, system_fn, user_fn, key, tag):
async with sem:
v = await panel_vote(client, system_fn, user_fn(h), key)
v["_h"] = h
results[tag].append(v)
tasks = []
for h in buckets["clean"]:
tasks.append(run(h, KEEP_SYSTEM, _keep_user, "keep", "clean"))
for h in buckets["nli"]:
tasks.append(run(h, NLI_SYSTEM, _nli_user, "entailed", "nli"))
# bounded fan-out
for i in range(0, len(tasks), args.concurrency):
await asyncio.gather(*tasks[i : i + args.concurrency])
done = len(results["clean"]) + len(results["nli"])
print(f"{done}/{len(tasks)} judged", flush=True)
# ── report ──
def summarize(rows, yes_label, no_label):
c = Counter(r["_verdict"] for r in rows)
return c
print("\n" + "=" * 60)
print("PANEL DRY-RUN (no DB writes)")
print("=" * 60)
clean = results["clean"]
cc = summarize(clean, "keep", "drop")
print(f"\nBUCKET 1 — clean, below threshold ({len(clean)}):")
print(f" ✓ auto-APPROVE (3/3 keep): {cc['unanimous_yes']}")
print(f" ✗ auto-REJECT (3/3 drop): {cc['unanimous_no']}")
print(f" → CHAIR (split): {cc['split']}")
print(f" ? incomplete (judge errors): {cc['incomplete']}")
nli = results["nli"]
nc = summarize(nli, "entailed", "not")
print(f"\nBUCKET 2 — nli_unsupported ({len(nli)}):")
print(f" ✓ clear-flag + APPROVE (3/3 entailed): {nc['unanimous_yes']}")
print(f" ✗ confirm-flag (3/3 not-entailed): {nc['unanimous_no']}")
print(f" → CHAIR (split): {nc['split']}")
print(f" ? incomplete: {nc['incomplete']}")
print(f"\nBUCKET 3 — extraction defects ({len(buckets['defect'])}): → re-extraction")
if buckets["other"]:
print(f"BUCKET 4 — other flags ({len(buckets['other'])}): → chair")
auto = cc["unanimous_yes"] + cc["unanimous_no"] + nc["unanimous_yes"] + nc["unanimous_no"]
chair = cc["split"] + nc["split"] + cc["incomplete"] + nc["incomplete"] + len(buckets["other"])
reext = len(buckets["defect"])
print("\n" + "-" * 60)
print(f"NET: {len(pending)} pending → panel resolves {auto} automatically, "
f"{chair} to chair, {reext} to re-extraction")
print(f" chair queue collapses {len(pending)}{chair}")
Path("/tmp/halacha_panel_dryrun.json").write_text(json.dumps(
[{**{k: v for k, v in r.items() if not k.startswith("_h")},
"id": str(r["_h"]["id"]), "case": r["_h"].get("case_number"),
"rule": (r["_h"].get("rule_statement") or "")[:120]}
for r in clean + nli], ensure_ascii=False, indent=1))
print("\nper-item verdicts → /tmp/halacha_panel_dryrun.json")
# ── apply the chair-approved policy (reversible; backup first) ──────────
# CLEAN → majority 2/3 (keep→approved, drop→rejected, tie→chair)
# NLI → asymmetric: unanimous-entailed → clear nli flag (+approve if clean),
# majority not-entailed → rejected, else → chair
# DEFECT → untouched (needs re-extraction)
if not args.apply:
print("\n(dry-run — pass --apply to write the approved policy)")
return 0
def majority(v: dict) -> bool | None:
vs = [v[k] for k in ("claude", "deepseek", "gemini") if v[k] is not None]
if len(vs) < 2:
return None
y, n = sum(vs), len(vs) - sum(vs)
return True if y > n else (False if n > y else None)
ts = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
audit = Path(__file__).resolve().parent.parent / "data" / "audit"
audit.mkdir(parents=True, exist_ok=True)
backup = audit / f"halacha-panel-apply-backup-{ts}.csv"
with backup.open("w", encoding="utf-8", newline="") as f:
w = csv.writer(f)
w.writerow(["id", "review_status", "quality_flags"])
for r in clean + nli:
h = r["_h"]
w.writerow([h["id"], h["review_status"], "|".join(h.get("quality_flags") or [])])
pool = await db.get_pool()
REV = "panel:opus+deepseek+gemini"
approved = rejected = cleared = chair = 0
for r in clean:
d = majority(r)
if d is True:
await pool.execute("UPDATE halachot SET review_status='approved', "
"reviewed_at=now(), reviewer=$2, updated_at=now() WHERE id=$1",
r["_h"]["id"], REV + " 2/3-keep")
approved += 1
elif d is False:
await pool.execute("UPDATE halachot SET review_status='rejected', "
"reviewed_at=now(), reviewer=$2, updated_at=now() WHERE id=$1",
r["_h"]["id"], REV + " 2/3-drop")
rejected += 1
else:
chair += 1
for r in nli:
vs = [r[k] for k in ("claude", "deepseek", "gemini") if r[k] is not None]
unanimous_yes = len(vs) == 3 and all(vs)
maj_no = len(vs) >= 2 and sum(vs) < len(vs) - sum(vs)
if unanimous_yes:
rest = [x for x in (r["_h"].get("quality_flags") or []) if x != "nli_unsupported"]
if rest: # other flags remain → clear nli but keep in queue
await pool.execute("UPDATE halachot SET quality_flags=$2, updated_at=now() "
"WHERE id=$1", r["_h"]["id"], rest)
cleared += 1; chair += 1
else: # nli was the only blocker → clear + approve
await pool.execute("UPDATE halachot SET quality_flags='{}', "
"review_status='approved', reviewed_at=now(), reviewer=$2, "
"updated_at=now() WHERE id=$1", r["_h"]["id"], REV + " 3/3-entailed")
approved += 1; cleared += 1
elif maj_no:
await pool.execute("UPDATE halachot SET review_status='rejected', "
"reviewed_at=now(), reviewer=$2, updated_at=now() WHERE id=$1",
r["_h"]["id"], REV + " maj-not-entailed")
rejected += 1
else:
chair += 1
print(f"\nAPPLIED (reversible): approved {approved} · rejected {rejected} · "
f"nli-flag-cleared {cleared} · left to chair {chair + len(buckets['defect'])} "
f"(incl. {len(buckets['defect'])} defects for re-extraction)")
print(f"backup → {backup}")
return 0
if __name__ == "__main__":
ap = argparse.ArgumentParser(description=__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter)
ap.add_argument("--limit", type=int, default=0)
ap.add_argument("--concurrency", type=int, default=6)
ap.add_argument("--apply", action="store_true", help="(not yet wired — dry-run only)")
raise SystemExit(asyncio.run(main(ap.parse_args())))

View File

@@ -0,0 +1,117 @@
#!/usr/bin/env python3
"""Calibrate the approval-panel voting policy on the gold-set (Trust-or-Escalate).
The literature (Trust or Escalate, ICLR 2025; PoLL; selective prediction) says:
don't guess the aggregation policy — calibrate it to a target risk α on a
calibration set, and ESCALATE disagreement to the human. We have a calibration
set: the gold-set's ``is_holding`` is the COARSE "is this a real, keepable rule?"
label — the axis we already proved is reliable across models (92%).
This runs the panel's KEEP question (3 independent judges) on every gold-set item
that has an is_holding label, then reports, FOR EACH POLICY, the auto-decision
precision (vs is_holding) and coverage (how many it decides vs escalates):
- unanimous : auto-decide only on 3/3 agreement, else escalate
- majority : auto-decide on 2/3, else escalate
Pick the policy whose auto-error stays under your tolerance while covering the
most items. Read-only. Local-only (claude_session needs the CLI).
cd ~/legal-ai/mcp-server
.venv/bin/python ../scripts/halacha_panel_calibrate.py
"""
from __future__ import annotations
import argparse
import asyncio
import httpx
from legal_mcp.services import db
# reuse the exact panel judges + KEEP question (single source of truth)
from halacha_panel_approve import ( # noqa: E402
KEEP_SYSTEM, _bool, _keep_user, judge_claude, judge_deepseek, judge_gemini,
)
async def _votes(client, h) -> list[bool]:
user = _keep_user(h)
c, ds, gm = await asyncio.gather(
judge_claude(KEEP_SYSTEM, user),
judge_deepseek(client, KEEP_SYSTEM, user),
judge_gemini(client, KEEP_SYSTEM, user),
)
return [v for v in (_bool(c, "keep"), _bool(ds, "keep"), _bool(gm, "keep")) if v is not None]
def _decide(votes: list[bool], policy: str) -> bool | None:
"""Auto-decision (True=keep / False=drop) or None=escalate."""
if len(votes) < 2:
return None
yes, no = sum(votes), len(votes) - sum(votes)
if policy == "unanimous":
if len(votes) == 3 and yes == 3:
return True
if len(votes) == 3 and no == 3:
return False
return None
# majority
if yes > no:
return True
if no > yes:
return False
return None # tie
async def main(args: argparse.Namespace) -> int:
items = [it for it in await db.goldset_list(args.batch) if it.get("is_holding") is not None]
if args.limit:
items = items[: args.limit]
print(f"calibrating panel KEEP vs is_holding on {len(items)} gold-set items\n", flush=True)
sem = asyncio.Semaphore(args.concurrency)
rows = []
async with httpx.AsyncClient() as client:
async def one(it):
async with sem:
v = await _votes(client, it)
rows.append({"truth": bool(it["is_holding"]), "votes": v})
tasks = [one(it) for it in items]
for i in range(0, len(tasks), args.concurrency):
await asyncio.gather(*tasks[i : i + args.concurrency])
print(f"{len(rows)}/{len(items)}", flush=True)
print("\n" + "=" * 64)
print(f"{'policy':<11}{'auto':>6}{'escalate':>10}{'correct':>9}{'wrong':>7}{'precision':>11}{'coverage':>10}")
print("-" * 64)
for policy in ("unanimous", "majority"):
auto = wrong = correct = 0
for r in rows:
d = _decide(r["votes"], policy)
if d is None:
continue
auto += 1
if d == r["truth"]:
correct += 1
else:
wrong += 1
esc = len(rows) - auto
prec = correct / auto if auto else 0.0
cov = auto / len(rows) if rows else 0.0
print(f"{policy:<11}{auto:>6}{esc:>10}{correct:>9}{wrong:>7}{prec:>10.1%}{cov:>10.1%}")
# where do the WRONG auto-decisions fall? (false-keep is the costly one)
print("\n=== costly errors: panel auto-KEEPS but human says NOT-holding (per policy) ===")
for policy in ("unanimous", "majority"):
fk = sum(1 for r in rows if _decide(r["votes"], policy) is True and not r["truth"])
fd = sum(1 for r in rows if _decide(r["votes"], policy) is False and r["truth"])
print(f" {policy:<11} false-KEEP (bad rule approved): {fk} false-DROP (good rule rejected): {fd}")
return 0
if __name__ == "__main__":
ap = argparse.ArgumentParser()
ap.add_argument("--batch", default="default")
ap.add_argument("--limit", type=int, default=0)
ap.add_argument("--concurrency", type=int, default=6)
raise SystemExit(asyncio.run(main(ap.parse_args())))