feat(corpus): corpus redesign — eliminate halacha queue, verified-by-citation layer, rank-at-retrieval (#153)

Implements chaim's 2026-06-20 directive (5 steps; step 6 deferred): 1. No review queue — HALACHA_NO_REVIEW_QUEUE=true (auto-approve all → background); migration cleared 2,416 pending_review → approved. 2. Verified layer — halachot.verified/cite_count from chair citations (db.refresh_verified_layer + scripts/build_verified_layer.py runs citator on ALL committee decisions). 2,775 verified / 137 precedents. 3. Retrieval ranks verified ≫ background — HALACHA_VERIFIED_BOOST in both semantic + lexical halacha queries; filter now includes background (<> rejected). 5. Disabled destructive panel cap/novelty — HALACHA_PANEL_REGIME_ENABLED=false (8508/1049/1200 proved it lost 22-30 genuine principles incl. Lustrenik). 4. Ingest contract — going-forward already queues metadata; backfill_practice_area.py + 206 re-queued to the metadata drain. Source of truth: docs/precedent-corpus-redesign/00-final-synthesis.md. Quality flags are 97% false-positive (nli-audit) → no longer gate. UI queue removal → Claude Design gate. 429 tests green (no regressions). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 13:55:00 +00:00
parent afe6894441
commit b9fa74b875
6 changed files with 255 additions and 11 deletions
--- a/scripts/SCRIPTS.md
+++ b/scripts/SCRIPTS.md
@@ -65,6 +65,9 @@
 | `halacha_panel_calibrate.py` | python | **כיול + מדידת הפאנל** (Trust-or-Escalate, ICLR 2025). `--source live` (ברירת-מחדל): מריץ את שאלת-ה-KEEP על מדגם-הזהב ומודד מול `is_holding` precision+coverage+**split-rate** לכל מדיניות + false-keep/false-drop (מייבא שופטים מ-`halacha_panel_approve`, **חובה מקומי**). **#133/FU-5** — `--source captured`: **אפס-עלות** (בלי re-vote/LLM) — מצליב סבבים שמורים (FU-1) מול הכרעות-יו"ר (FU-2) דרך `db.panel_rounds_vs_chair` ומדווח split-rate+auto-precision **לכל סבב** (מגמת הלולאה: ככל שהרובריקה משתפרת precision נשמר ו-split יורד); משתף את `analyze_pairs` של FU-4 (מקור-יחיד). שתי המדידות מדווחות **anon-stability** (מבחן-אנונימיזציה #81.7) כמטריקת-בריאות נגד echo-chamber. `--batch`/`--limit`/`--concurrency`. | ידני — לפני חיווט `--apply` (live) / תקופתי — מעקב-לולאה (captured) |
 | `halacha_rubric_distill.py` | python | **#133/FU-4 — זיקוק-רובריקה PROPOSE-ONLY.** מצליב `halacha_panel_rounds` (FU-1, הצבעות+נימוקים) מול הכרעות-היו"ר (FU-2, seeds ב-`halacha_goldset` batch `chair-live`) דרך `db.panel_rounds_vs_chair` (read-only), מנתח דטרמיניסטית **כשלים שיטתיים** (false-keep/false-drop, פיצולים-שהוכרעו, שיעור-מחלוקת-עם-היו"ר לכל שופט), ומציע `KEEP_SYSTEM` v2 + exemplars מופשטים (claude_session מקומי, אפס עלות) כ**דוח-diff** ל-`data/learning/rubric-proposal-<ts>.md`. **לעולם לא auto-apply** — אימוץ v2 = עריכה אנושית של הקבוע דרך PR (INV-LRN1); exemplars מופשטים בלבד (INV-LRN5); הסיגנל היחיד = הכרעת-יו"ר, לא הצבעות-פאנל (anti-echo). מתחת ל-12 זוגות → "אין מספיק נתונים". `--no-llm` (סטטיסטיקה בלבד) / `--limit N`. **חובה מקומי**. | תקופתי — אחרי שהצטברו הכרעות-יו"ר על מחלוקות-פאנל |
 | `backfill_canonical_halachot.py` | python | **V41 — הקמת מודל ההלכות הקנוניות (חד-פעמי + idempotent).** (1) בונה רכיבים-קשורים (connected components) מ-`equivalent_halachot` (transitive closure — union-find). (2) לכל אשכול: בוחר נציג-קנוני (הכי הרבה corroboration → confidence → earliest), יוצר שורת `canonical_halachot`, ומעדכן `canonical_id` + `instance_type` לכל חברי האשכול. (3) לסינגלטונים (ללא קישורי-שוויון): 1:1 canonical. (4) מאכלס `halacha_citation_corroboration.canonical_id` מ-`halachot.canonical_id`. `--dry-run` (ברירת-מחדל, מחשב ומדווח בלבד) / `--apply` (כותב) / `--verbose`. לאחר הרצה: `canonical_statement` = ניסוח-נציג (pending_synthesis); עוקב: `backfill_canonical_synthesis.py` (Phase 4) יסנתז ניסוח-רחב דרך LLM. הרץ: `mcp-server/.venv/bin/python scripts/backfill_canonical_halachot.py --apply`. | **חד-פעמי** (לאחר deploy V41) / idempotent לפי צורך |
+| `build_verified_layer.py` | python | **#153 — בניית שכבת-המאומת מאזכורים.** "מאומת=אזכור, לא ביקורת" (chaim 2026-06-20): מריץ את ה-citator (`extract_internal_citations`) על **כל** החלטות-הוועדה (לא רק דפנה), ואז `db.refresh_verified_layer` שמחשב `halachot.verified`/`cite_count` מ-`precedent_internal_citations` (verified=פס"ד-המקור צוטט ע"י יו"ר; cite_count=# החלטות מצטטות). idempotent, גדל-אוטומטית עם החלטות חדשות. regex/embeddings בלבד, אפס-LLM. `--no-citator` (refresh בלבד). הרץ: `mcp-server/.venv/bin/python scripts/build_verified_layer.py`. | חוזר (אחרי קליטת החלטות-יו"ר) |
+| `backfill_practice_area.py` | python | **#153 step 4 — חוזה-קליטה.** 87% מהפסיקה-החיצונית בלי practice_area → אחזור-מסונן-תחום מחמיצן. מריץ סיווג קיים (`derive_domain_practice_area` דטרמיניסטי → `precedent_metadata_extractor.extract_and_apply` LLM) על הלא-מסווגים. מרוסן (extract_metadata=claude/sonnet). `--dry-run`/`--apply`/`--limit`/`--no-throttle`. **going-forward כבר מחווט** (ingest.py:233 מתזמן metadata); זה ל-backfill הקיימים (חלופה: להזין ל-`metadata_extraction_requested_at` ולתת ל-drain). | חד-פעמי (backfill) |
+| `compute_principle_gold.py` | python | **#153 (נטוש)** — גישת זיהוי-זהב ברמת-עיקרון דרך התאמת-`match_context`→הלכה. **הוחלף** ע"י "מאומת=אזכור" (`build_verified_layer.py`) אחרי שההתאמה נכשלה (match_context=רשימת-הפניות). נשמר לעיון. | deprecated |
 | `cull_principles.py` | python | **#152 Phase C — סינון רטרואקטיבי של קורפוס-העקרונות דרך פאנל-3 (הפיך).** מריץ על כל עיקרון 'original' קיים את אותו משטר שה-extractor משתמש בו להבא (`services/panel_extraction.panel_keep_score`, G2): 3 שופטים (Claude מקומי + DeepSeek + Gemini) מצביעים keep+score → כלל-האישור (3 קולות→שורד · 2 וציון≥0.85→שורד · 2 ו<0.85→יו"ר · ≤1→נדחה) → תקרת `HALACHA_PANEL_MAX_NEW`=5 לכל החלטה לפי ציון (`apply_cap`). נדחה → `halachot.review_status='rejected'` + ה-canonical שלו `rejected` (הפיך, גיבוי-CSV ב-`data/audit/` לפני כל כתיבה). מרוסן ב-`usage_limits` (עוצר-רך בתקרת-שימוש, resumable). `--dry-run` (ברירת-מחדל) / `--apply` / `--sample N` (החלטות אקראיות) / `--limit N` / `--no-throttle` / `--verbose`. **חובה מקומי** (3 שופטים). הרץ: `cd mcp-server && HOME=/home/chaim .venv/bin/python ../scripts/cull_principles.py --apply`. | **חד-פעמי** (סינון ראשוני) + ניתן-לחזרה |
 | `backfill_canonical_synthesis.py` | python | **V41 Phase 4 — סינתזת-LLM ל-`canonical_statement` (idempotent + resumable).** עובר על canonicals ב-`review_status='pending_synthesis'` (רב-instance ראשונים) ומזקק לכל אחד ניסוח אחד כללי ומעוגן בציטוטי-המופעים (INV-AH) דרך `services/canonical_synthesis.py` (מסלול-יחיד, G2). שערים: עיגון/הימנעות, **drift-floor** (cosine מול המקור, ברירת-מחדל 0.80 — סטייה גדולה→נשמר המקור), ואיסור ציטוטי-תיק חדשים. בכל מקרה הסטטוס מתקדם ל-`pending_review` לשער-היו"ר (G10/INV-LRN6). מודל Opus (`HALACHA_CANONICAL_SYNTH_MODEL`). מרוסן ע"י `usage_limits` (עוצר-רך בתקרת-שימוש, resumable). `--dry-run` (ברירת-מחדל) / `--apply` / `--sample N` (מדגם אקראי לבדיקה) / `--limit N` / `--no-throttle` / `--verbose`. CSV-audit ל-`data/audit/canonical-synthesis-*.csv`. **חובה מקומי** (claude_session). הרץ: `cd mcp-server && HOME=/home/chaim .venv/bin/python ../scripts/backfill_canonical_synthesis.py --apply`. שוטף: כלי-MCP `canonical_synthesize_pending`. | **חד-פעמי** (המסה הראשונית) + idempotent לחדשים |
 | `halacha_batch_reconcile.py` | python | **#82.7** — dedup חוצה-פסקים offline (שמרני, **dry-run בלבד**). dedup-on-insert משווה רק תוך-פסק; כאן סף מחמיר (cosine ≥0.95, `--cosine`) ולא-הרסני: מאתר זוגות הלכות near-duplicate בין פסקים שונים (pgvector `<=>` exact) עם איתות לקסיקלי (Jaccard/Levenshtein) ומדווח ל-CSV ב-`data/audit/` לסקירת היו"ר. לא מדלג/ממזג/מוחק. `--include-pending`. **`--link`** רושם את הזוגות שנמצאו כ-`equivalent_halachot` (parallel authority, #84.2 — **deprecated post-V41** — השתמש ב-`backfill_canonical_halachot.py --apply` במקום). רץ עם venv של mcp-server. | **deprecated** — הוחלף ב-`backfill_canonical_halachot.py` (V41). נשמר לצורכי audit |
--- a/scripts/backfill_practice_area.py
+++ b/scripts/backfill_practice_area.py
@@ -0,0 +1,98 @@
+#!/usr/bin/env python3
+"""Backfill practice_area for external precedents (#153, step 4 — ingest contract).
+
+87% of external court rulings (209/239) lack practice_area, so area-scoped retrieval
+misses them. The classifier infrastructure already exists
+(precedent_metadata_extractor.extract_and_apply → practice_area + metadata); it just
+never ran on these rows. This runs it on the unclassified, throttled by usage_limits.
+
+Deterministic shortcut first (derive_domain_practice_area from our case-number scheme,
+free); only rows it can't resolve go to the LLM classifier.
+
+  cd ~/legal-ai/mcp-server
+  HOME=/home/chaim .venv/bin/python ../scripts/backfill_practice_area.py --dry-run
+  HOME=/home/chaim .venv/bin/python ../scripts/backfill_practice_area.py --apply
+"""
+from __future__ import annotations
+
+import argparse
+import asyncio
+import os
+import sys
+
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "mcp-server", "src"))
+
+from legal_mcp.services import db, precedent_metadata_extractor  # noqa: E402
+from legal_mcp.services.practice_area import derive_domain_practice_area  # noqa: E402
+
+try:
+    from legal_mcp.services import usage_limits
+except Exception:  # pragma: no cover
+    usage_limits = None
+
+
+def _over_ceiling() -> tuple[bool, str]:
+    if usage_limits is None:
+        return False, ""
+    u = usage_limits.subscription_usage()
+    if u is None:
+        return False, ""
+    over, _r, detail = usage_limits.ceiling_status(u)
+    return over, detail
+
+
+async def _run(apply: bool, limit: int | None, throttle: bool) -> int:
+    pool = await db.get_pool()
+    rows = await pool.fetch(
+        "SELECT id, case_number FROM case_law "
+        "WHERE source_kind='external_upload' AND COALESCE(practice_area,'')='' "
+        "  AND COALESCE(full_text,'')<>'' ORDER BY created_at")
+    if limit:
+        rows = rows[:limit]
+    print(f"[{'APPLY' if apply else 'DRY-RUN'}] {len(rows)} unclassified external precedents\n", flush=True)
+    det = llm = stopped = 0
+    by_area: dict[str, int] = {}
+    for n, r in enumerate(rows, 1):
+        # 1) deterministic from our case-number scheme (free)
+        area = derive_domain_practice_area(r["case_number"] or "")
+        if area:
+            det += 1
+            by_area[area] = by_area.get(area, 0) + 1
+            if apply:
+                await pool.execute("UPDATE case_law SET practice_area=$2 WHERE id=$1", r["id"], area)
+            continue
+        # 2) LLM classifier (throttled)
+        if throttle:
+            over, detail = _over_ceiling()
+            if over:
+                print(f"\n⏸ usage ceiling ({detail}) — stopping at {n-1}. Re-run to resume.", flush=True)
+                stopped = 1
+                break
+        if apply:
+            res = await precedent_metadata_extractor.extract_and_apply(r["id"])
+            pa = (res or {}).get("practice_area") or ""
+            if pa:
+                llm += 1
+                by_area[pa] = by_area.get(pa, 0) + 1
+        else:
+            llm += 1
+        if n % 20 == 0:
+            print(f"  …{n}/{len(rows)}", flush=True)
+    print(f"\n── summary ── deterministic: {det} · LLM: {llm} · by_area: {by_area}"
+          f"{' (stopped early)' if stopped else ''}")
+    if not apply:
+        print("dry-run — nothing written. Re-run with --apply.")
+    return 0
+
+
+def main() -> int:
+    p = argparse.ArgumentParser(description="Backfill practice_area for external precedents (#153)")
+    p.add_argument("--apply", action="store_true")
+    p.add_argument("--limit", type=int, default=None)
+    p.add_argument("--no-throttle", action="store_true")
+    a = p.parse_args()
+    return asyncio.run(_run(a.apply, a.limit, not a.no_throttle))
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
--- a/scripts/build_verified_layer.py
+++ b/scripts/build_verified_layer.py
@@ -0,0 +1,53 @@
+#!/usr/bin/env python3
+"""Build the verified principle layer from chair citations (#153, corpus redesign).
+
+"Trusted = citation, not review" (chaim 2026-06-20). A principle is `verified` iff
+its SOURCE precedent was actually cited by a chair (any committee decision); never
+from human review. This:
+  1. Runs the citator (`extract_internal_citations`) over ALL committee decisions —
+     not just דפנה's — so other chairs' citations populate the graph too (tier-2).
+  2. Recomputes halachot.verified / cite_count from precedent_internal_citations.
+
+Idempotent. Run after ingesting new chair decisions (or wire into the ingest path)
+so the verified layer grows automatically. EMBEDDING/REGEX-only for the citator,
+no LLM.
+
+  cd ~/legal-ai/mcp-server
+  HOME=/home/chaim .venv/bin/python ../scripts/build_verified_layer.py            # full
+  HOME=/home/chaim .venv/bin/python ../scripts/build_verified_layer.py --no-citator  # refresh only
+"""
+from __future__ import annotations
+
+import argparse
+import asyncio
+import os
+import sys
+
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "mcp-server", "src"))
+
+from legal_mcp.services import citation_extractor, db  # noqa: E402
+
+
+async def _run(run_citator: bool) -> int:
+    if run_citator:
+        print("→ extracting citations from ALL committee decisions (citator)…", flush=True)
+        res = await citation_extractor.extract_all_internal_committee()
+        print(f"  citator: {res}", flush=True)
+    print("→ refreshing verified/cite_count from chair citations…", flush=True)
+    stats = await db.refresh_verified_layer()
+    print(f"\n── verified layer ──")
+    print(f"   verified principles: {stats['verified_principles']}")
+    print(f"   verified precedents: {stats['verified_precedents']}")
+    return 0
+
+
+def main() -> int:
+    p = argparse.ArgumentParser(description="Build verified principle layer (#153)")
+    p.add_argument("--no-citator", action="store_true",
+                   help="skip citation extraction; only recompute verified/cite_count")
+    a = p.parse_args()
+    return asyncio.run(_run(run_citator=not a.no_citator))
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())