diff --git a/docs/legal-principles-redesign.md b/docs/legal-principles-redesign.md new file mode 100644 index 0000000..5247fc5 --- /dev/null +++ b/docs/legal-principles-redesign.md @@ -0,0 +1,81 @@ +# עיצוב-מחדש: עקרונות משפטיים (לשעבר "הלכות") + +> **מקור-ההחלטה:** chaim, 2026-06-19. נולד תוך תכנון סינתזת-`canonical_statement`, כשהתגלה +> שהקורפוס תפח ל-5,243 "הלכות" (18.8 לפסק, 1,820 מהחלטות הוועדה עצמה) — מודל מושגי שגוי. +> מסמך זה הוא מקור-האמת ליוזמה עד שיוטמע ב-`docs/spec/`. + +## 1. הבעיה + +מערכת-החילוץ הקיימת תייגה כל פרופוזיציה-משפטית כ"הלכה" וחילצה ~18.8 לכל פסק, ללא תקרה, +ללא הבחנת-מקור, ובאישור-אוטומטי חד-מודלי (confidence ≥0.80). תוצאה: 5,243 רשומות — +מנופחות ומתויגות-שגוי. **ועדת ערר מיישמת דין; היא אינה יוצרת הלכה.** קריאה ל-1,820 +פרופוזיציות מהחלטות-הוועדה "הלכות" שגויה משפטית. + +## 2. מודל-המושגים החדש + +מטרייה: **עקרונות משפטיים**. שני תת-סוגים לפי מקור: + +| מקור (`case_law.source_kind`) | מונח | מחייב? | +|---|---|---| +| פס"ד מחוזי/עליון (external, binding) | **הלכה** | תקדים מחייב | +| החלטת ועדת-ערר (`internal_committee`) | **כלל פרשני** | לא-מחייב; פרשנות/כלל-החלה שהוועדה גיבשה | + +## 3. אלגוריתם-החילוץ החדש (חל על שני המקורות) + +```text +1. 3 מודלים שונים (Claude מקומי + DeepSeek + Gemini) מנתחים לעומק את הפסק; + כל מודל מציע מועמדים, כל מועמד עם ציון 0-1. +2. התאמה סמנטית בין שלושת המודלים → סט-מועמדים מאוחד; לכל מועמד: + votes = כמה מודלים זיהו/אימצו אותו (1-3) + score = ממוצע הציונים של המצביעים בלבד +3. דדופ מול הקורפוס (V41 lookup-before-insert, cosine ≥ HALACHA_CANONICAL_THRESHOLD): + • מוכר → קישור ל-canonical קיים (instance/citation). לא נספר במכסה → משחרר סלוט. + • חדש → מועמד לעיקרון חדש. +4. כלל-אישור על מועמדים חדשים: + votes = 3 → APPROVED מיידי (גם אם score < 0.85) + votes ≥ 2 AND score ≥ 0.85 → APPROVED + votes = 2 AND score < 0.85 → pending_review (שער-יו"ר, G10) [ברירת-מחדל] + votes = 1 → DROP (לא עיקרון אמיתי) [ברירת-מחדל] +5. תקרה: עד 5 עקרונות חדשים לפסק. אם >5 עוברים — בוחרים 5 לפי score יורד. [ברירת-מחדל] + מקושרים-מוכרים (שלב 3) אינם נספרים בתקרה. +``` + +**ברירות-מחדל הנדסיות (ניתנות-לכיול ב-config):** מקרה-גבול (2 הצבעות, score<0.85) → יו"ר ולא +פח; בחירת 5 כש->5 עוברים → לפי score; הצבעה-יחידה → drop. + +## 4. סינון רטרואקטיבי + +אותו פאנל-3 + תקרת-5 + כלל-0.85 ירוץ על **5,243 הקיימים**, מקובצים לפי פסק-המקור: +לכל פסק — להפעיל את האלגוריתם, לשמור את הניצולים (≤5), לסמן את השאר `rejected` (הפיך, +גיבוי SQL/CSV ל-`data/audit/`). מודל על מהלך-הניקוי 2026-06-03 (`docs/halacha-strict-rubric.md`) +ועל `halacha_panel_approve.py` הקיים. + +## 5. תשתית קיימת לבנות עליה + +- **פאנל תלת-מודלי:** `scripts/halacha_panel_approve.py` (Claude מקומי + DeepSeek + Gemini, + KEEP_SYSTEM) — אותם 3 מודלים מ-gold-set (AC1=0.92). מקור-הצבעות. +- **דדופ/קישור V41:** `db.nearest_canonical_halacha` (cosine), lookup-before-insert בחילוץ. +- **ולידטורים:** `services/halacha_quality.py` (non_decision/application/thin/quote/NLI). +- **רובריקה:** `docs/halacha-strict-rubric.md` (6 עילות-חיתוך). +- **שער-מקור:** `db.EXTRACTION_ELIGIBLE_PREDICATE` (db.py:7171) — נקודת-הזרקת תקרת/תיוג. +- **סינתזה:** `services/canonical_synthesis.py` + `backfill_canonical_synthesis.py` (כבר נבנו; + יחולו על הניצולים בשם החדש — פאזה אחרונה). + +## 6. פאזות-ביצוע (מוצע) + +| # | פאזה | תוכן | תלות | +|---|---|---|---| +| **0** | עצירה | הקפאת ריצת-הסינתזה המלאה (בוצע) | — | +| **A** | מודל-הצבעות משותף | שירות `panel_extraction` — 3 מודלים, התאמה סמנטית, votes+mean-score, כלל-אישור. מקור-יחיד ל-B ו-C (G2) | — | +| **B** | רף להבא | חיבור A ל-`halacha_extractor`: תקרת-5, דדופ-משחרר-סלוט, תיוג הלכה/כלל-פרשני לפי מקור. מחליף auto-approve חד-מודלי | A | +| **C** | סינון רטרואקטיבי | סקריפט-batch מריץ A על 5,243 לפי פסק; ניצולים≤5; השאר rejected (הפיך) | A | +| **D** | שם | "הלכה"→הלכה/כלל-פרשני/עקרונות; UI + תיאורי-כלים + תיעוד. rename-DB מלא = אופציונלי-נפרד | — | +| **E** | סינתזה | `canonical_synthesis` על הניצולים, בשם החדש | C, D | + +**סדר-בנייה מומלץ:** A → (B ‖ D) → C → E. A הוא הליבה המשותפת; D (שם) עצמאי ובטוח להקדים. + +## 7. Invariants + +מקיים: INV-G10/INV-LRN1 (שער-יו"ר על מקרי-גבול), INV-AH (עיגון-מקור בחילוץ), INV-G2 +(מודל-הצבעות מקור-יחיד ל-B+C), INV-G9 (audit-trail להצבעות + לסינון), INV-G6 (רענון-embedding). +מודל-הצבעות-היו"ר משתלב ב-active-learning הקיים (`halacha_panel_rounds`, [[project_active_learning_panel]]). diff --git a/docs/spec/07-learning.md b/docs/spec/07-learning.md index 1ef5055..0eae7eb 100644 --- a/docs/spec/07-learning.md +++ b/docs/spec/07-learning.md @@ -207,6 +207,27 @@ Dimensions for Data Quality* (2013) · ISO 8000 (Data quality) | סטטוס: ver (`lessons.py:355, 309`). עקיבוּת-מקור קושרת ל-[X5-audit-provenance.md](X5-audit-provenance.md). **הפרה ידועה:** — +### INV-LRN6: סינתזת-עיקרון-קנוני מעוגנת ומגודרת-שער (V41 Phase 4 → G10/INV-AH/G9) +**כלל:** סינתזת ה-`canonical_statement` של עיקרון-הלכה קנוני (מיזוג/זיקוק ניסוחי-המופעים +לניסוח אחד כללי) חייבת לקיים שלושה תנאים: **(א) עיגון** — הניסוח נובע מ-`supporting_quote` +של המופעים בלבד, ללא הוספת דין/סייג/ציטוט-תיק שאינו במקור; חוסר-עיגון → **הימנעות** +(`grounded=false`, נשמר הניסוח הקיים) ולא המצאה ([INV-AH](../anti-hallucination-gate.md), AH-1/2/3). +**(ב) שער-drift** — הניסוח המסונתז מוטמע-מחדש ומושווה (cosine) לניסוח-המקור; מתחת לרצפה +(`HALACHA_CANONICAL_SYNTH_DRIFT_FLOOR`=0.80) הסינתזה **נדחית** (נשמר המקור) — הטמעה +מהוזה/סוטה-נושא לא תדרוס עיקרון תקין בשקט. **(ג) שער-יו"ר** — סינתזה אף פעם אינה מאשרת: +היא מקדמת `review_status` מ-`pending_synthesis` ל-`pending_review` בלבד; ההכרעה הסופית +היא של היו"ר בפאנל ([INV-LRN1](#inv-lrn1-עדכון-ידע-דורש-אישור-יור-ידני--אין-auto-commit-governance-g10)/G10). +כל ניסיון-סינתזה (התקבל / נשמר-מקור / נמנע) **מתועד** (CSV ב-`data/audit/` + log), ובהטמעה +מתעדכן ה-embedding יחד עם הניסוח כדי ש-lookup-before-insert (cosine) לא יסחף ([INV-G6](00-constitution.md#inv-g6-re-index-בכל-שינוי-תוכן)). +**מסלול-יחיד (G2):** כל הקוראים (backfill, כלי-MCP `canonical_synthesize_pending`, דריינר-לילה) +עוברים דרך `services/canonical_synthesis.py::synthesize_canonical` — אין נתיב-סינתזה מקביל. +**מקורות:** Stanford RegLab/Magesh et al. (JELS 2025 — grounding מול הזיה) · Dhuliawala et al. +*Chain-of-Verification* (arXiv:2309.11495, 2023) · RAGAS faithfulness (atomic-claim grounding) | סטטוס: verified +**אכיפה:** `services/canonical_synthesis.py` (עיגון בפרומפט, `_new_citations`, שער-drift); +`db.apply_canonical_synthesis` (סטטוס→pending_review אטומי + רענון-embedding); הפאנל הקנוני +(`/precedents`, PR#300) לאישור-יו"ר; CSV-audit ב-`data/audit/canonical-synthesis-*.csv`. +**הפרה ידועה:** — (חדש) + --- ## 4. הג'ובים המתוזמנים (תמיכת-תשתית ללולאה) diff --git a/mcp-server/src/legal_mcp/config.py b/mcp-server/src/legal_mcp/config.py index d364a9f..1dd8675 100644 --- a/mcp-server/src/legal_mcp/config.py +++ b/mcp-server/src/legal_mcp/config.py @@ -162,6 +162,24 @@ HALACHA_AUTO_APPROVE_THRESHOLD = float( os.environ.get("HALACHA_AUTO_APPROVE_THRESHOLD", "0.80") ) +# ── Tri-model panel extraction regime (legal-principles-redesign, #152) ────── +# chaim 2026-06-19: replace single-model auto-approve with a 3-model panel that +# deep-analyzes each decision. 3 models (Claude local + DeepSeek + Gemini) each +# PROPOSE candidate principles with a 0-1 score; candidates are matched across +# models (cosine ≥ MATCH_COSINE) → votes (# distinct models) + score (mean of the +# voters' scores). Approval rule (chaim): 3 votes → approve (even score= this value against an already-stored halacha of the SAME precedent @@ -210,6 +228,20 @@ HALACHA_CONSOLIDATE_EFFORT = os.environ.get("HALACHA_CONSOLIDATE_EFFORT", "high" HALACHA_CANONICAL_LOOKUP_ENABLED = os.environ.get("HALACHA_CANONICAL_LOOKUP_ENABLED", "true").lower() == "true" HALACHA_CANONICAL_THRESHOLD = float(os.environ.get("HALACHA_CANONICAL_THRESHOLD", "0.85")) +# V41 canonical synthesis (Phase 4) — a claude_session pass that rewrites each +# canonical's statement (carried over verbatim from the representative halacha at +# backfill) into ONE clean, case-independent legal principle, grounded in the +# instances' supporting quotes (INV-AH), then flips review_status +# pending_synthesis → pending_review for the chair gate (G10). Opus by default — +# substance-bearing rewrite, chair-facing. Runs through the local CLI (zero $-cost, +# but consumes subscription usage windows → throttled via usage_limits). +# Drift guard: the synthesized statement is re-embedded and compared (cosine) to +# the source; below the floor the synthesis is REJECTED (kept as-is, flagged) so a +# hallucinated/topic-drifted rewrite never silently overwrites a sound principle. +HALACHA_CANONICAL_SYNTH_MODEL = os.environ.get("HALACHA_CANONICAL_SYNTH_MODEL", HALACHA_EXTRACT_MODEL) +HALACHA_CANONICAL_SYNTH_EFFORT = os.environ.get("HALACHA_CANONICAL_SYNTH_EFFORT", "high") +HALACHA_CANONICAL_SYNTH_DRIFT_FLOOR = float(os.environ.get("HALACHA_CANONICAL_SYNTH_DRIFT_FLOOR", "0.80")) + # Google Cloud Vision (OCR for scanned PDFs) GOOGLE_CLOUD_VISION_API_KEY = os.environ.get("GOOGLE_CLOUD_VISION_API_KEY", "") diff --git a/mcp-server/src/legal_mcp/server.py b/mcp-server/src/legal_mcp/server.py index ddf04d6..f34afa5 100644 --- a/mcp-server/src/legal_mcp/server.py +++ b/mcp-server/src/legal_mcp/server.py @@ -465,6 +465,13 @@ async def canonical_halacha_get(canonical_id: str) -> str: return await plib.canonical_halacha_get(canonical_id) +@mcp.tool() +async def canonical_synthesize_pending(limit: int = 20) -> str: + """סנתז ניסוח-קנוני לעקרונות הממתינים (pending_synthesis) → pending_review (שער-יו"ר). V41 Phase 4. + מעוגן בציטוטי-המופעים (INV-AH) עם שער-drift. on-demand/burst; המסה הראשונית ב-backfill.""" + return await plib.canonical_synthesize_pending(limit) + + # Documents @mcp.tool() async def document_upload( diff --git a/mcp-server/src/legal_mcp/services/canonical_synthesis.py b/mcp-server/src/legal_mcp/services/canonical_synthesis.py new file mode 100644 index 0000000..3ab6e31 --- /dev/null +++ b/mcp-server/src/legal_mcp/services/canonical_synthesis.py @@ -0,0 +1,220 @@ +"""Canonical-halacha synthesis (V41 Phase 4). + +The backfill carried each canonical's ``canonical_statement`` over verbatim from +its representative halacha. This pass asks a local ``claude_session`` model to +rewrite that statement into ONE clean, case-independent legal principle — for the +~6 multi-instance canonicals a genuine merge of the N phrasings, for the singleton +majority a faithful generalising polish — then advances ``review_status`` +pending_synthesis → pending_review for the chair gate (G10 / INV-LRN1). + +Invariants this module upholds: + • INV-AH — the synthesis is GROUNDED in the instances' ``supporting_quote``s. + The model abstains (``grounded=false``) rather than invent law, no + new case citations may appear, and a re-embedding **drift guard** + rejects any rewrite that drifts too far from the source statement. + • G10/INV-LRN1 — never auto-approves; lands at ``pending_review`` for the chair. + • G9 — every outcome (accepted / kept-original / abstained) is logged + returned. + • G2 — single synthesis path; the backfill script, the on-demand MCP tool and + the nightly drain all call :func:`synthesize_canonical` here. + +LLM calls go through ``claude_session`` (local ``claude -p`` CLI) only — never the +Anthropic SDK, never from the FastAPI container (see claude_session docstring). +""" + +from __future__ import annotations + +import logging +import math +import re +from uuid import UUID + +from legal_mcp import config +from legal_mcp.services import claude_session, db, embeddings + +logger = logging.getLogger(__name__) + +# Case-citation shapes (docket numbers) that must NOT be invented by the rewrite: +# "1234/05", "85074-09-24", "8125-09-24". Statute section refs ("סעיף 197") do not +# match and are legitimately part of a principle. +_CITATION_RE = re.compile(r"\d{3,5}[-/]\d{2}(?:[-/]\d{2,4})?") + +_SYSTEM = ( + "אתה עורך-דין בכיר המנסח עקרונות-הלכה קנוניים לבסיס-ידע משפטי של ועדת ערר " + "לתכנון ובנייה. תפקידך לזקק ניסוח אחד, כללי ומדויק, של עיקרון משפטי — לא לסכם " + "תיק ולא להמציא דין." +) + + +def _build_prompt(data: dict) -> str: + instances = data.get("instances") or [] + blocks: list[str] = [] + for i, inst in enumerate(instances, 1): + parts = [f"### מופע {i} (תיק {inst.get('case_number') or '—'}, " + f"סוג: {inst.get('instance_type') or '—'})"] + if inst.get("rule_statement"): + parts.append(f"ניסוח-העיקרון: {inst['rule_statement']}") + if inst.get("supporting_quote"): + parts.append(f"ציטוט-תומך (מקור-העיגון): \"{inst['supporting_quote']}\"") + if inst.get("reasoning_summary"): + parts.append(f"נימוק: {inst['reasoning_summary']}") + blocks.append("\n".join(parts)) + evidence = "\n\n".join(blocks) if blocks else "(אין מופעים)" + multi = len(instances) > 1 + + task = ( + "מזג את כל ניסוחי-המופעים לעיקרון קנוני אחד המשותף לכולם." + if multi else + "נסח מחדש את העיקרון לניסוח קנוני נקי וכללי." + ) + + return f"""{_SYSTEM} + +הניסוח הקנוני הנוכחי (שיש לשפר): +{data.get('canonical_statement') or '(ריק)'} + +מקורות-העיגון (מופעי העיקרון בפסיקה): +{evidence} + +## המשימה +{task} + +## כללים מחייבים (INV-AH — עיגון, ללא הזיה) +1. **עיגון-מקור בלבד.** הניסוח חייב לנבוע מהציטוטים-התומכים שלמעלה. אסור להוסיף דין, חריג, סייג או תנאי שאינו עולה מהמקורות. +2. **ללא ציטוטי-תיקים חדשים.** אל תוסיף מספרי-תיק/פסקי-דין שאינם מופיעים במקורות. הפניה לסעיף-חוק כללי (למשל "סעיף 197 לחוק התכנון והבניה") מותרת אם היא חלק מהעיקרון. +3. **כללי ובלתי-תלוי-תיק.** הסר שמות-צדדים, עובדות-תיק ספציפיות ומספרים קונקרטיים. נסח עיקרון רב-תחולה, לא סיכום של מקרה. +4. **רגיסטר משפטי נקי** בעברית, משפט אחד עד שניים, ללא מילות-פתיחה ("נקבע כי", "בית-המשפט קבע") — רק העיקרון עצמו. +5. **הימנעות עדיפה על המצאה.** אם אינך יכול לזקק עיקרון מעוגן מהמקורות — החזר grounded=false והשאר את הניסוח הקיים. + +## פלט — JSON בלבד, ללא markdown וללא הסבר: +{{ + "canonical_statement": "<הניסוח הקנוני המזוקק, או הניסוח הקיים אם grounded=false>", + "grounded": true, + "changed": true, + "reason": "<משפט קצר: מה שונה, או מדוע נמנעת>" +}}""" + + +def _cosine(a: list[float], b: list[float]) -> float: + dot = sum(x * y for x, y in zip(a, b)) + na = math.sqrt(sum(x * x for x in a)) + nb = math.sqrt(sum(y * y for y in b)) + if na == 0 or nb == 0: + return 0.0 + return dot / (na * nb) + + +def _new_citations(text: str, source_text: str) -> list[str]: + """Docket-number tokens present in the rewrite but absent from the source evidence.""" + src = set(_CITATION_RE.findall(source_text)) + return [tok for tok in _CITATION_RE.findall(text) if tok not in src] + + +async def synthesize_canonical( + canonical_id: UUID, + *, + model: str | None = None, + effort: str | None = None, + drift_floor: float | None = None, +) -> dict: + """Synthesize one canonical's statement. PURE — does not write to the DB. + + Returns a proposal dict the caller applies (or not, for dry-run): + {status, canonical_id, accepted, original, proposed, embedding, drift_cosine, reason} + + status ∈ {accepted, abstained, drift_rejected, new_citation, no_instances, + llm_error, not_found}. ``accepted`` carries ``proposed`` + ``embedding`` + (the rewrite's vector, to commit alongside the statement). Every other status + keeps the original statement. + """ + model = model or config.HALACHA_CANONICAL_SYNTH_MODEL + effort = effort or config.HALACHA_CANONICAL_SYNTH_EFFORT + drift_floor = config.HALACHA_CANONICAL_SYNTH_DRIFT_FLOOR if drift_floor is None else drift_floor + + data = await db.fetch_canonical_synthesis_input(canonical_id) + if data is None: + return {"status": "not_found", "canonical_id": str(canonical_id)} + + original = data.get("canonical_statement") or "" + instances = data.get("instances") or [] + base = {"status": "", "canonical_id": str(canonical_id), "accepted": False, + "original": original, "proposed": original, "embedding": None, + "drift_cosine": None, "reason": ""} + + if not instances: + return {**base, "status": "no_instances", "reason": "no linked instances"} + + try: + result = await claude_session.query_json( + _build_prompt(data), model=model, effort=effort, tools="", + ) + except Exception as e: + logger.warning("synthesize_canonical %s: LLM error: %s", canonical_id, e) + return {**base, "status": "llm_error", "reason": str(e)} + + if not isinstance(result, dict) or not result.get("canonical_statement"): + return {**base, "status": "llm_error", "reason": "malformed LLM output"} + + if not result.get("grounded", True): + return {**base, "status": "abstained", + "reason": result.get("reason") or "model abstained (not grounded)"} + + proposed = str(result["canonical_statement"]).strip() + if not proposed or proposed == original: + return {**base, "status": "abstained", "reason": "no change proposed"} + + # AH-2: no invented docket citations. Source = current statement + all evidence. + source_text = original + " " + " ".join( + f"{i.get('rule_statement', '')} {i.get('supporting_quote', '')}" for i in instances + ) + invented = _new_citations(proposed, source_text) + if invented: + return {**base, "status": "new_citation", "proposed": proposed, + "reason": f"introduced citations absent from source: {invented}"} + + # Drift guard: re-embed the rewrite, compare to the source statement's vector. + new_emb = (await embeddings.embed_texts([proposed]))[0] + src_emb = data.get("embedding") + if not src_emb: + src_emb = (await embeddings.embed_texts([original]))[0] + drift = _cosine(new_emb, src_emb) + if drift < drift_floor: + return {**base, "status": "drift_rejected", "proposed": proposed, + "drift_cosine": round(drift, 4), + "reason": f"drift {drift:.3f} < floor {drift_floor}"} + + return {**base, "status": "accepted", "accepted": True, "proposed": proposed, + "embedding": new_emb, "drift_cosine": round(drift, 4), + "reason": result.get("reason") or "synthesized"} + + +async def synthesize_and_apply( + canonical_id: UUID, + *, + model: str | None = None, + effort: str | None = None, + drift_floor: float | None = None, +) -> dict: + """Synthesize one canonical and commit the outcome. + + On ``accepted`` writes the new statement + its embedding. On any other terminal + outcome (abstained / drift_rejected / new_citation) the ORIGINAL statement is + kept but ``review_status`` still advances to ``pending_review`` — a synthesis was + attempted, so the row leaves the queue (no infinite re-attempt) and reaches the + chair as-is. ``not_found`` / ``no_instances`` / ``llm_error`` are NOT committed + (transient or empty) so they are retried on the next pass. + """ + proposal = await synthesize_canonical( + canonical_id, model=model, effort=effort, drift_floor=drift_floor, + ) + status = proposal["status"] + if status in ("not_found", "no_instances", "llm_error"): + return proposal + + if proposal["accepted"]: + await db.apply_canonical_synthesis( + canonical_id, proposal["proposed"], embedding=proposal["embedding"], + ) + else: + # keep original statement + embedding, just advance the gate + await db.apply_canonical_synthesis(canonical_id, proposal["original"]) + return proposal diff --git a/mcp-server/src/legal_mcp/services/db.py b/mcp-server/src/legal_mcp/services/db.py index a6d5528..6ef2c9f 100644 --- a/mcp-server/src/legal_mcp/services/db.py +++ b/mcp-server/src/legal_mcp/services/db.py @@ -6147,6 +6147,71 @@ async def update_canonical_statement( return result.split()[-1] != "0" +async def fetch_canonical_synthesis_input(canonical_id: "UUID") -> "dict | None": + """Fetch everything the canonical-synthesis pass needs for one principle (V41 Phase 4). + + Unlike :func:`get_canonical_halacha` (UI-facing) this returns the canonical's own + ``embedding`` (as a python list, for the drift guard) AND each instance's full text + fields (``rule_statement`` + ``supporting_quote`` + ``reasoning_summary``) — the + grounding evidence the LLM rewrites from (INV-AH). Returns None if not found. + """ + pool = await get_pool() + row = await pool.fetchrow( + "SELECT id::text, canonical_statement, rule_type, practice_areas, " + " subject_tags, review_status, instance_count, embedding " + "FROM canonical_halachot WHERE id=$1", + canonical_id, + ) + if not row: + return None + instances = await pool.fetch( + "SELECT h.instance_type, h.treatment, h.rule_statement, " + " h.supporting_quote, h.reasoning_summary, " + " cl.case_number, cl.case_name " + "FROM halachot h JOIN case_law cl ON cl.id = h.case_law_id " + "WHERE h.canonical_id=$1 " + "ORDER BY (h.instance_type='original') DESC, cl.case_number", + canonical_id, + ) + emb = row["embedding"] + out = dict(row) + out["embedding"] = list(emb) if emb is not None else None + out["instances"] = [dict(i) for i in instances] + return out + + +async def apply_canonical_synthesis( + canonical_id: "UUID", + canonical_statement: str, + embedding: "list[float] | None" = None, + review_status: str = "pending_review", +) -> bool: + """Atomically commit a synthesis outcome for one canonical (V41 Phase 4). + + Always advances ``review_status`` (default → ``pending_review`` for the chair + gate, G10/INV-LRN1) and writes ``canonical_statement``. ``embedding`` is updated + only when provided (None = leave as-is) so the keep-original path on a + drift-rejected/abstained synthesis doesn't need to re-embed. Returns True if the + row existed. + """ + pool = await get_pool() + if embedding is None: + result = await pool.execute( + "UPDATE canonical_halachot " + "SET canonical_statement=$2, review_status=$3, updated_at=now() " + "WHERE id=$1", + canonical_id, canonical_statement, review_status, + ) + else: + result = await pool.execute( + "UPDATE canonical_halachot " + "SET canonical_statement=$2, embedding=$3, review_status=$4, updated_at=now() " + "WHERE id=$1", + canonical_id, canonical_statement, embedding, review_status, + ) + return result.split()[-1] != "0" + + async def list_canonical_instances(canonical_id: "UUID") -> list[dict]: """List all halachot (instances) sharing a canonical_id — used by the UI accordion.""" pool = await get_pool() diff --git a/mcp-server/src/legal_mcp/services/panel_extraction.py b/mcp-server/src/legal_mcp/services/panel_extraction.py new file mode 100644 index 0000000..e93fa74 --- /dev/null +++ b/mcp-server/src/legal_mcp/services/panel_extraction.py @@ -0,0 +1,243 @@ +"""Tri-model panel extraction regime (legal-principles-redesign, #152). + +The shared core (G2) for BOTH the going-forward extractor (Phase B) and the +retroactive cull (Phase C). chaim 2026-06-19: + + 1. THREE models (Claude local + DeepSeek + Gemini) deep-analyze a decision and + each PROPOSES candidate principles, each with a 0-1 score. + 2. Candidates are matched ACROSS models by embedding cosine → a "merged + candidate" carries: votes (# distinct models that proposed it) and score + (mean of the voters' scores). + 3. Approval rule: + votes == 3 → approved (even if score < floor) + votes >= 2 AND score >= SCORE_FLOOR → approved + votes == 2 AND score < SCORE_FLOOR → pending_review (chair, G10) + votes <= 1 → rejected (dropped) + 4. The CALLER applies the corpus-dedup (V41 link → frees a slot) and the + MAX_NEW cap (top-N approved-new by score). This module is corpus-agnostic + and DB-free so it is unit-testable and reused identically by B and C. + +Terminology (#152): a principle from a binding higher court is a הלכה; one from +the appeals committee (internal_committee) is a כלל פרשני (interpretive rule) — +the committee applies law, it does not make binding precedent. The extract prompt +adapts to ``source_kind`` and, for the committee, demands genuine novelty. +""" +from __future__ import annotations + +import logging +import math + +import httpx + +from legal_mcp import config +from legal_mcp.services import embeddings, panel_judges + +logger = logging.getLogger(__name__) + +_RULE_TYPES = ("holding", "interpretive", "procedural") # citable kinds only + + +def _extract_system(source_kind: str, is_binding: bool, max_candidates: int) -> str: + if source_kind == "internal_committee": + nature = ( + "המקור הוא החלטת ועדת-ערר. ועדת ערר מיישמת דין קיים ואינה יוצרת הלכה מחייבת. " + "חלץ אך ורק כללים פרשניים חדשים לגמרי שהוועדה גיבשה — לא יישום של הלכה ידועה, " + "לא חזרה על דין מוכר, ולא תיאור עובדות. אם אין כלל פרשני חדש אמיתי — החזר []." + ) + elif is_binding: + nature = ( + "המקור הוא פסק-דין של בית-משפט מחוזי/עליון. חלץ הלכות — כללים משפטיים " + "בני-הכללה והסתמכות שהפסק קובע או מאמץ ומיישם." + ) + else: + nature = ( + "המקור הוא פסיקה משכנעת (לא-מחייבת). חלץ עקרונות משפטיים בני-הכללה בלבד." + ) + return ( + "אתה משפטן בכיר בוועדת ערר לתכנון ובנייה, מנתח פסיקה לבסיס-ידע בר-ציטוט. " + f"{nature}\n\n" + "כללי-ברזל:\n" + "• רק עיקרון כללי בר-הכללה והסתמכות — לא החלה תלוית-עובדות/צדדים/סכומים, " + "לא אמרת-אגב (סוגיה שלא הוכרעה), לא חזרה מילולית על הציטוט ללא הפשטה.\n" + "• כל עיקרון חייב עיגון: ציטוט מילולי מהמקור התומך בו (INV-AH).\n" + f"• החזר עד {max_candidates} המועמדים החזקים ביותר בלבד; מוטב מעט ואיכותי.\n\n" + "פלט — JSON array בלבד, ללא markdown:\n" + "[{\n" + ' "rule_statement": "<העיקרון, כללי ובלתי-תלוי-תיק>",\n' + ' "supporting_quote": "<ציטוט מילולי מהמקור>",\n' + ' "reasoning_summary": "<מדוע זה עיקרון בר-הסתמכות>",\n' + ' "rule_type": "holding|interpretive|procedural",\n' + ' "score": 0.0-1.0\n' + "}]\n" + "אם אין עקרונות ראויים — החזר []." + ) + + +def _coerce_list(reply) -> list[dict]: + """A judge may return a list, or {"principles":[...]}/{"items":[...]}, or junk.""" + if isinstance(reply, list): + items = reply + elif isinstance(reply, dict): + for k in ("principles", "items", "halachot", "results", "candidates"): + if isinstance(reply.get(k), list): + items = reply[k] + break + else: + items = [reply] if reply.get("rule_statement") else [] + else: + return [] + out = [] + for it in items: + if not isinstance(it, dict): + continue + rule = (it.get("rule_statement") or "").strip() + quote = (it.get("supporting_quote") or "").strip() + if not rule or not quote: + continue + rt = (it.get("rule_type") or "interpretive").strip().lower() + try: + score = float(it.get("score", 0.0)) + except (TypeError, ValueError): + score = 0.0 + out.append({ + "rule_statement": rule, + "supporting_quote": quote, + "reasoning_summary": (it.get("reasoning_summary") or "").strip(), + "rule_type": rt if rt in _RULE_TYPES else "interpretive", + "score": max(0.0, min(1.0, score)), + }) + return out + + +def _cosine(a: list[float], b: list[float]) -> float: + dot = sum(x * y for x, y in zip(a, b)) + na = math.sqrt(sum(x * x for x in a)) + nb = math.sqrt(sum(y * y for y in b)) + return 0.0 if na == 0 or nb == 0 else dot / (na * nb) + + +def classify(votes: int, score: float) -> str: + """The chair's approval rule → 'approved' | 'pending_review' | 'rejected'.""" + floor = config.HALACHA_PANEL_SCORE_FLOOR + if votes >= 3: + return "approved" + if votes == 2: + return "approved" if score >= floor else "pending_review" + return "rejected" + + +def cluster_candidates( + per_model: dict[str, list[dict]], embs: dict[int, list[float]], +) -> list[dict]: + """Greedy cross-model clustering. ``per_model`` maps judge→its candidate list; + ``embs`` maps id(candidate)→embedding. Each cluster merges near-duplicate + proposals: votes = # distinct models present, score = mean of each model's + BEST score in the cluster, representative = highest-scoring member. + + Pure (no I/O) given the embeddings — unit-testable. + """ + match = config.HALACHA_PANEL_MATCH_COSINE + clusters: list[dict] = [] + # deterministic order: model order, then model-local order + flat: list[tuple[str, dict]] = [] + for m in panel_judges.JUDGE_NAMES: + for c in per_model.get(m, []): + flat.append((m, c)) + + for model, cand in flat: + emb = embs.get(id(cand)) + placed = False + if emb is not None: + for cl in clusters: + if cl["_emb"] is not None and _cosine(cl["_emb"], emb) >= match: + cl["members"].append({"model": model, **cand}) + prev = cl["per_model_score"].get(model, -1.0) + cl["per_model_score"][model] = max(prev, cand["score"]) + if cand["score"] > cl["score_rep"]: + cl["score_rep"] = cand["score"] + cl["rule_statement"] = cand["rule_statement"] + cl["supporting_quote"] = cand["supporting_quote"] + cl["reasoning_summary"] = cand["reasoning_summary"] + cl["rule_type"] = cand["rule_type"] + cl["_emb"] = emb + placed = True + break + if not placed: + clusters.append({ + "rule_statement": cand["rule_statement"], + "supporting_quote": cand["supporting_quote"], + "reasoning_summary": cand["reasoning_summary"], + "rule_type": cand["rule_type"], + "members": [{"model": model, **cand}], + "per_model_score": {model: cand["score"]}, + "score_rep": cand["score"], + "_emb": emb, + }) + + out = [] + for cl in clusters: + pms = cl["per_model_score"] + votes = len(pms) + score = sum(pms.values()) / votes if votes else 0.0 + out.append({ + "rule_statement": cl["rule_statement"], + "supporting_quote": cl["supporting_quote"], + "reasoning_summary": cl["reasoning_summary"], + "rule_type": cl["rule_type"], + "votes": votes, + "score": round(score, 4), + "voters": sorted(pms.keys()), + "verdict": classify(votes, score), + "embedding": cl["_emb"], + }) + # strongest first + out.sort(key=lambda c: (c["votes"], c["score"]), reverse=True) + return out + + +async def _run_three(system: str, user: str, max_tokens: int) -> dict[str, object]: + async with httpx.AsyncClient() as client: + import asyncio + c, ds, gm = await asyncio.gather( + panel_judges.judge_claude(system, user, max_tokens=max_tokens), + panel_judges.judge_deepseek(client, system, user, max_tokens=max_tokens), + panel_judges.judge_gemini(client, system, user, max_tokens=max_tokens), + ) + return {"claude": c, "deepseek": ds, "gemini": gm} + + +async def panel_extract( + text: str, + *, + source_kind: str = "external_upload", + is_binding: bool = True, + propose_n: int | None = None, +) -> list[dict]: + """Run the 3-model panel over a decision's text → merged candidate principles. + + Returns clusters (strongest first), each: + {rule_statement, supporting_quote, reasoning_summary, rule_type, + votes, score, voters, verdict, embedding} + Does NOT dedup vs the corpus and does NOT apply the MAX_NEW cap — the caller + (extractor / cull) owns those (they need DB + differ B vs C). + """ + propose_n = propose_n if propose_n is not None else config.HALACHA_PANEL_MAX_NEW + 3 + system = _extract_system(source_kind, is_binding, propose_n) + user = f"--- תחילת המקור ---\n{text}\n--- סוף המקור ---" + replies = await _run_three(system, user, max_tokens=8000) + + per_model: dict[str, list[dict]] = {} + for name in panel_judges.JUDGE_NAMES: + per_model[name] = _coerce_list(replies.get(name)) + if not any(per_model.values()): + logger.warning("panel_extract: all three judges returned no candidates") + return [] + + # embed every candidate's rule_statement for cross-model matching + flat = [c for m in panel_judges.JUDGE_NAMES for c in per_model[m]] + embs: dict[int, list[float]] = {} + if flat: + vecs = await embeddings.embed_texts([c["rule_statement"] for c in flat]) + for c, v in zip(flat, vecs): + embs[id(c)] = list(v) + return cluster_candidates(per_model, embs) diff --git a/mcp-server/src/legal_mcp/services/panel_judges.py b/mcp-server/src/legal_mcp/services/panel_judges.py new file mode 100644 index 0000000..3f8ec0e --- /dev/null +++ b/mcp-server/src/legal_mcp/services/panel_judges.py @@ -0,0 +1,114 @@ +"""Three independent-lineage LLM judges — the shared primitive (G2). + +Extracted from scripts/halacha_panel_approve.py so the panel-extraction regime +(#152) and the existing approval-triage share ONE implementation of the judges +(no parallel HTTP/auth paths). Diversity of lineage is the point — cross-model +agreement is the reliable signal (gold-set AC1=0.92): + + • claude — Opus via claude_session (local CLI, zero marginal cost) [Anthropic] + • deepseek — api.deepseek.com (deepseek-chat) [DeepSeek] + • gemini — generativelanguage (gemini-2.5-flash, #1 LegalBench) [Google] + +Every judge has the SAME signature ``(system, user) -> dict | None`` and returns +None on ANY failure (missing key, HTTP error, bad JSON) — callers must tolerate a +missing judge (a 2/3 panel is still actionable). +""" +from __future__ import annotations + +import json +import os +from pathlib import Path + +import httpx + +from legal_mcp.services import claude_session + + +def _env_key(name: str, *files: str) -> str: + for f in files: + p = Path(f).expanduser() + if p.exists(): + for line in p.read_text().splitlines(): + if line.startswith(name + "="): + return line.split("=", 1)[1].strip() + return os.environ.get(name, "") + + +DEEPSEEK_KEY = _env_key("DEEPSEEK_API_KEY", "~/.hermes/profiles/deepseek/.env", "~/.env") +# canonical Infisical name is GOOGLE_GEMINI_API_KEY (/external-apis/gemini); accept +# the bare GEMINI_API_KEY too for back-compat. +GEMINI_KEY = _env_key("GOOGLE_GEMINI_API_KEY", "~/.env") or _env_key("GEMINI_API_KEY", "~/.env") + +JUDGE_NAMES = ("claude", "deepseek", "gemini") + + +def available() -> dict[str, bool]: + return {"claude": True, "deepseek": bool(DEEPSEEK_KEY), "gemini": bool(GEMINI_KEY)} + + +async def judge_claude(system: str, user: str, *, max_tokens: int = 2000) -> dict | list | None: + try: + # tools="" → no tool_use, so a pure text→JSON extraction never trips + # error_max_turns (and wastes no retries on a web-search detour). + return await claude_session.query_json(user, system=system, tools="") + except Exception: + return None + + +async def judge_deepseek( + client: httpx.AsyncClient, system: str, user: str, *, max_tokens: int = 2000, +) -> dict | list | None: + if not DEEPSEEK_KEY: + return None + try: + r = await client.post( + "https://api.deepseek.com/v1/chat/completions", + headers={"Authorization": f"Bearer {DEEPSEEK_KEY}", "Content-Type": "application/json"}, + json={"model": "deepseek-chat", "temperature": 0, "max_tokens": max_tokens, + "response_format": {"type": "json_object"}, + "messages": [{"role": "system", "content": system}, + {"role": "user", "content": user}]}, + timeout=120, + ) + r.raise_for_status() + return json.loads(r.json()["choices"][0]["message"]["content"]) + except Exception: + return None + + +async def judge_gemini( + client: httpx.AsyncClient, system: str, user: str, *, max_tokens: int = 8000, +) -> dict | list | None: + if not GEMINI_KEY: + return None + try: + r = await client.post( + f"https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent?key={GEMINI_KEY}", + headers={"Content-Type": "application/json"}, + json={"system_instruction": {"parts": [{"text": system}]}, + "contents": [{"parts": [{"text": user}]}], + # thinkingBudget=0 disables gemini-2.5-flash's "thinking", which + # otherwise eats the output budget on large inputs → empty parts + # → finishReason MAX_TOKENS → the judge silently dropped out. + "generationConfig": {"temperature": 0, "maxOutputTokens": max_tokens, + "responseMimeType": "application/json", + "thinkingConfig": {"thinkingBudget": 0}}}, + timeout=120, + ) + r.raise_for_status() + parts = (r.json().get("candidates") or [{}])[0].get("content", {}).get("parts") + if not parts: + return None + return json.loads(parts[0]["text"]) + except Exception: + return None + + +def to_bool(d: dict | None, key: str) -> bool | None: + """Robust bool coercion for a judge JSON field (handles he/en truthy strings).""" + if not isinstance(d, dict) or key not in d: + return None + v = d[key] + if isinstance(v, bool): + return v + return str(v).strip().lower() in ("true", "1", "yes", "כן") diff --git a/mcp-server/src/legal_mcp/tools/precedent_library.py b/mcp-server/src/legal_mcp/tools/precedent_library.py index 8ac3d2f..5722378 100644 --- a/mcp-server/src/legal_mcp/tools/precedent_library.py +++ b/mcp-server/src/legal_mcp/tools/precedent_library.py @@ -21,7 +21,7 @@ from __future__ import annotations import time from uuid import UUID -from legal_mcp.services import db, precedent_library, telemetry +from legal_mcp.services import canonical_synthesis, db, precedent_library, telemetry from legal_mcp.tools.envelope import empty, err as _err, ok as _ok # GAP-48: SSoT envelope @@ -439,3 +439,34 @@ async def canonical_halacha_get(canonical_id: str) -> str: if row is None: return _err("עיקרון קנוני לא נמצא") return _ok(row) + + +async def canonical_synthesize_pending(limit: int = 20) -> str: + """סנתז ניסוח-קנוני לעקרונות הממתינים (review_status='pending_synthesis'). V41 Phase 4. + + לכל עיקרון: מודל מקומי (claude_session) מזקק ניסוח אחד, כללי ומעוגן בציטוטי-המופעים + (INV-AH), שער-drift דוחה סטייה גדולה מדי, והסטטוס מתקדם ל-pending_review לשער-היו"ר + (G10). on-demand / burst ידני; המסה הראשונית מטופלת ב-backfill_canonical_synthesis.py. + + Args: + limit: מספר מקסימלי לסבב (עד 100). רב-instance מטופלים ראשונים. + """ + pool = await db.get_pool() + rows = await pool.fetch( + "SELECT id::text AS id FROM canonical_halachot " + "WHERE review_status='pending_synthesis' " + "ORDER BY instance_count DESC, created_at LIMIT $1", + min(max(limit, 1), 100), + ) + if not rows: + return _ok({"processed": 0, "results": [], "message": "אין עקרונות ממתינים לסינתזה"}) + results = [] + counts: dict[str, int] = {} + for r in rows: + res = await canonical_synthesis.synthesize_and_apply(UUID(r["id"])) + counts[res["status"]] = counts.get(res["status"], 0) + 1 + results.append({ + "canonical_id": res["canonical_id"], "status": res["status"], + "drift_cosine": res.get("drift_cosine"), "reason": res.get("reason", ""), + }) + return _ok({"processed": len(results), "by_status": counts, "results": results}) diff --git a/mcp-server/tests/test_canonical_synthesis.py b/mcp-server/tests/test_canonical_synthesis.py new file mode 100644 index 0000000..0bdf4f3 --- /dev/null +++ b/mcp-server/tests/test_canonical_synthesis.py @@ -0,0 +1,134 @@ +"""Unit tests for canonical_statement synthesis (V41 Phase 4) — INV-LRN6 / INV-AH. + +Pure-helper coverage + the grounding/drift/citation gates of synthesize_canonical, +with db / claude_session / embeddings monkeypatched (no DB, no LLM, no Voyage). +""" +from __future__ import annotations + +import asyncio +from uuid import uuid4 + +import pytest + +from legal_mcp.services import canonical_synthesis as cs + +CID = uuid4() + + +# ── pure helpers ─────────────────────────────────────────────────── + +def test_cosine_identity_and_orthogonal(): + assert cs._cosine([1.0, 0.0], [1.0, 0.0]) == pytest.approx(1.0) + assert cs._cosine([1.0, 0.0], [0.0, 1.0]) == pytest.approx(0.0) + assert cs._cosine([0.0, 0.0], [1.0, 1.0]) == 0.0 # zero-norm guard + + +def test_new_citations_flags_invented_docket_only(): + src = 'העיקרון מתוך ערר 1234/05 והלכה נוספת' + # statute section is fine; shared docket is fine; new docket flagged + out = 'לפי סעיף 197 לחוק, וכפי שנקבע בערר 1234/05 ובעע"מ 9999/21' + assert cs._new_citations(out, src) == ['9999/21'] + assert cs._new_citations('סעיף 197 לחוק התכנון והבניה', src) == [] + + +def _data(*, statement="עיקרון מקורי נקי", instances=None, embedding=None): + return { + "id": str(CID), + "canonical_statement": statement, + "practice_areas": [], + "subject_tags": [], + "review_status": "pending_synthesis", + "instance_count": len(instances or [{}]), + "embedding": embedding, + "instances": instances if instances is not None else [ + {"instance_type": "original", "treatment": "mentioned", + "rule_statement": "עיקרון מקורי נקי", + "supporting_quote": "ציטוט תומך מהפסיקה", "reasoning_summary": "", + "case_number": "1234-01-20", "case_name": "פלוני"}, + ], + } + + +def _patch(monkeypatch, *, data, llm, emb=None): + async def fake_fetch(_cid): + return data + + async def fake_query(*a, **k): + return llm + + async def fake_embed(texts, input_type="document"): + # default: proposed embeds identical to a [1,0] source → drift 1.0 + return [emb([t]) if emb else [1.0, 0.0] for t in texts] + + monkeypatch.setattr(cs.db, "fetch_canonical_synthesis_input", fake_fetch) + monkeypatch.setattr(cs.claude_session, "query_json", fake_query) + monkeypatch.setattr(cs.embeddings, "embed_texts", fake_embed) + + +def _run(monkeypatch, **kw): + return asyncio.run(cs.synthesize_canonical(CID, **kw)) + + +# ── gate behaviour ───────────────────────────────────────────────── + +def test_accepted_when_grounded_and_low_drift(monkeypatch): + _patch(monkeypatch, + data=_data(embedding=[1.0, 0.0]), + llm={"canonical_statement": "עיקרון מזוקק כללי", "grounded": True, + "changed": True, "reason": "זוקק"}) + res = _run(monkeypatch) + assert res["status"] == "accepted" and res["accepted"] is True + assert res["proposed"] == "עיקרון מזוקק כללי" + assert res["embedding"] == [1.0, 0.0] + assert res["drift_cosine"] == pytest.approx(1.0) + + +def test_abstained_when_not_grounded(monkeypatch): + _patch(monkeypatch, data=_data(), + llm={"canonical_statement": "x", "grounded": False, "reason": "אין עיגון"}) + res = _run(monkeypatch) + assert res["status"] == "abstained" and res["accepted"] is False + assert res["proposed"] == res["original"] # original kept + + +def test_abstained_when_no_change(monkeypatch): + _patch(monkeypatch, data=_data(statement="זהה"), + llm={"canonical_statement": "זהה", "grounded": True}) + assert _run(monkeypatch)["status"] == "abstained" + + +def test_drift_rejected_keeps_original(monkeypatch): + # source [1,0], proposed embeds to [0,1] → cosine 0 < floor + _patch(monkeypatch, + data=_data(embedding=[1.0, 0.0]), + llm={"canonical_statement": "עיקרון אחר לגמרי", "grounded": True}, + emb=lambda t: [0.0, 1.0]) + res = _run(monkeypatch, drift_floor=0.80) + assert res["status"] == "drift_rejected" and res["accepted"] is False + assert res["drift_cosine"] == pytest.approx(0.0) + assert res["proposed"] == "עיקרון אחר לגמרי" # surfaced for audit, not committed + + +def test_new_citation_rejected(monkeypatch): + _patch(monkeypatch, data=_data(embedding=[1.0, 0.0]), + llm={"canonical_statement": 'עיקרון עם ציטוט חדש עע"מ 8888/22', "grounded": True}) + res = _run(monkeypatch) + assert res["status"] == "new_citation" and res["accepted"] is False + + +def test_no_instances(monkeypatch): + _patch(monkeypatch, data=_data(instances=[]), + llm={"canonical_statement": "x", "grounded": True}) + assert _run(monkeypatch)["status"] == "no_instances" + + +def test_llm_error_on_none(monkeypatch): + _patch(monkeypatch, data=_data(), llm=None) + assert _run(monkeypatch)["status"] == "llm_error" + + +def test_not_found(monkeypatch): + async def none_fetch(_cid): + return None + monkeypatch.setattr(cs.db, "fetch_canonical_synthesis_input", none_fetch) + assert asyncio.run(cs.synthesize_canonical(CID))["status"] == "not_found" diff --git a/mcp-server/tests/test_panel_extraction.py b/mcp-server/tests/test_panel_extraction.py new file mode 100644 index 0000000..ae070b0 --- /dev/null +++ b/mcp-server/tests/test_panel_extraction.py @@ -0,0 +1,116 @@ +"""Unit tests for the tri-model panel extraction core (#152, Phase A). + +Pure logic only — classify (the chair's approval rule), _coerce_list (judge-reply +normalisation), and cluster_candidates (cross-model matching/voting) with injected +embeddings. No LLM, no Voyage, no DB. +""" +from __future__ import annotations + +import pytest + +from legal_mcp import config +from legal_mcp.services import panel_extraction as pe + + +# ── classify — chaim's rule ──────────────────────────────────────── + +def test_classify_three_votes_approves_regardless_of_score(): + assert pe.classify(3, 0.10) == "approved" + assert pe.classify(3, 0.99) == "approved" + + +def test_classify_two_votes_gated_by_floor(): + floor = config.HALACHA_PANEL_SCORE_FLOOR + assert pe.classify(2, floor) == "approved" + assert pe.classify(2, floor + 0.05) == "approved" + assert pe.classify(2, floor - 0.01) == "pending_review" + + +def test_classify_one_or_zero_votes_rejected(): + assert pe.classify(1, 0.99) == "rejected" + assert pe.classify(0, 0.99) == "rejected" + + +# ── _coerce_list — judge reply normalisation ─────────────────────── + +def test_coerce_list_accepts_bare_list(): + raw = [{"rule_statement": "כלל", "supporting_quote": "ציטוט", "score": 0.9}] + out = pe._coerce_list(raw) + assert len(out) == 1 and out[0]["rule_type"] == "interpretive" + + +def test_coerce_list_unwraps_dict_wrapper_and_drops_incomplete(): + raw = {"principles": [ + {"rule_statement": "כלל", "supporting_quote": "ציטוט", "rule_type": "holding", "score": 1.5}, + {"rule_statement": "", "supporting_quote": "ציטוט"}, # no rule → drop + {"rule_statement": "כלל2", "supporting_quote": ""}, # no quote → drop + ]} + out = pe._coerce_list(raw) + assert len(out) == 1 + assert out[0]["rule_type"] == "holding" + assert out[0]["score"] == 1.0 # clamped to [0,1] + + +def test_coerce_list_bad_rule_type_falls_back(): + out = pe._coerce_list([{"rule_statement": "כלל", "supporting_quote": "צ", "rule_type": "obiter", "score": 0.5}]) + assert out[0]["rule_type"] == "interpretive" + + +def test_coerce_list_junk_returns_empty(): + assert pe._coerce_list("nonsense") == [] + assert pe._coerce_list(None) == [] + + +# ── cluster_candidates — cross-model matching & voting ───────────── + +def _c(rule, score): + return {"rule_statement": rule, "supporting_quote": "q", "reasoning_summary": "", + "rule_type": "interpretive", "score": score} + + +def test_cluster_merges_across_models_counts_votes_and_means_score(): + # same principle proposed by all three (identical embedding) → 1 cluster, 3 votes + a, b, c = _c("X", 0.9), _c("X", 0.8), _c("X", 0.7) + per_model = {"claude": [a], "deepseek": [b], "gemini": [c]} + embs = {id(a): [1.0, 0.0], id(b): [1.0, 0.0], id(c): [1.0, 0.0]} + out = pe.cluster_candidates(per_model, embs) + assert len(out) == 1 + cl = out[0] + assert cl["votes"] == 3 + assert cl["score"] == pytest.approx((0.9 + 0.8 + 0.7) / 3, abs=1e-3) + assert cl["verdict"] == "approved" + assert cl["voters"] == ["claude", "deepseek", "gemini"] + + +def test_cluster_separates_distinct_principles(): + a, b = _c("X", 0.9), _c("Y", 0.9) + per_model = {"claude": [a, b]} + embs = {id(a): [1.0, 0.0], id(b): [0.0, 1.0]} # orthogonal → 2 clusters + out = pe.cluster_candidates(per_model, embs) + assert len(out) == 2 + assert all(cl["votes"] == 1 and cl["verdict"] == "rejected" for cl in out) + + +def test_cluster_same_model_twice_counts_one_vote_keeps_best_score(): + # one model proposes two near-dupes; another proposes the same → 2 votes, not 3 + a1, a2 = _c("X", 0.6), _c("X", 0.95) + b = _c("X", 0.88) + per_model = {"claude": [a1, a2], "deepseek": [b]} + embs = {id(a1): [1.0, 0.0], id(a2): [1.0, 0.0], id(b): [1.0, 0.0]} + out = pe.cluster_candidates(per_model, embs) + assert len(out) == 1 + cl = out[0] + assert cl["votes"] == 2 # claude counts once + # claude's best (0.95) and deepseek (0.88) → mean + assert cl["score"] == pytest.approx((0.95 + 0.88) / 2, abs=1e-3) + assert cl["rule_statement"] == "X" + + +def test_cluster_sorted_strongest_first(): + a = _c("X", 0.9) # 1 vote + b, c = _c("Y", 0.9), _c("Y", 0.9) # 2 votes + per_model = {"claude": [a, b], "deepseek": [c]} + embs = {id(a): [1.0, 0.0], id(b): [0.0, 1.0], id(c): [0.0, 1.0]} + out = pe.cluster_candidates(per_model, embs) + assert out[0]["rule_statement"] == "Y" and out[0]["votes"] == 2 + assert out[1]["rule_statement"] == "X" and out[1]["votes"] == 1 diff --git a/scripts/SCRIPTS.md b/scripts/SCRIPTS.md index b9fe9d2..a945c7b 100644 --- a/scripts/SCRIPTS.md +++ b/scripts/SCRIPTS.md @@ -65,6 +65,7 @@ | `halacha_panel_calibrate.py` | python | **כיול + מדידת הפאנל** (Trust-or-Escalate, ICLR 2025). `--source live` (ברירת-מחדל): מריץ את שאלת-ה-KEEP על מדגם-הזהב ומודד מול `is_holding` precision+coverage+**split-rate** לכל מדיניות + false-keep/false-drop (מייבא שופטים מ-`halacha_panel_approve`, **חובה מקומי**). **#133/FU-5** — `--source captured`: **אפס-עלות** (בלי re-vote/LLM) — מצליב סבבים שמורים (FU-1) מול הכרעות-יו"ר (FU-2) דרך `db.panel_rounds_vs_chair` ומדווח split-rate+auto-precision **לכל סבב** (מגמת הלולאה: ככל שהרובריקה משתפרת precision נשמר ו-split יורד); משתף את `analyze_pairs` של FU-4 (מקור-יחיד). שתי המדידות מדווחות **anon-stability** (מבחן-אנונימיזציה #81.7) כמטריקת-בריאות נגד echo-chamber. `--batch`/`--limit`/`--concurrency`. | ידני — לפני חיווט `--apply` (live) / תקופתי — מעקב-לולאה (captured) | | `halacha_rubric_distill.py` | python | **#133/FU-4 — זיקוק-רובריקה PROPOSE-ONLY.** מצליב `halacha_panel_rounds` (FU-1, הצבעות+נימוקים) מול הכרעות-היו"ר (FU-2, seeds ב-`halacha_goldset` batch `chair-live`) דרך `db.panel_rounds_vs_chair` (read-only), מנתח דטרמיניסטית **כשלים שיטתיים** (false-keep/false-drop, פיצולים-שהוכרעו, שיעור-מחלוקת-עם-היו"ר לכל שופט), ומציע `KEEP_SYSTEM` v2 + exemplars מופשטים (claude_session מקומי, אפס עלות) כ**דוח-diff** ל-`data/learning/rubric-proposal-.md`. **לעולם לא auto-apply** — אימוץ v2 = עריכה אנושית של הקבוע דרך PR (INV-LRN1); exemplars מופשטים בלבד (INV-LRN5); הסיגנל היחיד = הכרעת-יו"ר, לא הצבעות-פאנל (anti-echo). מתחת ל-12 זוגות → "אין מספיק נתונים". `--no-llm` (סטטיסטיקה בלבד) / `--limit N`. **חובה מקומי**. | תקופתי — אחרי שהצטברו הכרעות-יו"ר על מחלוקות-פאנל | | `backfill_canonical_halachot.py` | python | **V41 — הקמת מודל ההלכות הקנוניות (חד-פעמי + idempotent).** (1) בונה רכיבים-קשורים (connected components) מ-`equivalent_halachot` (transitive closure — union-find). (2) לכל אשכול: בוחר נציג-קנוני (הכי הרבה corroboration → confidence → earliest), יוצר שורת `canonical_halachot`, ומעדכן `canonical_id` + `instance_type` לכל חברי האשכול. (3) לסינגלטונים (ללא קישורי-שוויון): 1:1 canonical. (4) מאכלס `halacha_citation_corroboration.canonical_id` מ-`halachot.canonical_id`. `--dry-run` (ברירת-מחדל, מחשב ומדווח בלבד) / `--apply` (כותב) / `--verbose`. לאחר הרצה: `canonical_statement` = ניסוח-נציג (pending_synthesis); עוקב: `backfill_canonical_synthesis.py` (Phase 4) יסנתז ניסוח-רחב דרך LLM. הרץ: `mcp-server/.venv/bin/python scripts/backfill_canonical_halachot.py --apply`. | **חד-פעמי** (לאחר deploy V41) / idempotent לפי צורך | +| `backfill_canonical_synthesis.py` | python | **V41 Phase 4 — סינתזת-LLM ל-`canonical_statement` (idempotent + resumable).** עובר על canonicals ב-`review_status='pending_synthesis'` (רב-instance ראשונים) ומזקק לכל אחד ניסוח אחד כללי ומעוגן בציטוטי-המופעים (INV-AH) דרך `services/canonical_synthesis.py` (מסלול-יחיד, G2). שערים: עיגון/הימנעות, **drift-floor** (cosine מול המקור, ברירת-מחדל 0.80 — סטייה גדולה→נשמר המקור), ואיסור ציטוטי-תיק חדשים. בכל מקרה הסטטוס מתקדם ל-`pending_review` לשער-היו"ר (G10/INV-LRN6). מודל Opus (`HALACHA_CANONICAL_SYNTH_MODEL`). מרוסן ע"י `usage_limits` (עוצר-רך בתקרת-שימוש, resumable). `--dry-run` (ברירת-מחדל) / `--apply` / `--sample N` (מדגם אקראי לבדיקה) / `--limit N` / `--no-throttle` / `--verbose`. CSV-audit ל-`data/audit/canonical-synthesis-*.csv`. **חובה מקומי** (claude_session). הרץ: `cd mcp-server && HOME=/home/chaim .venv/bin/python ../scripts/backfill_canonical_synthesis.py --apply`. שוטף: כלי-MCP `canonical_synthesize_pending`. | **חד-פעמי** (המסה הראשונית) + idempotent לחדשים | | `halacha_batch_reconcile.py` | python | **#82.7** — dedup חוצה-פסקים offline (שמרני, **dry-run בלבד**). dedup-on-insert משווה רק תוך-פסק; כאן סף מחמיר (cosine ≥0.95, `--cosine`) ולא-הרסני: מאתר זוגות הלכות near-duplicate בין פסקים שונים (pgvector `<=>` exact) עם איתות לקסיקלי (Jaccard/Levenshtein) ומדווח ל-CSV ב-`data/audit/` לסקירת היו"ר. לא מדלג/ממזג/מוחק. `--include-pending`. **`--link`** רושם את הזוגות שנמצאו כ-`equivalent_halachot` (parallel authority, #84.2 — **deprecated post-V41** — השתמש ב-`backfill_canonical_halachot.py --apply` במקום). רץ עם venv של mcp-server. | **deprecated** — הוחלף ב-`backfill_canonical_halachot.py` (V41). נשמר לצורכי audit | | `calibrate_halacha_dedup.py` | python | **#82.1** — כיול ספי ה-dedup הלקסיקלי (#82.3) מול gold-set הניקוי. קורא `halacha-cleanup-manifest-*.csv` (זוגות duplicate↔survivor מתויגי-אדם), טוען טקסט-survivor מה-DB, ו-sweep של (jaccard_min × levenshtein_min) עם P/R/F1, מסמן את נקודת-העבודה המוגדרת. אימת ש-(0.55, 0.70) → **precision 1.0** (אפס false-merge), recall 0.30 — מתאים לאיתות-משני שחוסם auto-approve. `--manifest `. רץ עם venv של mcp-server | חד-פעמי — כיול (בוצע 2026-06-06) | | `ab_halacha_opus48.py` | python | **A/B לא-הרסני לחילוץ הלכות (Claude)** — מריץ מחדש חילוץ הלכות על פסק-דין בודד דרך מודל/effort נבחרים (`AB_MODEL`/`AB_EFFORT`, ברירת-מחדל `claude-opus-4-8`/`xhigh`) ומשווה לסטטיסטיקות ההלכות הקיימות ב-DB **בלי למחוק/לכתוב כלום**. משכפל את `halacha_extractor.extract()` (אותם פרומפטים, בחירת-צ'אנקים, אימות-ציטוט) ומחליף רק את קריאת ה-LLM ב-`claude -p --model --effort`. מפיק `data/ab_halacha__.json`. הרצה: `DOTENV_PATH=/home/chaim/.env DATA_DIR=.../data .venv/bin/python scripts/ab_halacha_opus48.py `. **ממצא 2026-05-31 (שטיין 1128-08-20):** Opus 4.8@xhigh חילץ 51 מול 124 בייצור (100% quote-verified מול 96%) אך ביטחון מכויל-נמוך יותר (חציון 0.75 מול 0.82) — ולכן **לא** מקטין את תור-האישור-הידני תחת sweep אוטו-אישור conf≥0.78 (26 מול 24). שיפור איכות, לא צמצום-תור. | ידני (החלטת מודל-חילוץ) | diff --git a/scripts/backfill_canonical_synthesis.py b/scripts/backfill_canonical_synthesis.py new file mode 100644 index 0000000..b40b680 --- /dev/null +++ b/scripts/backfill_canonical_synthesis.py @@ -0,0 +1,174 @@ +#!/usr/bin/env python3 +"""Backfill — LLM synthesis of canonical_halachot.canonical_statement (V41 Phase 4). + +WHAT THIS DOES +-------------- +Walks canonicals in ``review_status='pending_synthesis'`` and, for each, asks a +local ``claude_session`` model (Opus by default) to rewrite the statement carried +over from the representative halacha into ONE clean, case-independent legal +principle — grounded in the instances' supporting quotes (INV-AH). Accepted +rewrites are committed with a fresh embedding; abstained / drift-rejected / +new-citation outcomes keep the original statement. Either way ``review_status`` +advances to ``pending_review`` for the chair gate (G10 / INV-LRN1). + +All logic lives in services/canonical_synthesis.py (G2) — this script is the +batch driver: ordering, throttling, dry-run reporting and a CSV audit trail. + +IDEMPOTENCY / RESUME +-------------------- +Operates on ``pending_synthesis`` only; a committed canonical leaves the queue, so +re-running continues where it stopped. Safe to interrupt. + +THROTTLING +---------- +Each item is one Opus call against chaim's claude.ai subscription. Before every +item the shared usage_limits ceilings are checked; once a window is over its soft +ceiling the run STOPS gracefully (resumable) instead of hammering 429. Disable +with --no-throttle (e.g. small samples). + +USAGE +----- +cd ~/legal-ai/mcp-server +.venv/bin/python ../scripts/backfill_canonical_synthesis.py --sample 20 # dry-run, 20 random +.venv/bin/python ../scripts/backfill_canonical_synthesis.py --dry-run --limit 50 # dry-run, first 50 (multi-instance first) +.venv/bin/python ../scripts/backfill_canonical_synthesis.py --apply # full throttled run +.venv/bin/python ../scripts/backfill_canonical_synthesis.py --apply --limit 200 +""" +from __future__ import annotations + +import argparse +import asyncio +import csv +import os +import random +import sys +from collections import Counter +from datetime import datetime, timezone +from uuid import UUID + +sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "mcp-server", "src")) + +from legal_mcp.services import canonical_synthesis, db # noqa: E402 + +try: # stdlib-only module, importable from system python too + from legal_mcp.services import usage_limits +except Exception: # pragma: no cover + usage_limits = None + +AUDIT_DIR = os.path.join(os.path.dirname(__file__), "..", "data", "audit") + + +async def _pending(limit: int | None, sample: int | None) -> list[dict]: + """Pending-synthesis canonicals, multi-instance first (highest value).""" + pool = await db.get_pool() + rows = await pool.fetch( + "SELECT id::text AS id, instance_count, canonical_statement " + "FROM canonical_halachot WHERE review_status='pending_synthesis' " + "ORDER BY instance_count DESC, created_at", + ) + items = [dict(r) for r in rows] + if sample and sample < len(items): + items = random.sample(items, sample) + if limit: + items = items[:limit] + return items + + +def _throttled() -> tuple[bool, str]: + if usage_limits is None: + return False, "usage_limits unavailable" + usage = usage_limits.subscription_usage() + if usage is None: + return False, "usage read failed (proceeding)" + over, _reset, detail = usage_limits.ceiling_status(usage) + return over, detail + + +def _short(s: str, n: int = 90) -> str: + s = (s or "").replace("\n", " ") + return s if len(s) <= n else s[: n - 1] + "…" + + +async def _run(apply: bool, limit: int | None, sample: int | None, + throttle: bool, verbose: bool) -> int: + items = await _pending(limit, sample) + total = len(items) + mode = "APPLY" if apply else "DRY-RUN" + print(f"[{mode}] {total} canonicals pending_synthesis to process " + f"(throttle={'on' if throttle else 'off'})\n") + if not total: + print("nothing to do.") + return 0 + + stamp = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ") + os.makedirs(AUDIT_DIR, exist_ok=True) + audit_path = os.path.join( + AUDIT_DIR, f"canonical-synthesis-{'apply' if apply else 'dryrun'}-{stamp}.csv") + counts: Counter[str] = Counter() + stopped = False + + with open(audit_path, "w", newline="", encoding="utf-8") as fh: + w = csv.writer(fh) + w.writerow(["canonical_id", "instance_count", "status", "drift_cosine", + "reason", "before", "after"]) + for n, it in enumerate(items, 1): + if throttle: + over, detail = _throttled() + if over: + print(f"\n⏸ usage ceiling reached ({detail}) — stopping at " + f"{n - 1}/{total}. Re-run to resume.") + stopped = True + break + + cid = UUID(it["id"]) + if apply: + res = await canonical_synthesis.synthesize_and_apply(cid) + else: + res = await canonical_synthesis.synthesize_canonical(cid) + counts[res["status"]] += 1 + + w.writerow([it["id"], it["instance_count"], res["status"], + res.get("drift_cosine"), res.get("reason", ""), + res.get("original", ""), res.get("proposed", "")]) + + mark = {"accepted": "✓", "abstained": "·", "drift_rejected": "✗", + "new_citation": "✗", "llm_error": "!", "no_instances": "·", + "not_found": "!"}.get(res["status"], "?") + line = (f"[{n}/{total}] {mark} {res['status']:<14} " + f"inst={it['instance_count']} {it['id'][:8]}") + print(line) + if verbose and res["status"] in ("accepted",) or (verbose and res.get("proposed") != res.get("original")): + print(f" before: {_short(res.get('original', ''))}") + print(f" after : {_short(res.get('proposed', ''))} " + f"(drift={res.get('drift_cosine')})") + if res.get("reason"): + print(f" reason: {_short(res['reason'], 110)}") + + processed = sum(counts.values()) + print(f"\n── summary ({mode}) — {processed}/{total} processed" + f"{' (stopped early)' if stopped else ''} ──") + for status, c in counts.most_common(): + print(f" {status:<16} {c}") + print(f"\naudit CSV: {audit_path}") + if not apply: + print("dry-run — nothing written to the DB. Re-run with --apply to commit.") + return 0 + + +def main() -> int: + p = argparse.ArgumentParser(description="LLM synthesis of canonical_statement (V41 Phase 4)") + p.add_argument("--apply", action="store_true", help="commit to the DB (default: dry-run)") + p.add_argument("--dry-run", action="store_true", help="explicit dry-run (default)") + p.add_argument("--limit", type=int, default=None, help="cap items processed") + p.add_argument("--sample", type=int, default=None, help="random sample of N (dry-run inspection)") + p.add_argument("--no-throttle", action="store_true", help="skip usage-ceiling checks") + p.add_argument("--verbose", action="store_true", help="print before/after for changed items") + args = p.parse_args() + return asyncio.run(_run( + apply=args.apply, limit=args.limit, sample=args.sample, + throttle=not args.no_throttle, verbose=args.verbose, + )) + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/scripts/halacha_panel_approve.py b/scripts/halacha_panel_approve.py index 7d00cc6..a134b4c 100644 --- a/scripts/halacha_panel_approve.py +++ b/scripts/halacha_panel_approve.py @@ -50,24 +50,17 @@ from pathlib import Path import httpx -from legal_mcp.services import claude_session, db +from legal_mcp.services import db, panel_judges +# Judges are the shared primitive (G2) — #152 lifted them to services/panel_judges. +from legal_mcp.services.panel_judges import ( + DEEPSEEK_KEY, + GEMINI_KEY, + judge_claude, + judge_deepseek, + judge_gemini, +) -# ── keys (local files, same pattern as the other local judges) ── - -def _env_key(name: str, *files: str) -> str: - for f in files: - p = Path(f).expanduser() - if p.exists(): - for line in p.read_text().splitlines(): - if line.startswith(name + "="): - return line.split("=", 1)[1].strip() - return os.environ.get(name, "") - - -DEEPSEEK_KEY = _env_key("DEEPSEEK_API_KEY", "~/.hermes/profiles/deepseek/.env", "~/.env") -# canonical Infisical name is GOOGLE_GEMINI_API_KEY (/external-apis/gemini); accept -# the bare GEMINI_API_KEY too for back-compat. -GEMINI_KEY = _env_key("GOOGLE_GEMINI_API_KEY", "~/.env") or _env_key("GEMINI_API_KEY", "~/.env") +_bool = panel_judges.to_bool # ── the two coarse questions (the reliable axis — NOT the fuzzy sub-type) ── @@ -99,62 +92,6 @@ def _nli_user(h: dict) -> str: return f"כלל:\n{h.get('rule_statement') or ''}\n\nציטוט:\n{h.get('supporting_quote') or ''}" -# ── three judges, one signature: (system, user) -> dict|None ── - -async def judge_claude(system: str, user: str) -> dict | None: - try: - return await claude_session.query_json(user, system=system) - except Exception: - return None - - -async def judge_deepseek(client: httpx.AsyncClient, system: str, user: str) -> dict | None: - if not DEEPSEEK_KEY: - return None - try: - r = await client.post( - "https://api.deepseek.com/v1/chat/completions", - headers={"Authorization": f"Bearer {DEEPSEEK_KEY}", "Content-Type": "application/json"}, - json={"model": "deepseek-chat", "temperature": 0, "max_tokens": 120, - "response_format": {"type": "json_object"}, - "messages": [{"role": "system", "content": system}, - {"role": "user", "content": user}]}, - timeout=90, - ) - r.raise_for_status() - return json.loads(r.json()["choices"][0]["message"]["content"]) - except Exception: - return None - - -async def judge_gemini(client: httpx.AsyncClient, system: str, user: str) -> dict | None: - if not GEMINI_KEY: - return None - try: - r = await client.post( - f"https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent?key={GEMINI_KEY}", - headers={"Content-Type": "application/json"}, - json={"system_instruction": {"parts": [{"text": system}]}, - "contents": [{"parts": [{"text": user}]}], - "generationConfig": {"temperature": 0, "maxOutputTokens": 4000, - "responseMimeType": "application/json"}}, - timeout=90, - ) - r.raise_for_status() - return json.loads(r.json()["candidates"][0]["content"]["parts"][0]["text"]) - except Exception: - return None - - -def _bool(d: dict | None, key: str) -> bool | None: - if not isinstance(d, dict) or key not in d: - return None - v = d[key] - if isinstance(v, bool): - return v - return str(v).strip().lower() in ("true", "1", "yes", "כן") - - async def panel_vote(client, system, user, key) -> dict: """Run all three judges; return per-judge bools + the verdict.""" c, ds, gm = await asyncio.gather(