feat: external precedent library with auto halacha extraction

Adds a third corpus of legal authority distinct from style_corpus (Daphna's prior decisions for voice) and case_precedents (chair-attached quotes per case). The new corpus holds chair-uploaded court rulings and other appeals committee decisions, with binding rules (הלכות) extracted automatically and queued for chair approval. Pipeline (web/app.py + services/precedent_library.py): file → extract → chunk → Voyage embed → halacha_extractor → store + publish progress over the existing Redis SSE channel. Schema V7 (services/db.py): extends case_law with source_kind + extraction status fields under a CHECK constraint pinning practice_area to the three appeals committee domains (rishuy_uvniya, betterment_levy, compensation_197). New precedent_chunks (vector(1024)) and halachot tables (vector(1024) over rule_statement, IVFFlat indexes, gin on practice_areas/subject_tags). Halachot start as pending_review; only approved/published rows are visible to search_precedent_library. Agents: legal-writer, legal-researcher, legal-analyst, legal-ceo, legal-qa get search_precedent_library. legal-writer prompt explains the three-corpus distinction and CREAC use; legal-qa now verifies that every cited halacha resolves to an approved row in the corpus. UI: /precedents page with four tabs — library / semantic search / pending review (J/K nav, A/R/E shortcuts, badge count) / stats. Reuses the existing upload-sheet progress + SSE pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 08:38:18 +00:00
parent a6edb75bbf
commit 7ee90dce31
23 changed files with 3853 additions and 67 deletions
--- a/mcp-server/src/legal_mcp/services/halacha_extractor.py
+++ b/mcp-server/src/legal_mcp/services/halacha_extractor.py
@@ -0,0 +1,326 @@
+"""Extract binding legal rules (הלכות) from external court rulings.
+
+Runs Claude (via the local headless ``claude -p`` bridge) over the
+legal_analysis / ruling / conclusion chunks of a precedent, returns a
+structured list of halachot, validates each one against the source text,
+embeds the rule statement, and stores everything as ``pending_review`` in
+the ``halachot`` table.
+
+All extraction is idempotent — calling ``extract(case_law_id)`` twice
+deletes prior rows for that precedent first.
+
+Trust model:
+    Per chair decision, NO halacha is auto-published. Every extracted
+    halacha enters with ``review_status='pending_review'``. The chair
+    approves/rejects via the UI, and only ``approved`` (or ``published``)
+    rows are visible to ``search_precedent_library`` and the writing
+    agents.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import logging
+import re
+from uuid import UUID
+
+from legal_mcp import config
+from legal_mcp.config import parse_llm_json
+from legal_mcp.services import claude_session, db, embeddings, proofreader
+
+logger = logging.getLogger(__name__)
+
+
+# Concurrency model mirrors claims_extractor — each ``claude -p`` subprocess
+# holds ~300 MB RSS, so we cap parallel chunks to keep the box healthy.
+CHUNK_CONCURRENCY = 3
+CHUNK_RETRY_ATTEMPTS = 1
+
+# Sections from which to extract. facts/intro/appellant_claims/respondent_claims
+# never contain holdings, only positions, so we skip them.
+EXTRACTABLE_SECTIONS = ("legal_analysis", "ruling", "conclusion")
+
+
+HALACHA_EXTRACTION_PROMPT = """אתה משפטן בכיר המתמחה בדיני תכנון ובניה (ועדות ערר, היטל השבחה, פיצויים לפי סעיף 197 לחוק התכנון והבניה). תפקידך: לחלץ הלכות מחייבות מתוך פסק דין/החלטה משפטית.
+
+## הגדרות מחייבות
+
+הלכה (binding rule) = כלל משפטי שהפסק קובע או מאמץ ומיישם, באופן שניתן להסתמך עליו בהחלטות עתידיות.
+
+לא-הלכה (אין לחלץ):
+- אמרת אגב (obiter dicta) — הערות שאינן הכרחיות להכרעה.
+- ממצאים עובדתיים ספציפיים לתיק ("העורר לא הוכיח X").
+- ציטוטי הלכות מפסקי דין אחרים שלא אומצו במפורש בפסק זה.
+- הצהרות על דין קיים שאינן מיושמות בהכרעה.
+
+הבחנה קריטית: כאשר הפסק מצטט הלכה מפסק קודם, חלץ אותה רק אם בית המשפט בפסק הנוכחי **מאמץ ומחיל** אותה (לא רק מזכיר אותה ברקע).
+
+## תחומים אפשריים (practice_areas) — תחומי ועדת הערר בלבד
+- rishuy_uvniya — רישוי ובניה (תיקי 1xxx: היתרים, שימוש חורג, תכניות, קווי בניין, גובה, חניה)
+- betterment_levy — היטל השבחה (תיקי 8xxx: שומה, מערכות, תכניות המקנות בה, מועד קובע, סופיות ההחלטה)
+- compensation_197 — פיצויים לפי ס' 197 (תיקי 9xxx: פגיעה במקרקעין, ירידת ערך, ס' 200/פטור)
+
+הלכה אחת יכולה לחול על כמה תחומים — practice_areas הוא array ולא string יחיד.
+
+## סוגי הלכה (rule_type)
+- binding — הלכה מחייבת שהוחלה על התיק.
+- interpretive — פרשנות סעיף חוק/תכנית שאומצה.
+- procedural — כלל פרוצדורלי (סמכות, מועדים, הליכי שמיעה).
+- obiter — אמרת אגב חשובה (חלץ רק אם משמעותית; סמן confidence נמוך).
+
+## פלט נדרש
+החזר JSON array בלבד, ללא markdown, ללא הסברים. דוגמה:
+[
+  {
+    "rule_statement": "ניסוח הכלל בלשון משפטית מדויקת בגוף שלישי, 1-3 משפטים.",
+    "rule_type": "binding",
+    "reasoning_summary": "תמצית ההיגיון: למה בית המשפט הגיע לכלל הזה (1-2 משפטים).",
+    "supporting_quote": "ציטוט מילולי מדויק מהפסק התומך בכלל. חייב להופיע מילה במילה בטקסט הקלט.",
+    "page_reference": "פס' 12 / עמ' 8 — ככל שניתן לזהות מהקלט.",
+    "practice_areas": ["betterment_levy"],
+    "subject_tags": ["מועד_קביעת_שומה", "סופיות_ההחלטה"],
+    "cites": ["עע\\"מ 3975/22"],
+    "confidence": 0.85
+  }
+]
+
+## כללי איכות
+1. **נאמנות מוחלטת לציטוט** — supporting_quote חייב להיות הדבקה מדויקת מהקלט. אם אין ציטוט מתאים — אל תמציא הלכה.
+2. **מספר הלכות** — פסק רגיל מכיל 1-4 הלכות מחייבות. אל תמתח את הרשימה. אם אין הלכה — החזר [].
+3. **לא לפצל יתר על המידה** — אם שני סעיפים מבטאים את אותו עיקרון, אחד את הניסוח.
+4. **שפה** — rule_statement בעברית משפטית מקצועית, לא צמצום מילולי של הציטוט.
+5. **subject_tags** — 2-5 תגיות בעברית, snake_case (חניה, קווי_בניין, שיקול_דעת, פגם_פרוצדורלי, סמכות, מועדים, פגיעה_במקרקעין, ירידת_ערך).
+6. **confidence** — 0..1. מתחת ל-0.7 = ספק לגבי היות זה הלכה מחייבת.
+"""
+
+
+_VALID_PRACTICE_AREAS = {"rishuy_uvniya", "betterment_levy", "compensation_197"}
+_VALID_RULE_TYPES = {"binding", "interpretive", "procedural", "obiter"}
+
+
+def _normalize_for_comparison(text: str) -> str:
+    """Normalize Hebrew text for substring matching.
+
+    Collapses whitespace and unifies the half-dozen Hebrew quote-mark
+    variants. Use ``proofreader._fix_hebrew_quotes`` for the quote part
+    so we stay consistent with the proofreader pipeline.
+    """
+    fixed = proofreader._fix_hebrew_quotes(text)
+    # Collapse all whitespace (newlines, tabs, multiple spaces) to a single space.
+    return re.sub(r"\s+", " ", fixed).strip()
+
+
+def _verify_quote(supporting_quote: str, full_text: str) -> bool:
+    """Return True if ``supporting_quote`` appears verbatim in ``full_text``
+    after Hebrew quote/whitespace normalization.
+
+    The LLM occasionally trims a leading/trailing word from the quote;
+    we accept the quote if at least 90% of its characters match a
+    contiguous substring of the source.
+    """
+    if not supporting_quote.strip():
+        return False
+    normalized_quote = _normalize_for_comparison(supporting_quote)
+    normalized_text = _normalize_for_comparison(full_text)
+    if not normalized_quote:
+        return False
+    if normalized_quote in normalized_text:
+        return True
+    # Fallback: try the inner 90% of the quote (drops boundary trim).
+    if len(normalized_quote) >= 30:
+        trim = max(2, len(normalized_quote) // 20)
+        inner = normalized_quote[trim:-trim]
+        if inner and inner in normalized_text:
+            return True
+    return False
+
+
+def _coerce_halacha(raw: dict) -> dict | None:
+    """Validate and normalize one LLM-returned halacha dict.
+
+    Returns ``None`` if the entry is missing required fields.
+    """
+    if not isinstance(raw, dict):
+        return None
+    rule_statement = (raw.get("rule_statement") or "").strip()
+    supporting_quote = (raw.get("supporting_quote") or "").strip()
+    if not rule_statement or not supporting_quote:
+        return None
+
+    rule_type = (raw.get("rule_type") or "binding").strip().lower()
+    if rule_type not in _VALID_RULE_TYPES:
+        rule_type = "binding"
+
+    practice_areas_raw = raw.get("practice_areas") or []
+    if isinstance(practice_areas_raw, str):
+        practice_areas_raw = [practice_areas_raw]
+    practice_areas = [p for p in practice_areas_raw if p in _VALID_PRACTICE_AREAS]
+
+    subject_tags_raw = raw.get("subject_tags") or []
+    if isinstance(subject_tags_raw, str):
+        subject_tags_raw = [subject_tags_raw]
+    subject_tags = [str(t).strip() for t in subject_tags_raw if str(t).strip()]
+
+    cites_raw = raw.get("cites") or []
+    if isinstance(cites_raw, str):
+        cites_raw = [cites_raw]
+    cites = [str(c).strip() for c in cites_raw if str(c).strip()]
+
+    try:
+        confidence = float(raw.get("confidence", 0.0))
+    except (TypeError, ValueError):
+        confidence = 0.0
+    confidence = max(0.0, min(1.0, confidence))
+
+    return {
+        "rule_statement": rule_statement,
+        "rule_type": rule_type,
+        "reasoning_summary": (raw.get("reasoning_summary") or "").strip(),
+        "supporting_quote": supporting_quote,
+        "page_reference": (raw.get("page_reference") or "").strip(),
+        "practice_areas": practice_areas,
+        "subject_tags": subject_tags,
+        "cites": cites,
+        "confidence": confidence,
+    }
+
+
+async def _extract_chunk(
+    chunk_text: str,
+    section_type: str,
+    chunk_index: int,
+    chunk_total: int,
+    context: str,
+) -> list[dict]:
+    """Run the halacha extractor on one chunk with retry."""
+    chunk_label = f" (חלק {chunk_index + 1}/{chunk_total})" if chunk_total > 1 else ""
+    prompt = (
+        f"{HALACHA_EXTRACTION_PROMPT}\n\n"
+        f"## הקלט\n"
+        f"סוג קטע: {section_type}\n"
+        f"{context}{chunk_label}\n\n"
+        f"--- תחילת הטקסט ---\n{chunk_text}\n--- סוף הטקסט ---"
+    )
+    last_err: Exception | None = None
+    for attempt in range(CHUNK_RETRY_ATTEMPTS + 1):
+        try:
+            result = await claude_session.query_json(prompt)
+        except Exception as e:
+            last_err = e
+            logger.warning(
+                "halacha_extractor chunk %d/%d attempt %d raised: %s",
+                chunk_index + 1, chunk_total, attempt + 1, e,
+            )
+            continue
+        if isinstance(result, list):
+            return result
+        logger.warning(
+            "halacha_extractor chunk %d/%d attempt %d returned non-list (%s)",
+            chunk_index + 1, chunk_total, attempt + 1, type(result).__name__,
+        )
+    logger.error(
+        "halacha_extractor chunk %d/%d failed after %d attempts: %s",
+        chunk_index + 1, chunk_total, CHUNK_RETRY_ATTEMPTS + 1, last_err,
+    )
+    return []
+
+
+async def extract(case_law_id: UUID | str) -> dict:
+    """Extract halachot from an uploaded precedent and store them.
+
+    Idempotent: replaces any existing halachot for this case_law_id.
+    All inserted rows start as ``review_status='pending_review'``.
+
+    Returns:
+        ``{"status": "...", "extracted": N, "verified": M, "stored": K, ...}``
+    """
+    if isinstance(case_law_id, str):
+        case_law_id = UUID(case_law_id)
+
+    record = await db.get_case_law(case_law_id)
+    if not record:
+        return {"status": "not_found", "extracted": 0, "stored": 0}
+
+    chunks = await db.list_precedent_chunks(
+        case_law_id, section_types=EXTRACTABLE_SECTIONS,
+    )
+    if not chunks:
+        await db.set_case_law_halacha_status(case_law_id, "completed")
+        return {"status": "no_chunks", "extracted": 0, "stored": 0}
+
+    await db.set_case_law_halacha_status(case_law_id, "processing")
+    await db.delete_halachot(case_law_id)
+
+    citation = record.get("case_number", "")
+    court = record.get("court", "")
+    date_str = str(record.get("date") or "")
+    context = f"מקור: {citation} — {court}, {date_str}"
+
+    sem = asyncio.Semaphore(CHUNK_CONCURRENCY)
+
+    async def _bounded(idx: int, chunk_row: dict) -> list[dict]:
+        async with sem:
+            return await _extract_chunk(
+                chunk_row["content"], chunk_row["section_type"],
+                idx, len(chunks), context,
+            )
+
+    chunk_results = await asyncio.gather(
+        *[_bounded(i, c) for i, c in enumerate(chunks)]
+    )
+    raw_halachot: list[dict] = []
+    for items in chunk_results:
+        raw_halachot.extend(items)
+
+    if not raw_halachot:
+        await db.set_case_law_halacha_status(case_law_id, "completed")
+        return {"status": "no_halachot", "extracted": 0, "stored": 0}
+
+    # Validate against the full text of the precedent for the quote check.
+    full_text = record.get("full_text") or ""
+
+    cleaned: list[dict] = []
+    for raw in raw_halachot:
+        coerced = _coerce_halacha(raw)
+        if coerced is None:
+            continue
+        coerced["quote_verified"] = _verify_quote(
+            coerced["supporting_quote"], full_text,
+        )
+        cleaned.append(coerced)
+
+    if not cleaned:
+        await db.set_case_law_halacha_status(case_law_id, "completed")
+        return {"status": "no_valid_halachot", "extracted": len(raw_halachot), "stored": 0}
+
+    # Embed rule_statement + reasoning_summary so semantic search hits the
+    # rule directly rather than the surrounding chunk centroid.
+    embed_inputs = [
+        f"{h['rule_statement']} — {h['reasoning_summary']}".strip(" —")
+        for h in cleaned
+    ]
+    try:
+        vectors = await embeddings.embed_texts(embed_inputs, input_type="document")
+    except Exception as e:
+        logger.error("halacha_extractor: embeddings failed: %s", e)
+        vectors = [None] * len(cleaned)
+
+    for halacha, vec in zip(cleaned, vectors):
+        halacha["embedding"] = vec
+
+    stored = await db.store_halachot(case_law_id, cleaned)
+
+    verified = sum(1 for h in cleaned if h["quote_verified"])
+    await db.set_case_law_halacha_status(case_law_id, "completed")
+
+    logger.info(
+        "halacha_extractor: case_law=%s extracted=%d cleaned=%d verified=%d stored=%d",
+        case_law_id, len(raw_halachot), len(cleaned), verified, stored,
+    )
+    return {
+        "status": "completed",
+        "extracted": len(raw_halachot),
+        "valid": len(cleaned),
+        "verified": verified,
+        "stored": stored,
+    }