feat(precedents): metadata auto-fill, edit sheet, persuasive extraction

Three improvements to the precedent library based on usage feedback: 1. Auto-fill metadata at upload time. New service precedent_metadata_extractor reads the ruling's full_text and suggests case_name (short), summary, headnote, key_quote, subject_tags, appeal_subtype. The merge policy fills only empty fields, preserving everything the chair typed in the upload form. Wired into the ingest pipeline; also exposed as a re-run endpoint POST /api/precedent-library/{id}/extract-metadata for existing records. 2. Edit sheet in the UI. Pencil icon on each library row opens a pre-populated form covering every field. A Sparkles button on the sheet runs the metadata extractor on demand and refreshes the form. The case_number is read-only because halachot are FK'd to it; renaming requires delete + re-upload. 3. Halacha extractor branches on is_binding. Sources marked binding (Supreme/Administrative) keep the strict halacha prompt. Non-binding sources (other appeals committees, district courts on planning matters) get a different prompt that extracts applications, interpretive principles, and persuasive conclusions — labeled with new rule_types 'application' and 'persuasive'. The fallback also widens chunk selection: if the chunker labeled nothing as legal_analysis/ruling/conclusion, we now run on all chunks rather than returning zero halachot for a usable ruling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 10:19:35 +00:00
parent b51163b67c
commit 73a79ea7e8
10 changed files with 841 additions and 21 deletions
--- a/mcp-server/src/legal_mcp/services/precedent_metadata_extractor.py
+++ b/mcp-server/src/legal_mcp/services/precedent_metadata_extractor.py
@@ -0,0 +1,216 @@
+"""Auto-extract precedent metadata from a freshly-uploaded ruling.
+
+Runs after chunking. Reads the precedent's full_text and asks Claude to
+fill in the metadata fields that an upload form usually leaves empty:
+short case_name, summary, headnote, key_quote, subject_tags,
+appeal_subtype.
+
+Caller policy: only empty user-supplied fields are filled. Anything the
+chair already typed in the upload form is preserved. This is enforced
+in ``apply_to_record``.
+"""
+
+from __future__ import annotations
+
+import logging
+from uuid import UUID
+
+from legal_mcp.config import parse_llm_json
+from legal_mcp.services import claude_session, db
+
+logger = logging.getLogger(__name__)
+
+
+# The prompt is short — we only need the first 12K chars of the ruling
+# (header + opening of discussion is enough for naming + summary). For
+# subject tags we sample the discussion section too.
+_HEAD_CHARS = 12_000
+_TAIL_CHARS = 6_000
+
+
+METADATA_EXTRACTION_PROMPT = """אתה מסייע משפטי בכיר. קרא את פסק הדין/ההחלטה הבא וחלץ ממנו מטא-דאטה לקטלוג הקורפוס.
+
+המטרה: למלא שדות בטופס העלאה שהמשתמש הזין באופן חלקי. **אל תמציא** — אם המידע לא מופיע בטקסט, השאר ריק (מחרוזת ריקה / מערך ריק).
+
+## פלט נדרש
+החזר JSON אחד (object — לא array) בפורמט הבא, ללא markdown וללא הסברים:
+
+{
+  "case_name_short": "שם קצר ל-3-6 מילים (למשל 'אהרון ברק' או 'ב. קרן-נכסים'). אל תכלול מספר תיק. שם המבקש/העורר העיקרי. אם זו החלטה מאוחדת — שם הצד המוביל.",
+  "appeal_subtype": "תת-סוג ספציפי בתוך תחום המשפט (למשל 'תכנית רחביה', 'מימוש במכר', 'תמ\\"א 38', 'שימוש חורג', 'סופיות ההחלטה'). מילה אחת או צירוף קצר.",
+  "summary": "תקציר עניני 2-3 משפטים: מה הייתה השאלה, מה הוכרע. בלי שיפוט.",
+  "headnote": "headnote בסגנון נבו: 1-2 משפטים שמסכמים את העיקרון שנקבע/יושם בפסק. למשל 'תכנית רחביה — היטל השבחה במימוש במכר — אין לחייב כשהזכויות צפות'.",
+  "key_quote": "ציטוט מילולי בודד, 30-100 מילים, שמייצג את לב הפסק. חייב להופיע מילה במילה בטקסט. אם אין ציטוט מתאים — מחרוזת ריקה.",
+  "subject_tags": ["תגיות", "נושא", "בעברית"]
+}
+
+## כללי איכות
+1. **case_name_short** — שם בולט וקצר. בלי 'נ\\'' / 'נגד' / מספרי תיק.
+2. **appeal_subtype** — אופציונלי. אם הסוגיה רחבה ולא מסווגת — השאר ריק.
+3. **summary** — תיאור ניטרלי, גוף שלישי.
+4. **headnote** — לא מצטטים, מסכמים. סגנון נבו: ביטוי קצר אחד.
+5. **key_quote** — חייב להיות הדבקה מילולית מהקלט. אם אין ציטוט בולט — השאר ריק.
+6. **subject_tags** — 3-7 תגיות בעברית, snake_case (חניה, קווי_בניין, שיקול_דעת, פגם_פרוצדורלי, סמכות, מועדים, פגיעה_במקרקעין, ירידת_ערך, תכנית_רחביה, מימוש_במכר, וכד'). שייך לתחום של ועדת ערר תכנון ובניה.
+
+## הקלט
+{context}
+
+--- תחילת הטקסט ---
+{text_window}
+--- סוף הטקסט ---
+"""
+
+
+def _build_text_window(full_text: str) -> str:
+    """Return the head + tail of the ruling, with a marker if truncated.
+
+    Most rulings have the parties/subject in the head and the conclusion
+    in the tail; the middle is the discussion which is captured via the
+    halacha extractor independently. Sending head+tail keeps the prompt
+    cheap while preserving naming and conclusion context.
+    """
+    if len(full_text) <= _HEAD_CHARS + _TAIL_CHARS:
+        return full_text
+    return (
+        full_text[:_HEAD_CHARS]
+        + "\n\n[... חלק האמצע הושמט עקב אורך — ראה את החלק האחרון של הפסק להלן ...]\n\n"
+        + full_text[-_TAIL_CHARS:]
+    )
+
+
+async def extract_metadata(case_law_id: UUID | str) -> dict:
+    """Run metadata extraction. Returns a dict with the suggested values.
+
+    Does NOT write to the DB — caller decides what to merge.
+    """
+    if isinstance(case_law_id, str):
+        case_law_id = UUID(case_law_id)
+
+    record = await db.get_case_law(case_law_id)
+    if not record:
+        return {}
+    full_text = (record.get("full_text") or "").strip()
+    if not full_text:
+        return {}
+
+    citation = record.get("case_number") or ""
+    court = record.get("court") or ""
+    date_str = str(record.get("date") or "")
+    practice_area = record.get("practice_area") or ""
+
+    context = (
+        f"מראה מקום: {citation}\n"
+        f"ערכאה: {court}\n"
+        f"תאריך: {date_str}\n"
+        f"תחום: {practice_area}"
+    )
+    prompt = METADATA_EXTRACTION_PROMPT.format(
+        context=context, text_window=_build_text_window(full_text),
+    )
+
+    try:
+        result = await claude_session.query_json(prompt)
+    except Exception as e:
+        logger.warning("precedent_metadata_extractor: query failed: %s", e)
+        return {}
+
+    if not isinstance(result, dict):
+        logger.warning(
+            "precedent_metadata_extractor: expected dict, got %s",
+            type(result).__name__,
+        )
+        return {}
+
+    # Normalize keys / types
+    out: dict = {}
+    if isinstance(result.get("case_name_short"), str):
+        out["case_name_short"] = result["case_name_short"].strip()
+    if isinstance(result.get("appeal_subtype"), str):
+        out["appeal_subtype"] = result["appeal_subtype"].strip()
+    if isinstance(result.get("summary"), str):
+        out["summary"] = result["summary"].strip()
+    if isinstance(result.get("headnote"), str):
+        out["headnote"] = result["headnote"].strip()
+    if isinstance(result.get("key_quote"), str):
+        out["key_quote"] = result["key_quote"].strip()
+    tags = result.get("subject_tags") or []
+    if isinstance(tags, list):
+        out["subject_tags"] = [str(t).strip() for t in tags if str(t).strip()]
+    return out
+
+
+async def apply_to_record(
+    case_law_id: UUID | str,
+    suggested: dict,
+) -> dict:
+    """Merge suggested metadata into the case_law row, filling ONLY empty fields.
+
+    Empty rules:
+      - string field == "" → fill from suggested
+      - list field == [] → fill from suggested
+      - if suggested key is missing or empty, skip
+
+    case_name has special handling: if the current case_name equals the
+    case_number (a tell-tale sign of the upload form sending the long
+    citation into both fields), treat it as empty and overwrite.
+    """
+    if isinstance(case_law_id, str):
+        case_law_id = UUID(case_law_id)
+    record = await db.get_case_law(case_law_id)
+    if not record:
+        return {"updated": False, "fields": []}
+
+    fields_to_update: dict = {}
+
+    cur_case_name = (record.get("case_name") or "").strip()
+    cur_case_number = (record.get("case_number") or "").strip()
+    suggested_case_name = (suggested.get("case_name_short") or "").strip()
+    if suggested_case_name and (
+        not cur_case_name or cur_case_name == cur_case_number
+    ):
+        fields_to_update["case_name"] = suggested_case_name
+
+    if not (record.get("appeal_subtype") or "").strip():
+        s = (suggested.get("appeal_subtype") or "").strip()
+        if s:
+            fields_to_update["appeal_subtype"] = s
+
+    if not (record.get("summary") or "").strip():
+        s = (suggested.get("summary") or "").strip()
+        if s:
+            fields_to_update["summary"] = s
+
+    if not (record.get("headnote") or "").strip():
+        s = (suggested.get("headnote") or "").strip()
+        if s:
+            fields_to_update["headnote"] = s
+
+    if not (record.get("key_quote") or "").strip():
+        s = (suggested.get("key_quote") or "").strip()
+        if s:
+            fields_to_update["key_quote"] = s
+
+    cur_tags = record.get("subject_tags") or []
+    if not cur_tags:
+        sug_tags = suggested.get("subject_tags") or []
+        if sug_tags:
+            fields_to_update["subject_tags"] = sug_tags
+
+    if not fields_to_update:
+        return {"updated": False, "fields": []}
+
+    await db.update_case_law(case_law_id, **fields_to_update)
+    return {"updated": True, "fields": list(fields_to_update.keys())}
+
+
+async def extract_and_apply(case_law_id: UUID | str) -> dict:
+    """Convenience wrapper: extract → merge into row → return summary."""
+    suggested = await extract_metadata(case_law_id)
+    if not suggested:
+        return {"status": "no_metadata", "fields": []}
+    result = await apply_to_record(case_law_id, suggested)
+    return {
+        "status": "completed" if result["updated"] else "no_changes",
+        "fields": result["fields"],
+        "suggested": suggested,
+    }