feat(precedents): metadata extractor also fills date, level, court

The first end-to-end run on 403-17 surfaced three fields the auto-fill left blank because the chair didn't set them in the upload form: date, precedent_level, and court. All three are right there in the ruling's header text — there's no reason to require manual entry. Prompt now asks for: - decision_date_iso (YYYY-MM-DD parsed from "ניתנה היום, … 5 בספטמבר 2022" style signatures) - precedent_level (closed enum: עליון/מנהלי/ועדת_ערר_ארצית/ועדת_ערר_מחוזית) - court (the full court name from the title block) Validation is unchanged: precedent_level only accepts the four enum values; decision_date_iso is parsed into a Python date object before being handed to update_case_law (asyncpg doesn't coerce strings to DATE columns); court is stored verbatim. Merge policy is unchanged — only fills empty fields. Anything the chair typed in the upload form survives. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 12:16:03 +00:00
parent fc3b6b6cae
commit 6420fe4b0b
1 changed files with 42 additions and 2 deletions
--- a/mcp-server/src/legal_mcp/services/precedent_metadata_extractor.py
+++ b/mcp-server/src/legal_mcp/services/precedent_metadata_extractor.py
@@ -3,7 +3,7 @@
 Runs after chunking. Reads the precedent's full_text and asks Claude to
 fill in the metadata fields that an upload form usually leaves empty:
 short case_name, summary, headnote, key_quote, subject_tags,
-appeal_subtype.
+appeal_subtype, decision_date, precedent_level, court.

 Caller policy: only empty user-supplied fields are filled. Anything the
 chair already typed in the upload form is preserved. This is enforced
@@ -13,6 +13,7 @@ in ``apply_to_record``.
 from __future__ import annotations

 import logging
+from datetime import date as date_type
 from uuid import UUID

 from legal_mcp.config import parse_llm_json
@@ -45,7 +46,10 @@ METADATA_EXTRACTION_PROMPT = """אתה מסייע משפטי בכיר. קרא א
  "summary": "תקציר עניני 2-3 משפטים: מה הייתה השאלה, מה הוכרע. בלי שיפוט.",
  "headnote": "headnote בסגנון נבו: 1-2 משפטים שמסכמים את העיקרון שנקבע/יושם בפסק. למשל 'תכנית רחביה — היטל השבחה במימוש במכר — אין לחייב כשהזכויות צפות'.",
  "key_quote": "ציטוט מילולי בודד, 30-100 מילים, שמייצג את לב הפסק. חייב להופיע מילה במילה בטקסט. אם אין ציטוט מתאים — מחרוזת ריקה.",
-  "subject_tags": ["תגיות", "נושא", "בעברית"]
+  "subject_tags": ["תגיות", "נושא", "בעברית"],
+  "decision_date_iso": "YYYY-MM-DD — תאריך מתן ההחלטה כפי שמופיע בטקסט (בכותרת או בחתימה הסופית). אם לא ניתן לזהות במדויק — מחרוזת ריקה.",
+  "precedent_level": "אחד מ-4: 'עליון' / 'מנהלי' / 'ועדת_ערר_ארצית' / 'ועדת_ערר_מחוזית'. בחר לפי הערכאה שמסומנת בכותרת הפסק. אם לא ברור — מחרוזת ריקה.",
+  "court": "שם הערכאה כפי שהוא מופיע בכותרת (למשל 'בית המשפט העליון', 'בית המשפט המחוזי בירושלים בשבתו כבית משפט לעניינים מנהליים', 'ועדת הערר לתכנון ובניה פיצויים והיטלי השבחה — מחוז ירושלים'). מחרוזת ריקה אם לא ניתן לזהות."
 }

 ## כללי איכות
@@ -55,6 +59,9 @@ METADATA_EXTRACTION_PROMPT = """אתה מסייע משפטי בכיר. קרא א
 4. **headnote** — לא מצטטים, מסכמים. סגנון נבו: ביטוי קצר אחד.
 5. **key_quote** — חייב להיות הדבקה מילולית מהקלט. אם אין ציטוט בולט — השאר ריק.
 6. **subject_tags** — 3-7 תגיות בעברית, snake_case (חניה, קווי_בניין, שיקול_דעת, פגם_פרוצדורלי, סמכות, מועדים, פגיעה_במקרקעין, ירידת_ערך, תכנית_רחביה, מימוש_במכר, וכד'). שייך לתחום של ועדת ערר תכנון ובניה.
+7. **decision_date_iso** — תאריך מדויק בלבד. אם בטקסט יש "ניתנה היום, ט' באלול תשפ"א, 5 בספטמבר 2022" — הפלט: "2022-09-05".
+8. **precedent_level** — קבע לפי הערכאה: בית המשפט העליון = "עליון"; בית משפט מחוזי בשבתו כבית משפט לעניינים מנהליים = "מנהלי"; ועדת ערר ארצית = "ועדת_ערר_ארצית"; ועדת ערר מחוזית (כמו ועדות תכנון ובניה ירושלים/מחוז המרכז וכד') = "ועדת_ערר_מחוזית". השתמש ב-underscore כפי שמופיע — לא ברווח.
+9. **court** — מהכותרת הראשית של הפסק. ניסוח מלא (לא קיצור). מחרוזת ריקה אם לא ניתן לזהות.
 """


@@ -139,6 +146,15 @@ async def extract_metadata(case_law_id: UUID | str) -> dict:
    tags = result.get("subject_tags") or []
    if isinstance(tags, list):
        out["subject_tags"] = [str(t).strip() for t in tags if str(t).strip()]
+    if isinstance(result.get("decision_date_iso"), str):
+        out["decision_date_iso"] = result["decision_date_iso"].strip()
+    if isinstance(result.get("precedent_level"), str):
+        # Validate against the closed enum used elsewhere in the system
+        lvl = result["precedent_level"].strip()
+        if lvl in {"עליון", "מנהלי", "ועדת_ערר_ארצית", "ועדת_ערר_מחוזית"}:
+            out["precedent_level"] = lvl
+    if isinstance(result.get("court"), str):
+        out["court"] = result["court"].strip()
    return out


@@ -199,6 +215,30 @@ async def apply_to_record(
        if sug_tags:
            fields_to_update["subject_tags"] = sug_tags

+    # decision_date — only fill if currently null. The DB column is DATE,
+    # so we parse the LLM's ISO string into a date object before passing
+    # it to update_case_law (asyncpg won't coerce a string to DATE).
+    if record.get("date") is None:
+        iso = (suggested.get("decision_date_iso") or "").strip()
+        if iso:
+            try:
+                fields_to_update["date"] = date_type.fromisoformat(iso[:10])
+            except ValueError:
+                logger.debug(
+                    "metadata_extractor: ignoring invalid decision_date_iso=%r",
+                    iso,
+                )
+
+    if not (record.get("precedent_level") or "").strip():
+        lvl = (suggested.get("precedent_level") or "").strip()
+        if lvl:
+            fields_to_update["precedent_level"] = lvl
+
+    if not (record.get("court") or "").strip():
+        c = (suggested.get("court") or "").strip()
+        if c:
+            fields_to_update["court"] = c
+
    if not fields_to_update:
        return {"updated": False, "fields": []}