fix(precedents): normalize citation→docket case_number + enforce source_type↔precedent_level
All checks were successful
G12 Leak-Guard / leak-guard (pull_request) Successful in 3s
Lint — undefined names / undefined-names (pull_request) Successful in 12s

שני באגים בקליטת-פסיקה חיצונית (התגלו בתיק 1132-09-24 שהועלה דרך "פסקה חסרה"):

1. case_number קיבל את מחרוזת-הציטוט המלאה במקום דוקט נקי. הסיבה: overwrite_case_number=True
   הועבר רק לנתיב-הפנימי (internal_decisions); נתיב-הדריינר ל-external השאיר את הציטוט שב-
   case_number (precedent_library: case_number=citation). היקף: 122 רשומות external_upload.
2. source_type לא נאכף מול precedent_level — רק ה-prompt ביקש מה-LLM. כשה-LLM פלט
   level=ועדת_ערר_מחוזית אך source_type=court_ruling, ההחלטה סווגה בספרייה כ"פסיקת בית משפט".

תיקון (ב-apply_to_record, כך שכל הנתיבים נהנים):
• case_number מנורמל לדוקט הנקי כש-(א) caller כופה או (ב) הערך הנוכחי ציטוט-צורני (רווח/אורך>20);
  guard _is_clean_docket מבטיח שלעולם לא נכתב ערך לא-דוקט לשדה-הזהות (LLM-זבל נדחה).
• _source_type_for_level גוזר source_type מ-precedent_level ודורס אי-עקביות (ועדת_ערר_*→
  appeals_committee; עליון/מנהלי→court_ruling) — מקור-אמת אחד, לא הישענות על עקביות-LLM.

נבדק: 18 unit-tests (docket-validation, level→type mapping) + 3 integration-tests מול
apply_to_record עם DB מדומה (נרמול, אי-דריסת-דוקט-תקין, דחיית-זבל, אכיפת-עקביות). py_compile נקי.
תיקון-נקודתי כבר בוצע ידנית ל-1132-09-24. Backfill ל-122 בנפרד (TaskMaster #141).

Invariants: G1 (תיקון-במקור), G2 (אותו extractor — בלי מסלול מקביל), INV-AH (מקור-אמת
דטרמיניסטי לסיווג, לא ניחוש-LLM). G11 (זהות-תיק נקייה).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-14 20:57:08 +00:00
parent 1a4b4fcf63
commit a05df3eb1a

View File

@@ -15,6 +15,7 @@ in ``apply_to_record``.
from __future__ import annotations from __future__ import annotations
import logging import logging
import re
from datetime import date as date_type from datetime import date as date_type
from uuid import UUID from uuid import UUID
@@ -220,6 +221,31 @@ async def extract_metadata(case_law_id: UUID | str) -> dict:
return out return out
# Israeli court docket: digits with slash/dash separators, no spaces, no letters
# (e.g. "1132-09-24", "4768/22", "35758-09-25"). Used to (a) detect a
# citation-shaped case_number that must be normalized and (b) guard against ever
# writing a non-docket string into the identity field.
_DOCKET_RE = re.compile(r"\d{1,6}(?:[-/]\d{1,4}){1,2}")
def _is_clean_docket(s: str) -> bool:
return bool(_DOCKET_RE.fullmatch((s or "").strip()))
def _source_type_for_level(level: str) -> str:
"""Derive source_type from precedent_level — the library section is driven by
source_type, so the two MUST agree (an LLM slip pairing
precedent_level='ועדת_ערר_מחוזית' with source_type='court_ruling' files a
committee decision under "court rulings"). Empty when the level is
indeterminate (don't force a guess)."""
level = (level or "").strip()
if level.startswith("ועדת_ערר"):
return "appeals_committee"
if level in ("עליון", "מנהלי"):
return "court_ruling"
return ""
async def apply_to_record( async def apply_to_record(
case_law_id: UUID | str, case_law_id: UUID | str,
suggested: dict, suggested: dict,
@@ -327,10 +353,23 @@ async def apply_to_record(
if pt and (record.get("source_kind") == "internal_committee"): if pt and (record.get("source_kind") == "internal_committee"):
fields_to_update["proceeding_type"] = pt fields_to_update["proceeding_type"] = pt
if overwrite_case_number: # case_number normalization. The precedent upload / missing-precedent flow
cn = (suggested.get("case_number_clean") or "").strip() # stores the FULL citation string into case_number (precedent_library:
if cn: # case_number=citation). Replace it with the clean docket when the LLM gives
fields_to_update["case_number"] = cn # one AND either (a) caller forces it (overwrite_case_number — migrations) or
# (b) the stored value is clearly citation-shaped (has a space / is long — a
# real docket never is). Guard: only write a value that IS a clean docket, so
# a bad LLM output can never corrupt the identity field.
cn_clean = (suggested.get("case_number_clean") or "").strip()
cur_cn = cur_case_number
citation_shaped = bool(cur_cn) and (" " in cur_cn or len(cur_cn) > 20)
if (
cn_clean
and _is_clean_docket(cn_clean)
and cn_clean != cur_cn
and (overwrite_case_number or citation_shaped)
):
fields_to_update["case_number"] = cn_clean
# citation_formatted — full citation per Israeli citation rules. Only # citation_formatted — full citation per Israeli citation rules. Only
# fill if empty; user edits in /precedents/[id] are preserved. # fill if empty; user edits in /precedents/[id] are preserved.
@@ -355,6 +394,26 @@ async def apply_to_record(
if s: if s:
fields_to_update["district"] = s fields_to_update["district"] = s
# Enforce source_type ↔ precedent_level consistency in CODE (the LLM prompt
# asks for it, but a slip would file a ועדת-ערר decision under "court
# rulings"). Derive from the EFFECTIVE level (this run's update or the stored
# value) and override an inconsistent source_type — even one already on the
# record, since the library section depends on it.
eff_level = (
fields_to_update.get("precedent_level")
or record.get("precedent_level")
or ""
).strip()
derived_st = _source_type_for_level(eff_level)
if derived_st:
eff_st = (
fields_to_update.get("source_type")
or record.get("source_type")
or ""
).strip()
if eff_st != derived_st:
fields_to_update["source_type"] = derived_st
if not fields_to_update: if not fields_to_update:
return {"updated": False, "fields": []} return {"updated": False, "fields": []}