fix(precedents): חילוץ מספר-תיק קנוני מהציטוט — לא ציטוט-מלא כמזהה (#137)
All checks were successful
G12 Leak-Guard / leak-guard (pull_request) Successful in 4s
Lint — undefined names / undefined-names (pull_request) Successful in 14s

בהעלאה דרך "פסיקה-חסרה" (ענף ועדת-ערר), כשטופס case_number ריק המסלול נפל-לאחור
לציטוט המלא (committee_case_number = case_number.strip() or citation), כך שמחרוזת-
תצוגה עם שמות-צדדים הושתלה בשדה-המזהה — הפרת INV-ID2/INV-ID1 (X1). נצפה על
precedent 1bf0bae0 (ערר 85074-04-25 רפאל לוי/חולון): case_number=85074/0425,
case_name=ציטוט שלם.

תיקון (G1 — נרמול-במקור, G2 — שימוש-חוזר בפרסר הקנוני):
- court_citation.case_number_from_citation(citation) — מחזיר את אסימון-המספר
  המנורמל בלבד (classify; '' כשאין מספר). חולץ נכון 85074-04-25 גם מתוך
  "ערר (ת\"א 85074-04-25) ...". reuse של הפרסר היחיד, בלי regex מקביל.
- web/app.py (ענף ועדת-ערר): fallback דרך case_number_from_citation; אם אין
  מספר — HTTPException 400 "נא להזין מספר-תיק ידנית" במקום השתלת ציטוט-מלא.
- db._canonical_case_number: מוקשח לחלץ את אסימון-המספר (זורק זנב שמות-צדדים),
  כך ששדה-המזהה לעולם לא נשמר מזוהם — גם בקריאה ישירה (committee + active cases).
  מספר נקי חוזר ללא שינוי; חודש לא מומצא (X1 §1).
- תיקון-נתון: scripts/fix_137_committee_case_number.py (בוצע) — 1bf0bae0:
  case_number→85074-04-25, case_name→צדדים, token ב-citation_formatted.
  אומת היחיד עם canon(num)≠num ב-internal_committee. אידמפוטנטי.

מחוץ-לתחום (תועד כ-follow-up): מסלול external (precedent_library) משתמש בציטוט-
מלא כמזהה-מורשת — זהו פריט-המיגרציה X1 §5 (138 רשומות external/cited_only),
לא הבאג הזה. prefill ב-UI של /missing-precedents — דורש שער Claude Design.

בדיקות: test_court_citation (case_number_from_citation: party-strip/forms/empty),
test_canonical_case_number (harden). כל 339 בדיקות mcp עוברות. guards נקיים.

Invariants: G1 (נרמול-במקור), INV-ID1/ID2 (מזהה מנורמל, אין ציטוט-מלא כמזהה),
G2 (פרסר יחיד), G12 (leak-guard נקי).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-15 03:21:10 +00:00
parent 76a29756c5
commit c27987ba72
7 changed files with 208 additions and 6 deletions

View File

@@ -117,6 +117,23 @@ def normalize_case_number(raw: str) -> str:
return cleaned.replace("/", "-").strip("-")
def case_number_from_citation(citation: str) -> str:
"""Canonical ``case_number`` extracted from a full citation, or ``''``.
Returns the normalized number token only (e.g. ``85074-04-25``) — NEVER the
full citation string with party names / court / date. This is the
identifier-field rule from X1 (INV-ID2): a citation like
``ערר (ת"א 85074-04-25) רפאל לוי ואח' נ' הוועדה … - חולון`` yields
``85074-04-25``, not the whole display string.
Reuses ``classify`` (the one canonical citation parser) so callers that need
a case_number out of an arbitrary citation never roll their own regex (#137,
G2). Returns ``''`` when no number can be parsed — the caller MUST treat that
as "needs a manual case_number" and never fall back to the raw citation.
"""
return classify(citation).case_number_norm
def _split_filed(num_norm: str) -> tuple[str, str, str] | None:
"""Split a normalized NNNNN-MM-YY number into (file, month, year).

View File

@@ -1757,12 +1757,23 @@ def _canonical_case_number(s: str) -> str:
Used at the write boundary for identifier-keyed corpora (internal
committee decisions, active cases). NOT for external precedents, whose
canonical identifier is the full citation.
Extracts the case-number TOKEN — the leading run of digits and internal
separators after the proceeding-type prefix — and drops any trailing display
text. Without this, a full citation mis-passed as the identifier (e.g.
``ערר (ת"א 85074-04-25) רפאל לוי נ' …``) left party names glued to the number
(``85074-04-25) רפאל לוי …``), breaking equality lookups (#137, INV-ID2). A
clean number is returned unchanged.
"""
s = (s or "").strip()
m = re.search(r"\d", s)
if m:
s = s[m.start():]
return s.strip().replace("/", "-")
if not m:
return ""
s = s[m.start():]
token = re.match(r"\d[\d/\-]*", s)
if token:
s = token.group(0)
return s.strip().rstrip("/-").replace("/", "-")
def _content_hash(text: str) -> str:

View File

@@ -0,0 +1,34 @@
"""Unit tests for db._canonical_case_number — #137 / INV-ID2 / X1 §1.
The write-time canonicalizer must extract the case-number TOKEN only and drop
any trailing display text (party names) that a mis-passed full citation glued
onto the number. A clean number is returned unchanged; no month is invented.
"""
from __future__ import annotations
from legal_mcp.services.db import _canonical_case_number as canon
def test_clean_numbers_unchanged():
assert canon("8137-24") == "8137-24"
assert canon("85074-09-24") == "85074-09-24"
assert canon("8126-03-25") == "8126-03-25"
# Legacy two-part number — month is NOT invented (X1 §1).
assert canon("8126-25") == "8126-25"
def test_prefix_stripped():
assert canon("ערר 8137/24") == "8137-24"
assert canon('בל"מ 85074-09-24') == "85074-09-24"
def test_trailing_party_names_dropped():
# The #137 symptom: ingest left "85074-04-25) רפאל לוי …" in the identifier.
assert canon("85074-04-25) רפאל לוי ואח' נ' הוועדה המקומית - חולון") == "85074-04-25"
assert canon("8137-24 פלוני נ' אלמוני") == "8137-24"
def test_empty_and_no_digit():
assert canon("") == ""
assert canon("ללא מספר") == ""

View File

@@ -2,7 +2,11 @@
from __future__ import annotations
from legal_mcp.services.court_citation import classify, normalize_case_number
from legal_mcp.services.court_citation import (
case_number_from_citation,
classify,
normalize_case_number,
)
def test_admin_filed_format_the_example():
@@ -80,6 +84,28 @@ def test_normalize_case_number():
assert normalize_case_number("1110/20") == "1110-20"
def test_case_number_from_citation_strips_party_names():
"""#137 — a full ועדת-ערר citation yields ONLY the number, never the
display string with party names (INV-ID2). This is the exact precedent
1bf0bae0 that planted ``85074-04-25) רפאל לוי …`` into case_number."""
cit = 'ערר (ת"א 85074-04-25) רפאל לוי ואח\' נ\' הוועדה המקומית - חולון'
assert case_number_from_citation(cit) == "85074-04-25"
def test_case_number_from_citation_various_forms():
assert case_number_from_citation('ערר (ת"א 1198-12-25) זאטוס') == "1198-12-25"
assert case_number_from_citation("85074-04-25") == "85074-04-25"
assert case_number_from_citation('בל"מ 85074-09-24') == "85074-09-24"
assert case_number_from_citation("ערר 8137/24") == "8137-24"
def test_case_number_from_citation_empty_when_unparseable():
"""No number → '' so the caller demands a manual number, never the raw
citation (the #137 fallback that caused the bug)."""
assert case_number_from_citation("") == ""
assert case_number_from_citation("פסק דין בלי מספר") == ""
def test_supreme_with_net_format_triple():
"""A Supreme prefix carrying a נט-format number exposes the triple so the
orchestrator can route it to Tier-1 (נט המשפט serves Supreme too)."""