Files
legal-ai/mcp-server/tests/test_nevo_preamble.py
Chaim f8c3fd6c89 fix(nevo): strip preamble/mini-ratio from court rulings too (#86.1)
strip_nevo_preamble's _DECISION_START only matched ועדת-ערר openings (בפנינו /
הערר שבנדון / ...), so Nevo COURT judgments — exactly the ones carrying a
מיני-רציו — slipped through unstripped. The editorial mini-ratio then leaked into
the chunked body, risking that the halacha extractor reads Nevo's answer key
(contamination) and polluting the corpus. Proven on בג"ץ 1764/05: its full_text
still contained the מיני-רציו (unstripped).

Fix:
- Extend _DECISION_START with court-ruling openings: פסק-דין/פסק דין header and
  the authoring-judge line (השופט/ת, כב' השופט, הנשיא, המשנה לנשיא). re.search
  picks the earliest line-start match → the real opinion start, not the prose
  ratio above it.
- Widen the Nevo-marker detection window 400→1500 chars so a long court/parties
  header doesn't push חקיקה שאוזכרה:/מיני-רציו: out of range.

Verified on the real 1764/05 full_text: strips 2702 chars, body now starts at
'השופט ס' ג'ובראן:', מיני-רציו gone. Regression: ועדת-ערר openings still strip;
non-Nevo text untouched; markers-past-400 now detected. Suite 182 passed (6 new).

This is the anti-contamination prerequisite for the Nevo-ratio gold-set (#86.3/#81.7).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 16:55:31 +00:00

58 lines
2.3 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
from __future__ import annotations
from legal_mcp.services import extractor as ex
# Nevo preamble block shared by the Nevo-sourced cases.
_PREAMBLE = (
"חקיקה שאוזכרה:\n"
"חוק התכנון והבניה, תשכ\"ה-1965: סע' 197\n\n"
"מיני-רציו:\n"
"* העותרים לא הוכיחו טעם מיוחד.\n"
"ביהמ\"ש העליון דחה את העתירה בקובעו:\n"
"המחוקק הגביל את הזמן ל-3 שנים.\n\n"
)
def test_strips_court_ruling_judge_opening():
# #86.1: court rulings open with the authoring judge — previously NOT stripped.
text = _PREAMBLE + "השופט ס' ג'ובראן:\n\nהאם קיימים טעמים מיוחדים..."
out = ex.strip_nevo_preamble(text)
assert out.startswith("השופט ס' ג'ובראן:")
assert "מיני-רציו" not in out
assert "דחה את העתירה בקובעו" not in out
def test_strips_court_ruling_pdin_header():
text = _PREAMBLE + "פסק-דין\n\nלפנינו עתירה..."
out = ex.strip_nevo_preamble(text)
assert out.startswith("פסק-דין")
assert "מיני-רציו" not in out
def test_strips_vaada_opening_regression():
# existing behaviour must keep working
text = _PREAMBLE + "בפנינו ערר על החלטת הוועדה המקומית..."
out = ex.strip_nevo_preamble(text)
assert out.startswith("בפנינו ערר")
assert "מיני-רציו" not in out
def test_non_nevo_unchanged():
# no Nevo markers → returned as-is even though it has a judge line
text = "פסק דין\nהשופט כהן: בעניין שלפנינו..."
assert ex.strip_nevo_preamble(text) == text
def test_nevo_markers_but_no_body_start_unchanged():
# markers present but nothing that looks like a decision body → leave intact
text = "מיני-רציו:\n* תקציר בלבד ללא גוף החלטה\n"
assert ex.strip_nevo_preamble(text) == text
def test_markers_past_400_chars_still_detected():
# a long court/parties header pushes the markers past the old 400-char window
header = "בבית המשפט העליון " + ("x " * 200) + "\n" # ~600 chars
text = header + _PREAMBLE + "השופטת ע' ארבל:\n\nגוף ההחלטה..."
out = ex.strip_nevo_preamble(text)
assert out.startswith("השופטת ע' ארבל:")