fix(nevo): strip preamble/mini-ratio from court rulings too (#86.1)
strip_nevo_preamble's _DECISION_START only matched ועדת-ערר openings (בפנינו / הערר שבנדון / ...), so Nevo COURT judgments — exactly the ones carrying a מיני-רציו — slipped through unstripped. The editorial mini-ratio then leaked into the chunked body, risking that the halacha extractor reads Nevo's answer key (contamination) and polluting the corpus. Proven on בג"ץ 1764/05: its full_text still contained the מיני-רציו (unstripped). Fix: - Extend _DECISION_START with court-ruling openings: פסק-דין/פסק דין header and the authoring-judge line (השופט/ת, כב' השופט, הנשיא, המשנה לנשיא). re.search picks the earliest line-start match → the real opinion start, not the prose ratio above it. - Widen the Nevo-marker detection window 400→1500 chars so a long court/parties header doesn't push חקיקה שאוזכרה:/מיני-רציו: out of range. Verified on the real 1764/05 full_text: strips 2702 chars, body now starts at 'השופט ס' ג'ובראן:', מיני-רציו gone. Regression: ועדת-ערר openings still strip; non-Nevo text untouched; markers-past-400 now detected. Suite 182 passed (6 new). This is the anti-contamination prerequisite for the Nevo-ratio gold-set (#86.3/#81.7). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -358,8 +358,16 @@ def render_pages_for_multimodal(
|
||||
_NEVO_MARKERS = ("ספרות:", "חקיקה שאוזכרה:", "מיני-רציו:", "פסקי דין שאוזכרו:",
|
||||
"כתבי עת:", "הועתק מנבו")
|
||||
|
||||
# Markers for where the actual decision body begins (everything before is Nevo
|
||||
# preamble: bibliography + מיני-רציו). Two families:
|
||||
# - ועדת ערר / district openings (בפנינו / הערר שבנדון / ...)
|
||||
# - COURT-RULING openings (#86.1): a פסק-דין header or the authoring judge's
|
||||
# line ("השופט/ת X:", "כב' השופט", "הנשיא"). Without these, Nevo court
|
||||
# judgments — exactly the ones carrying a מיני-רציו — slipped through unstripped
|
||||
# (e.g. בג"ץ 1764/05), risking that the extractor reads Nevo's answer key.
|
||||
_DECISION_START = re.compile(
|
||||
r"^(בפנינו|לפנינו|הערר שבנדון|ועדת הערר לתכנון|רקע עובדתי|עסקינן)",
|
||||
r"^(בפנינו|לפנינו|לפניי|הערר שבנדון|ועדת הערר לתכנון|רקע עובדתי|עסקינן|"
|
||||
r"פסק[- ]דין|פסק[- ]דינו|כב(?:וד)?['׳]?\s*השופט|המשנה לנשיא|הנשיא|השופט)",
|
||||
re.MULTILINE,
|
||||
)
|
||||
|
||||
@@ -369,7 +377,9 @@ def strip_nevo_preamble(text: str) -> str:
|
||||
|
||||
Returns the original text unchanged if no preamble is detected.
|
||||
"""
|
||||
head = text[:400]
|
||||
# Window wide enough to catch the Nevo markers even when a long court/parties
|
||||
# header precedes them (court rulings push חקיקה שאוזכרה:/מיני-רציו: down).
|
||||
head = text[:1500]
|
||||
if not any(marker in head for marker in _NEVO_MARKERS):
|
||||
return text
|
||||
m = _DECISION_START.search(text)
|
||||
|
||||
Reference in New Issue
Block a user