Files
legal-ai/mcp-server/src/legal_mcp/services/court_citation.py
Chaim 0990db7a3c feat(X13): auto-fetch court verdicts from נט המשפט → corpus (Tier 0 + scaffold)
תת-מערכת אחזור-פסיקה אוטומטי: כשיומון מצביע על פס"ד בית-משפט, מסווגים את
הערכאה, מורידים מהמקור הציבורי המתאים, וקולטים דרך צינור-הקליטה הקנוני.

- spec-first: docs/spec/X13-court-fetch.md (INV-CF1..CF7) + אינדקס
- מסווג court_citation.py (supreme/admin/skip) + 10 בדיקות (עת"מ 46111-12-22 → admin)
- Tier 0: court_fetch_supreme.py — supremedecisions API (reverse-engineered), httpx
  + browser-headers (אומת 200) + politeness
- תור court_fetch_jobs (SCHEMA_V30) + DB helpers + court_fetch_orchestrator.py
- Tier 1 scaffold: legal-court-fetch-service (aiohttp+Bearer, מראת legal-chat-service)
  + camofox_client (Camoufox open-source) + recaptcha_audio (Whisper מקומי) + pm2
- Tier 2 fallback חינני: manual + missing_precedent (INV-CF2/CF3 — אין drop שקט)
- כלי-MCP court_verdict_fetch / court_fetch_status; SCRIPTS.md

Invariants: מקיים G2 (מסלול-קליטה יחיד, INV-CF1) · G3/G1 (idempotent+נרמול, INV-CF5)
· G4/§6 (אין בליעה שקטה, INV-CF2) · G10 (שער-אנושי, INV-CF3) · G5 (source_type,
INV-CF6) · G9 (provenance+audit, INV-CF7). מקורות INV-CF4: RFC 9309 · Google
crawler · OWASP OAT.

Follow-ups (טרם אומתו חי): live Tier-0 validation · התקנת camofox-browser+whisper
· כיול selectors Tier-1 · COURT_FETCH_SHARED_SECRET (Infisical+Coolify) · טריגר
מ-digest try_autolink (worktree-digests-radar). V30 עלול להתנגש עם digests-radar.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 18:12:13 +00:00

205 lines
7.7 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""Court-citation classifier for the auto-fetch subsystem (X13).
Given a raw citation string (typically a digest's ``underlying_citation``,
e.g. ``עת"מ 46111-12-22 יכין-אפק נ' הוועדה המחוזית``), decide:
* **which tier** can fetch it (``supreme`` | ``admin`` | ``skip``), and
* the **canonical case number** plus, for נט המשפט, the
(file, month, year) triple the public case-search form needs.
Tier mapping (INV-CF6 — only court rulings are auto-fetched; ועדת-ערר is
never sent to a public fetch, it needs Nevo):
* ``supreme`` — Supreme Court prefixes (עע"מ/בג"ץ/ע"א/רע"א/דנ"א/בר"מ/בש"א).
Fetched directly from ``supremedecisions.court.gov.il`` (Tier 0, no CAPTCHA).
* ``admin`` — district / administrative-court prefixes (עת"מ/עמ"נ/…) and
the bare נט-המשפט "filed" format ``NNNNN-MM-YY``. Fetched via the
host-side stealth browser against נט המשפט (Tier 1).
* ``skip`` — ועדת-ערר (ערר/בל"מ). Not publicly fetchable → missing_precedent.
Regex families intentionally mirror ``citation_extractor.py`` (the canonical
prefix/number patterns) so the two stay in sync — we reuse ``_NUM_RX`` shape
and ``_normalize_case_number`` semantics rather than inventing a parallel
parser (INV-CF1 / engineering "symmetry" rule).
"""
from __future__ import annotations
import re
from dataclasses import dataclass
# Canonical number core, identical shape to citation_extractor._NUM_RX:
# 3-5 digits, optional separator + 2-4 digits, optional third group
# (the NNNNN-MM-YY "filed" format — 46111-12-22 = file 46111, month 12, yr 22).
_NUM_RX = r"\d{1,5}(?:[-/]\d{1,4}(?:[-/]\d{2,4})?)?"
# Hebrew gershayim: straight (") or curly (״).
_Q = r"[\"״]"
# Optional leading one-letter Hebrew preposition/conjunction (ב/ל/ה/ו/כ/מ/ש)
# attached to the prefix — e.g. "בערר", "וערר", "כפי שקבעתי בערר". Anchored by
# a lookbehind that forbids a *preceding* Hebrew letter, so we don't match a
# prefix buried inside a longer word. Regex backtracking lets the preposition
# match empty when the prefix itself starts with one of these letters (בג"ץ).
_LEAD = r"(?<![א-ת])(?:[בלהוכמש])?"
# Supreme Court prefixes → Tier 0 (supremedecisions public download API).
_SUPREME_PREFIXES = [
rf"עע{_Q}מ", # ערעור מנהלי (לעליון)
rf"בג{_Q}ץ", # בג"ץ
rf"בג{_Q}צ", # variant spelling
rf"דנג{_Q}ץ", # דיון נוסף בג"ץ
rf"ע{_Q}א", # ערעור אזרחי
rf"רע{_Q}א", # רשות ערעור אזרחי
rf"דנ{_Q}א", # דיון נוסף אזרחי
rf"בר{_Q}מ", # בקשת רשות ערעור מנהלי (עליון)
rf"בש{_Q}א", # בקשת רשות … (עליון)
]
# District / administrative-court prefixes → Tier 1 (נט המשפט case viewer).
_ADMIN_PREFIXES = [
rf"עת{_Q}מ", # עתירה מנהלית (בימ"ש לעניינים מנהליים)
rf"עמ{_Q}נ", # ערעור מנהלי (מחוזי)
rf"ת{_Q}א", # תביעה אזרחית (מחוזי/שלום)
rf"ה{_Q}פ", # המרצת פתיחה
]
# Appeals-committee → skip (needs Nevo; never auto-fetched).
_SKIP_PREFIXES = [
rf"ערר",
rf"בל{_Q}מ",
]
_SUPREME_RX = re.compile(
_LEAD + r"(" + "|".join(_SUPREME_PREFIXES) + r")\s*(" + _NUM_RX + r")",
re.UNICODE,
)
_ADMIN_RX = re.compile(
_LEAD + r"(" + "|".join(_ADMIN_PREFIXES) + r")\s*(" + _NUM_RX + r")",
re.UNICODE,
)
_SKIP_RX = re.compile(
_LEAD + r"(" + "|".join(_SKIP_PREFIXES) + r")" + r"(?:\s*\([^)\n]{0,80}\))?\s*(" + _NUM_RX + r")",
re.UNICODE,
)
# Bare נט-המשפט filed format with no prefix: 46111-12-22 (5/4-digit file,
# 1-2 digit month, 2-4 digit year). Used when a digest gives just the number.
_BARE_FILED_RX = re.compile(r"(?<!\d)(\d{1,5})-(\d{1,2})-(\d{2,4})(?!\d)", re.UNICODE)
@dataclass
class CourtCitation:
"""Result of classifying a citation for auto-fetch routing."""
tier: str # "supreme" | "admin" | "skip" | "unknown"
court_prefix: str # e.g. 'עת"מ', or "" for bare/unknown
case_number_raw: str # the matched number as written, e.g. "46111-12-22"
case_number_norm: str # canonical: slashes→dashes, digits/sep only
# נט-המשפט form fields (only when the filed format NNNNN-MM-YY is present):
file_number: str | None = None
month: str | None = None
year: str | None = None
@property
def fetchable(self) -> bool:
return self.tier in ("supreme", "admin")
def normalize_case_number(raw: str) -> str:
"""Canonicalize a case number for idempotency keys / matching.
Mirrors ``citation_extractor._normalize_case_number``: strip everything
but digits and separators, unify ``/`` → ``-``. Display value is never
derived from this.
"""
cleaned = re.sub(r"[^\d/\-]", "", raw or "")
return cleaned.replace("/", "-").strip("-")
def _split_filed(num_norm: str) -> tuple[str, str, str] | None:
"""Split a normalized NNNNN-MM-YY number into (file, month, year).
Only the three-group "filed" format yields a נט-המשפט triple; two-group
formats (1234-22 / 1234/22) are Supreme-style serials and return None.
"""
m = _BARE_FILED_RX.fullmatch(num_norm)
if not m:
return None
file_no, month, year = m.group(1), m.group(2), m.group(3)
# Plausibility: month 1-12, year 2-4 digits. Reject implausible months
# (avoids mis-reading a 2-group serial that slipped through).
if not (1 <= int(month) <= 12):
return None
return file_no, month, year
def classify(citation: str) -> CourtCitation:
"""Classify a raw citation string into a fetch tier + parsed number.
Resolution order: ועדת-ערר (skip) is checked FIRST so an "ערר" prefix is
never mis-routed to a court tier; then Supreme prefixes; then admin
prefixes; then a bare filed number defaults to ``admin`` (נט המשפט is the
only public source for prefix-less district/שלום numbers).
"""
text = (citation or "").strip()
if not text:
return CourtCitation("unknown", "", "", "")
# 1. ועדת-ערר → skip (must win over any court match).
m = _SKIP_RX.search(text)
if m:
raw = m.group(2)
return CourtCitation(
tier="skip",
court_prefix=m.group(1),
case_number_raw=raw,
case_number_norm=normalize_case_number(raw),
)
# 2. Supreme Court prefix → Tier 0.
m = _SUPREME_RX.search(text)
if m:
raw = m.group(2)
return CourtCitation(
tier="supreme",
court_prefix=m.group(1),
case_number_raw=raw,
case_number_norm=normalize_case_number(raw),
)
# 3. District / admin prefix → Tier 1.
m = _ADMIN_RX.search(text)
if m:
raw = m.group(2)
norm = normalize_case_number(raw)
filed = _split_filed(norm)
return CourtCitation(
tier="admin",
court_prefix=m.group(1),
case_number_raw=raw,
case_number_norm=norm,
file_number=filed[0] if filed else None,
month=filed[1] if filed else None,
year=filed[2] if filed else None,
)
# 4. Bare filed number (no prefix) → default admin (נט המשפט).
m = _BARE_FILED_RX.search(text)
if m:
raw = m.group(0)
norm = normalize_case_number(raw)
filed = _split_filed(norm)
if filed:
return CourtCitation(
tier="admin",
court_prefix="",
case_number_raw=raw,
case_number_norm=norm,
file_number=filed[0],
month=filed[1],
year=filed[2],
)
return CourtCitation("unknown", "", "", "")