feat(X13): auto-fetch court verdicts from נט המשפט → corpus (Tier 0 + scaffold)

תת-מערכת אחזור-פסיקה אוטומטי: כשיומון מצביע על פס"ד בית-משפט, מסווגים את
הערכאה, מורידים מהמקור הציבורי המתאים, וקולטים דרך צינור-הקליטה הקנוני.

- spec-first: docs/spec/X13-court-fetch.md (INV-CF1..CF7) + אינדקס
- מסווג court_citation.py (supreme/admin/skip) + 10 בדיקות (עת"מ 46111-12-22 → admin)
- Tier 0: court_fetch_supreme.py — supremedecisions API (reverse-engineered), httpx
  + browser-headers (אומת 200) + politeness
- תור court_fetch_jobs (SCHEMA_V30) + DB helpers + court_fetch_orchestrator.py
- Tier 1 scaffold: legal-court-fetch-service (aiohttp+Bearer, מראת legal-chat-service)
  + camofox_client (Camoufox open-source) + recaptcha_audio (Whisper מקומי) + pm2
- Tier 2 fallback חינני: manual + missing_precedent (INV-CF2/CF3 — אין drop שקט)
- כלי-MCP court_verdict_fetch / court_fetch_status; SCRIPTS.md

Invariants: מקיים G2 (מסלול-קליטה יחיד, INV-CF1) · G3/G1 (idempotent+נרמול, INV-CF5)
· G4/§6 (אין בליעה שקטה, INV-CF2) · G10 (שער-אנושי, INV-CF3) · G5 (source_type,
INV-CF6) · G9 (provenance+audit, INV-CF7). מקורות INV-CF4: RFC 9309 · Google
crawler · OWASP OAT.

Follow-ups (טרם אומתו חי): live Tier-0 validation · התקנת camofox-browser+whisper
· כיול selectors Tier-1 · COURT_FETCH_SHARED_SECRET (Infisical+Coolify) · טריגר
מ-digest try_autolink (worktree-digests-radar). V30 עלול להתנגש עם digests-radar.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-07 18:08:23 +00:00
parent 955675eb1f
commit 0990db7a3c
16 changed files with 1518 additions and 3 deletions

View File

@@ -0,0 +1,181 @@
"""Tier 0 — Supreme Court verdict fetcher (X13).
Pulls a published Supreme Court verdict PDF from the **public** decisions
portal ``supremedecisions.court.gov.il`` — no smart-card, no CAPTCHA. The
portal is an AngularJS SPA backed by a small JSON API (reverse-engineered
from ``/Scripts/app/config.js`` + the search/results controllers):
POST Home/SearchVerdicts body {"document": <query>, "lan": 1} → result list
GET Home/GetCasesYearNum ?... (year + number lookup) → case + docs
GET Home/Download?path=<path>&fileName=<file>&type=4 → the PDF bytes
Two things matter for getting a 200 instead of an F5 connection-reset
(verified empirically 2026-06-07):
* a **complete** browser header set — UA + Accept + Accept-Language. A bare
UA alone gets reset.
* **politeness** (INV-CF4): one request at a time, a cooldown between them,
a Referer of the portal root. We never parallelise or hammer.
Honesty / scope: the *result→download* field mapping (where ``path`` and
``fileName`` live in the SearchVerdicts JSON) is derived from the client code,
not yet confirmed against a live JSON response (the live site rate-limited
probing during development). ``fetch_supreme_verdict`` therefore validates the
response shape and **raises** on anything unexpected (INV-CF2 — no silent
swallow) so the orchestrator can record the failure and fall back, rather than
returning a wrong/empty file. The first live run is the validation pass; see
the X13 verification section.
"""
from __future__ import annotations
import asyncio
import logging
import os
from dataclasses import dataclass
import httpx
logger = logging.getLogger(__name__)
_BASE = "https://supremedecisions.court.gov.il"
# A complete, browser-like header set. Empirically required to pass the F5
# WAF (a bare User-Agent gets a TCP reset).
_HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/126.0 Safari/537.36"
),
"Accept": "application/json, text/plain, */*",
"Accept-Language": "he-IL,he;q=0.9,en;q=0.8",
"Referer": _BASE + "/",
}
# Politeness knobs (INV-CF4). Serial only — never run these concurrently.
_REQUEST_TIMEOUT_S = float(os.environ.get("COURT_FETCH_HTTP_TIMEOUT_S", "30"))
_INTER_REQUEST_COOLDOWN_S = float(os.environ.get("COURT_FETCH_COOLDOWN_S", "2"))
# type=4 → PDF in the portal's Download endpoint (from resultsControler.js).
_DOC_TYPE_PDF = "4"
@dataclass
class FetchedVerdict:
"""A downloaded verdict file held in memory, ready for ingest."""
content: bytes
filename: str
source_url: str
court: str = "בית המשפט העליון"
class SupremeFetchError(RuntimeError):
"""Raised when the public portal returns an unexpected shape / no document.
Carries a human-readable Hebrew reason so the orchestrator can persist it
on the job row (INV-CF2) and decide on fallback.
"""
async def _get(client: httpx.AsyncClient, path: str, **kwargs) -> httpx.Response:
await asyncio.sleep(_INTER_REQUEST_COOLDOWN_S)
resp = await client.get(f"{_BASE}/{path.lstrip('/')}", **kwargs)
resp.raise_for_status()
return resp
async def _post(client: httpx.AsyncClient, path: str, json: dict) -> httpx.Response:
await asyncio.sleep(_INTER_REQUEST_COOLDOWN_S)
resp = await client.post(f"{_BASE}/{path.lstrip('/')}", json=json)
resp.raise_for_status()
return resp
def _extract_doc_ref(results: object) -> tuple[str, str] | None:
"""Pull (path, fileName) of the first verdict document from a results blob.
The SearchVerdicts/GetCasesYearNum responses nest documents under varying
keys across the portal's endpoints. We probe the known shapes defensively
and return the first (path, fileName) pair found; ``None`` if none.
"""
def walk(node):
if isinstance(node, dict):
# A document node carries both a path and a file name.
path = node.get("Path") or node.get("path")
fname = node.get("FileName") or node.get("fileName") or node.get("Filename")
if path and fname:
yield (str(path), str(fname))
for v in node.values():
yield from walk(v)
elif isinstance(node, list):
for v in node:
yield from walk(v)
for pair in walk(results):
return pair
return None
async def fetch_supreme_verdict(
*, citation: str, case_number_norm: str
) -> FetchedVerdict:
"""Fetch a Supreme Court verdict PDF by citation. Raises on failure.
Flow: full-text search for the citation → locate the verdict document's
(path, fileName) → download the PDF. Serial + cooled-down throughout.
"""
async with httpx.AsyncClient(
http2=True,
headers=_HEADERS,
timeout=_REQUEST_TIMEOUT_S,
follow_redirects=True,
) as client:
# 1. Search. The portal's quick-search posts {document, lan}; lan=1=Hebrew.
try:
search = await _post(
client, "Home/SearchVerdicts",
json={"document": citation, "lan": 1},
)
results = search.json()
except httpx.HTTPError as e:
raise SupremeFetchError(
f"חיפוש בפורטל העליון נכשל עבור {citation}: {e}"
) from e
except ValueError as e: # non-JSON body
raise SupremeFetchError(
f"תשובת-חיפוש לא-JSON מהפורטל עבור {citation}"
) from e
ref = _extract_doc_ref(results)
if not ref:
raise SupremeFetchError(
f"לא נמצא מסמך-פסק עבור {citation} בפורטל העליון "
f"(ייתכן שאינו פורסם או שמבנה-התשובה השתנה)."
)
path, fname = ref
# 2. Download the PDF.
try:
dl = await _get(
client, "Home/Download",
params={"path": path, "fileName": fname, "type": _DOC_TYPE_PDF},
)
except httpx.HTTPError as e:
raise SupremeFetchError(
f"הורדת PDF נכשלה עבור {citation} (path={path}): {e}"
) from e
content = dl.content
ctype = dl.headers.get("content-type", "")
if not content or ("pdf" not in ctype.lower() and not content[:4] == b"%PDF"):
raise SupremeFetchError(
f"הקובץ שהתקבל עבור {citation} אינו PDF תקין (content-type={ctype})."
)
source_url = (
f"{_BASE}/Home/Download?path={path}&fileName={fname}&type={_DOC_TYPE_PDF}"
)
safe_name = fname if fname.lower().endswith(".pdf") else f"{case_number_norm}.pdf"
return FetchedVerdict(
content=content, filename=safe_name, source_url=source_url,
)