feat(X13): auto-fetch court verdicts from נט המשפט → corpus (Tier 0 + scaffold)

תת-מערכת אחזור-פסיקה אוטומטי: כשיומון מצביע על פס"ד בית-משפט, מסווגים את
הערכאה, מורידים מהמקור הציבורי המתאים, וקולטים דרך צינור-הקליטה הקנוני.

- spec-first: docs/spec/X13-court-fetch.md (INV-CF1..CF7) + אינדקס
- מסווג court_citation.py (supreme/admin/skip) + 10 בדיקות (עת"מ 46111-12-22 → admin)
- Tier 0: court_fetch_supreme.py — supremedecisions API (reverse-engineered), httpx
  + browser-headers (אומת 200) + politeness
- תור court_fetch_jobs (SCHEMA_V30) + DB helpers + court_fetch_orchestrator.py
- Tier 1 scaffold: legal-court-fetch-service (aiohttp+Bearer, מראת legal-chat-service)
  + camofox_client (Camoufox open-source) + recaptcha_audio (Whisper מקומי) + pm2
- Tier 2 fallback חינני: manual + missing_precedent (INV-CF2/CF3 — אין drop שקט)
- כלי-MCP court_verdict_fetch / court_fetch_status; SCRIPTS.md

Invariants: מקיים G2 (מסלול-קליטה יחיד, INV-CF1) · G3/G1 (idempotent+נרמול, INV-CF5)
· G4/§6 (אין בליעה שקטה, INV-CF2) · G10 (שער-אנושי, INV-CF3) · G5 (source_type,
INV-CF6) · G9 (provenance+audit, INV-CF7). מקורות INV-CF4: RFC 9309 · Google
crawler · OWASP OAT.

Follow-ups (טרם אומתו חי): live Tier-0 validation · התקנת camofox-browser+whisper
· כיול selectors Tier-1 · COURT_FETCH_SHARED_SECRET (Infisical+Coolify) · טריגר
מ-digest try_autolink (worktree-digests-radar). V30 עלול להתנגש עם digests-radar.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-07 18:08:23 +00:00
parent 955675eb1f
commit 0990db7a3c
16 changed files with 1518 additions and 3 deletions

View File

@@ -0,0 +1,204 @@
"""Court-citation classifier for the auto-fetch subsystem (X13).
Given a raw citation string (typically a digest's ``underlying_citation``,
e.g. ``עת"מ 46111-12-22 יכין-אפק נ' הוועדה המחוזית``), decide:
* **which tier** can fetch it (``supreme`` | ``admin`` | ``skip``), and
* the **canonical case number** plus, for נט המשפט, the
(file, month, year) triple the public case-search form needs.
Tier mapping (INV-CF6 — only court rulings are auto-fetched; ועדת-ערר is
never sent to a public fetch, it needs Nevo):
* ``supreme`` — Supreme Court prefixes (עע"מ/בג"ץ/ע"א/רע"א/דנ"א/בר"מ/בש"א).
Fetched directly from ``supremedecisions.court.gov.il`` (Tier 0, no CAPTCHA).
* ``admin`` — district / administrative-court prefixes (עת"מ/עמ"נ/…) and
the bare נט-המשפט "filed" format ``NNNNN-MM-YY``. Fetched via the
host-side stealth browser against נט המשפט (Tier 1).
* ``skip`` — ועדת-ערר (ערר/בל"מ). Not publicly fetchable → missing_precedent.
Regex families intentionally mirror ``citation_extractor.py`` (the canonical
prefix/number patterns) so the two stay in sync — we reuse ``_NUM_RX`` shape
and ``_normalize_case_number`` semantics rather than inventing a parallel
parser (INV-CF1 / engineering "symmetry" rule).
"""
from __future__ import annotations
import re
from dataclasses import dataclass
# Canonical number core, identical shape to citation_extractor._NUM_RX:
# 3-5 digits, optional separator + 2-4 digits, optional third group
# (the NNNNN-MM-YY "filed" format — 46111-12-22 = file 46111, month 12, yr 22).
_NUM_RX = r"\d{1,5}(?:[-/]\d{1,4}(?:[-/]\d{2,4})?)?"
# Hebrew gershayim: straight (") or curly (״).
_Q = r"[\"״]"
# Optional leading one-letter Hebrew preposition/conjunction (ב/ל/ה/ו/כ/מ/ש)
# attached to the prefix — e.g. "בערר", "וערר", "כפי שקבעתי בערר". Anchored by
# a lookbehind that forbids a *preceding* Hebrew letter, so we don't match a
# prefix buried inside a longer word. Regex backtracking lets the preposition
# match empty when the prefix itself starts with one of these letters (בג"ץ).
_LEAD = r"(?<![א-ת])(?:[בלהוכמש])?"
# Supreme Court prefixes → Tier 0 (supremedecisions public download API).
_SUPREME_PREFIXES = [
rf"עע{_Q}מ", # ערעור מנהלי (לעליון)
rf"בג{_Q}ץ", # בג"ץ
rf"בג{_Q}צ", # variant spelling
rf"דנג{_Q}ץ", # דיון נוסף בג"ץ
rf"ע{_Q}א", # ערעור אזרחי
rf"רע{_Q}א", # רשות ערעור אזרחי
rf"דנ{_Q}א", # דיון נוסף אזרחי
rf"בר{_Q}מ", # בקשת רשות ערעור מנהלי (עליון)
rf"בש{_Q}א", # בקשת רשות … (עליון)
]
# District / administrative-court prefixes → Tier 1 (נט המשפט case viewer).
_ADMIN_PREFIXES = [
rf"עת{_Q}מ", # עתירה מנהלית (בימ"ש לעניינים מנהליים)
rf"עמ{_Q}נ", # ערעור מנהלי (מחוזי)
rf"ת{_Q}א", # תביעה אזרחית (מחוזי/שלום)
rf"ה{_Q}פ", # המרצת פתיחה
]
# Appeals-committee → skip (needs Nevo; never auto-fetched).
_SKIP_PREFIXES = [
rf"ערר",
rf"בל{_Q}מ",
]
_SUPREME_RX = re.compile(
_LEAD + r"(" + "|".join(_SUPREME_PREFIXES) + r")\s*(" + _NUM_RX + r")",
re.UNICODE,
)
_ADMIN_RX = re.compile(
_LEAD + r"(" + "|".join(_ADMIN_PREFIXES) + r")\s*(" + _NUM_RX + r")",
re.UNICODE,
)
_SKIP_RX = re.compile(
_LEAD + r"(" + "|".join(_SKIP_PREFIXES) + r")" + r"(?:\s*\([^)\n]{0,80}\))?\s*(" + _NUM_RX + r")",
re.UNICODE,
)
# Bare נט-המשפט filed format with no prefix: 46111-12-22 (5/4-digit file,
# 1-2 digit month, 2-4 digit year). Used when a digest gives just the number.
_BARE_FILED_RX = re.compile(r"(?<!\d)(\d{1,5})-(\d{1,2})-(\d{2,4})(?!\d)", re.UNICODE)
@dataclass
class CourtCitation:
"""Result of classifying a citation for auto-fetch routing."""
tier: str # "supreme" | "admin" | "skip" | "unknown"
court_prefix: str # e.g. 'עת"מ', or "" for bare/unknown
case_number_raw: str # the matched number as written, e.g. "46111-12-22"
case_number_norm: str # canonical: slashes→dashes, digits/sep only
# נט-המשפט form fields (only when the filed format NNNNN-MM-YY is present):
file_number: str | None = None
month: str | None = None
year: str | None = None
@property
def fetchable(self) -> bool:
return self.tier in ("supreme", "admin")
def normalize_case_number(raw: str) -> str:
"""Canonicalize a case number for idempotency keys / matching.
Mirrors ``citation_extractor._normalize_case_number``: strip everything
but digits and separators, unify ``/`` → ``-``. Display value is never
derived from this.
"""
cleaned = re.sub(r"[^\d/\-]", "", raw or "")
return cleaned.replace("/", "-").strip("-")
def _split_filed(num_norm: str) -> tuple[str, str, str] | None:
"""Split a normalized NNNNN-MM-YY number into (file, month, year).
Only the three-group "filed" format yields a נט-המשפט triple; two-group
formats (1234-22 / 1234/22) are Supreme-style serials and return None.
"""
m = _BARE_FILED_RX.fullmatch(num_norm)
if not m:
return None
file_no, month, year = m.group(1), m.group(2), m.group(3)
# Plausibility: month 1-12, year 2-4 digits. Reject implausible months
# (avoids mis-reading a 2-group serial that slipped through).
if not (1 <= int(month) <= 12):
return None
return file_no, month, year
def classify(citation: str) -> CourtCitation:
"""Classify a raw citation string into a fetch tier + parsed number.
Resolution order: ועדת-ערר (skip) is checked FIRST so an "ערר" prefix is
never mis-routed to a court tier; then Supreme prefixes; then admin
prefixes; then a bare filed number defaults to ``admin`` (נט המשפט is the
only public source for prefix-less district/שלום numbers).
"""
text = (citation or "").strip()
if not text:
return CourtCitation("unknown", "", "", "")
# 1. ועדת-ערר → skip (must win over any court match).
m = _SKIP_RX.search(text)
if m:
raw = m.group(2)
return CourtCitation(
tier="skip",
court_prefix=m.group(1),
case_number_raw=raw,
case_number_norm=normalize_case_number(raw),
)
# 2. Supreme Court prefix → Tier 0.
m = _SUPREME_RX.search(text)
if m:
raw = m.group(2)
return CourtCitation(
tier="supreme",
court_prefix=m.group(1),
case_number_raw=raw,
case_number_norm=normalize_case_number(raw),
)
# 3. District / admin prefix → Tier 1.
m = _ADMIN_RX.search(text)
if m:
raw = m.group(2)
norm = normalize_case_number(raw)
filed = _split_filed(norm)
return CourtCitation(
tier="admin",
court_prefix=m.group(1),
case_number_raw=raw,
case_number_norm=norm,
file_number=filed[0] if filed else None,
month=filed[1] if filed else None,
year=filed[2] if filed else None,
)
# 4. Bare filed number (no prefix) → default admin (נט המשפט).
m = _BARE_FILED_RX.search(text)
if m:
raw = m.group(0)
norm = normalize_case_number(raw)
filed = _split_filed(norm)
if filed:
return CourtCitation(
tier="admin",
court_prefix="",
case_number_raw=raw,
case_number_norm=norm,
file_number=filed[0],
month=filed[1],
year=filed[2],
)
return CourtCitation("unknown", "", "", "")

View File

@@ -0,0 +1,241 @@
"""X13 orchestrator — classify → fetch → ingest → record.
The single entry point (`fetch_and_ingest`) wires the three tiers to the
**canonical** precedent-ingest pipeline (INV-CF1 — no parallel ingest path)
and keeps the `court_fetch_jobs` row honest at every step (INV-CF2 — a job
always ends in an explicit terminal state, never a silent drop).
Tier routing (from `court_citation.classify`):
* ``skip`` — ועדת-ערר → never fetched; logged as a missing_precedent gap.
* ``supreme`` — Tier 0, in-process httpx (`court_fetch_supreme`).
* ``admin`` — Tier 1, the host-side stealth-browser service over loopback.
Fallback (INV-CF3): after ``MAX_AUTONOMOUS_ATTEMPTS`` autonomous failures the
job flips to ``manual`` and a missing_precedent row is opened so the chair
sees the gap and can solve the CAPTCHA live (VNC) or drop the file manually.
This module runs **in the local MCP server only** — `ingest_precedent` drives
halacha extraction via the local ``claude`` CLI (see `claude_session.py`). It
is invoked from the `court_verdict_fetch` MCP tool, not from the container.
"""
from __future__ import annotations
import logging
import os
import tempfile
from pathlib import Path
from uuid import UUID
import httpx
from legal_mcp.services import court_citation, db
from legal_mcp.services.court_fetch_supreme import (
SupremeFetchError,
fetch_supreme_verdict,
)
logger = logging.getLogger(__name__)
# After this many autonomous failures, stop auto-retrying and escalate to a
# human (INV-CF3). Kept low — the .gov site shouldn't be hammered (INV-CF4).
MAX_AUTONOMOUS_ATTEMPTS = int(os.environ.get("COURT_FETCH_MAX_ATTEMPTS", "2"))
# The host-side Tier-1 browser service (pm2). The MCP server runs on the host,
# so it reaches the service over loopback directly (the container bridge in
# web/court_fetch_proxy.py is a separate, optional entry point).
COURT_FETCH_SERVICE_URL = os.environ.get(
"COURT_FETCH_SERVICE_URL", "http://127.0.0.1:8771"
)
_SHARED_SECRET = os.environ.get("COURT_FETCH_SHARED_SECRET", "").strip()
_TIER1_TIMEOUT_S = float(os.environ.get("COURT_FETCH_TIER1_TIMEOUT_S", "300"))
# Provenance level by tier — Supreme rulings are binding; admin-court verdicts
# are administrative (set is_binding conservatively True, chair can downgrade).
_LEVEL_BY_TIER = {"supreme": "עליון", "admin": "מנהלי"}
class _Tier1Unavailable(RuntimeError):
"""The host browser service is not reachable / not configured."""
async def _ingest_bytes(
*, content: bytes, filename: str, citation: str, tier: str,
court: str, source_url: str,
) -> dict:
"""Stage bytes to a temp file and run the canonical ingest (INV-CF1)."""
from legal_mcp.services import precedent_library
suffix = Path(filename).suffix or ".pdf"
tmp = tempfile.NamedTemporaryFile(
prefix="court_fetch_", suffix=suffix, delete=False
)
try:
tmp.write(content)
tmp.flush()
tmp.close()
result = await precedent_library.ingest_precedent(
file_path=tmp.name,
citation=citation,
court=court,
source_type="court_ruling", # INV-CF6
precedent_level=_LEVEL_BY_TIER.get(tier, ""),
is_binding=True,
)
# Stamp provenance on the new case_law row (INV-CF7).
case_law_id = result.get("case_law_id")
if case_law_id and source_url:
try:
await db.update_case_law(
UUID(str(case_law_id)), source_url=source_url
)
except Exception: # provenance is best-effort, never blocks ingest
logger.warning("could not stamp source_url on %s", case_law_id)
return result
finally:
try:
os.unlink(tmp.name)
except OSError:
pass
async def _fetch_tier1_admin(cit: court_citation.CourtCitation) -> dict:
"""Call the host-side browser service to fetch an admin-court verdict.
Returns the service's JSON: ``{ok, content_b64, filename, source_url,
court, reason}``. Raises ``_Tier1Unavailable`` if the service can't be
reached, ``SupremeFetchError``-style RuntimeError on a fetch failure the
service reports.
"""
if not (cit.file_number and cit.month and cit.year):
raise RuntimeError(
f"מספר-תיק {cit.case_number_norm} אינו בפורמט נט-המשפט (תיק-חודש-שנה)"
)
headers = {"Authorization": f"Bearer {_SHARED_SECRET}"} if _SHARED_SECRET else {}
payload = {
"file_number": cit.file_number,
"month": cit.month,
"year": cit.year,
"case_number": cit.case_number_norm,
"court": cit.court_prefix,
}
try:
async with httpx.AsyncClient(timeout=_TIER1_TIMEOUT_S) as client:
resp = await client.post(
f"{COURT_FETCH_SERVICE_URL}/fetch", json=payload, headers=headers
)
except httpx.ConnectError as e:
raise _Tier1Unavailable(
f"שירות-האחזור (legal-court-fetch-service) אינו זמין ב-"
f"{COURT_FETCH_SERVICE_URL}: {e}"
) from e
if resp.status_code != 200:
raise RuntimeError(f"שירות-האחזור החזיר {resp.status_code}: {resp.text[:200]}")
return resp.json()
async def fetch_and_ingest(
citation: str, *, digest_id: UUID | None = None
) -> dict:
"""Classify a citation, fetch the verdict, ingest it, and record the job.
Idempotent on the canonical case number (INV-CF5): a case already fetched
(job ``done``) is returned without re-fetching.
"""
cit = court_citation.classify(citation)
# ── skip: ועדת-ערר — never auto-fetched (INV-CF6). Surface as a gap. ──
if cit.tier == "skip":
await _open_gap(citation, reason="ועדת-ערר — לא ניתן לאחזור ציבורי (נדרש נבו)")
return {"status": "skipped", "tier": "skip", "citation": citation,
"reason": "appeals_committee — needs Nevo"}
if cit.tier == "unknown" or not cit.case_number_norm:
return {"status": "unrecognized", "citation": citation}
# ── idempotent job row ──
job = await db.court_fetch_job_upsert(
case_number_norm=cit.case_number_norm,
citation_raw=citation,
tier=cit.tier,
court=cit.court_prefix,
digest_id=digest_id,
)
if job.get("status") == "done":
return {"status": "already_done", "job": job}
if job.get("status") == "manual":
return {"status": "awaiting_manual", "job": job}
job_id = UUID(str(job["id"]))
await db.court_fetch_job_update(job_id, status="running", bump_attempts=True)
# ── fetch ──
try:
if cit.tier == "supreme":
fetched = await fetch_supreme_verdict(
citation=citation, case_number_norm=cit.case_number_norm
)
content, filename = fetched.content, fetched.filename
source_url, court = fetched.source_url, fetched.court
else: # admin → Tier 1
res = await _fetch_tier1_admin(cit)
if not res.get("ok"):
raise RuntimeError(res.get("reason") or "אחזור נכשל")
import base64
content = base64.b64decode(res["content_b64"])
filename = res.get("filename") or f"{cit.case_number_norm}.pdf"
source_url = res.get("source_url", "")
court = res.get("court") or cit.court_prefix
except (_Tier1Unavailable, SupremeFetchError, RuntimeError) as e:
return await _record_failure(job_id, cit, citation, str(e))
# ── ingest into the canonical pipeline (INV-CF1) ──
try:
result = await _ingest_bytes(
content=content, filename=filename, citation=citation,
tier=cit.tier, court=court, source_url=source_url,
)
except Exception as e: # noqa: BLE001 — recorded, never swallowed (INV-CF2)
logger.exception("ingest failed for %s", cit.case_number_norm)
return await _record_failure(job_id, cit, citation, f"קליטה נכשלה: {e}")
case_law_id = result.get("case_law_id")
await db.court_fetch_job_update(
job_id, status="done",
case_law_id=UUID(str(case_law_id)) if case_law_id else None,
source_url=source_url, error="",
)
return {"status": "done", "tier": cit.tier, "case_law_id": case_law_id,
"citation": citation, "source_url": source_url, "ingest": result}
async def _record_failure(
job_id: UUID, cit: court_citation.CourtCitation, citation: str, err: str
) -> dict:
"""Record a fetch/ingest failure; escalate to manual after N attempts (INV-CF3)."""
job = await db.court_fetch_job_get(cit.case_number_norm)
attempts = (job or {}).get("attempts", 1)
if attempts >= MAX_AUTONOMOUS_ATTEMPTS:
await db.court_fetch_job_update(job_id, status="manual", error=err)
await _open_gap(
citation,
reason=f"אחזור אוטונומי נכשל ({attempts} נסיונות) — נדרשת הורדה ידנית. {err}",
)
logger.warning("court fetch escalated to manual: %s%s", citation, err)
return {"status": "manual", "citation": citation, "error": err,
"attempts": attempts}
await db.court_fetch_job_update(job_id, status="failed", error=err)
logger.warning("court fetch failed (will retry): %s%s", citation, err)
return {"status": "failed", "citation": citation, "error": err,
"attempts": attempts}
async def _open_gap(citation: str, *, reason: str) -> None:
"""Open a missing_precedent gap so the chair sees it (INV-CF2/CF3).
Best-effort + de-duplicated by the missing_precedents layer; a failure
here is logged, never raised (it must not mask the original outcome).
"""
try:
await db.create_missing_precedent(citation=citation, notes=reason)
except Exception:
logger.warning("could not open missing_precedent for %s", citation)

View File

@@ -0,0 +1,181 @@
"""Tier 0 — Supreme Court verdict fetcher (X13).
Pulls a published Supreme Court verdict PDF from the **public** decisions
portal ``supremedecisions.court.gov.il`` — no smart-card, no CAPTCHA. The
portal is an AngularJS SPA backed by a small JSON API (reverse-engineered
from ``/Scripts/app/config.js`` + the search/results controllers):
POST Home/SearchVerdicts body {"document": <query>, "lan": 1} → result list
GET Home/GetCasesYearNum ?... (year + number lookup) → case + docs
GET Home/Download?path=<path>&fileName=<file>&type=4 → the PDF bytes
Two things matter for getting a 200 instead of an F5 connection-reset
(verified empirically 2026-06-07):
* a **complete** browser header set — UA + Accept + Accept-Language. A bare
UA alone gets reset.
* **politeness** (INV-CF4): one request at a time, a cooldown between them,
a Referer of the portal root. We never parallelise or hammer.
Honesty / scope: the *result→download* field mapping (where ``path`` and
``fileName`` live in the SearchVerdicts JSON) is derived from the client code,
not yet confirmed against a live JSON response (the live site rate-limited
probing during development). ``fetch_supreme_verdict`` therefore validates the
response shape and **raises** on anything unexpected (INV-CF2 — no silent
swallow) so the orchestrator can record the failure and fall back, rather than
returning a wrong/empty file. The first live run is the validation pass; see
the X13 verification section.
"""
from __future__ import annotations
import asyncio
import logging
import os
from dataclasses import dataclass
import httpx
logger = logging.getLogger(__name__)
_BASE = "https://supremedecisions.court.gov.il"
# A complete, browser-like header set. Empirically required to pass the F5
# WAF (a bare User-Agent gets a TCP reset).
_HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/126.0 Safari/537.36"
),
"Accept": "application/json, text/plain, */*",
"Accept-Language": "he-IL,he;q=0.9,en;q=0.8",
"Referer": _BASE + "/",
}
# Politeness knobs (INV-CF4). Serial only — never run these concurrently.
_REQUEST_TIMEOUT_S = float(os.environ.get("COURT_FETCH_HTTP_TIMEOUT_S", "30"))
_INTER_REQUEST_COOLDOWN_S = float(os.environ.get("COURT_FETCH_COOLDOWN_S", "2"))
# type=4 → PDF in the portal's Download endpoint (from resultsControler.js).
_DOC_TYPE_PDF = "4"
@dataclass
class FetchedVerdict:
"""A downloaded verdict file held in memory, ready for ingest."""
content: bytes
filename: str
source_url: str
court: str = "בית המשפט העליון"
class SupremeFetchError(RuntimeError):
"""Raised when the public portal returns an unexpected shape / no document.
Carries a human-readable Hebrew reason so the orchestrator can persist it
on the job row (INV-CF2) and decide on fallback.
"""
async def _get(client: httpx.AsyncClient, path: str, **kwargs) -> httpx.Response:
await asyncio.sleep(_INTER_REQUEST_COOLDOWN_S)
resp = await client.get(f"{_BASE}/{path.lstrip('/')}", **kwargs)
resp.raise_for_status()
return resp
async def _post(client: httpx.AsyncClient, path: str, json: dict) -> httpx.Response:
await asyncio.sleep(_INTER_REQUEST_COOLDOWN_S)
resp = await client.post(f"{_BASE}/{path.lstrip('/')}", json=json)
resp.raise_for_status()
return resp
def _extract_doc_ref(results: object) -> tuple[str, str] | None:
"""Pull (path, fileName) of the first verdict document from a results blob.
The SearchVerdicts/GetCasesYearNum responses nest documents under varying
keys across the portal's endpoints. We probe the known shapes defensively
and return the first (path, fileName) pair found; ``None`` if none.
"""
def walk(node):
if isinstance(node, dict):
# A document node carries both a path and a file name.
path = node.get("Path") or node.get("path")
fname = node.get("FileName") or node.get("fileName") or node.get("Filename")
if path and fname:
yield (str(path), str(fname))
for v in node.values():
yield from walk(v)
elif isinstance(node, list):
for v in node:
yield from walk(v)
for pair in walk(results):
return pair
return None
async def fetch_supreme_verdict(
*, citation: str, case_number_norm: str
) -> FetchedVerdict:
"""Fetch a Supreme Court verdict PDF by citation. Raises on failure.
Flow: full-text search for the citation → locate the verdict document's
(path, fileName) → download the PDF. Serial + cooled-down throughout.
"""
async with httpx.AsyncClient(
http2=True,
headers=_HEADERS,
timeout=_REQUEST_TIMEOUT_S,
follow_redirects=True,
) as client:
# 1. Search. The portal's quick-search posts {document, lan}; lan=1=Hebrew.
try:
search = await _post(
client, "Home/SearchVerdicts",
json={"document": citation, "lan": 1},
)
results = search.json()
except httpx.HTTPError as e:
raise SupremeFetchError(
f"חיפוש בפורטל העליון נכשל עבור {citation}: {e}"
) from e
except ValueError as e: # non-JSON body
raise SupremeFetchError(
f"תשובת-חיפוש לא-JSON מהפורטל עבור {citation}"
) from e
ref = _extract_doc_ref(results)
if not ref:
raise SupremeFetchError(
f"לא נמצא מסמך-פסק עבור {citation} בפורטל העליון "
f"(ייתכן שאינו פורסם או שמבנה-התשובה השתנה)."
)
path, fname = ref
# 2. Download the PDF.
try:
dl = await _get(
client, "Home/Download",
params={"path": path, "fileName": fname, "type": _DOC_TYPE_PDF},
)
except httpx.HTTPError as e:
raise SupremeFetchError(
f"הורדת PDF נכשלה עבור {citation} (path={path}): {e}"
) from e
content = dl.content
ctype = dl.headers.get("content-type", "")
if not content or ("pdf" not in ctype.lower() and not content[:4] == b"%PDF"):
raise SupremeFetchError(
f"הקובץ שהתקבל עבור {citation} אינו PDF תקין (content-type={ctype})."
)
source_url = (
f"{_BASE}/Home/Download?path={path}&fileName={fname}&type={_DOC_TYPE_PDF}"
)
safe_name = fname if fname.lower().endswith(".pdf") else f"{case_number_norm}.pdf"
return FetchedVerdict(
content=content, filename=safe_name, source_url=source_url,
)

View File

@@ -1352,6 +1352,36 @@ CREATE INDEX IF NOT EXISTS idx_digests_content_tsv ON digests USING gin(content_
"""
# ── X13 — Court Verdict Fetch queue ──────────────────────────────────────
# A lightweight, observable, idempotent job queue for the auto-fetch
# subsystem (docs/spec/X13-court-fetch.md). One row per court verdict we try
# to pull from a public source. Mirrors the extraction-queue pattern: status
# is always explicit (INV-CF2 — no silent drop), the canonical case number is
# the idempotency key (INV-CF5), and ``attempts`` drives the human-fallback
# gate (INV-CF3 — flip to 'manual' after N autonomous failures).
# V31 — digests (X12) took V30 when it merged first.
SCHEMA_V31_SQL = """
CREATE TABLE IF NOT EXISTS court_fetch_jobs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
case_number_norm TEXT NOT NULL UNIQUE, -- idempotency key (INV-CF5)
citation_raw TEXT NOT NULL DEFAULT '',
tier TEXT NOT NULL DEFAULT '', -- supreme | admin | skip
court TEXT NOT NULL DEFAULT '',
status TEXT NOT NULL DEFAULT 'pending', -- pending|running|done|failed|manual
attempts INT NOT NULL DEFAULT 0,
error TEXT NOT NULL DEFAULT '',
case_law_id UUID REFERENCES case_law(id) ON DELETE SET NULL,
digest_id UUID, -- source digest (X12), nullable for ad-hoc
source_url TEXT NOT NULL DEFAULT '', -- provenance (INV-CF7)
created_at TIMESTAMPTZ DEFAULT now(),
updated_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX IF NOT EXISTS idx_court_fetch_jobs_status ON court_fetch_jobs(status);
CREATE INDEX IF NOT EXISTS idx_court_fetch_jobs_digest ON court_fetch_jobs(digest_id)
WHERE digest_id IS NOT NULL;
"""
async def _run_schema_migrations(pool: asyncpg.Pool) -> None:
async with pool.acquire() as conn:
await conn.execute(SCHEMA_SQL)
@@ -1385,7 +1415,8 @@ async def _run_schema_migrations(pool: asyncpg.Pool) -> None:
await conn.execute(SCHEMA_V28_SQL)
await conn.execute(SCHEMA_V29_SQL)
await conn.execute(SCHEMA_V30_SQL)
logger.info("Database schema initialized (v1-v30)")
await conn.execute(SCHEMA_V31_SQL)
logger.info("Database schema initialized (v1-v31)")
async def init_schema() -> None:
@@ -5930,3 +5961,110 @@ async def find_missing_precedent_by_citation(
citation.strip(),
)
return _row_to_missing_precedent(row) if row else None
# ── X13 — Court Verdict Fetch jobs ───────────────────────────────────────
# CRUD for the auto-fetch queue (docs/spec/X13-court-fetch.md). Status is
# always explicit; failures are recorded, never swallowed (INV-CF2). Upsert
# is keyed on the canonical case number (INV-CF5).
def _row_to_court_fetch_job(row) -> dict:
return dict(row) if row else None
async def court_fetch_job_upsert(
case_number_norm: str,
citation_raw: str = "",
tier: str = "",
court: str = "",
digest_id: UUID | None = None,
) -> dict:
"""Idempotent create-or-get of a fetch job by canonical case number.
Re-requesting the same case number returns the existing row (with a
``_existing`` flag) rather than creating a duplicate — the canonical
number is a UNIQUE key. A job that already reached a terminal state is
returned as-is so callers can decide whether to retry.
"""
if not (case_number_norm or "").strip():
raise ValueError("case_number_norm is required")
pool = await get_pool()
async with pool.acquire() as conn:
existing = await conn.fetchrow(
"SELECT * FROM court_fetch_jobs WHERE case_number_norm = $1",
case_number_norm,
)
if existing:
out = _row_to_court_fetch_job(existing)
out["_existing"] = True
return out
row = await conn.fetchrow(
"""INSERT INTO court_fetch_jobs
(case_number_norm, citation_raw, tier, court, digest_id)
VALUES ($1, $2, $3, $4, $5)
RETURNING *""",
case_number_norm, citation_raw, tier, court, digest_id,
)
out = _row_to_court_fetch_job(row)
out["_existing"] = False
return out
async def court_fetch_job_update(
job_id: UUID,
*,
status: str | None = None,
error: str | None = None,
case_law_id: UUID | None = None,
source_url: str | None = None,
bump_attempts: bool = False,
) -> dict:
"""Patch a job row. Only provided fields change; ``updated_at`` always does."""
sets = ["updated_at = now()"]
args: list = []
if status is not None:
args.append(status); sets.append(f"status = ${len(args)}")
if error is not None:
args.append(error); sets.append(f"error = ${len(args)}")
if case_law_id is not None:
args.append(case_law_id); sets.append(f"case_law_id = ${len(args)}")
if source_url is not None:
args.append(source_url); sets.append(f"source_url = ${len(args)}")
if bump_attempts:
sets.append("attempts = attempts + 1")
args.append(job_id)
pool = await get_pool()
async with pool.acquire() as conn:
row = await conn.fetchrow(
f"UPDATE court_fetch_jobs SET {', '.join(sets)} "
f"WHERE id = ${len(args)} RETURNING *",
*args,
)
return _row_to_court_fetch_job(row)
async def court_fetch_job_get(case_number_norm: str) -> dict | None:
pool = await get_pool()
async with pool.acquire() as conn:
row = await conn.fetchrow(
"SELECT * FROM court_fetch_jobs WHERE case_number_norm = $1",
case_number_norm,
)
return _row_to_court_fetch_job(row) if row else None
async def court_fetch_job_list(status: str | None = None, limit: int = 100) -> list[dict]:
pool = await get_pool()
async with pool.acquire() as conn:
if status:
rows = await conn.fetch(
"SELECT * FROM court_fetch_jobs WHERE status = $1 "
"ORDER BY created_at DESC LIMIT $2",
status, limit,
)
else:
rows = await conn.fetch(
"SELECT * FROM court_fetch_jobs ORDER BY created_at DESC LIMIT $1",
limit,
)
return [_row_to_court_fetch_job(r) for r in rows]