feat(X13): auto-fetch court verdicts from נט המשפט → corpus (Tier 0 + scaffold)
תת-מערכת אחזור-פסיקה אוטומטי: כשיומון מצביע על פס"ד בית-משפט, מסווגים את הערכאה, מורידים מהמקור הציבורי המתאים, וקולטים דרך צינור-הקליטה הקנוני. - spec-first: docs/spec/X13-court-fetch.md (INV-CF1..CF7) + אינדקס - מסווג court_citation.py (supreme/admin/skip) + 10 בדיקות (עת"מ 46111-12-22 → admin) - Tier 0: court_fetch_supreme.py — supremedecisions API (reverse-engineered), httpx + browser-headers (אומת 200) + politeness - תור court_fetch_jobs (SCHEMA_V30) + DB helpers + court_fetch_orchestrator.py - Tier 1 scaffold: legal-court-fetch-service (aiohttp+Bearer, מראת legal-chat-service) + camofox_client (Camoufox open-source) + recaptcha_audio (Whisper מקומי) + pm2 - Tier 2 fallback חינני: manual + missing_precedent (INV-CF2/CF3 — אין drop שקט) - כלי-MCP court_verdict_fetch / court_fetch_status; SCRIPTS.md Invariants: מקיים G2 (מסלול-קליטה יחיד, INV-CF1) · G3/G1 (idempotent+נרמול, INV-CF5) · G4/§6 (אין בליעה שקטה, INV-CF2) · G10 (שער-אנושי, INV-CF3) · G5 (source_type, INV-CF6) · G9 (provenance+audit, INV-CF7). מקורות INV-CF4: RFC 9309 · Google crawler · OWASP OAT. Follow-ups (טרם אומתו חי): live Tier-0 validation · התקנת camofox-browser+whisper · כיול selectors Tier-1 · COURT_FETCH_SHARED_SECRET (Infisical+Coolify) · טריגר מ-digest try_autolink (worktree-digests-radar). V30 עלול להתנגש עם digests-radar. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
7
mcp-server/src/legal_mcp/court_fetch_service/__init__.py
Normal file
7
mcp-server/src/legal_mcp/court_fetch_service/__init__.py
Normal file
@@ -0,0 +1,7 @@
|
||||
"""Host-side Tier-1 verdict fetch service (X13).
|
||||
|
||||
Runs on the host under pm2 (it needs a real browser, which the legal-ai
|
||||
container can't run). Drives a Camoufox stealth browser against נט המשפט to
|
||||
download administrative/district-court verdicts the Supreme portal (Tier 0)
|
||||
doesn't carry. See docs/spec/X13-court-fetch.md.
|
||||
"""
|
||||
148
mcp-server/src/legal_mcp/court_fetch_service/camofox_client.py
Normal file
148
mcp-server/src/legal_mcp/court_fetch_service/camofox_client.py
Normal file
@@ -0,0 +1,148 @@
|
||||
"""Camoufox-browser client + נט-המשפט navigation flow (X13, Tier 1).
|
||||
|
||||
Open-source, zero-API-cost stealth browsing: a self-hosted ``camofox-browser``
|
||||
REST server (``jo-inc/camofox-browser``, wrapping Camoufox — a Firefox fork
|
||||
with C++ fingerprint spoofing) drives a real browser. We talk to it over the
|
||||
same REST surface the Hermes agent uses (``~/.hermes/.../browser_camofox.py``):
|
||||
|
||||
POST /tabs → {tab_id}
|
||||
POST /tabs/{tab}/navigate {url}
|
||||
GET /tabs/{tab}/snapshot → accessibility tree w/ element refs
|
||||
POST /tabs/{tab}/click {ref}
|
||||
POST /tabs/{tab}/type {ref,text}
|
||||
GET /tabs/{tab}/screenshot
|
||||
DELETE /sessions/{user}
|
||||
|
||||
Set ``CAMOFOX_URL`` (e.g. ``http://127.0.0.1:9377``) to enable. The server's
|
||||
``/health`` exposes a VNC URL — that's the human-fallback surface (INV-CF3):
|
||||
when the autonomous reCAPTCHA solve fails, the chair opens the VNC and solves
|
||||
it live, and this flow continues.
|
||||
|
||||
⚠ CALIBRATION: the נט-המשפט external-case-search is an ASP.NET WebForms app
|
||||
behind an F5 WAF + reCAPTCHA. The element selectors and step sequence below
|
||||
are the *documented plan* of the flow; they must be calibrated against the
|
||||
live snapshot on first run (the site rate-limited static probing during
|
||||
development). Every step that can't find its target **raises** a clear Hebrew
|
||||
reason (INV-CF2 — no silent success-with-garbage) so the orchestrator escalates
|
||||
to the Tier-2 human fallback rather than returning an empty/wrong file.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
|
||||
import httpx
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# נט המשפט public entry points (discovered from the homepage __doPostBack menu).
|
||||
NGCS_HOME = "https://www.court.gov.il/ngcs.web.site/homepage.aspx"
|
||||
|
||||
CAMOFOX_URL = os.environ.get("CAMOFOX_URL", "").rstrip("/")
|
||||
_TIMEOUT = float(os.environ.get("COURT_FETCH_BROWSER_TIMEOUT_S", "60"))
|
||||
|
||||
|
||||
class CamofoxUnavailable(RuntimeError):
|
||||
"""camofox-browser isn't configured/reachable."""
|
||||
|
||||
|
||||
class NgcsFlowError(RuntimeError):
|
||||
"""A step in the נט-המשפט flow failed (selector/CAPTCHA/navigation)."""
|
||||
|
||||
|
||||
def is_enabled() -> bool:
|
||||
return bool(CAMOFOX_URL)
|
||||
|
||||
|
||||
async def health() -> dict:
|
||||
"""Probe camofox-browser; surfaces the VNC URL for the human fallback."""
|
||||
if not CAMOFOX_URL:
|
||||
raise CamofoxUnavailable("CAMOFOX_URL is not set")
|
||||
async with httpx.AsyncClient(timeout=10) as c:
|
||||
r = await c.get(f"{CAMOFOX_URL}/health")
|
||||
r.raise_for_status()
|
||||
return r.json()
|
||||
|
||||
|
||||
class _Browser:
|
||||
"""Thin async wrapper over the camofox-browser REST surface."""
|
||||
|
||||
def __init__(self, client: httpx.AsyncClient, tab_id: str, user_id: str):
|
||||
self._c = client
|
||||
self.tab = tab_id
|
||||
self.user = user_id
|
||||
|
||||
@classmethod
|
||||
async def open(cls, client: httpx.AsyncClient) -> "_Browser":
|
||||
r = await client.post(f"{CAMOFOX_URL}/tabs", json={})
|
||||
r.raise_for_status()
|
||||
data = r.json()
|
||||
return cls(client, data["tab_id"], data.get("user_id", data["tab_id"]))
|
||||
|
||||
async def navigate(self, url: str) -> None:
|
||||
r = await self._c.post(f"{CAMOFOX_URL}/tabs/{self.tab}/navigate", json={"url": url})
|
||||
r.raise_for_status()
|
||||
|
||||
async def snapshot(self) -> dict:
|
||||
r = await self._c.get(f"{CAMOFOX_URL}/tabs/{self.tab}/snapshot")
|
||||
r.raise_for_status()
|
||||
return r.json()
|
||||
|
||||
async def click(self, ref: str) -> dict:
|
||||
r = await self._c.post(f"{CAMOFOX_URL}/tabs/{self.tab}/click", json={"ref": ref})
|
||||
r.raise_for_status()
|
||||
return r.json()
|
||||
|
||||
async def type(self, ref: str, text: str) -> None:
|
||||
r = await self._c.post(
|
||||
f"{CAMOFOX_URL}/tabs/{self.tab}/type", json={"ref": ref, "text": text}
|
||||
)
|
||||
r.raise_for_status()
|
||||
|
||||
async def close(self) -> None:
|
||||
try:
|
||||
await self._c.delete(f"{CAMOFOX_URL}/sessions/{self.user}")
|
||||
except httpx.HTTPError:
|
||||
pass
|
||||
|
||||
|
||||
async def fetch_admin_verdict(
|
||||
*, file_number: str, month: str, year: str, case_number: str, court: str
|
||||
) -> dict:
|
||||
"""Drive נט המשפט to download an admin/district verdict PDF.
|
||||
|
||||
Returns ``{content: bytes, filename: str, source_url: str, court: str}``.
|
||||
Raises ``CamofoxUnavailable`` / ``NgcsFlowError`` on failure.
|
||||
|
||||
The flow (to be calibrated against the live snapshot):
|
||||
1. Open the homepage; trigger "חיפוש תיקים חיצוני" (btnExternalSearchCases).
|
||||
2. Fill the case-number / month / year fields.
|
||||
3. Solve the reCAPTCHA via the audio challenge (recaptcha_audio); on
|
||||
repeated failure, surface the VNC URL for a human solve (INV-CF3).
|
||||
4. Submit; open the matched case; locate the verdict ("פסק דין") document.
|
||||
5. Download the cleared PDF (served via S3 pre-signed URL) and return bytes.
|
||||
"""
|
||||
if not CAMOFOX_URL:
|
||||
raise CamofoxUnavailable(
|
||||
"שירות-הדפדפן (camofox-browser) אינו מוגדר — הגדר CAMOFOX_URL "
|
||||
"והפעל את jo-inc/camofox-browser. ראה docs/spec/X13-court-fetch.md."
|
||||
)
|
||||
|
||||
async with httpx.AsyncClient(timeout=_TIMEOUT) as client:
|
||||
br = await _Browser.open(client)
|
||||
try:
|
||||
await br.navigate(NGCS_HOME)
|
||||
snap = await br.snapshot()
|
||||
_ = snap # calibration anchor: locate btnExternalSearchCases here.
|
||||
|
||||
# The concrete selector/CAPTCHA/download steps require live
|
||||
# calibration with camofox running. Until calibrated we fail
|
||||
# loudly so the orchestrator escalates to the human fallback
|
||||
# (INV-CF2/CF3) rather than pretending success.
|
||||
raise NgcsFlowError(
|
||||
"זרימת נט-המשפט (Tier 1) ממתינה לכיול מול snapshot חי של "
|
||||
"camofox-browser — בקשת-אחזור מוסלמת ל-fallback אנושי (VNC/ידני)."
|
||||
)
|
||||
finally:
|
||||
await br.close()
|
||||
@@ -0,0 +1,80 @@
|
||||
"""Open-source reCAPTCHA v2 audio-challenge solver (X13, Tier 1).
|
||||
|
||||
Pure open-source, zero-API-cost: switch the reCAPTCHA widget to its **audio**
|
||||
challenge, download the mp3, transcribe it with a **local Whisper** model
|
||||
(``faster-whisper``), and submit the transcript. This is the well-known
|
||||
"Buster"-style technique. It is intentionally a *best-effort* solver —
|
||||
reCAPTCHA actively fights audio solving, so a non-trivial failure rate is
|
||||
expected and handled by the Tier-2 human fallback (INV-CF3), never hidden.
|
||||
|
||||
Model is loaded lazily and cached; ``WHISPER_MODEL`` (default ``small``) and
|
||||
``WHISPER_DEVICE`` (default ``cpu``) tune it. The dependency is optional — if
|
||||
``faster-whisper`` isn't installed, ``transcribe_audio`` raises a clear error
|
||||
so the caller falls back to a human solve rather than crashing the service.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
import tempfile
|
||||
|
||||
import httpx
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
_WHISPER_MODEL_NAME = os.environ.get("WHISPER_MODEL", "small")
|
||||
_WHISPER_DEVICE = os.environ.get("WHISPER_DEVICE", "cpu")
|
||||
_model = None
|
||||
|
||||
|
||||
class AudioSolveUnavailable(RuntimeError):
|
||||
"""faster-whisper isn't installed — cannot solve audio locally."""
|
||||
|
||||
|
||||
def _get_model():
|
||||
global _model
|
||||
if _model is not None:
|
||||
return _model
|
||||
try:
|
||||
from faster_whisper import WhisperModel # type: ignore
|
||||
except ImportError as e:
|
||||
raise AudioSolveUnavailable(
|
||||
"faster-whisper אינו מותקן — לא ניתן לפתור reCAPTCHA אודיו מקומית. "
|
||||
"התקן `pip install faster-whisper` או הסתמך על fallback אנושי (VNC)."
|
||||
) from e
|
||||
logger.info("loading whisper model %s on %s", _WHISPER_MODEL_NAME, _WHISPER_DEVICE)
|
||||
_model = WhisperModel(
|
||||
_WHISPER_MODEL_NAME, device=_WHISPER_DEVICE, compute_type="int8"
|
||||
)
|
||||
return _model
|
||||
|
||||
|
||||
async def download_audio(audio_url: str) -> bytes:
|
||||
async with httpx.AsyncClient(timeout=30, follow_redirects=True) as c:
|
||||
r = await c.get(audio_url)
|
||||
r.raise_for_status()
|
||||
return r.content
|
||||
|
||||
|
||||
def transcribe_audio(mp3_bytes: bytes) -> str:
|
||||
"""Transcribe a reCAPTCHA audio clip to its (English) digit/word phrase.
|
||||
|
||||
Raises ``AudioSolveUnavailable`` if the local model isn't installed.
|
||||
"""
|
||||
model = _get_model()
|
||||
with tempfile.NamedTemporaryFile(suffix=".mp3", delete=True) as f:
|
||||
f.write(mp3_bytes)
|
||||
f.flush()
|
||||
# reCAPTCHA audio is English regardless of page locale.
|
||||
segments, _info = model.transcribe(f.name, language="en")
|
||||
text = " ".join(seg.text for seg in segments).strip()
|
||||
# Normalise: reCAPTCHA expects the bare phrase, lower-case, no punctuation.
|
||||
cleaned = "".join(ch for ch in text.lower() if ch.isalnum() or ch.isspace())
|
||||
return " ".join(cleaned.split())
|
||||
|
||||
|
||||
async def solve_from_audio_url(audio_url: str) -> str:
|
||||
"""Convenience: download + transcribe an audio-challenge URL."""
|
||||
mp3 = await download_audio(audio_url)
|
||||
return transcribe_audio(mp3)
|
||||
145
mcp-server/src/legal_mcp/court_fetch_service/server.py
Normal file
145
mcp-server/src/legal_mcp/court_fetch_service/server.py
Normal file
@@ -0,0 +1,145 @@
|
||||
"""Host-side HTTP bridge for Tier-1 verdict fetching (X13).
|
||||
|
||||
Mirrors ``legal_mcp.chat_service.server`` — the proven host-side pattern: an
|
||||
aiohttp app, bound to the docker bridge gateway, Bearer-auth, that does the one
|
||||
thing the container can't (here: drive a real browser against נט המשפט).
|
||||
|
||||
Endpoints:
|
||||
POST /fetch body {file_number, month, year, case_number, court}
|
||||
→ {ok, content_b64, filename, source_url, court, reason}
|
||||
REQUIRES Authorization: Bearer <COURT_FETCH_SHARED_SECRET>.
|
||||
GET /health liveness (no auth); reports camofox + VNC URL if available.
|
||||
|
||||
Run with pm2:
|
||||
pm2 start scripts/legal-court-fetch-service.config.cjs
|
||||
|
||||
Security posture (identical rationale to legal-chat-service):
|
||||
1. Bind defaults to ``10.0.1.1`` (docker0 bridge gateway) — reachable from
|
||||
the host + containers on docker bridges, invisible to outside networks.
|
||||
2. ``/fetch`` requires a Bearer token (constant-time compare); the service
|
||||
refuses to start without ``COURT_FETCH_SHARED_SECRET`` set.
|
||||
3. ``/health`` is unauthenticated and spawns nothing.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import base64
|
||||
import hmac
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import sys
|
||||
|
||||
from aiohttp import web
|
||||
|
||||
_pkg_root = os.path.dirname(os.path.dirname(os.path.dirname(__file__)))
|
||||
if _pkg_root not in sys.path:
|
||||
sys.path.insert(0, _pkg_root)
|
||||
|
||||
from legal_mcp.court_fetch_service import camofox_client # noqa: E402
|
||||
|
||||
logger = logging.getLogger("legal_court_fetch_service")
|
||||
|
||||
_SHARED_SECRET: str = ""
|
||||
|
||||
|
||||
async def health(request: web.Request) -> web.Response:
|
||||
info = {"ok": True, "service": "legal-court-fetch-service",
|
||||
"camofox_enabled": camofox_client.is_enabled()}
|
||||
if camofox_client.is_enabled():
|
||||
try:
|
||||
info["camofox"] = await camofox_client.health()
|
||||
except Exception as e: # health must never throw
|
||||
info["camofox_error"] = str(e)
|
||||
return web.json_response(info)
|
||||
|
||||
|
||||
def _check_bearer(request: web.Request) -> web.Response | None:
|
||||
auth = request.headers.get("Authorization", "")
|
||||
expected = "Bearer " + _SHARED_SECRET
|
||||
if not auth or not hmac.compare_digest(auth, expected):
|
||||
return web.json_response(
|
||||
{"error": "unauthorized: missing or invalid Bearer token"}, status=401
|
||||
)
|
||||
return None
|
||||
|
||||
|
||||
async def fetch(request: web.Request) -> web.Response:
|
||||
unauth = _check_bearer(request)
|
||||
if unauth is not None:
|
||||
return unauth
|
||||
try:
|
||||
body = await request.json()
|
||||
except json.JSONDecodeError:
|
||||
return web.json_response({"error": "invalid JSON body"}, status=400)
|
||||
|
||||
required = ("file_number", "month", "year")
|
||||
if not all(body.get(k) for k in required):
|
||||
return web.json_response(
|
||||
{"ok": False, "reason": f"missing one of {required}"}, status=400
|
||||
)
|
||||
|
||||
try:
|
||||
result = await camofox_client.fetch_admin_verdict(
|
||||
file_number=str(body["file_number"]),
|
||||
month=str(body["month"]),
|
||||
year=str(body["year"]),
|
||||
case_number=str(body.get("case_number", "")),
|
||||
court=str(body.get("court", "")),
|
||||
)
|
||||
return web.json_response({
|
||||
"ok": True,
|
||||
"content_b64": base64.b64encode(result["content"]).decode("ascii"),
|
||||
"filename": result.get("filename", ""),
|
||||
"source_url": result.get("source_url", ""),
|
||||
"court": result.get("court", ""),
|
||||
})
|
||||
except (camofox_client.CamofoxUnavailable, camofox_client.NgcsFlowError) as e:
|
||||
# Expected, recoverable failure → orchestrator escalates (INV-CF3).
|
||||
return web.json_response({"ok": False, "reason": str(e)}, status=200)
|
||||
except Exception as e: # noqa: BLE001
|
||||
logger.exception("fetch failed")
|
||||
return web.json_response({"ok": False, "reason": f"unexpected: {e}"}, status=200)
|
||||
|
||||
|
||||
def build_app() -> web.Application:
|
||||
app = web.Application(client_max_size=64 * 1024 * 1024)
|
||||
app.router.add_get("/health", health)
|
||||
app.router.add_post("/fetch", fetch)
|
||||
return app
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = argparse.ArgumentParser(description="legal-court-fetch-service")
|
||||
parser.add_argument("--port", type=int, default=8771)
|
||||
parser.add_argument("--host", default="10.0.1.1",
|
||||
help="bind address; default = docker0 bridge gateway")
|
||||
parser.add_argument("--log-level", default="INFO")
|
||||
args = parser.parse_args()
|
||||
|
||||
logging.basicConfig(level=args.log_level.upper(),
|
||||
format="%(asctime)s %(name)s %(levelname)s %(message)s")
|
||||
|
||||
secret = os.environ.get("COURT_FETCH_SHARED_SECRET", "").strip()
|
||||
if not secret:
|
||||
logger.error(
|
||||
"COURT_FETCH_SHARED_SECRET is empty; refusing to start. Set it in "
|
||||
"/home/chaim/.legal-court-fetch-service.env (loaded by pm2) and "
|
||||
"mirror it as a Coolify env var on the legal-ai app."
|
||||
)
|
||||
return 2
|
||||
if len(secret) < 24:
|
||||
logger.error("COURT_FETCH_SHARED_SECRET too short (>=32 chars expected).")
|
||||
return 2
|
||||
global _SHARED_SECRET
|
||||
_SHARED_SECRET = secret
|
||||
|
||||
app = build_app()
|
||||
logger.info("legal-court-fetch-service listening on %s:%d", args.host, args.port)
|
||||
web.run_app(app, host=args.host, port=args.port, print=lambda _m: None)
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
Reference in New Issue
Block a user