feat(X13): auto-fetch court verdicts from נט המשפט → corpus (Tier 0 + scaffold)

תת-מערכת אחזור-פסיקה אוטומטי: כשיומון מצביע על פס"ד בית-משפט, מסווגים את הערכאה, מורידים מהמקור הציבורי המתאים, וקולטים דרך צינור-הקליטה הקנוני. - spec-first: docs/spec/X13-court-fetch.md (INV-CF1..CF7) + אינדקס - מסווג court_citation.py (supreme/admin/skip) + 10 בדיקות (עת"מ 46111-12-22 → admin) - Tier 0: court_fetch_supreme.py — supremedecisions API (reverse-engineered), httpx + browser-headers (אומת 200) + politeness - תור court_fetch_jobs (SCHEMA_V30) + DB helpers + court_fetch_orchestrator.py - Tier 1 scaffold: legal-court-fetch-service (aiohttp+Bearer, מראת legal-chat-service) + camofox_client (Camoufox open-source) + recaptcha_audio (Whisper מקומי) + pm2 - Tier 2 fallback חינני: manual + missing_precedent (INV-CF2/CF3 — אין drop שקט) - כלי-MCP court_verdict_fetch / court_fetch_status; SCRIPTS.md Invariants: מקיים G2 (מסלול-קליטה יחיד, INV-CF1) · G3/G1 (idempotent+נרמול, INV-CF5) · G4/§6 (אין בליעה שקטה, INV-CF2) · G10 (שער-אנושי, INV-CF3) · G5 (source_type, INV-CF6) · G9 (provenance+audit, INV-CF7). מקורות INV-CF4: RFC 9309 · Google crawler · OWASP OAT. Follow-ups (טרם אומתו חי): live Tier-0 validation · התקנת camofox-browser+whisper · כיול selectors Tier-1 · COURT_FETCH_SHARED_SECRET (Infisical+Coolify) · טריגר מ-digest try_autolink (worktree-digests-radar). V30 עלול להתנגש עם digests-radar. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 18:08:23 +00:00
parent 955675eb1f
commit 0990db7a3c
16 changed files with 1518 additions and 3 deletions
--- a/mcp-server/src/legal_mcp/court_fetch_service/recaptcha_audio.py
+++ b/mcp-server/src/legal_mcp/court_fetch_service/recaptcha_audio.py
@@ -0,0 +1,80 @@
+"""Open-source reCAPTCHA v2 audio-challenge solver (X13, Tier 1).
+
+Pure open-source, zero-API-cost: switch the reCAPTCHA widget to its **audio**
+challenge, download the mp3, transcribe it with a **local Whisper** model
+(``faster-whisper``), and submit the transcript. This is the well-known
+"Buster"-style technique. It is intentionally a *best-effort* solver —
+reCAPTCHA actively fights audio solving, so a non-trivial failure rate is
+expected and handled by the Tier-2 human fallback (INV-CF3), never hidden.
+
+Model is loaded lazily and cached; ``WHISPER_MODEL`` (default ``small``) and
+``WHISPER_DEVICE`` (default ``cpu``) tune it. The dependency is optional — if
+``faster-whisper`` isn't installed, ``transcribe_audio`` raises a clear error
+so the caller falls back to a human solve rather than crashing the service.
+"""
+
+from __future__ import annotations
+
+import logging
+import os
+import tempfile
+
+import httpx
+
+logger = logging.getLogger(__name__)
+
+_WHISPER_MODEL_NAME = os.environ.get("WHISPER_MODEL", "small")
+_WHISPER_DEVICE = os.environ.get("WHISPER_DEVICE", "cpu")
+_model = None
+
+
+class AudioSolveUnavailable(RuntimeError):
+    """faster-whisper isn't installed — cannot solve audio locally."""
+
+
+def _get_model():
+    global _model
+    if _model is not None:
+        return _model
+    try:
+        from faster_whisper import WhisperModel  # type: ignore
+    except ImportError as e:
+        raise AudioSolveUnavailable(
+            "faster-whisper אינו מותקן — לא ניתן לפתור reCAPTCHA אודיו מקומית. "
+            "התקן `pip install faster-whisper` או הסתמך על fallback אנושי (VNC)."
+        ) from e
+    logger.info("loading whisper model %s on %s", _WHISPER_MODEL_NAME, _WHISPER_DEVICE)
+    _model = WhisperModel(
+        _WHISPER_MODEL_NAME, device=_WHISPER_DEVICE, compute_type="int8"
+    )
+    return _model
+
+
+async def download_audio(audio_url: str) -> bytes:
+    async with httpx.AsyncClient(timeout=30, follow_redirects=True) as c:
+        r = await c.get(audio_url)
+        r.raise_for_status()
+        return r.content
+
+
+def transcribe_audio(mp3_bytes: bytes) -> str:
+    """Transcribe a reCAPTCHA audio clip to its (English) digit/word phrase.
+
+    Raises ``AudioSolveUnavailable`` if the local model isn't installed.
+    """
+    model = _get_model()
+    with tempfile.NamedTemporaryFile(suffix=".mp3", delete=True) as f:
+        f.write(mp3_bytes)
+        f.flush()
+        # reCAPTCHA audio is English regardless of page locale.
+        segments, _info = model.transcribe(f.name, language="en")
+        text = " ".join(seg.text for seg in segments).strip()
+    # Normalise: reCAPTCHA expects the bare phrase, lower-case, no punctuation.
+    cleaned = "".join(ch for ch in text.lower() if ch.isalnum() or ch.isspace())
+    return " ".join(cleaned.split())
+
+
+async def solve_from_audio_url(audio_url: str) -> str:
+    """Convenience: download + transcribe an audio-challenge URL."""
+    mp3 = await download_audio(audio_url)
+    return transcribe_audio(mp3)