feat(X13): auto-fetch court verdicts from נט המשפט → corpus (Tier 0 + scaffold)
תת-מערכת אחזור-פסיקה אוטומטי: כשיומון מצביע על פס"ד בית-משפט, מסווגים את הערכאה, מורידים מהמקור הציבורי המתאים, וקולטים דרך צינור-הקליטה הקנוני. - spec-first: docs/spec/X13-court-fetch.md (INV-CF1..CF7) + אינדקס - מסווג court_citation.py (supreme/admin/skip) + 10 בדיקות (עת"מ 46111-12-22 → admin) - Tier 0: court_fetch_supreme.py — supremedecisions API (reverse-engineered), httpx + browser-headers (אומת 200) + politeness - תור court_fetch_jobs (SCHEMA_V30) + DB helpers + court_fetch_orchestrator.py - Tier 1 scaffold: legal-court-fetch-service (aiohttp+Bearer, מראת legal-chat-service) + camofox_client (Camoufox open-source) + recaptcha_audio (Whisper מקומי) + pm2 - Tier 2 fallback חינני: manual + missing_precedent (INV-CF2/CF3 — אין drop שקט) - כלי-MCP court_verdict_fetch / court_fetch_status; SCRIPTS.md Invariants: מקיים G2 (מסלול-קליטה יחיד, INV-CF1) · G3/G1 (idempotent+נרמול, INV-CF5) · G4/§6 (אין בליעה שקטה, INV-CF2) · G10 (שער-אנושי, INV-CF3) · G5 (source_type, INV-CF6) · G9 (provenance+audit, INV-CF7). מקורות INV-CF4: RFC 9309 · Google crawler · OWASP OAT. Follow-ups (טרם אומתו חי): live Tier-0 validation · התקנת camofox-browser+whisper · כיול selectors Tier-1 · COURT_FETCH_SHARED_SECRET (Infisical+Coolify) · טריגר מ-digest try_autolink (worktree-digests-radar). V30 עלול להתנגש עם digests-radar. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -250,6 +250,7 @@ Hellyer (Law Library Journal 110:4, 2018, open-access) — טיפול-שיפוט
|
||||
| [X9-mcp-tool-contract.md](X9-mcp-tool-contract.md) | חוזה 71 כלי-ה-MCP: envelope · שמות · idempotency · extract/get-symmetry · שלמות-הרשאות | G2, G3, G10 |
|
||||
| [X10-deploy-env-secrets.md](X10-deploy-env-secrets.md) | env-catalog SSoT · מקור-config יחיד (Coolify) · ללא hardcode · secrets · drift | G2, G4, G9 |
|
||||
| [X11-citation-corroboration.md](X11-citation-corroboration.md) | citator פנימי — תיקוף הלכות בטיפול-שיפוטי מצטבר · תיקון-G10 מבוקר · סף-corroboration · התאמה-להלכה | G9, G10 |
|
||||
| [X13-court-fetch.md](X13-court-fetch.md) | אחזור-פסיקה אוטומטי מנט המשפט — 3 שכבות (עליון/מנהלי/skip) · שירות-מארח · reCAPTCHA · שער-אנושי | G2, G3, G4, G5, G9, G10 |
|
||||
|
||||
> **X6–X10 (מחזור-2):** מכסים את 8 משטחי-האפליקציה שמחוץ לצינור-הליבה (אינטגרציה, web-ui, מילוי-שדות,
|
||||
> אחסון-ניתוחים, כלי-MCP, deploy/env). הממצאים ב-[gap-audit.md](gap-audit.md) (GAP-24..62 → FU-9..15)
|
||||
|
||||
@@ -3,9 +3,10 @@
|
||||
זהו מקור-האמת הקנוני ל"מהו תקין" במערכת. שער-הכניסה: [00-constitution.md](00-constitution.md).
|
||||
כל invariant מגובה ב-≥3 מקורות סמכותיים; פריט לא-מאומת מסומן ⚠ UNVERIFIED ומועלה ליו"ר.
|
||||
|
||||
מבנה: 00 חוקה · 01–07 מחזור-חיים · X1–X10 חוצי-שלבים. ראה אינדקס מלא בחוקה.
|
||||
מבנה: 00 חוקה · 01–07 מחזור-חיים · X1–X13 חוצי-שלבים. ראה אינדקס מלא בחוקה.
|
||||
- X1–X5: מזהים · רב-חברתי · אינטגרציה+deploy · סוכנים · audit.
|
||||
- X6–X10 (מחזור-2, 8 משטחי-האפליקציה): חוזה UI↔API · לקוח-Paperclip · מילוי-שדות · חוזה כלי-MCP · deploy/env/secrets.
|
||||
- X11 · X13: citator תיקוף-הלכות · אחזור-פסיקה אוטומטי מנט המשפט (שירות).
|
||||
|
||||
מפות-ממצאים: [gap-audit.md](gap-audit.md) (GAP-01..62 → FU-1..15; מחזור-1 ✅ הושלם, מחזור-2 פתוח) · [ui-audit.md](ui-audit.md) (ביקורת 13 דפי-UI).
|
||||
בסיס-עיצוב: docs/superpowers/specs/2026-05-30-system-spec-design.md
|
||||
|
||||
151
docs/spec/X13-court-fetch.md
Normal file
151
docs/spec/X13-court-fetch.md
Normal file
@@ -0,0 +1,151 @@
|
||||
# X13 — אחזור-פסיקה אוטומטי מנט המשפט (Court Verdict Fetch)
|
||||
|
||||
> כפוף ל-[חוקת המערכת](00-constitution.md). תת-מערכת **שירות** (לא קורפוס) שמורידה פסקי-דין
|
||||
> ציבוריים של בתי-משפט ומזרימה אותם ל**צינור-הקליטה הקנוני** של ספריית-הפסיקה. אחות-מושגית
|
||||
> ל-[X12 — Digests Radar](X12-digests-radar.md) (הטריגר העיקרי) ול-[01-ingest](01-ingest.md)
|
||||
> (היעד). אינה קורפוס רביעי ואינה מסלול-ingest מקביל.
|
||||
|
||||
---
|
||||
|
||||
## 0. ייעוד והקשר
|
||||
|
||||
יומון (digest) מצביע על פסק-דין נושא (`underlying_citation`, למשל `עת"מ 46111-12-22`). כשהפסק
|
||||
אינו בקורפוס, המערכת **מאחזרת אותו אוטומטית** ממקור ציבורי, מחלצת טקסט, וקולטת אותו דרך
|
||||
`precedent_library_upload` → `ingest_precedent`. כך הופך פסק-דין מ"מצוטט-בלבד" ל"שמיש לחיפוש
|
||||
וחילוץ-הלכות".
|
||||
|
||||
**הבחנת-מקור קריטית:** רק **פסקי-דין של בתי-משפט** ניתנים לאחזור ציבורי. **החלטות ועדת-ערר**
|
||||
אינן זמינות ציבורית (נדרש נבו) — מסומנות כפער ולא נשלחות לאחזור.
|
||||
|
||||
**שתי דרכי-מקור ציבוריות:**
|
||||
- **עליון** (עע"מ/בג"ץ/ע"א/רע"א/בר"מ/דנ"א) → `supremedecisions.court.gov.il` — הורדה ישירה (httpx), ללא CAPTCHA.
|
||||
- **מנהלי/מחוזי/שלום** (עת"מ/עמ"נ/...) → מציג-התיקים של **נט המשפט** — ASP.NET WebForms
|
||||
(`__doPostBack`/VIEWSTATE), anti-bot של F5, reCAPTCHA על החיפוש הציבורי, מסמכים כ-S3 cleared URLs.
|
||||
מחייב **דפדפן-אמת** (host-side), ולכן שירות-מארח ב-pm2 (כדפוס `legal-chat-service`).
|
||||
|
||||
---
|
||||
|
||||
## 1. ארכיטקטורה — שלוש שכבות (tiered)
|
||||
|
||||
```
|
||||
underlying_citation → [classifier] → tier ∈ {supreme, admin, skip}
|
||||
skip(ערר/בל"מ) → missing_precedent (נבו ידני) — לא אחזור
|
||||
supreme → Tier 0: httpx בקונטיינר → supremedecisions — אוטונומי מלא
|
||||
admin → Tier 1: legal-court-fetch-service (host/pm2) — אוטונומי-first
|
||||
→ Camoufox stealth browser → external-search → reCAPTCHA(audio/Whisper)
|
||||
→ download cleared PDF
|
||||
→ Tier 2 fallback: VNC ידני / missing_precedent + התראה — שער-אנושי
|
||||
(כל ה-tiers) → precedent_library_upload(source_type=court_ruling) → ingest_precedent
|
||||
→ chunks+embeddings+halachot(pending) → relink digest / close gap
|
||||
```
|
||||
|
||||
מצב-העבודה מנוהל בטבלת-תור `court_fetch_jobs` (idempotent, נצפה, retryable).
|
||||
|
||||
---
|
||||
|
||||
## 2. Invariants
|
||||
|
||||
### INV-CF1: מסלול-קליטה יחיד — אין ingest מקביל
|
||||
**כלל:** כל ה-tiers מתנקזים ל**צינור-הקליטה הקנוני היחיד** (`precedent_library_upload` →
|
||||
`ingest_precedent`). המאחזר מספק קובץ+מטא בלבד; אסור לו לכתוב `case_law`/`precedent_chunks`/
|
||||
`halachot` ישירות או לשכפל לוגיקת-chunking/embedding.
|
||||
**מקור-סמכות:** פרויקטלי-תפעולי — מיישם את [G2](00-constitution.md#inv-g2) (מקור-אמת יחיד, אין מסלול מקביל) על תת-מערכת זו.
|
||||
**אכיפה:** האורקסטרטור קורא רק ל-API/שירות-הקליטה הקיים; ביקורת-ארכיטקטורה ב-PR.
|
||||
**הפרה ידועה:** —
|
||||
|
||||
### INV-CF2: אין בליעה שקטה — כל אחזור נצפה
|
||||
**כלל:** לכל פסק-דין שזוהה לאחזור יש רשומת-job עם סטטוס סופי מפורש
|
||||
(`done`/`failed`/`manual`). כישלון-אחזור **לעולם אינו נבלע** — הוא מסומן ומועלה (Tier 2),
|
||||
לא נזרק בשקט. `except: pass` אסור.
|
||||
**מקור-סמכות:** פרויקטלי-תפעולי — מיישם את [G4](00-constitution.md#inv-g4) וכלל-ההנדסה "אין בליעה שקטה" (§6).
|
||||
**אכיפה:** טבלת `court_fetch_jobs` (status+error+attempts) + לוג-warning בכל כישלון + Tier-2 gate.
|
||||
**הפרה ידועה:** הפער הקיים ב-X12 — `try_autolink` שנכשל מחזיר `None` בשקט (יתוקן ע"י טריגר זה).
|
||||
|
||||
### INV-CF3: אוטונומי-first, שער-אנושי חובה ב-fallback
|
||||
**כלל:** האחזור מנסה אוטונומית; אך כש-N נסיונות נכשלים, **שער-אנושי** (VNC לפתרון-CAPTCHA
|
||||
חי / סימון missing_precedent + התראה) הוא **חובה, לא רשות**. המערכת אינה "מוותרת" ואינה
|
||||
"מסתירה" — היא מסלימה לאדם.
|
||||
**מקור-סמכות:** פרויקטלי-תפעולי — מיישם את [G10](00-constitution.md#inv-g10) (המערכת מסייעת; שערים אנושיים = invariant).
|
||||
**אכיפה:** מונה-נסיונות בטבלת-התור + מעבר אוטומטי ל-status=`manual` עם נתיב-פעולה ל-chaim.
|
||||
**הפרה ידועה:** —
|
||||
|
||||
### INV-CF4: אחזור-אחראי (politeness) — סדרתי, מרווח, חתימה-אמיתית
|
||||
**כלל:** האחזור מאתר-ממשלתי הוא **אחראי**: סדרתי (לא מקבילי), עם cooldown בין בקשות,
|
||||
כיבוד-`robots`/תנאי-שימוש, ו-rate מתון. אסור flooding/parallel-hammering שעלול לחסום IP
|
||||
או להעמיס על שירות ציבורי.
|
||||
**מקורות:** RFC 9309 (*Robots Exclusion Protocol*, IETF 2022) · Google Search Central —
|
||||
*Crawler / crawl-rate guidance* · OWASP — *Automated Threat Handbook* (OAT-021 Denial of
|
||||
Service / responsible automation) | סטטוס: verified
|
||||
**אכיפה:** האורקסטרטור והשירות אוכפים serial + `INTER_FETCH_COOLDOWN_SEC`; Camoufox מספק
|
||||
חתימת-דפדפן אמיתית (לא spoof-חמדני). מראה לדפוס-התור ב-[`precedent_library.py`](../../mcp-server/src/legal_mcp/services/precedent_library.py).
|
||||
**הפרה ידועה:** —
|
||||
|
||||
### INV-CF5: אחזור idempotent
|
||||
**כלל:** אחזור הוא **idempotent** — מפתח-job דטרמיניסטי לפי `case_number` מנורמל. אחזור
|
||||
חוזר של אותו תיק אינו יוצר job כפול ואינו קולט פסק-דין פעמיים (upsert על המפתח הקנוני).
|
||||
**מקור-סמכות:** פרויקטלי-תפעולי — מיישם את [G3](00-constitution.md#inv-g3) (ingest idempotent) ו-[G1](00-constitution.md#inv-g1) (מזהה מנורמל בכתיבה).
|
||||
**אכיפה:** אילוץ-ייחודיות על `court_fetch_jobs.case_number_norm`; הקליטה עצמה idempotent דרך `ingest_precedent`.
|
||||
**הפרה ידועה:** —
|
||||
|
||||
### INV-CF6: שער-סיווג מקור — רק פסקי-דין של בתי-משפט
|
||||
**כלל:** רק ציטוט שסווג כ**פסק-דין של בית-משפט** נשלח לאחזור. **ועדת-ערר (ערר/בל"מ) לעולם
|
||||
אינה נשלחת לאחזור-ציבורי** (נדרש נבו) — היא מסומנת `missing_precedent` בלבד. הפריט הנקלט
|
||||
נושא `source_type=court_ruling`, `source_kind=external_upload`, `precedent_level` לפי הערכאה.
|
||||
**מקור-סמכות:** פרויקטלי-תפעולי — מיישם את [G5](00-constitution.md#inv-g5) (metadata מלא + הפרדת-קורפוס)
|
||||
ותואם את הבחנת-המקור ב-[01-ingest](01-ingest.md) (`court_ruling` מול `appeals_committee`).
|
||||
**אכיפה:** המסווג מחזיר `tier=skip` ל-ערר/בל"מ; הקליטה אוכפת `source_type`.
|
||||
**הפרה ידועה:** —
|
||||
|
||||
### INV-CF7: עקיבוּת-מקור + גבול-ToS
|
||||
**כלל:** כל אחזור נושם **provenance** מלא (`source_url`, tier, זמן, מזהה-job) ב-audit-trail.
|
||||
האחזור מוגבל ל**מסמכים ציבוריים** הזמינים ללא הזדהות (smart-card); אופי המערכת הוא
|
||||
**הורדה-בסיוע** (עם שער-אנושי), לא בוט-סמוי לעקיפת בקרת-גישה.
|
||||
**מקור-סמכות:** פרויקטלי-תפעולי — מיישם את [G9](00-constitution.md#inv-g9) (עקיבוּת + audit-trail);
|
||||
גבול-ה-ToS מועלה ליו"ר (חיים) כשיקול-מדיניות (עיקרון-עבודה 4: המשתמש הוא הסמכות).
|
||||
**אכיפה:** `source_url`+tier נשמרים על `case_law`/`court_fetch_jobs`; שער-אנושי שומר על אופי בסיוע.
|
||||
**הפרה ידועה:** —
|
||||
|
||||
---
|
||||
|
||||
## 3. מודל-נתונים — `court_fetch_jobs`
|
||||
|
||||
| עמודה | טיפוס | תפקיד |
|
||||
|--------|-------|-------|
|
||||
| `id` | UUID PK | מזהה-job |
|
||||
| `case_number_norm` | TEXT UNIQUE | מפתח-idempotency קנוני (INV-CF5) |
|
||||
| `citation_raw` | TEXT | הציטוט המקורי כפי שזוהה |
|
||||
| `tier` | TEXT | `supreme` \| `admin` \| `skip` |
|
||||
| `court` | TEXT | ערכאה שזוהתה |
|
||||
| `status` | TEXT | `pending` \| `running` \| `done` \| `failed` \| `manual` |
|
||||
| `attempts` | INT | מונה-נסיונות (ל-Tier 2 gate, INV-CF3) |
|
||||
| `error` | TEXT | הודעת-כישלון אחרונה (INV-CF2) |
|
||||
| `case_law_id` | UUID FK | הפסק שנקלט (NULL עד done) |
|
||||
| `digest_id` | UUID FK | היומון-מקור (NULL לאד-הוק) |
|
||||
| `source_url` | TEXT | provenance (INV-CF7) |
|
||||
| `created_at` / `updated_at` | TIMESTAMPTZ | |
|
||||
|
||||
---
|
||||
|
||||
## 4. רכיבי-מימוש (מיפוי לקוד)
|
||||
|
||||
| רכיב | קובץ | מקור-תבנית / שימוש-חוזר |
|
||||
|------|------|------------------------|
|
||||
| מסווג | `mcp-server/.../services/court_citation.py` | regex מ-`citation_extractor.py:67-132` |
|
||||
| Tier 0 | `services/court_fetch_supreme.py` | httpx; דפוס-cooldown מ-`precedent_library.py:176-186` |
|
||||
| Tier 1 שירות | `mcp-server/.../court_fetch_service/server.py` | שכפול `chat_service/server.py` (aiohttp+Bearer+bind 10.0.1.1) |
|
||||
| Camoufox client | `court_fetch_service/camofox_client.py` | חיקוי `~/.hermes/.../browser_camofox.py` |
|
||||
| reCAPTCHA audio | `court_fetch_service/recaptcha_audio.py` | faster-whisper מקומי |
|
||||
| proxy בקונטיינר | `web/court_fetch_proxy.py` | שכפול `web/chat_proxy.py` |
|
||||
| pm2 | `scripts/legal-court-fetch-service.config.cjs` | שכפול `legal-chat-service.config.cjs` |
|
||||
| אורקסטרטור+תור | `services/court_fetch_orchestrator.py` + `db.py` (SCHEMA_Vxx) | דפוס-תור קיים |
|
||||
| כלי-MCP | `tools/court_fetch.py` (`court_verdict_fetch`) | חוזה-envelope [X9](X9-mcp-tool-contract.md) |
|
||||
| טריגר | `services/digest_library.py` (`try_autolink` fail-path) | X12 |
|
||||
| סוד | `COURT_FETCH_SHARED_SECRET` (Infisical + Coolify) | דפוס `LEGAL_CHAT_SHARED_SECRET`, [X10](X10-deploy-env-secrets.md) |
|
||||
|
||||
---
|
||||
|
||||
## 5. סיכונים (R&D — לעקוב)
|
||||
- reCAPTCHA נלחם פעיל בפותרי-אודיו → שיעור-כישלון אפשרי גבוה → Tier 2 הוא קו-ההגנה (INV-CF3).
|
||||
- F5/anti-bot עלול לחסום IP → politeness סדרתי + Camoufox (INV-CF4).
|
||||
- שבירות מול שינויי-אתר → ריכוז selectors במקום אחד + בדיקות-עשן תקופתיות.
|
||||
- גבול-ToS על אתר .gov → INV-CF7 + שיקול-יו"ר.
|
||||
7
mcp-server/src/legal_mcp/court_fetch_service/__init__.py
Normal file
7
mcp-server/src/legal_mcp/court_fetch_service/__init__.py
Normal file
@@ -0,0 +1,7 @@
|
||||
"""Host-side Tier-1 verdict fetch service (X13).
|
||||
|
||||
Runs on the host under pm2 (it needs a real browser, which the legal-ai
|
||||
container can't run). Drives a Camoufox stealth browser against נט המשפט to
|
||||
download administrative/district-court verdicts the Supreme portal (Tier 0)
|
||||
doesn't carry. See docs/spec/X13-court-fetch.md.
|
||||
"""
|
||||
148
mcp-server/src/legal_mcp/court_fetch_service/camofox_client.py
Normal file
148
mcp-server/src/legal_mcp/court_fetch_service/camofox_client.py
Normal file
@@ -0,0 +1,148 @@
|
||||
"""Camoufox-browser client + נט-המשפט navigation flow (X13, Tier 1).
|
||||
|
||||
Open-source, zero-API-cost stealth browsing: a self-hosted ``camofox-browser``
|
||||
REST server (``jo-inc/camofox-browser``, wrapping Camoufox — a Firefox fork
|
||||
with C++ fingerprint spoofing) drives a real browser. We talk to it over the
|
||||
same REST surface the Hermes agent uses (``~/.hermes/.../browser_camofox.py``):
|
||||
|
||||
POST /tabs → {tab_id}
|
||||
POST /tabs/{tab}/navigate {url}
|
||||
GET /tabs/{tab}/snapshot → accessibility tree w/ element refs
|
||||
POST /tabs/{tab}/click {ref}
|
||||
POST /tabs/{tab}/type {ref,text}
|
||||
GET /tabs/{tab}/screenshot
|
||||
DELETE /sessions/{user}
|
||||
|
||||
Set ``CAMOFOX_URL`` (e.g. ``http://127.0.0.1:9377``) to enable. The server's
|
||||
``/health`` exposes a VNC URL — that's the human-fallback surface (INV-CF3):
|
||||
when the autonomous reCAPTCHA solve fails, the chair opens the VNC and solves
|
||||
it live, and this flow continues.
|
||||
|
||||
⚠ CALIBRATION: the נט-המשפט external-case-search is an ASP.NET WebForms app
|
||||
behind an F5 WAF + reCAPTCHA. The element selectors and step sequence below
|
||||
are the *documented plan* of the flow; they must be calibrated against the
|
||||
live snapshot on first run (the site rate-limited static probing during
|
||||
development). Every step that can't find its target **raises** a clear Hebrew
|
||||
reason (INV-CF2 — no silent success-with-garbage) so the orchestrator escalates
|
||||
to the Tier-2 human fallback rather than returning an empty/wrong file.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
|
||||
import httpx
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# נט המשפט public entry points (discovered from the homepage __doPostBack menu).
|
||||
NGCS_HOME = "https://www.court.gov.il/ngcs.web.site/homepage.aspx"
|
||||
|
||||
CAMOFOX_URL = os.environ.get("CAMOFOX_URL", "").rstrip("/")
|
||||
_TIMEOUT = float(os.environ.get("COURT_FETCH_BROWSER_TIMEOUT_S", "60"))
|
||||
|
||||
|
||||
class CamofoxUnavailable(RuntimeError):
|
||||
"""camofox-browser isn't configured/reachable."""
|
||||
|
||||
|
||||
class NgcsFlowError(RuntimeError):
|
||||
"""A step in the נט-המשפט flow failed (selector/CAPTCHA/navigation)."""
|
||||
|
||||
|
||||
def is_enabled() -> bool:
|
||||
return bool(CAMOFOX_URL)
|
||||
|
||||
|
||||
async def health() -> dict:
|
||||
"""Probe camofox-browser; surfaces the VNC URL for the human fallback."""
|
||||
if not CAMOFOX_URL:
|
||||
raise CamofoxUnavailable("CAMOFOX_URL is not set")
|
||||
async with httpx.AsyncClient(timeout=10) as c:
|
||||
r = await c.get(f"{CAMOFOX_URL}/health")
|
||||
r.raise_for_status()
|
||||
return r.json()
|
||||
|
||||
|
||||
class _Browser:
|
||||
"""Thin async wrapper over the camofox-browser REST surface."""
|
||||
|
||||
def __init__(self, client: httpx.AsyncClient, tab_id: str, user_id: str):
|
||||
self._c = client
|
||||
self.tab = tab_id
|
||||
self.user = user_id
|
||||
|
||||
@classmethod
|
||||
async def open(cls, client: httpx.AsyncClient) -> "_Browser":
|
||||
r = await client.post(f"{CAMOFOX_URL}/tabs", json={})
|
||||
r.raise_for_status()
|
||||
data = r.json()
|
||||
return cls(client, data["tab_id"], data.get("user_id", data["tab_id"]))
|
||||
|
||||
async def navigate(self, url: str) -> None:
|
||||
r = await self._c.post(f"{CAMOFOX_URL}/tabs/{self.tab}/navigate", json={"url": url})
|
||||
r.raise_for_status()
|
||||
|
||||
async def snapshot(self) -> dict:
|
||||
r = await self._c.get(f"{CAMOFOX_URL}/tabs/{self.tab}/snapshot")
|
||||
r.raise_for_status()
|
||||
return r.json()
|
||||
|
||||
async def click(self, ref: str) -> dict:
|
||||
r = await self._c.post(f"{CAMOFOX_URL}/tabs/{self.tab}/click", json={"ref": ref})
|
||||
r.raise_for_status()
|
||||
return r.json()
|
||||
|
||||
async def type(self, ref: str, text: str) -> None:
|
||||
r = await self._c.post(
|
||||
f"{CAMOFOX_URL}/tabs/{self.tab}/type", json={"ref": ref, "text": text}
|
||||
)
|
||||
r.raise_for_status()
|
||||
|
||||
async def close(self) -> None:
|
||||
try:
|
||||
await self._c.delete(f"{CAMOFOX_URL}/sessions/{self.user}")
|
||||
except httpx.HTTPError:
|
||||
pass
|
||||
|
||||
|
||||
async def fetch_admin_verdict(
|
||||
*, file_number: str, month: str, year: str, case_number: str, court: str
|
||||
) -> dict:
|
||||
"""Drive נט המשפט to download an admin/district verdict PDF.
|
||||
|
||||
Returns ``{content: bytes, filename: str, source_url: str, court: str}``.
|
||||
Raises ``CamofoxUnavailable`` / ``NgcsFlowError`` on failure.
|
||||
|
||||
The flow (to be calibrated against the live snapshot):
|
||||
1. Open the homepage; trigger "חיפוש תיקים חיצוני" (btnExternalSearchCases).
|
||||
2. Fill the case-number / month / year fields.
|
||||
3. Solve the reCAPTCHA via the audio challenge (recaptcha_audio); on
|
||||
repeated failure, surface the VNC URL for a human solve (INV-CF3).
|
||||
4. Submit; open the matched case; locate the verdict ("פסק דין") document.
|
||||
5. Download the cleared PDF (served via S3 pre-signed URL) and return bytes.
|
||||
"""
|
||||
if not CAMOFOX_URL:
|
||||
raise CamofoxUnavailable(
|
||||
"שירות-הדפדפן (camofox-browser) אינו מוגדר — הגדר CAMOFOX_URL "
|
||||
"והפעל את jo-inc/camofox-browser. ראה docs/spec/X13-court-fetch.md."
|
||||
)
|
||||
|
||||
async with httpx.AsyncClient(timeout=_TIMEOUT) as client:
|
||||
br = await _Browser.open(client)
|
||||
try:
|
||||
await br.navigate(NGCS_HOME)
|
||||
snap = await br.snapshot()
|
||||
_ = snap # calibration anchor: locate btnExternalSearchCases here.
|
||||
|
||||
# The concrete selector/CAPTCHA/download steps require live
|
||||
# calibration with camofox running. Until calibrated we fail
|
||||
# loudly so the orchestrator escalates to the human fallback
|
||||
# (INV-CF2/CF3) rather than pretending success.
|
||||
raise NgcsFlowError(
|
||||
"זרימת נט-המשפט (Tier 1) ממתינה לכיול מול snapshot חי של "
|
||||
"camofox-browser — בקשת-אחזור מוסלמת ל-fallback אנושי (VNC/ידני)."
|
||||
)
|
||||
finally:
|
||||
await br.close()
|
||||
@@ -0,0 +1,80 @@
|
||||
"""Open-source reCAPTCHA v2 audio-challenge solver (X13, Tier 1).
|
||||
|
||||
Pure open-source, zero-API-cost: switch the reCAPTCHA widget to its **audio**
|
||||
challenge, download the mp3, transcribe it with a **local Whisper** model
|
||||
(``faster-whisper``), and submit the transcript. This is the well-known
|
||||
"Buster"-style technique. It is intentionally a *best-effort* solver —
|
||||
reCAPTCHA actively fights audio solving, so a non-trivial failure rate is
|
||||
expected and handled by the Tier-2 human fallback (INV-CF3), never hidden.
|
||||
|
||||
Model is loaded lazily and cached; ``WHISPER_MODEL`` (default ``small``) and
|
||||
``WHISPER_DEVICE`` (default ``cpu``) tune it. The dependency is optional — if
|
||||
``faster-whisper`` isn't installed, ``transcribe_audio`` raises a clear error
|
||||
so the caller falls back to a human solve rather than crashing the service.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
import tempfile
|
||||
|
||||
import httpx
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
_WHISPER_MODEL_NAME = os.environ.get("WHISPER_MODEL", "small")
|
||||
_WHISPER_DEVICE = os.environ.get("WHISPER_DEVICE", "cpu")
|
||||
_model = None
|
||||
|
||||
|
||||
class AudioSolveUnavailable(RuntimeError):
|
||||
"""faster-whisper isn't installed — cannot solve audio locally."""
|
||||
|
||||
|
||||
def _get_model():
|
||||
global _model
|
||||
if _model is not None:
|
||||
return _model
|
||||
try:
|
||||
from faster_whisper import WhisperModel # type: ignore
|
||||
except ImportError as e:
|
||||
raise AudioSolveUnavailable(
|
||||
"faster-whisper אינו מותקן — לא ניתן לפתור reCAPTCHA אודיו מקומית. "
|
||||
"התקן `pip install faster-whisper` או הסתמך על fallback אנושי (VNC)."
|
||||
) from e
|
||||
logger.info("loading whisper model %s on %s", _WHISPER_MODEL_NAME, _WHISPER_DEVICE)
|
||||
_model = WhisperModel(
|
||||
_WHISPER_MODEL_NAME, device=_WHISPER_DEVICE, compute_type="int8"
|
||||
)
|
||||
return _model
|
||||
|
||||
|
||||
async def download_audio(audio_url: str) -> bytes:
|
||||
async with httpx.AsyncClient(timeout=30, follow_redirects=True) as c:
|
||||
r = await c.get(audio_url)
|
||||
r.raise_for_status()
|
||||
return r.content
|
||||
|
||||
|
||||
def transcribe_audio(mp3_bytes: bytes) -> str:
|
||||
"""Transcribe a reCAPTCHA audio clip to its (English) digit/word phrase.
|
||||
|
||||
Raises ``AudioSolveUnavailable`` if the local model isn't installed.
|
||||
"""
|
||||
model = _get_model()
|
||||
with tempfile.NamedTemporaryFile(suffix=".mp3", delete=True) as f:
|
||||
f.write(mp3_bytes)
|
||||
f.flush()
|
||||
# reCAPTCHA audio is English regardless of page locale.
|
||||
segments, _info = model.transcribe(f.name, language="en")
|
||||
text = " ".join(seg.text for seg in segments).strip()
|
||||
# Normalise: reCAPTCHA expects the bare phrase, lower-case, no punctuation.
|
||||
cleaned = "".join(ch for ch in text.lower() if ch.isalnum() or ch.isspace())
|
||||
return " ".join(cleaned.split())
|
||||
|
||||
|
||||
async def solve_from_audio_url(audio_url: str) -> str:
|
||||
"""Convenience: download + transcribe an audio-challenge URL."""
|
||||
mp3 = await download_audio(audio_url)
|
||||
return transcribe_audio(mp3)
|
||||
145
mcp-server/src/legal_mcp/court_fetch_service/server.py
Normal file
145
mcp-server/src/legal_mcp/court_fetch_service/server.py
Normal file
@@ -0,0 +1,145 @@
|
||||
"""Host-side HTTP bridge for Tier-1 verdict fetching (X13).
|
||||
|
||||
Mirrors ``legal_mcp.chat_service.server`` — the proven host-side pattern: an
|
||||
aiohttp app, bound to the docker bridge gateway, Bearer-auth, that does the one
|
||||
thing the container can't (here: drive a real browser against נט המשפט).
|
||||
|
||||
Endpoints:
|
||||
POST /fetch body {file_number, month, year, case_number, court}
|
||||
→ {ok, content_b64, filename, source_url, court, reason}
|
||||
REQUIRES Authorization: Bearer <COURT_FETCH_SHARED_SECRET>.
|
||||
GET /health liveness (no auth); reports camofox + VNC URL if available.
|
||||
|
||||
Run with pm2:
|
||||
pm2 start scripts/legal-court-fetch-service.config.cjs
|
||||
|
||||
Security posture (identical rationale to legal-chat-service):
|
||||
1. Bind defaults to ``10.0.1.1`` (docker0 bridge gateway) — reachable from
|
||||
the host + containers on docker bridges, invisible to outside networks.
|
||||
2. ``/fetch`` requires a Bearer token (constant-time compare); the service
|
||||
refuses to start without ``COURT_FETCH_SHARED_SECRET`` set.
|
||||
3. ``/health`` is unauthenticated and spawns nothing.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import base64
|
||||
import hmac
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import sys
|
||||
|
||||
from aiohttp import web
|
||||
|
||||
_pkg_root = os.path.dirname(os.path.dirname(os.path.dirname(__file__)))
|
||||
if _pkg_root not in sys.path:
|
||||
sys.path.insert(0, _pkg_root)
|
||||
|
||||
from legal_mcp.court_fetch_service import camofox_client # noqa: E402
|
||||
|
||||
logger = logging.getLogger("legal_court_fetch_service")
|
||||
|
||||
_SHARED_SECRET: str = ""
|
||||
|
||||
|
||||
async def health(request: web.Request) -> web.Response:
|
||||
info = {"ok": True, "service": "legal-court-fetch-service",
|
||||
"camofox_enabled": camofox_client.is_enabled()}
|
||||
if camofox_client.is_enabled():
|
||||
try:
|
||||
info["camofox"] = await camofox_client.health()
|
||||
except Exception as e: # health must never throw
|
||||
info["camofox_error"] = str(e)
|
||||
return web.json_response(info)
|
||||
|
||||
|
||||
def _check_bearer(request: web.Request) -> web.Response | None:
|
||||
auth = request.headers.get("Authorization", "")
|
||||
expected = "Bearer " + _SHARED_SECRET
|
||||
if not auth or not hmac.compare_digest(auth, expected):
|
||||
return web.json_response(
|
||||
{"error": "unauthorized: missing or invalid Bearer token"}, status=401
|
||||
)
|
||||
return None
|
||||
|
||||
|
||||
async def fetch(request: web.Request) -> web.Response:
|
||||
unauth = _check_bearer(request)
|
||||
if unauth is not None:
|
||||
return unauth
|
||||
try:
|
||||
body = await request.json()
|
||||
except json.JSONDecodeError:
|
||||
return web.json_response({"error": "invalid JSON body"}, status=400)
|
||||
|
||||
required = ("file_number", "month", "year")
|
||||
if not all(body.get(k) for k in required):
|
||||
return web.json_response(
|
||||
{"ok": False, "reason": f"missing one of {required}"}, status=400
|
||||
)
|
||||
|
||||
try:
|
||||
result = await camofox_client.fetch_admin_verdict(
|
||||
file_number=str(body["file_number"]),
|
||||
month=str(body["month"]),
|
||||
year=str(body["year"]),
|
||||
case_number=str(body.get("case_number", "")),
|
||||
court=str(body.get("court", "")),
|
||||
)
|
||||
return web.json_response({
|
||||
"ok": True,
|
||||
"content_b64": base64.b64encode(result["content"]).decode("ascii"),
|
||||
"filename": result.get("filename", ""),
|
||||
"source_url": result.get("source_url", ""),
|
||||
"court": result.get("court", ""),
|
||||
})
|
||||
except (camofox_client.CamofoxUnavailable, camofox_client.NgcsFlowError) as e:
|
||||
# Expected, recoverable failure → orchestrator escalates (INV-CF3).
|
||||
return web.json_response({"ok": False, "reason": str(e)}, status=200)
|
||||
except Exception as e: # noqa: BLE001
|
||||
logger.exception("fetch failed")
|
||||
return web.json_response({"ok": False, "reason": f"unexpected: {e}"}, status=200)
|
||||
|
||||
|
||||
def build_app() -> web.Application:
|
||||
app = web.Application(client_max_size=64 * 1024 * 1024)
|
||||
app.router.add_get("/health", health)
|
||||
app.router.add_post("/fetch", fetch)
|
||||
return app
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = argparse.ArgumentParser(description="legal-court-fetch-service")
|
||||
parser.add_argument("--port", type=int, default=8771)
|
||||
parser.add_argument("--host", default="10.0.1.1",
|
||||
help="bind address; default = docker0 bridge gateway")
|
||||
parser.add_argument("--log-level", default="INFO")
|
||||
args = parser.parse_args()
|
||||
|
||||
logging.basicConfig(level=args.log_level.upper(),
|
||||
format="%(asctime)s %(name)s %(levelname)s %(message)s")
|
||||
|
||||
secret = os.environ.get("COURT_FETCH_SHARED_SECRET", "").strip()
|
||||
if not secret:
|
||||
logger.error(
|
||||
"COURT_FETCH_SHARED_SECRET is empty; refusing to start. Set it in "
|
||||
"/home/chaim/.legal-court-fetch-service.env (loaded by pm2) and "
|
||||
"mirror it as a Coolify env var on the legal-ai app."
|
||||
)
|
||||
return 2
|
||||
if len(secret) < 24:
|
||||
logger.error("COURT_FETCH_SHARED_SECRET too short (>=32 chars expected).")
|
||||
return 2
|
||||
global _SHARED_SECRET
|
||||
_SHARED_SECRET = secret
|
||||
|
||||
app = build_app()
|
||||
logger.info("legal-court-fetch-service listening on %s:%d", args.host, args.port)
|
||||
web.run_app(app, host=args.host, port=args.port, print=lambda _m: None)
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
@@ -58,6 +58,7 @@ from legal_mcp.tools import ( # noqa: E402
|
||||
missing_precedents as mp_tools,
|
||||
citations as cit_tools,
|
||||
training_enrichment as train_tools,
|
||||
court_fetch as cf_tools,
|
||||
)
|
||||
|
||||
|
||||
@@ -895,6 +896,22 @@ async def missing_precedent_close(
|
||||
)
|
||||
|
||||
|
||||
# ── Court verdict auto-fetch (X13) ────────────────────────────────
|
||||
@mcp.tool()
|
||||
async def court_verdict_fetch(citation: str) -> str:
|
||||
"""אחזור אוטומטי של פסק-דין בית-משפט מנט המשפט/פורטל-העליון וקליטה לקורפוס.
|
||||
|
||||
מסווג את הציטוט (עליון→Tier0 / מנהלי→Tier1 / ערר→skip), מוריד וקולט דרך
|
||||
צינור-הקליטה הקנוני. דוגמה: 'עת"מ 46111-12-22'. כלי מקומי בלבד."""
|
||||
return await cf_tools.court_verdict_fetch(citation)
|
||||
|
||||
|
||||
@mcp.tool()
|
||||
async def court_fetch_status(case_number: str = "", status_filter: str = "") -> str:
|
||||
"""סטטוס תור-אחזור הפסיקה. case_number לפריט יחיד, או status_filter (pending/failed/manual/done)."""
|
||||
return await cf_tools.court_fetch_status(case_number, status_filter)
|
||||
|
||||
|
||||
# ── Internal citations graph (TaskMaster #34) ─────────────────────
|
||||
|
||||
|
||||
|
||||
204
mcp-server/src/legal_mcp/services/court_citation.py
Normal file
204
mcp-server/src/legal_mcp/services/court_citation.py
Normal file
@@ -0,0 +1,204 @@
|
||||
"""Court-citation classifier for the auto-fetch subsystem (X13).
|
||||
|
||||
Given a raw citation string (typically a digest's ``underlying_citation``,
|
||||
e.g. ``עת"מ 46111-12-22 יכין-אפק נ' הוועדה המחוזית``), decide:
|
||||
|
||||
* **which tier** can fetch it (``supreme`` | ``admin`` | ``skip``), and
|
||||
* the **canonical case number** plus, for נט המשפט, the
|
||||
(file, month, year) triple the public case-search form needs.
|
||||
|
||||
Tier mapping (INV-CF6 — only court rulings are auto-fetched; ועדת-ערר is
|
||||
never sent to a public fetch, it needs Nevo):
|
||||
|
||||
* ``supreme`` — Supreme Court prefixes (עע"מ/בג"ץ/ע"א/רע"א/דנ"א/בר"מ/בש"א).
|
||||
Fetched directly from ``supremedecisions.court.gov.il`` (Tier 0, no CAPTCHA).
|
||||
* ``admin`` — district / administrative-court prefixes (עת"מ/עמ"נ/…) and
|
||||
the bare נט-המשפט "filed" format ``NNNNN-MM-YY``. Fetched via the
|
||||
host-side stealth browser against נט המשפט (Tier 1).
|
||||
* ``skip`` — ועדת-ערר (ערר/בל"מ). Not publicly fetchable → missing_precedent.
|
||||
|
||||
Regex families intentionally mirror ``citation_extractor.py`` (the canonical
|
||||
prefix/number patterns) so the two stay in sync — we reuse ``_NUM_RX`` shape
|
||||
and ``_normalize_case_number`` semantics rather than inventing a parallel
|
||||
parser (INV-CF1 / engineering "symmetry" rule).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
from dataclasses import dataclass
|
||||
|
||||
# Canonical number core, identical shape to citation_extractor._NUM_RX:
|
||||
# 3-5 digits, optional separator + 2-4 digits, optional third group
|
||||
# (the NNNNN-MM-YY "filed" format — 46111-12-22 = file 46111, month 12, yr 22).
|
||||
_NUM_RX = r"\d{1,5}(?:[-/]\d{1,4}(?:[-/]\d{2,4})?)?"
|
||||
|
||||
# Hebrew gershayim: straight (") or curly (״).
|
||||
_Q = r"[\"״]"
|
||||
|
||||
# Optional leading one-letter Hebrew preposition/conjunction (ב/ל/ה/ו/כ/מ/ש)
|
||||
# attached to the prefix — e.g. "בערר", "וערר", "כפי שקבעתי בערר". Anchored by
|
||||
# a lookbehind that forbids a *preceding* Hebrew letter, so we don't match a
|
||||
# prefix buried inside a longer word. Regex backtracking lets the preposition
|
||||
# match empty when the prefix itself starts with one of these letters (בג"ץ).
|
||||
_LEAD = r"(?<![א-ת])(?:[בלהוכמש])?"
|
||||
|
||||
# Supreme Court prefixes → Tier 0 (supremedecisions public download API).
|
||||
_SUPREME_PREFIXES = [
|
||||
rf"עע{_Q}מ", # ערעור מנהלי (לעליון)
|
||||
rf"בג{_Q}ץ", # בג"ץ
|
||||
rf"בג{_Q}צ", # variant spelling
|
||||
rf"דנג{_Q}ץ", # דיון נוסף בג"ץ
|
||||
rf"ע{_Q}א", # ערעור אזרחי
|
||||
rf"רע{_Q}א", # רשות ערעור אזרחי
|
||||
rf"דנ{_Q}א", # דיון נוסף אזרחי
|
||||
rf"בר{_Q}מ", # בקשת רשות ערעור מנהלי (עליון)
|
||||
rf"בש{_Q}א", # בקשת רשות … (עליון)
|
||||
]
|
||||
|
||||
# District / administrative-court prefixes → Tier 1 (נט המשפט case viewer).
|
||||
_ADMIN_PREFIXES = [
|
||||
rf"עת{_Q}מ", # עתירה מנהלית (בימ"ש לעניינים מנהליים)
|
||||
rf"עמ{_Q}נ", # ערעור מנהלי (מחוזי)
|
||||
rf"ת{_Q}א", # תביעה אזרחית (מחוזי/שלום)
|
||||
rf"ה{_Q}פ", # המרצת פתיחה
|
||||
]
|
||||
|
||||
# Appeals-committee → skip (needs Nevo; never auto-fetched).
|
||||
_SKIP_PREFIXES = [
|
||||
rf"ערר",
|
||||
rf"בל{_Q}מ",
|
||||
]
|
||||
|
||||
_SUPREME_RX = re.compile(
|
||||
_LEAD + r"(" + "|".join(_SUPREME_PREFIXES) + r")\s*(" + _NUM_RX + r")",
|
||||
re.UNICODE,
|
||||
)
|
||||
_ADMIN_RX = re.compile(
|
||||
_LEAD + r"(" + "|".join(_ADMIN_PREFIXES) + r")\s*(" + _NUM_RX + r")",
|
||||
re.UNICODE,
|
||||
)
|
||||
_SKIP_RX = re.compile(
|
||||
_LEAD + r"(" + "|".join(_SKIP_PREFIXES) + r")" + r"(?:\s*\([^)\n]{0,80}\))?\s*(" + _NUM_RX + r")",
|
||||
re.UNICODE,
|
||||
)
|
||||
|
||||
# Bare נט-המשפט filed format with no prefix: 46111-12-22 (5/4-digit file,
|
||||
# 1-2 digit month, 2-4 digit year). Used when a digest gives just the number.
|
||||
_BARE_FILED_RX = re.compile(r"(?<!\d)(\d{1,5})-(\d{1,2})-(\d{2,4})(?!\d)", re.UNICODE)
|
||||
|
||||
|
||||
@dataclass
|
||||
class CourtCitation:
|
||||
"""Result of classifying a citation for auto-fetch routing."""
|
||||
|
||||
tier: str # "supreme" | "admin" | "skip" | "unknown"
|
||||
court_prefix: str # e.g. 'עת"מ', or "" for bare/unknown
|
||||
case_number_raw: str # the matched number as written, e.g. "46111-12-22"
|
||||
case_number_norm: str # canonical: slashes→dashes, digits/sep only
|
||||
# נט-המשפט form fields (only when the filed format NNNNN-MM-YY is present):
|
||||
file_number: str | None = None
|
||||
month: str | None = None
|
||||
year: str | None = None
|
||||
|
||||
@property
|
||||
def fetchable(self) -> bool:
|
||||
return self.tier in ("supreme", "admin")
|
||||
|
||||
|
||||
def normalize_case_number(raw: str) -> str:
|
||||
"""Canonicalize a case number for idempotency keys / matching.
|
||||
|
||||
Mirrors ``citation_extractor._normalize_case_number``: strip everything
|
||||
but digits and separators, unify ``/`` → ``-``. Display value is never
|
||||
derived from this.
|
||||
"""
|
||||
cleaned = re.sub(r"[^\d/\-]", "", raw or "")
|
||||
return cleaned.replace("/", "-").strip("-")
|
||||
|
||||
|
||||
def _split_filed(num_norm: str) -> tuple[str, str, str] | None:
|
||||
"""Split a normalized NNNNN-MM-YY number into (file, month, year).
|
||||
|
||||
Only the three-group "filed" format yields a נט-המשפט triple; two-group
|
||||
formats (1234-22 / 1234/22) are Supreme-style serials and return None.
|
||||
"""
|
||||
m = _BARE_FILED_RX.fullmatch(num_norm)
|
||||
if not m:
|
||||
return None
|
||||
file_no, month, year = m.group(1), m.group(2), m.group(3)
|
||||
# Plausibility: month 1-12, year 2-4 digits. Reject implausible months
|
||||
# (avoids mis-reading a 2-group serial that slipped through).
|
||||
if not (1 <= int(month) <= 12):
|
||||
return None
|
||||
return file_no, month, year
|
||||
|
||||
|
||||
def classify(citation: str) -> CourtCitation:
|
||||
"""Classify a raw citation string into a fetch tier + parsed number.
|
||||
|
||||
Resolution order: ועדת-ערר (skip) is checked FIRST so an "ערר" prefix is
|
||||
never mis-routed to a court tier; then Supreme prefixes; then admin
|
||||
prefixes; then a bare filed number defaults to ``admin`` (נט המשפט is the
|
||||
only public source for prefix-less district/שלום numbers).
|
||||
"""
|
||||
text = (citation or "").strip()
|
||||
if not text:
|
||||
return CourtCitation("unknown", "", "", "")
|
||||
|
||||
# 1. ועדת-ערר → skip (must win over any court match).
|
||||
m = _SKIP_RX.search(text)
|
||||
if m:
|
||||
raw = m.group(2)
|
||||
return CourtCitation(
|
||||
tier="skip",
|
||||
court_prefix=m.group(1),
|
||||
case_number_raw=raw,
|
||||
case_number_norm=normalize_case_number(raw),
|
||||
)
|
||||
|
||||
# 2. Supreme Court prefix → Tier 0.
|
||||
m = _SUPREME_RX.search(text)
|
||||
if m:
|
||||
raw = m.group(2)
|
||||
return CourtCitation(
|
||||
tier="supreme",
|
||||
court_prefix=m.group(1),
|
||||
case_number_raw=raw,
|
||||
case_number_norm=normalize_case_number(raw),
|
||||
)
|
||||
|
||||
# 3. District / admin prefix → Tier 1.
|
||||
m = _ADMIN_RX.search(text)
|
||||
if m:
|
||||
raw = m.group(2)
|
||||
norm = normalize_case_number(raw)
|
||||
filed = _split_filed(norm)
|
||||
return CourtCitation(
|
||||
tier="admin",
|
||||
court_prefix=m.group(1),
|
||||
case_number_raw=raw,
|
||||
case_number_norm=norm,
|
||||
file_number=filed[0] if filed else None,
|
||||
month=filed[1] if filed else None,
|
||||
year=filed[2] if filed else None,
|
||||
)
|
||||
|
||||
# 4. Bare filed number (no prefix) → default admin (נט המשפט).
|
||||
m = _BARE_FILED_RX.search(text)
|
||||
if m:
|
||||
raw = m.group(0)
|
||||
norm = normalize_case_number(raw)
|
||||
filed = _split_filed(norm)
|
||||
if filed:
|
||||
return CourtCitation(
|
||||
tier="admin",
|
||||
court_prefix="",
|
||||
case_number_raw=raw,
|
||||
case_number_norm=norm,
|
||||
file_number=filed[0],
|
||||
month=filed[1],
|
||||
year=filed[2],
|
||||
)
|
||||
|
||||
return CourtCitation("unknown", "", "", "")
|
||||
241
mcp-server/src/legal_mcp/services/court_fetch_orchestrator.py
Normal file
241
mcp-server/src/legal_mcp/services/court_fetch_orchestrator.py
Normal file
@@ -0,0 +1,241 @@
|
||||
"""X13 orchestrator — classify → fetch → ingest → record.
|
||||
|
||||
The single entry point (`fetch_and_ingest`) wires the three tiers to the
|
||||
**canonical** precedent-ingest pipeline (INV-CF1 — no parallel ingest path)
|
||||
and keeps the `court_fetch_jobs` row honest at every step (INV-CF2 — a job
|
||||
always ends in an explicit terminal state, never a silent drop).
|
||||
|
||||
Tier routing (from `court_citation.classify`):
|
||||
* ``skip`` — ועדת-ערר → never fetched; logged as a missing_precedent gap.
|
||||
* ``supreme`` — Tier 0, in-process httpx (`court_fetch_supreme`).
|
||||
* ``admin`` — Tier 1, the host-side stealth-browser service over loopback.
|
||||
|
||||
Fallback (INV-CF3): after ``MAX_AUTONOMOUS_ATTEMPTS`` autonomous failures the
|
||||
job flips to ``manual`` and a missing_precedent row is opened so the chair
|
||||
sees the gap and can solve the CAPTCHA live (VNC) or drop the file manually.
|
||||
|
||||
This module runs **in the local MCP server only** — `ingest_precedent` drives
|
||||
halacha extraction via the local ``claude`` CLI (see `claude_session.py`). It
|
||||
is invoked from the `court_verdict_fetch` MCP tool, not from the container.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
from uuid import UUID
|
||||
|
||||
import httpx
|
||||
|
||||
from legal_mcp.services import court_citation, db
|
||||
from legal_mcp.services.court_fetch_supreme import (
|
||||
SupremeFetchError,
|
||||
fetch_supreme_verdict,
|
||||
)
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# After this many autonomous failures, stop auto-retrying and escalate to a
|
||||
# human (INV-CF3). Kept low — the .gov site shouldn't be hammered (INV-CF4).
|
||||
MAX_AUTONOMOUS_ATTEMPTS = int(os.environ.get("COURT_FETCH_MAX_ATTEMPTS", "2"))
|
||||
|
||||
# The host-side Tier-1 browser service (pm2). The MCP server runs on the host,
|
||||
# so it reaches the service over loopback directly (the container bridge in
|
||||
# web/court_fetch_proxy.py is a separate, optional entry point).
|
||||
COURT_FETCH_SERVICE_URL = os.environ.get(
|
||||
"COURT_FETCH_SERVICE_URL", "http://127.0.0.1:8771"
|
||||
)
|
||||
_SHARED_SECRET = os.environ.get("COURT_FETCH_SHARED_SECRET", "").strip()
|
||||
_TIER1_TIMEOUT_S = float(os.environ.get("COURT_FETCH_TIER1_TIMEOUT_S", "300"))
|
||||
|
||||
# Provenance level by tier — Supreme rulings are binding; admin-court verdicts
|
||||
# are administrative (set is_binding conservatively True, chair can downgrade).
|
||||
_LEVEL_BY_TIER = {"supreme": "עליון", "admin": "מנהלי"}
|
||||
|
||||
|
||||
class _Tier1Unavailable(RuntimeError):
|
||||
"""The host browser service is not reachable / not configured."""
|
||||
|
||||
|
||||
async def _ingest_bytes(
|
||||
*, content: bytes, filename: str, citation: str, tier: str,
|
||||
court: str, source_url: str,
|
||||
) -> dict:
|
||||
"""Stage bytes to a temp file and run the canonical ingest (INV-CF1)."""
|
||||
from legal_mcp.services import precedent_library
|
||||
|
||||
suffix = Path(filename).suffix or ".pdf"
|
||||
tmp = tempfile.NamedTemporaryFile(
|
||||
prefix="court_fetch_", suffix=suffix, delete=False
|
||||
)
|
||||
try:
|
||||
tmp.write(content)
|
||||
tmp.flush()
|
||||
tmp.close()
|
||||
result = await precedent_library.ingest_precedent(
|
||||
file_path=tmp.name,
|
||||
citation=citation,
|
||||
court=court,
|
||||
source_type="court_ruling", # INV-CF6
|
||||
precedent_level=_LEVEL_BY_TIER.get(tier, ""),
|
||||
is_binding=True,
|
||||
)
|
||||
# Stamp provenance on the new case_law row (INV-CF7).
|
||||
case_law_id = result.get("case_law_id")
|
||||
if case_law_id and source_url:
|
||||
try:
|
||||
await db.update_case_law(
|
||||
UUID(str(case_law_id)), source_url=source_url
|
||||
)
|
||||
except Exception: # provenance is best-effort, never blocks ingest
|
||||
logger.warning("could not stamp source_url on %s", case_law_id)
|
||||
return result
|
||||
finally:
|
||||
try:
|
||||
os.unlink(tmp.name)
|
||||
except OSError:
|
||||
pass
|
||||
|
||||
|
||||
async def _fetch_tier1_admin(cit: court_citation.CourtCitation) -> dict:
|
||||
"""Call the host-side browser service to fetch an admin-court verdict.
|
||||
|
||||
Returns the service's JSON: ``{ok, content_b64, filename, source_url,
|
||||
court, reason}``. Raises ``_Tier1Unavailable`` if the service can't be
|
||||
reached, ``SupremeFetchError``-style RuntimeError on a fetch failure the
|
||||
service reports.
|
||||
"""
|
||||
if not (cit.file_number and cit.month and cit.year):
|
||||
raise RuntimeError(
|
||||
f"מספר-תיק {cit.case_number_norm} אינו בפורמט נט-המשפט (תיק-חודש-שנה)"
|
||||
)
|
||||
headers = {"Authorization": f"Bearer {_SHARED_SECRET}"} if _SHARED_SECRET else {}
|
||||
payload = {
|
||||
"file_number": cit.file_number,
|
||||
"month": cit.month,
|
||||
"year": cit.year,
|
||||
"case_number": cit.case_number_norm,
|
||||
"court": cit.court_prefix,
|
||||
}
|
||||
try:
|
||||
async with httpx.AsyncClient(timeout=_TIER1_TIMEOUT_S) as client:
|
||||
resp = await client.post(
|
||||
f"{COURT_FETCH_SERVICE_URL}/fetch", json=payload, headers=headers
|
||||
)
|
||||
except httpx.ConnectError as e:
|
||||
raise _Tier1Unavailable(
|
||||
f"שירות-האחזור (legal-court-fetch-service) אינו זמין ב-"
|
||||
f"{COURT_FETCH_SERVICE_URL}: {e}"
|
||||
) from e
|
||||
if resp.status_code != 200:
|
||||
raise RuntimeError(f"שירות-האחזור החזיר {resp.status_code}: {resp.text[:200]}")
|
||||
return resp.json()
|
||||
|
||||
|
||||
async def fetch_and_ingest(
|
||||
citation: str, *, digest_id: UUID | None = None
|
||||
) -> dict:
|
||||
"""Classify a citation, fetch the verdict, ingest it, and record the job.
|
||||
|
||||
Idempotent on the canonical case number (INV-CF5): a case already fetched
|
||||
(job ``done``) is returned without re-fetching.
|
||||
"""
|
||||
cit = court_citation.classify(citation)
|
||||
|
||||
# ── skip: ועדת-ערר — never auto-fetched (INV-CF6). Surface as a gap. ──
|
||||
if cit.tier == "skip":
|
||||
await _open_gap(citation, reason="ועדת-ערר — לא ניתן לאחזור ציבורי (נדרש נבו)")
|
||||
return {"status": "skipped", "tier": "skip", "citation": citation,
|
||||
"reason": "appeals_committee — needs Nevo"}
|
||||
if cit.tier == "unknown" or not cit.case_number_norm:
|
||||
return {"status": "unrecognized", "citation": citation}
|
||||
|
||||
# ── idempotent job row ──
|
||||
job = await db.court_fetch_job_upsert(
|
||||
case_number_norm=cit.case_number_norm,
|
||||
citation_raw=citation,
|
||||
tier=cit.tier,
|
||||
court=cit.court_prefix,
|
||||
digest_id=digest_id,
|
||||
)
|
||||
if job.get("status") == "done":
|
||||
return {"status": "already_done", "job": job}
|
||||
if job.get("status") == "manual":
|
||||
return {"status": "awaiting_manual", "job": job}
|
||||
|
||||
job_id = UUID(str(job["id"]))
|
||||
await db.court_fetch_job_update(job_id, status="running", bump_attempts=True)
|
||||
|
||||
# ── fetch ──
|
||||
try:
|
||||
if cit.tier == "supreme":
|
||||
fetched = await fetch_supreme_verdict(
|
||||
citation=citation, case_number_norm=cit.case_number_norm
|
||||
)
|
||||
content, filename = fetched.content, fetched.filename
|
||||
source_url, court = fetched.source_url, fetched.court
|
||||
else: # admin → Tier 1
|
||||
res = await _fetch_tier1_admin(cit)
|
||||
if not res.get("ok"):
|
||||
raise RuntimeError(res.get("reason") or "אחזור נכשל")
|
||||
import base64
|
||||
content = base64.b64decode(res["content_b64"])
|
||||
filename = res.get("filename") or f"{cit.case_number_norm}.pdf"
|
||||
source_url = res.get("source_url", "")
|
||||
court = res.get("court") or cit.court_prefix
|
||||
except (_Tier1Unavailable, SupremeFetchError, RuntimeError) as e:
|
||||
return await _record_failure(job_id, cit, citation, str(e))
|
||||
|
||||
# ── ingest into the canonical pipeline (INV-CF1) ──
|
||||
try:
|
||||
result = await _ingest_bytes(
|
||||
content=content, filename=filename, citation=citation,
|
||||
tier=cit.tier, court=court, source_url=source_url,
|
||||
)
|
||||
except Exception as e: # noqa: BLE001 — recorded, never swallowed (INV-CF2)
|
||||
logger.exception("ingest failed for %s", cit.case_number_norm)
|
||||
return await _record_failure(job_id, cit, citation, f"קליטה נכשלה: {e}")
|
||||
|
||||
case_law_id = result.get("case_law_id")
|
||||
await db.court_fetch_job_update(
|
||||
job_id, status="done",
|
||||
case_law_id=UUID(str(case_law_id)) if case_law_id else None,
|
||||
source_url=source_url, error="",
|
||||
)
|
||||
return {"status": "done", "tier": cit.tier, "case_law_id": case_law_id,
|
||||
"citation": citation, "source_url": source_url, "ingest": result}
|
||||
|
||||
|
||||
async def _record_failure(
|
||||
job_id: UUID, cit: court_citation.CourtCitation, citation: str, err: str
|
||||
) -> dict:
|
||||
"""Record a fetch/ingest failure; escalate to manual after N attempts (INV-CF3)."""
|
||||
job = await db.court_fetch_job_get(cit.case_number_norm)
|
||||
attempts = (job or {}).get("attempts", 1)
|
||||
if attempts >= MAX_AUTONOMOUS_ATTEMPTS:
|
||||
await db.court_fetch_job_update(job_id, status="manual", error=err)
|
||||
await _open_gap(
|
||||
citation,
|
||||
reason=f"אחזור אוטונומי נכשל ({attempts} נסיונות) — נדרשת הורדה ידנית. {err}",
|
||||
)
|
||||
logger.warning("court fetch escalated to manual: %s — %s", citation, err)
|
||||
return {"status": "manual", "citation": citation, "error": err,
|
||||
"attempts": attempts}
|
||||
await db.court_fetch_job_update(job_id, status="failed", error=err)
|
||||
logger.warning("court fetch failed (will retry): %s — %s", citation, err)
|
||||
return {"status": "failed", "citation": citation, "error": err,
|
||||
"attempts": attempts}
|
||||
|
||||
|
||||
async def _open_gap(citation: str, *, reason: str) -> None:
|
||||
"""Open a missing_precedent gap so the chair sees it (INV-CF2/CF3).
|
||||
|
||||
Best-effort + de-duplicated by the missing_precedents layer; a failure
|
||||
here is logged, never raised (it must not mask the original outcome).
|
||||
"""
|
||||
try:
|
||||
await db.create_missing_precedent(citation=citation, notes=reason)
|
||||
except Exception:
|
||||
logger.warning("could not open missing_precedent for %s", citation)
|
||||
181
mcp-server/src/legal_mcp/services/court_fetch_supreme.py
Normal file
181
mcp-server/src/legal_mcp/services/court_fetch_supreme.py
Normal file
@@ -0,0 +1,181 @@
|
||||
"""Tier 0 — Supreme Court verdict fetcher (X13).
|
||||
|
||||
Pulls a published Supreme Court verdict PDF from the **public** decisions
|
||||
portal ``supremedecisions.court.gov.il`` — no smart-card, no CAPTCHA. The
|
||||
portal is an AngularJS SPA backed by a small JSON API (reverse-engineered
|
||||
from ``/Scripts/app/config.js`` + the search/results controllers):
|
||||
|
||||
POST Home/SearchVerdicts body {"document": <query>, "lan": 1} → result list
|
||||
GET Home/GetCasesYearNum ?... (year + number lookup) → case + docs
|
||||
GET Home/Download?path=<path>&fileName=<file>&type=4 → the PDF bytes
|
||||
|
||||
Two things matter for getting a 200 instead of an F5 connection-reset
|
||||
(verified empirically 2026-06-07):
|
||||
* a **complete** browser header set — UA + Accept + Accept-Language. A bare
|
||||
UA alone gets reset.
|
||||
* **politeness** (INV-CF4): one request at a time, a cooldown between them,
|
||||
a Referer of the portal root. We never parallelise or hammer.
|
||||
|
||||
Honesty / scope: the *result→download* field mapping (where ``path`` and
|
||||
``fileName`` live in the SearchVerdicts JSON) is derived from the client code,
|
||||
not yet confirmed against a live JSON response (the live site rate-limited
|
||||
probing during development). ``fetch_supreme_verdict`` therefore validates the
|
||||
response shape and **raises** on anything unexpected (INV-CF2 — no silent
|
||||
swallow) so the orchestrator can record the failure and fall back, rather than
|
||||
returning a wrong/empty file. The first live run is the validation pass; see
|
||||
the X13 verification section.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
import logging
|
||||
import os
|
||||
from dataclasses import dataclass
|
||||
|
||||
import httpx
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
_BASE = "https://supremedecisions.court.gov.il"
|
||||
|
||||
# A complete, browser-like header set. Empirically required to pass the F5
|
||||
# WAF (a bare User-Agent gets a TCP reset).
|
||||
_HEADERS = {
|
||||
"User-Agent": (
|
||||
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
|
||||
"(KHTML, like Gecko) Chrome/126.0 Safari/537.36"
|
||||
),
|
||||
"Accept": "application/json, text/plain, */*",
|
||||
"Accept-Language": "he-IL,he;q=0.9,en;q=0.8",
|
||||
"Referer": _BASE + "/",
|
||||
}
|
||||
|
||||
# Politeness knobs (INV-CF4). Serial only — never run these concurrently.
|
||||
_REQUEST_TIMEOUT_S = float(os.environ.get("COURT_FETCH_HTTP_TIMEOUT_S", "30"))
|
||||
_INTER_REQUEST_COOLDOWN_S = float(os.environ.get("COURT_FETCH_COOLDOWN_S", "2"))
|
||||
|
||||
# type=4 → PDF in the portal's Download endpoint (from resultsControler.js).
|
||||
_DOC_TYPE_PDF = "4"
|
||||
|
||||
|
||||
@dataclass
|
||||
class FetchedVerdict:
|
||||
"""A downloaded verdict file held in memory, ready for ingest."""
|
||||
|
||||
content: bytes
|
||||
filename: str
|
||||
source_url: str
|
||||
court: str = "בית המשפט העליון"
|
||||
|
||||
|
||||
class SupremeFetchError(RuntimeError):
|
||||
"""Raised when the public portal returns an unexpected shape / no document.
|
||||
|
||||
Carries a human-readable Hebrew reason so the orchestrator can persist it
|
||||
on the job row (INV-CF2) and decide on fallback.
|
||||
"""
|
||||
|
||||
|
||||
async def _get(client: httpx.AsyncClient, path: str, **kwargs) -> httpx.Response:
|
||||
await asyncio.sleep(_INTER_REQUEST_COOLDOWN_S)
|
||||
resp = await client.get(f"{_BASE}/{path.lstrip('/')}", **kwargs)
|
||||
resp.raise_for_status()
|
||||
return resp
|
||||
|
||||
|
||||
async def _post(client: httpx.AsyncClient, path: str, json: dict) -> httpx.Response:
|
||||
await asyncio.sleep(_INTER_REQUEST_COOLDOWN_S)
|
||||
resp = await client.post(f"{_BASE}/{path.lstrip('/')}", json=json)
|
||||
resp.raise_for_status()
|
||||
return resp
|
||||
|
||||
|
||||
def _extract_doc_ref(results: object) -> tuple[str, str] | None:
|
||||
"""Pull (path, fileName) of the first verdict document from a results blob.
|
||||
|
||||
The SearchVerdicts/GetCasesYearNum responses nest documents under varying
|
||||
keys across the portal's endpoints. We probe the known shapes defensively
|
||||
and return the first (path, fileName) pair found; ``None`` if none.
|
||||
"""
|
||||
def walk(node):
|
||||
if isinstance(node, dict):
|
||||
# A document node carries both a path and a file name.
|
||||
path = node.get("Path") or node.get("path")
|
||||
fname = node.get("FileName") or node.get("fileName") or node.get("Filename")
|
||||
if path and fname:
|
||||
yield (str(path), str(fname))
|
||||
for v in node.values():
|
||||
yield from walk(v)
|
||||
elif isinstance(node, list):
|
||||
for v in node:
|
||||
yield from walk(v)
|
||||
|
||||
for pair in walk(results):
|
||||
return pair
|
||||
return None
|
||||
|
||||
|
||||
async def fetch_supreme_verdict(
|
||||
*, citation: str, case_number_norm: str
|
||||
) -> FetchedVerdict:
|
||||
"""Fetch a Supreme Court verdict PDF by citation. Raises on failure.
|
||||
|
||||
Flow: full-text search for the citation → locate the verdict document's
|
||||
(path, fileName) → download the PDF. Serial + cooled-down throughout.
|
||||
"""
|
||||
async with httpx.AsyncClient(
|
||||
http2=True,
|
||||
headers=_HEADERS,
|
||||
timeout=_REQUEST_TIMEOUT_S,
|
||||
follow_redirects=True,
|
||||
) as client:
|
||||
# 1. Search. The portal's quick-search posts {document, lan}; lan=1=Hebrew.
|
||||
try:
|
||||
search = await _post(
|
||||
client, "Home/SearchVerdicts",
|
||||
json={"document": citation, "lan": 1},
|
||||
)
|
||||
results = search.json()
|
||||
except httpx.HTTPError as e:
|
||||
raise SupremeFetchError(
|
||||
f"חיפוש בפורטל העליון נכשל עבור {citation}: {e}"
|
||||
) from e
|
||||
except ValueError as e: # non-JSON body
|
||||
raise SupremeFetchError(
|
||||
f"תשובת-חיפוש לא-JSON מהפורטל עבור {citation}"
|
||||
) from e
|
||||
|
||||
ref = _extract_doc_ref(results)
|
||||
if not ref:
|
||||
raise SupremeFetchError(
|
||||
f"לא נמצא מסמך-פסק עבור {citation} בפורטל העליון "
|
||||
f"(ייתכן שאינו פורסם או שמבנה-התשובה השתנה)."
|
||||
)
|
||||
path, fname = ref
|
||||
|
||||
# 2. Download the PDF.
|
||||
try:
|
||||
dl = await _get(
|
||||
client, "Home/Download",
|
||||
params={"path": path, "fileName": fname, "type": _DOC_TYPE_PDF},
|
||||
)
|
||||
except httpx.HTTPError as e:
|
||||
raise SupremeFetchError(
|
||||
f"הורדת PDF נכשלה עבור {citation} (path={path}): {e}"
|
||||
) from e
|
||||
|
||||
content = dl.content
|
||||
ctype = dl.headers.get("content-type", "")
|
||||
if not content or ("pdf" not in ctype.lower() and not content[:4] == b"%PDF"):
|
||||
raise SupremeFetchError(
|
||||
f"הקובץ שהתקבל עבור {citation} אינו PDF תקין (content-type={ctype})."
|
||||
)
|
||||
|
||||
source_url = (
|
||||
f"{_BASE}/Home/Download?path={path}&fileName={fname}&type={_DOC_TYPE_PDF}"
|
||||
)
|
||||
safe_name = fname if fname.lower().endswith(".pdf") else f"{case_number_norm}.pdf"
|
||||
return FetchedVerdict(
|
||||
content=content, filename=safe_name, source_url=source_url,
|
||||
)
|
||||
@@ -1287,6 +1287,39 @@ ALTER TABLE halacha_goldset ADD COLUMN IF NOT EXISTS ai_generated_at TIMESTAMPTZ
|
||||
"""
|
||||
|
||||
|
||||
# ── X13 — Court Verdict Fetch queue ──────────────────────────────────────
|
||||
# A lightweight, observable, idempotent job queue for the auto-fetch
|
||||
# subsystem (docs/spec/X13-court-fetch.md). One row per court verdict we try
|
||||
# to pull from a public source. Mirrors the extraction-queue pattern: status
|
||||
# is always explicit (INV-CF2 — no silent drop), the canonical case number is
|
||||
# the idempotency key (INV-CF5), and ``attempts`` drives the human-fallback
|
||||
# gate (INV-CF3 — flip to 'manual' after N autonomous failures).
|
||||
#
|
||||
# NOTE (merge): main is at V29; the digests-radar worktree adds its own V30.
|
||||
# If digests-radar lands first, renumber this block to V31 and update the
|
||||
# apply loop. Kept as V30 here so the branch is self-consistent on main.
|
||||
SCHEMA_V30_SQL = """
|
||||
CREATE TABLE IF NOT EXISTS court_fetch_jobs (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
case_number_norm TEXT NOT NULL UNIQUE, -- idempotency key (INV-CF5)
|
||||
citation_raw TEXT NOT NULL DEFAULT '',
|
||||
tier TEXT NOT NULL DEFAULT '', -- supreme | admin | skip
|
||||
court TEXT NOT NULL DEFAULT '',
|
||||
status TEXT NOT NULL DEFAULT 'pending', -- pending|running|done|failed|manual
|
||||
attempts INT NOT NULL DEFAULT 0,
|
||||
error TEXT NOT NULL DEFAULT '',
|
||||
case_law_id UUID REFERENCES case_law(id) ON DELETE SET NULL,
|
||||
digest_id UUID, -- source digest (X12), nullable for ad-hoc
|
||||
source_url TEXT NOT NULL DEFAULT '', -- provenance (INV-CF7)
|
||||
created_at TIMESTAMPTZ DEFAULT now(),
|
||||
updated_at TIMESTAMPTZ DEFAULT now()
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_court_fetch_jobs_status ON court_fetch_jobs(status);
|
||||
CREATE INDEX IF NOT EXISTS idx_court_fetch_jobs_digest ON court_fetch_jobs(digest_id)
|
||||
WHERE digest_id IS NOT NULL;
|
||||
"""
|
||||
|
||||
|
||||
async def _run_schema_migrations(pool: asyncpg.Pool) -> None:
|
||||
async with pool.acquire() as conn:
|
||||
await conn.execute(SCHEMA_SQL)
|
||||
@@ -1319,7 +1352,8 @@ async def _run_schema_migrations(pool: asyncpg.Pool) -> None:
|
||||
await conn.execute(SCHEMA_V27_SQL)
|
||||
await conn.execute(SCHEMA_V28_SQL)
|
||||
await conn.execute(SCHEMA_V29_SQL)
|
||||
logger.info("Database schema initialized (v1-v29)")
|
||||
await conn.execute(SCHEMA_V30_SQL)
|
||||
logger.info("Database schema initialized (v1-v30)")
|
||||
|
||||
|
||||
async def init_schema() -> None:
|
||||
@@ -5559,3 +5593,110 @@ async def find_missing_precedent_by_citation(
|
||||
citation.strip(),
|
||||
)
|
||||
return _row_to_missing_precedent(row) if row else None
|
||||
|
||||
|
||||
# ── X13 — Court Verdict Fetch jobs ───────────────────────────────────────
|
||||
# CRUD for the auto-fetch queue (docs/spec/X13-court-fetch.md). Status is
|
||||
# always explicit; failures are recorded, never swallowed (INV-CF2). Upsert
|
||||
# is keyed on the canonical case number (INV-CF5).
|
||||
|
||||
def _row_to_court_fetch_job(row) -> dict:
|
||||
return dict(row) if row else None
|
||||
|
||||
|
||||
async def court_fetch_job_upsert(
|
||||
case_number_norm: str,
|
||||
citation_raw: str = "",
|
||||
tier: str = "",
|
||||
court: str = "",
|
||||
digest_id: UUID | None = None,
|
||||
) -> dict:
|
||||
"""Idempotent create-or-get of a fetch job by canonical case number.
|
||||
|
||||
Re-requesting the same case number returns the existing row (with a
|
||||
``_existing`` flag) rather than creating a duplicate — the canonical
|
||||
number is a UNIQUE key. A job that already reached a terminal state is
|
||||
returned as-is so callers can decide whether to retry.
|
||||
"""
|
||||
if not (case_number_norm or "").strip():
|
||||
raise ValueError("case_number_norm is required")
|
||||
pool = await get_pool()
|
||||
async with pool.acquire() as conn:
|
||||
existing = await conn.fetchrow(
|
||||
"SELECT * FROM court_fetch_jobs WHERE case_number_norm = $1",
|
||||
case_number_norm,
|
||||
)
|
||||
if existing:
|
||||
out = _row_to_court_fetch_job(existing)
|
||||
out["_existing"] = True
|
||||
return out
|
||||
row = await conn.fetchrow(
|
||||
"""INSERT INTO court_fetch_jobs
|
||||
(case_number_norm, citation_raw, tier, court, digest_id)
|
||||
VALUES ($1, $2, $3, $4, $5)
|
||||
RETURNING *""",
|
||||
case_number_norm, citation_raw, tier, court, digest_id,
|
||||
)
|
||||
out = _row_to_court_fetch_job(row)
|
||||
out["_existing"] = False
|
||||
return out
|
||||
|
||||
|
||||
async def court_fetch_job_update(
|
||||
job_id: UUID,
|
||||
*,
|
||||
status: str | None = None,
|
||||
error: str | None = None,
|
||||
case_law_id: UUID | None = None,
|
||||
source_url: str | None = None,
|
||||
bump_attempts: bool = False,
|
||||
) -> dict:
|
||||
"""Patch a job row. Only provided fields change; ``updated_at`` always does."""
|
||||
sets = ["updated_at = now()"]
|
||||
args: list = []
|
||||
if status is not None:
|
||||
args.append(status); sets.append(f"status = ${len(args)}")
|
||||
if error is not None:
|
||||
args.append(error); sets.append(f"error = ${len(args)}")
|
||||
if case_law_id is not None:
|
||||
args.append(case_law_id); sets.append(f"case_law_id = ${len(args)}")
|
||||
if source_url is not None:
|
||||
args.append(source_url); sets.append(f"source_url = ${len(args)}")
|
||||
if bump_attempts:
|
||||
sets.append("attempts = attempts + 1")
|
||||
args.append(job_id)
|
||||
pool = await get_pool()
|
||||
async with pool.acquire() as conn:
|
||||
row = await conn.fetchrow(
|
||||
f"UPDATE court_fetch_jobs SET {', '.join(sets)} "
|
||||
f"WHERE id = ${len(args)} RETURNING *",
|
||||
*args,
|
||||
)
|
||||
return _row_to_court_fetch_job(row)
|
||||
|
||||
|
||||
async def court_fetch_job_get(case_number_norm: str) -> dict | None:
|
||||
pool = await get_pool()
|
||||
async with pool.acquire() as conn:
|
||||
row = await conn.fetchrow(
|
||||
"SELECT * FROM court_fetch_jobs WHERE case_number_norm = $1",
|
||||
case_number_norm,
|
||||
)
|
||||
return _row_to_court_fetch_job(row) if row else None
|
||||
|
||||
|
||||
async def court_fetch_job_list(status: str | None = None, limit: int = 100) -> list[dict]:
|
||||
pool = await get_pool()
|
||||
async with pool.acquire() as conn:
|
||||
if status:
|
||||
rows = await conn.fetch(
|
||||
"SELECT * FROM court_fetch_jobs WHERE status = $1 "
|
||||
"ORDER BY created_at DESC LIMIT $2",
|
||||
status, limit,
|
||||
)
|
||||
else:
|
||||
rows = await conn.fetch(
|
||||
"SELECT * FROM court_fetch_jobs ORDER BY created_at DESC LIMIT $1",
|
||||
limit,
|
||||
)
|
||||
return [_row_to_court_fetch_job(r) for r in rows]
|
||||
|
||||
56
mcp-server/src/legal_mcp/tools/court_fetch.py
Normal file
56
mcp-server/src/legal_mcp/tools/court_fetch.py
Normal file
@@ -0,0 +1,56 @@
|
||||
"""MCP tools for the X13 court-verdict auto-fetch subsystem.
|
||||
|
||||
- ``court_verdict_fetch`` — classify a citation, fetch the verdict from the
|
||||
matching public source (Supreme portal / נט המשפט), and ingest it into the
|
||||
precedent library via the canonical pipeline. The standalone entry point
|
||||
(also driven automatically from digest auto-link, see X12/X13).
|
||||
- ``court_fetch_status`` — inspect the fetch-job queue (pending/failed/manual).
|
||||
|
||||
Local-only: ``court_verdict_fetch`` runs the ingest pipeline, which drives
|
||||
halacha extraction via the local ``claude`` CLI — same constraint as
|
||||
``precedent_process_pending``. Invoking it from the container will fail.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from legal_mcp.services import court_fetch_orchestrator as orch
|
||||
from legal_mcp.services import db
|
||||
from legal_mcp.tools.envelope import err as _err, ok as _ok
|
||||
|
||||
|
||||
async def court_verdict_fetch(citation: str) -> str:
|
||||
"""אחזור אוטומטי של פסק-דין בית-משפט וקליטה לקורפוס.
|
||||
|
||||
מקבל ציטוט (למשל 'עת"מ 46111-12-22' או 'עע"מ 1234/22'), מסווג את הערכאה,
|
||||
מוריד את הפסק מהמקור הציבורי המתאים, וקולט אותו דרך צינור-הקליטה הקנוני.
|
||||
ערר/בל"מ (ועדת-ערר) אינם ניתנים לאחזור ציבורי ויסומנו כפער.
|
||||
"""
|
||||
if not (citation or "").strip():
|
||||
return _err("citation is required")
|
||||
try:
|
||||
result = await orch.fetch_and_ingest(citation.strip())
|
||||
except Exception as e: # noqa: BLE001 — surfaced, not swallowed (INV-CF2)
|
||||
return _err(f"אחזור נכשל: {e}")
|
||||
|
||||
status = result.get("status")
|
||||
if status in ("done", "already_done"):
|
||||
return _ok(result, message="הפסק נקלט לקורפוס")
|
||||
if status == "skipped":
|
||||
return _ok(result, message="ועדת-ערר — לא ניתן לאחזור ציבורי (סומן כפער)")
|
||||
if status in ("manual", "awaiting_manual"):
|
||||
return _ok(result, message="האחזור האוטונומי נכשל — הוסלם להורדה ידנית")
|
||||
if status == "unrecognized":
|
||||
return _err("הציטוט לא זוהה כמספר-תיק תקין")
|
||||
return _ok(result, message=f"סטטוס: {status}")
|
||||
|
||||
|
||||
async def court_fetch_status(case_number: str = "", status_filter: str = "") -> str:
|
||||
"""סטטוס תור-האחזור. case_number לפריט יחיד, או status_filter לסינון רשימה."""
|
||||
if case_number.strip():
|
||||
from legal_mcp.services.court_citation import normalize_case_number
|
||||
job = await db.court_fetch_job_get(normalize_case_number(case_number))
|
||||
if not job:
|
||||
return _ok({"job": None}, message="אין job עבור תיק זה")
|
||||
return _ok({"job": job})
|
||||
jobs = await db.court_fetch_job_list(status=status_filter.strip() or None)
|
||||
return _ok({"jobs": jobs, "count": len(jobs)})
|
||||
80
mcp-server/tests/test_court_citation.py
Normal file
80
mcp-server/tests/test_court_citation.py
Normal file
@@ -0,0 +1,80 @@
|
||||
"""Unit tests for the X13 court-citation classifier."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from legal_mcp.services.court_citation import classify, normalize_case_number
|
||||
|
||||
|
||||
def test_admin_filed_format_the_example():
|
||||
"""The plan's example: עת"מ 46111-12-22 → admin, parsed into (46111,12,22)."""
|
||||
c = classify('עת"מ 46111-12-22 יכין-אפק בע"מ נ\' הוועדה המחוזית')
|
||||
assert c.tier == "admin"
|
||||
assert c.court_prefix in ('עת"מ', "עת״מ")
|
||||
assert c.case_number_raw == "46111-12-22"
|
||||
assert c.case_number_norm == "46111-12-22"
|
||||
assert (c.file_number, c.month, c.year) == ("46111", "12", "22")
|
||||
assert c.fetchable is True
|
||||
|
||||
|
||||
def test_bare_filed_number_defaults_admin():
|
||||
c = classify("46111-12-22")
|
||||
assert c.tier == "admin"
|
||||
assert (c.file_number, c.month, c.year) == ("46111", "12", "22")
|
||||
|
||||
|
||||
def test_supreme_prefixes():
|
||||
for cit, pref in [
|
||||
('עע"מ 1234/22', "supreme"),
|
||||
('בג"ץ 5678/21', "supreme"),
|
||||
('ע"א 999/20', "supreme"),
|
||||
('רע"א 4/19', "supreme"),
|
||||
('בר"מ 8126/24', "supreme"),
|
||||
]:
|
||||
c = classify(cit)
|
||||
assert c.tier == pref, f"{cit} -> {c.tier}"
|
||||
assert c.fetchable is True
|
||||
|
||||
|
||||
def test_appeals_committee_is_skip():
|
||||
"""ערר / בל"מ must never be auto-fetched (needs Nevo) — INV-CF6."""
|
||||
for cit in ['ערר 1110/20', 'בל"מ 8048/24', "ערר 1015-01-24 ירושלים שקופה"]:
|
||||
c = classify(cit)
|
||||
assert c.tier == "skip", f"{cit} -> {c.tier}"
|
||||
assert c.fetchable is False
|
||||
|
||||
|
||||
def test_skip_wins_over_court_match():
|
||||
"""An 'ערר' citation that also contains court-like digits stays skip."""
|
||||
c = classify("ראה החלטתי בערר 1041/24 ובהמשך")
|
||||
assert c.tier == "skip"
|
||||
|
||||
|
||||
def test_admin_amn_prefix():
|
||||
c = classify('עמ"נ 12345-06-23')
|
||||
assert c.tier == "admin"
|
||||
assert (c.file_number, c.month, c.year) == ("12345", "06", "23")
|
||||
|
||||
|
||||
def test_two_group_serial_has_no_filed_triple():
|
||||
"""Supreme serial 1234/22 normalizes but yields no (file,month,year)."""
|
||||
c = classify('עע"מ 1234/22')
|
||||
assert c.case_number_norm == "1234-22"
|
||||
assert c.file_number is None
|
||||
|
||||
|
||||
def test_implausible_month_not_parsed_as_filed():
|
||||
# 1234-22-05 has month=22 → not a valid filed triple.
|
||||
assert classify("1234-22-05").tier in ("unknown", "admin")
|
||||
c = classify("1234-22-05")
|
||||
if c.tier == "admin":
|
||||
assert c.month is None
|
||||
|
||||
|
||||
def test_empty_and_garbage():
|
||||
assert classify("").tier == "unknown"
|
||||
assert classify("שלום עולם בלי ציטוט").tier == "unknown"
|
||||
|
||||
|
||||
def test_normalize_case_number():
|
||||
assert normalize_case_number('עת"מ 46111/12/22') == "46111-12-22"
|
||||
assert normalize_case_number("1110/20") == "1110-20"
|
||||
@@ -19,6 +19,7 @@
|
||||
| `fu2c_reconcile_external_case_numbers.py` | python | **FU-2c (GAP-08, #68) — תיאום `case_number` של פסיקה חיצונית** (`source_kind <> internal_committee`) מציטוט-מלא לצורה קנונית **מציין-הליך + docket** (החלטת-יו"ר 2026-05-31, Option A: `/` נשמר, *לא* `-`; תואם db.py:369 ו-INV-ID2). דטרמיניסטי (designator+docket; 0/>1 docket → flag). `--dry-run` (ברירת-מחדל) מפיק `data/audit/fu2c-reconciliation-*.{csv,md}` עם flags (MISMATCH / NO_CITATION / CIT_NO_DOCKET / DESIG_MISMATCH / DUP_CHECK). `--apply --approved <csv>` מגבה ואז מעדכן שורות לא-חוסמות (כולל ADVISORY/NO_CITATION). `--overrides <csv>` (id,proposed_canonical,reason) פותח שורות-חוסמות בהכרעת-יו"ר מפורשת (למשל פס"ד מאוחד — ראה `data/audit/fu2c-overrides.csv` לרשומת לויתן/קלמנוביץ). לוגיקת-החילוץ + פיצול flags אומתו offline על 24 רשומות. scope: external בלבד (internal = FU-2b). FK-safe. | חד-פעמי, **chair-gated** (apply רק אחרי אישור דפנה) |
|
||||
| `eval_gold_bootstrap.py` | python | **FU-5 (GAP-11) — bootstrap ל-gold-set** של הערכת-אחזור ל-`data/eval/gold-set.jsonl`. שני מקורות: `--source citations` (cited==relevant מ-`search_relevance_feedback`; ריק עד שייצברו ציטוטים) ו-`--source known_item` (query=שם-תיק → relevant=עצמו; אות אמיתי היום). Idempotent — שומר שורות `source=chair`, מחדש `bootstrap_*`. דורש POSTGRES. | לפני eval; חוזר כשנצבר ground-truth |
|
||||
| `eval_retrieval.py` | python | **FU-5 (GAP-11, INV-RET4/G8) — harness הערכת-אחזור** — מריץ את מסלול-האחזור בייצור (`search_library`/`search_internal`) על ה-gold-set, מחשב precision@k/recall@k/MRR/nDCG@k (k=5,10), מצרף overall+per-corpus+per-PA ל-`data/eval/eval-report-<ts>.{json,md}` + delta מול `data/eval/baseline.json` (מתעד retrieval_config). `--self-test` בודק את המטריקות offline; `--update-baseline` מאמץ snapshot. **שער-CI במשמעת:** הרץ לפני/אחרי כל שינוי בשכבת-האחזור באותו קונפיג. דורש POSTGRES+VOYAGE_API_KEY. | לפני/אחרי שינוי RRF/k/embedder/rerank |
|
||||
| `legal-court-fetch-service.config.cjs` | pm2/js | **שירות-מארח Tier-1 לאחזור פסקי-דין מנט המשפט (X13)** — מריץ `python -m legal_mcp.court_fetch_service.server` ב-pm2, bound ל-`10.0.1.1:8771`, Bearer-auth (`COURT_FETCH_SHARED_SECRET` מ-`~/.legal-court-fetch-service.env`). מריץ דפדפן Camoufox (open-source) כי הקונטיינר לא יכול. תלות לאחזור-בפועל: `camofox-browser` רץ (`CAMOFOX_URL`) + `faster-whisper` ל-reCAPTCHA אודיו; אחרת מחזיר ok:false וה-orchestrator מסלים ל-fallback אנושי. מראָה לדפוס `legal-chat-service.config.cjs`. ספ: `docs/spec/X13-court-fetch.md`. התקנה: `pm2 start scripts/legal-court-fetch-service.config.cjs && pm2 save`. בריאות: `curl http://10.0.1.1:8771/health`. | pm2 (host-side) |
|
||||
| `auto-sync-cases.sh` | bash | סנכרון תיקי ערר ל-Gitea — רץ כל דקה | `* * * * *` (cron) |
|
||||
| `backup-db.sh` | bash | גיבוי PostgreSQL יומי ל-`data/backups/` (gzip) | לתזמן: `0 2 * * *` |
|
||||
| `restore-db.sh` | bash | שחזור DB מגיבוי (companion ל-backup-db.sh) | ידני |
|
||||
|
||||
65
scripts/legal-court-fetch-service.config.cjs
Normal file
65
scripts/legal-court-fetch-service.config.cjs
Normal file
@@ -0,0 +1,65 @@
|
||||
/**
|
||||
* pm2 ecosystem entry for legal-court-fetch-service — the host-side Tier-1
|
||||
* verdict fetcher (X13). It drives a Camoufox stealth browser against
|
||||
* נט המשפט to download administrative/district-court verdicts the Supreme
|
||||
* portal (Tier 0) doesn't carry. Lives on the host because the legal-ai
|
||||
* container can't run a browser. See docs/spec/X13-court-fetch.md.
|
||||
*
|
||||
* Mirrors legal-chat-service.config.cjs (same security model):
|
||||
* 1. Bind to 10.0.1.1 (docker0 bridge gateway) — host + docker-bridge
|
||||
* containers only; nothing from outside the host.
|
||||
* 2. Bearer token auth — COURT_FETCH_SHARED_SECRET loaded from
|
||||
* /home/chaim/.legal-court-fetch-service.env (chmod 600) and mirrored in
|
||||
* Coolify so the FastAPI proxy sends a matching Authorization header.
|
||||
* The service refuses to start without the secret.
|
||||
*
|
||||
* Prereqs for Tier-1 to actually fetch (otherwise it returns ok:false and the
|
||||
* orchestrator escalates to the human fallback — INV-CF3):
|
||||
* - camofox-browser running, CAMOFOX_URL set (e.g. http://127.0.0.1:9377).
|
||||
* git clone https://github.com/jo-inc/camofox-browser && npm i && npm start
|
||||
* - faster-whisper installed in the venv for the reCAPTCHA audio solver.
|
||||
*
|
||||
* Install (once):
|
||||
* pm2 start /home/chaim/legal-ai/scripts/legal-court-fetch-service.config.cjs
|
||||
* pm2 save
|
||||
* Smoke test:
|
||||
* curl http://10.0.1.1:8771/health
|
||||
* Update:
|
||||
* pm2 restart legal-court-fetch-service --update-env
|
||||
*/
|
||||
const fs = require("fs");
|
||||
|
||||
const ENV_FILE = "/home/chaim/.legal-court-fetch-service.env";
|
||||
const env = {
|
||||
HOME: "/home/chaim",
|
||||
PATH: "/home/chaim/.local/bin:/usr/local/bin:/usr/bin:/bin",
|
||||
PYTHONUNBUFFERED: "1",
|
||||
// CAMOFOX_URL: "http://127.0.0.1:9377", // set when camofox-browser is up
|
||||
};
|
||||
try {
|
||||
const text = fs.readFileSync(ENV_FILE, "utf8");
|
||||
for (const line of text.split("\n")) {
|
||||
if (!line || line.trim().startsWith("#")) continue;
|
||||
const m = line.match(/^\s*([A-Z_][A-Z0-9_]*)\s*=\s*(.*?)\s*$/);
|
||||
if (m) env[m[1]] = m[2];
|
||||
}
|
||||
} catch (e) {
|
||||
console.error(`legal-court-fetch-service: failed to load ${ENV_FILE}: ${e.message}`);
|
||||
console.error("Service will refuse to start without COURT_FETCH_SHARED_SECRET.");
|
||||
}
|
||||
|
||||
module.exports = {
|
||||
apps: [
|
||||
{
|
||||
name: "legal-court-fetch-service",
|
||||
cwd: "/home/chaim/legal-ai/mcp-server",
|
||||
script: "/home/chaim/legal-ai/mcp-server/.venv/bin/python",
|
||||
args: "-m legal_mcp.court_fetch_service.server --port 8771 --host 10.0.1.1",
|
||||
env,
|
||||
restart_delay: 5000,
|
||||
max_restarts: 10,
|
||||
autorestart: true,
|
||||
max_memory_restart: "1G",
|
||||
},
|
||||
],
|
||||
};
|
||||
Reference in New Issue
Block a user