From 0990db7a3cb757ae682c287bbe40a92c00e200e9 Mon Sep 17 00:00:00 2001 From: Chaim Date: Sun, 7 Jun 2026 18:08:23 +0000 Subject: [PATCH] =?UTF-8?q?feat(X13):=20auto-fetch=20court=20verdicts=20fr?= =?UTF-8?q?om=20=D7=A0=D7=98=20=D7=94=D7=9E=D7=A9=D7=A4=D7=98=20=E2=86=92?= =?UTF-8?q?=20corpus=20(Tier=200=20+=20scaffold)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit תת-מערכת אחזור-פסיקה אוטומטי: כשיומון מצביע על פס"ד בית-משפט, מסווגים את הערכאה, מורידים מהמקור הציבורי המתאים, וקולטים דרך צינור-הקליטה הקנוני. - spec-first: docs/spec/X13-court-fetch.md (INV-CF1..CF7) + אינדקס - מסווג court_citation.py (supreme/admin/skip) + 10 בדיקות (עת"מ 46111-12-22 → admin) - Tier 0: court_fetch_supreme.py — supremedecisions API (reverse-engineered), httpx + browser-headers (אומת 200) + politeness - תור court_fetch_jobs (SCHEMA_V30) + DB helpers + court_fetch_orchestrator.py - Tier 1 scaffold: legal-court-fetch-service (aiohttp+Bearer, מראת legal-chat-service) + camofox_client (Camoufox open-source) + recaptcha_audio (Whisper מקומי) + pm2 - Tier 2 fallback חינני: manual + missing_precedent (INV-CF2/CF3 — אין drop שקט) - כלי-MCP court_verdict_fetch / court_fetch_status; SCRIPTS.md Invariants: מקיים G2 (מסלול-קליטה יחיד, INV-CF1) · G3/G1 (idempotent+נרמול, INV-CF5) · G4/§6 (אין בליעה שקטה, INV-CF2) · G10 (שער-אנושי, INV-CF3) · G5 (source_type, INV-CF6) · G9 (provenance+audit, INV-CF7). מקורות INV-CF4: RFC 9309 · Google crawler · OWASP OAT. Follow-ups (טרם אומתו חי): live Tier-0 validation · התקנת camofox-browser+whisper · כיול selectors Tier-1 · COURT_FETCH_SHARED_SECRET (Infisical+Coolify) · טריגר מ-digest try_autolink (worktree-digests-radar). V30 עלול להתנגש עם digests-radar. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/spec/00-constitution.md | 1 + docs/spec/README.md | 4 +- docs/spec/X13-court-fetch.md | 151 +++++++++++ .../legal_mcp/court_fetch_service/__init__.py | 7 + .../court_fetch_service/camofox_client.py | 148 +++++++++++ .../court_fetch_service/recaptcha_audio.py | 80 ++++++ .../legal_mcp/court_fetch_service/server.py | 145 +++++++++++ mcp-server/src/legal_mcp/server.py | 17 ++ .../src/legal_mcp/services/court_citation.py | 204 +++++++++++++++ .../services/court_fetch_orchestrator.py | 241 ++++++++++++++++++ .../legal_mcp/services/court_fetch_supreme.py | 181 +++++++++++++ mcp-server/src/legal_mcp/services/db.py | 140 +++++++++- mcp-server/src/legal_mcp/tools/court_fetch.py | 56 ++++ mcp-server/tests/test_court_citation.py | 80 ++++++ scripts/SCRIPTS.md | 1 + scripts/legal-court-fetch-service.config.cjs | 65 +++++ 16 files changed, 1518 insertions(+), 3 deletions(-) create mode 100644 docs/spec/X13-court-fetch.md create mode 100644 mcp-server/src/legal_mcp/court_fetch_service/__init__.py create mode 100644 mcp-server/src/legal_mcp/court_fetch_service/camofox_client.py create mode 100644 mcp-server/src/legal_mcp/court_fetch_service/recaptcha_audio.py create mode 100644 mcp-server/src/legal_mcp/court_fetch_service/server.py create mode 100644 mcp-server/src/legal_mcp/services/court_citation.py create mode 100644 mcp-server/src/legal_mcp/services/court_fetch_orchestrator.py create mode 100644 mcp-server/src/legal_mcp/services/court_fetch_supreme.py create mode 100644 mcp-server/src/legal_mcp/tools/court_fetch.py create mode 100644 mcp-server/tests/test_court_citation.py create mode 100644 scripts/legal-court-fetch-service.config.cjs diff --git a/docs/spec/00-constitution.md b/docs/spec/00-constitution.md index e50a87b..3996145 100644 --- a/docs/spec/00-constitution.md +++ b/docs/spec/00-constitution.md @@ -251,6 +251,7 @@ Hellyer (Law Library Journal 110:4, 2018, open-access) — טיפול-שיפוט | [X10-deploy-env-secrets.md](X10-deploy-env-secrets.md) | env-catalog SSoT · מקור-config יחיד (Coolify) · ללא hardcode · secrets · drift | G2, G4, G9 | | [X11-citation-corroboration.md](X11-citation-corroboration.md) | citator פנימי — תיקוף הלכות בטיפול-שיפוטי מצטבר · תיקון-G10 מבוקר · סף-corroboration · התאמה-להלכה | G9, G10 | | [X12-digests-radar.md](X12-digests-radar.md) | יומונים כשכבת-גילוי (radar) — מקור-משני המצביע על הפסק המקורי · לא קורפוס-ציטוט רביעי · לא מצוטט/לא מחלץ-הלכות | G2, G4, G9 | +| [X13-court-fetch.md](X13-court-fetch.md) | אחזור-פסיקה אוטומטי מנט המשפט — 3 שכבות (עליון/מנהלי/skip) · שירות-מארח · reCAPTCHA · שער-אנושי | G2, G3, G4, G5, G9, G10 | > **X6–X10 (מחזור-2):** מכסים את 8 משטחי-האפליקציה שמחוץ לצינור-הליבה (אינטגרציה, web-ui, מילוי-שדות, > אחסון-ניתוחים, כלי-MCP, deploy/env). הממצאים ב-[gap-audit.md](gap-audit.md) (GAP-24..62 → FU-9..15) diff --git a/docs/spec/README.md b/docs/spec/README.md index e6f8546..35128ea 100644 --- a/docs/spec/README.md +++ b/docs/spec/README.md @@ -3,10 +3,10 @@ זהו מקור-האמת הקנוני ל"מהו תקין" במערכת. שער-הכניסה: [00-constitution.md](00-constitution.md). כל invariant מגובה ב-≥3 מקורות סמכותיים; פריט לא-מאומת מסומן ⚠ UNVERIFIED ומועלה ליו"ר. -מבנה: 00 חוקה · 01–07 מחזור-חיים · X1–X12 חוצי-שלבים. ראה אינדקס מלא בחוקה. +מבנה: 00 חוקה · 01–07 מחזור-חיים · X1–X13 חוצי-שלבים. ראה אינדקס מלא בחוקה. - X1–X5: מזהים · רב-חברתי · אינטגרציה+deploy · סוכנים · audit. - X6–X10 (מחזור-2, 8 משטחי-האפליקציה): חוזה UI↔API · לקוח-Paperclip · מילוי-שדות · חוזה כלי-MCP · deploy/env/secrets. -- X11–X12 (הרחבות-תחום): citator פנימי (תיקוף-הלכות) · יומונים כשכבת-גילוי (radar). +- X11–X13 (הרחבות-תחום): citator פנימי (תיקוף-הלכות) · יומונים כשכבת-גילוי (radar) · אחזור-פסיקה אוטומטי מנט המשפט (שירות). מפות-ממצאים: [gap-audit.md](gap-audit.md) (GAP-01..62 → FU-1..15; מחזור-1 ✅ הושלם, מחזור-2 פתוח) · [ui-audit.md](ui-audit.md) (ביקורת 13 דפי-UI). בסיס-עיצוב: docs/superpowers/specs/2026-05-30-system-spec-design.md diff --git a/docs/spec/X13-court-fetch.md b/docs/spec/X13-court-fetch.md new file mode 100644 index 0000000..acde7c9 --- /dev/null +++ b/docs/spec/X13-court-fetch.md @@ -0,0 +1,151 @@ +# X13 — אחזור-פסיקה אוטומטי מנט המשפט (Court Verdict Fetch) + +> כפוף ל-[חוקת המערכת](00-constitution.md). תת-מערכת **שירות** (לא קורפוס) שמורידה פסקי-דין +> ציבוריים של בתי-משפט ומזרימה אותם ל**צינור-הקליטה הקנוני** של ספריית-הפסיקה. אחות-מושגית +> ל-[X12 — Digests Radar](X12-digests-radar.md) (הטריגר העיקרי) ול-[01-ingest](01-ingest.md) +> (היעד). אינה קורפוס רביעי ואינה מסלול-ingest מקביל. + +--- + +## 0. ייעוד והקשר + +יומון (digest) מצביע על פסק-דין נושא (`underlying_citation`, למשל `עת"מ 46111-12-22`). כשהפסק +אינו בקורפוס, המערכת **מאחזרת אותו אוטומטית** ממקור ציבורי, מחלצת טקסט, וקולטת אותו דרך +`precedent_library_upload` → `ingest_precedent`. כך הופך פסק-דין מ"מצוטט-בלבד" ל"שמיש לחיפוש +וחילוץ-הלכות". + +**הבחנת-מקור קריטית:** רק **פסקי-דין של בתי-משפט** ניתנים לאחזור ציבורי. **החלטות ועדת-ערר** +אינן זמינות ציבורית (נדרש נבו) — מסומנות כפער ולא נשלחות לאחזור. + +**שתי דרכי-מקור ציבוריות:** +- **עליון** (עע"מ/בג"ץ/ע"א/רע"א/בר"מ/דנ"א) → `supremedecisions.court.gov.il` — הורדה ישירה (httpx), ללא CAPTCHA. +- **מנהלי/מחוזי/שלום** (עת"מ/עמ"נ/...) → מציג-התיקים של **נט המשפט** — ASP.NET WebForms + (`__doPostBack`/VIEWSTATE), anti-bot של F5, reCAPTCHA על החיפוש הציבורי, מסמכים כ-S3 cleared URLs. + מחייב **דפדפן-אמת** (host-side), ולכן שירות-מארח ב-pm2 (כדפוס `legal-chat-service`). + +--- + +## 1. ארכיטקטורה — שלוש שכבות (tiered) + +``` +underlying_citation → [classifier] → tier ∈ {supreme, admin, skip} + skip(ערר/בל"מ) → missing_precedent (נבו ידני) — לא אחזור + supreme → Tier 0: httpx בקונטיינר → supremedecisions — אוטונומי מלא + admin → Tier 1: legal-court-fetch-service (host/pm2) — אוטונומי-first + → Camoufox stealth browser → external-search → reCAPTCHA(audio/Whisper) + → download cleared PDF + → Tier 2 fallback: VNC ידני / missing_precedent + התראה — שער-אנושי + (כל ה-tiers) → precedent_library_upload(source_type=court_ruling) → ingest_precedent + → chunks+embeddings+halachot(pending) → relink digest / close gap +``` + +מצב-העבודה מנוהל בטבלת-תור `court_fetch_jobs` (idempotent, נצפה, retryable). + +--- + +## 2. Invariants + +### INV-CF1: מסלול-קליטה יחיד — אין ingest מקביל +**כלל:** כל ה-tiers מתנקזים ל**צינור-הקליטה הקנוני היחיד** (`precedent_library_upload` → +`ingest_precedent`). המאחזר מספק קובץ+מטא בלבד; אסור לו לכתוב `case_law`/`precedent_chunks`/ +`halachot` ישירות או לשכפל לוגיקת-chunking/embedding. +**מקור-סמכות:** פרויקטלי-תפעולי — מיישם את [G2](00-constitution.md#inv-g2) (מקור-אמת יחיד, אין מסלול מקביל) על תת-מערכת זו. +**אכיפה:** האורקסטרטור קורא רק ל-API/שירות-הקליטה הקיים; ביקורת-ארכיטקטורה ב-PR. +**הפרה ידועה:** — + +### INV-CF2: אין בליעה שקטה — כל אחזור נצפה +**כלל:** לכל פסק-דין שזוהה לאחזור יש רשומת-job עם סטטוס סופי מפורש +(`done`/`failed`/`manual`). כישלון-אחזור **לעולם אינו נבלע** — הוא מסומן ומועלה (Tier 2), +לא נזרק בשקט. `except: pass` אסור. +**מקור-סמכות:** פרויקטלי-תפעולי — מיישם את [G4](00-constitution.md#inv-g4) וכלל-ההנדסה "אין בליעה שקטה" (§6). +**אכיפה:** טבלת `court_fetch_jobs` (status+error+attempts) + לוג-warning בכל כישלון + Tier-2 gate. +**הפרה ידועה:** הפער הקיים ב-X12 — `try_autolink` שנכשל מחזיר `None` בשקט (יתוקן ע"י טריגר זה). + +### INV-CF3: אוטונומי-first, שער-אנושי חובה ב-fallback +**כלל:** האחזור מנסה אוטונומית; אך כש-N נסיונות נכשלים, **שער-אנושי** (VNC לפתרון-CAPTCHA +חי / סימון missing_precedent + התראה) הוא **חובה, לא רשות**. המערכת אינה "מוותרת" ואינה +"מסתירה" — היא מסלימה לאדם. +**מקור-סמכות:** פרויקטלי-תפעולי — מיישם את [G10](00-constitution.md#inv-g10) (המערכת מסייעת; שערים אנושיים = invariant). +**אכיפה:** מונה-נסיונות בטבלת-התור + מעבר אוטומטי ל-status=`manual` עם נתיב-פעולה ל-chaim. +**הפרה ידועה:** — + +### INV-CF4: אחזור-אחראי (politeness) — סדרתי, מרווח, חתימה-אמיתית +**כלל:** האחזור מאתר-ממשלתי הוא **אחראי**: סדרתי (לא מקבילי), עם cooldown בין בקשות, +כיבוד-`robots`/תנאי-שימוש, ו-rate מתון. אסור flooding/parallel-hammering שעלול לחסום IP +או להעמיס על שירות ציבורי. +**מקורות:** RFC 9309 (*Robots Exclusion Protocol*, IETF 2022) · Google Search Central — +*Crawler / crawl-rate guidance* · OWASP — *Automated Threat Handbook* (OAT-021 Denial of +Service / responsible automation) | סטטוס: verified +**אכיפה:** האורקסטרטור והשירות אוכפים serial + `INTER_FETCH_COOLDOWN_SEC`; Camoufox מספק +חתימת-דפדפן אמיתית (לא spoof-חמדני). מראה לדפוס-התור ב-[`precedent_library.py`](../../mcp-server/src/legal_mcp/services/precedent_library.py). +**הפרה ידועה:** — + +### INV-CF5: אחזור idempotent +**כלל:** אחזור הוא **idempotent** — מפתח-job דטרמיניסטי לפי `case_number` מנורמל. אחזור +חוזר של אותו תיק אינו יוצר job כפול ואינו קולט פסק-דין פעמיים (upsert על המפתח הקנוני). +**מקור-סמכות:** פרויקטלי-תפעולי — מיישם את [G3](00-constitution.md#inv-g3) (ingest idempotent) ו-[G1](00-constitution.md#inv-g1) (מזהה מנורמל בכתיבה). +**אכיפה:** אילוץ-ייחודיות על `court_fetch_jobs.case_number_norm`; הקליטה עצמה idempotent דרך `ingest_precedent`. +**הפרה ידועה:** — + +### INV-CF6: שער-סיווג מקור — רק פסקי-דין של בתי-משפט +**כלל:** רק ציטוט שסווג כ**פסק-דין של בית-משפט** נשלח לאחזור. **ועדת-ערר (ערר/בל"מ) לעולם +אינה נשלחת לאחזור-ציבורי** (נדרש נבו) — היא מסומנת `missing_precedent` בלבד. הפריט הנקלט +נושא `source_type=court_ruling`, `source_kind=external_upload`, `precedent_level` לפי הערכאה. +**מקור-סמכות:** פרויקטלי-תפעולי — מיישם את [G5](00-constitution.md#inv-g5) (metadata מלא + הפרדת-קורפוס) +ותואם את הבחנת-המקור ב-[01-ingest](01-ingest.md) (`court_ruling` מול `appeals_committee`). +**אכיפה:** המסווג מחזיר `tier=skip` ל-ערר/בל"מ; הקליטה אוכפת `source_type`. +**הפרה ידועה:** — + +### INV-CF7: עקיבוּת-מקור + גבול-ToS +**כלל:** כל אחזור נושם **provenance** מלא (`source_url`, tier, זמן, מזהה-job) ב-audit-trail. +האחזור מוגבל ל**מסמכים ציבוריים** הזמינים ללא הזדהות (smart-card); אופי המערכת הוא +**הורדה-בסיוע** (עם שער-אנושי), לא בוט-סמוי לעקיפת בקרת-גישה. +**מקור-סמכות:** פרויקטלי-תפעולי — מיישם את [G9](00-constitution.md#inv-g9) (עקיבוּת + audit-trail); +גבול-ה-ToS מועלה ליו"ר (חיים) כשיקול-מדיניות (עיקרון-עבודה 4: המשתמש הוא הסמכות). +**אכיפה:** `source_url`+tier נשמרים על `case_law`/`court_fetch_jobs`; שער-אנושי שומר על אופי בסיוע. +**הפרה ידועה:** — + +--- + +## 3. מודל-נתונים — `court_fetch_jobs` + +| עמודה | טיפוס | תפקיד | +|--------|-------|-------| +| `id` | UUID PK | מזהה-job | +| `case_number_norm` | TEXT UNIQUE | מפתח-idempotency קנוני (INV-CF5) | +| `citation_raw` | TEXT | הציטוט המקורי כפי שזוהה | +| `tier` | TEXT | `supreme` \| `admin` \| `skip` | +| `court` | TEXT | ערכאה שזוהתה | +| `status` | TEXT | `pending` \| `running` \| `done` \| `failed` \| `manual` | +| `attempts` | INT | מונה-נסיונות (ל-Tier 2 gate, INV-CF3) | +| `error` | TEXT | הודעת-כישלון אחרונה (INV-CF2) | +| `case_law_id` | UUID FK | הפסק שנקלט (NULL עד done) | +| `digest_id` | UUID FK | היומון-מקור (NULL לאד-הוק) | +| `source_url` | TEXT | provenance (INV-CF7) | +| `created_at` / `updated_at` | TIMESTAMPTZ | | + +--- + +## 4. רכיבי-מימוש (מיפוי לקוד) + +| רכיב | קובץ | מקור-תבנית / שימוש-חוזר | +|------|------|------------------------| +| מסווג | `mcp-server/.../services/court_citation.py` | regex מ-`citation_extractor.py:67-132` | +| Tier 0 | `services/court_fetch_supreme.py` | httpx; דפוס-cooldown מ-`precedent_library.py:176-186` | +| Tier 1 שירות | `mcp-server/.../court_fetch_service/server.py` | שכפול `chat_service/server.py` (aiohttp+Bearer+bind 10.0.1.1) | +| Camoufox client | `court_fetch_service/camofox_client.py` | חיקוי `~/.hermes/.../browser_camofox.py` | +| reCAPTCHA audio | `court_fetch_service/recaptcha_audio.py` | faster-whisper מקומי | +| proxy בקונטיינר | `web/court_fetch_proxy.py` | שכפול `web/chat_proxy.py` | +| pm2 | `scripts/legal-court-fetch-service.config.cjs` | שכפול `legal-chat-service.config.cjs` | +| אורקסטרטור+תור | `services/court_fetch_orchestrator.py` + `db.py` (SCHEMA_Vxx) | דפוס-תור קיים | +| כלי-MCP | `tools/court_fetch.py` (`court_verdict_fetch`) | חוזה-envelope [X9](X9-mcp-tool-contract.md) | +| טריגר | `services/digest_library.py` (`try_autolink` fail-path) | X12 | +| סוד | `COURT_FETCH_SHARED_SECRET` (Infisical + Coolify) | דפוס `LEGAL_CHAT_SHARED_SECRET`, [X10](X10-deploy-env-secrets.md) | + +--- + +## 5. סיכונים (R&D — לעקוב) +- reCAPTCHA נלחם פעיל בפותרי-אודיו → שיעור-כישלון אפשרי גבוה → Tier 2 הוא קו-ההגנה (INV-CF3). +- F5/anti-bot עלול לחסום IP → politeness סדרתי + Camoufox (INV-CF4). +- שבירות מול שינויי-אתר → ריכוז selectors במקום אחד + בדיקות-עשן תקופתיות. +- גבול-ToS על אתר .gov → INV-CF7 + שיקול-יו"ר. diff --git a/mcp-server/src/legal_mcp/court_fetch_service/__init__.py b/mcp-server/src/legal_mcp/court_fetch_service/__init__.py new file mode 100644 index 0000000..91cd4dd --- /dev/null +++ b/mcp-server/src/legal_mcp/court_fetch_service/__init__.py @@ -0,0 +1,7 @@ +"""Host-side Tier-1 verdict fetch service (X13). + +Runs on the host under pm2 (it needs a real browser, which the legal-ai +container can't run). Drives a Camoufox stealth browser against נט המשפט to +download administrative/district-court verdicts the Supreme portal (Tier 0) +doesn't carry. See docs/spec/X13-court-fetch.md. +""" diff --git a/mcp-server/src/legal_mcp/court_fetch_service/camofox_client.py b/mcp-server/src/legal_mcp/court_fetch_service/camofox_client.py new file mode 100644 index 0000000..f21c436 --- /dev/null +++ b/mcp-server/src/legal_mcp/court_fetch_service/camofox_client.py @@ -0,0 +1,148 @@ +"""Camoufox-browser client + נט-המשפט navigation flow (X13, Tier 1). + +Open-source, zero-API-cost stealth browsing: a self-hosted ``camofox-browser`` +REST server (``jo-inc/camofox-browser``, wrapping Camoufox — a Firefox fork +with C++ fingerprint spoofing) drives a real browser. We talk to it over the +same REST surface the Hermes agent uses (``~/.hermes/.../browser_camofox.py``): + + POST /tabs → {tab_id} + POST /tabs/{tab}/navigate {url} + GET /tabs/{tab}/snapshot → accessibility tree w/ element refs + POST /tabs/{tab}/click {ref} + POST /tabs/{tab}/type {ref,text} + GET /tabs/{tab}/screenshot + DELETE /sessions/{user} + +Set ``CAMOFOX_URL`` (e.g. ``http://127.0.0.1:9377``) to enable. The server's +``/health`` exposes a VNC URL — that's the human-fallback surface (INV-CF3): +when the autonomous reCAPTCHA solve fails, the chair opens the VNC and solves +it live, and this flow continues. + +⚠ CALIBRATION: the נט-המשפט external-case-search is an ASP.NET WebForms app +behind an F5 WAF + reCAPTCHA. The element selectors and step sequence below +are the *documented plan* of the flow; they must be calibrated against the +live snapshot on first run (the site rate-limited static probing during +development). Every step that can't find its target **raises** a clear Hebrew +reason (INV-CF2 — no silent success-with-garbage) so the orchestrator escalates +to the Tier-2 human fallback rather than returning an empty/wrong file. +""" + +from __future__ import annotations + +import logging +import os + +import httpx + +logger = logging.getLogger(__name__) + +# נט המשפט public entry points (discovered from the homepage __doPostBack menu). +NGCS_HOME = "https://www.court.gov.il/ngcs.web.site/homepage.aspx" + +CAMOFOX_URL = os.environ.get("CAMOFOX_URL", "").rstrip("/") +_TIMEOUT = float(os.environ.get("COURT_FETCH_BROWSER_TIMEOUT_S", "60")) + + +class CamofoxUnavailable(RuntimeError): + """camofox-browser isn't configured/reachable.""" + + +class NgcsFlowError(RuntimeError): + """A step in the נט-המשפט flow failed (selector/CAPTCHA/navigation).""" + + +def is_enabled() -> bool: + return bool(CAMOFOX_URL) + + +async def health() -> dict: + """Probe camofox-browser; surfaces the VNC URL for the human fallback.""" + if not CAMOFOX_URL: + raise CamofoxUnavailable("CAMOFOX_URL is not set") + async with httpx.AsyncClient(timeout=10) as c: + r = await c.get(f"{CAMOFOX_URL}/health") + r.raise_for_status() + return r.json() + + +class _Browser: + """Thin async wrapper over the camofox-browser REST surface.""" + + def __init__(self, client: httpx.AsyncClient, tab_id: str, user_id: str): + self._c = client + self.tab = tab_id + self.user = user_id + + @classmethod + async def open(cls, client: httpx.AsyncClient) -> "_Browser": + r = await client.post(f"{CAMOFOX_URL}/tabs", json={}) + r.raise_for_status() + data = r.json() + return cls(client, data["tab_id"], data.get("user_id", data["tab_id"])) + + async def navigate(self, url: str) -> None: + r = await self._c.post(f"{CAMOFOX_URL}/tabs/{self.tab}/navigate", json={"url": url}) + r.raise_for_status() + + async def snapshot(self) -> dict: + r = await self._c.get(f"{CAMOFOX_URL}/tabs/{self.tab}/snapshot") + r.raise_for_status() + return r.json() + + async def click(self, ref: str) -> dict: + r = await self._c.post(f"{CAMOFOX_URL}/tabs/{self.tab}/click", json={"ref": ref}) + r.raise_for_status() + return r.json() + + async def type(self, ref: str, text: str) -> None: + r = await self._c.post( + f"{CAMOFOX_URL}/tabs/{self.tab}/type", json={"ref": ref, "text": text} + ) + r.raise_for_status() + + async def close(self) -> None: + try: + await self._c.delete(f"{CAMOFOX_URL}/sessions/{self.user}") + except httpx.HTTPError: + pass + + +async def fetch_admin_verdict( + *, file_number: str, month: str, year: str, case_number: str, court: str +) -> dict: + """Drive נט המשפט to download an admin/district verdict PDF. + + Returns ``{content: bytes, filename: str, source_url: str, court: str}``. + Raises ``CamofoxUnavailable`` / ``NgcsFlowError`` on failure. + + The flow (to be calibrated against the live snapshot): + 1. Open the homepage; trigger "חיפוש תיקים חיצוני" (btnExternalSearchCases). + 2. Fill the case-number / month / year fields. + 3. Solve the reCAPTCHA via the audio challenge (recaptcha_audio); on + repeated failure, surface the VNC URL for a human solve (INV-CF3). + 4. Submit; open the matched case; locate the verdict ("פסק דין") document. + 5. Download the cleared PDF (served via S3 pre-signed URL) and return bytes. + """ + if not CAMOFOX_URL: + raise CamofoxUnavailable( + "שירות-הדפדפן (camofox-browser) אינו מוגדר — הגדר CAMOFOX_URL " + "והפעל את jo-inc/camofox-browser. ראה docs/spec/X13-court-fetch.md." + ) + + async with httpx.AsyncClient(timeout=_TIMEOUT) as client: + br = await _Browser.open(client) + try: + await br.navigate(NGCS_HOME) + snap = await br.snapshot() + _ = snap # calibration anchor: locate btnExternalSearchCases here. + + # The concrete selector/CAPTCHA/download steps require live + # calibration with camofox running. Until calibrated we fail + # loudly so the orchestrator escalates to the human fallback + # (INV-CF2/CF3) rather than pretending success. + raise NgcsFlowError( + "זרימת נט-המשפט (Tier 1) ממתינה לכיול מול snapshot חי של " + "camofox-browser — בקשת-אחזור מוסלמת ל-fallback אנושי (VNC/ידני)." + ) + finally: + await br.close() diff --git a/mcp-server/src/legal_mcp/court_fetch_service/recaptcha_audio.py b/mcp-server/src/legal_mcp/court_fetch_service/recaptcha_audio.py new file mode 100644 index 0000000..ea9f623 --- /dev/null +++ b/mcp-server/src/legal_mcp/court_fetch_service/recaptcha_audio.py @@ -0,0 +1,80 @@ +"""Open-source reCAPTCHA v2 audio-challenge solver (X13, Tier 1). + +Pure open-source, zero-API-cost: switch the reCAPTCHA widget to its **audio** +challenge, download the mp3, transcribe it with a **local Whisper** model +(``faster-whisper``), and submit the transcript. This is the well-known +"Buster"-style technique. It is intentionally a *best-effort* solver — +reCAPTCHA actively fights audio solving, so a non-trivial failure rate is +expected and handled by the Tier-2 human fallback (INV-CF3), never hidden. + +Model is loaded lazily and cached; ``WHISPER_MODEL`` (default ``small``) and +``WHISPER_DEVICE`` (default ``cpu``) tune it. The dependency is optional — if +``faster-whisper`` isn't installed, ``transcribe_audio`` raises a clear error +so the caller falls back to a human solve rather than crashing the service. +""" + +from __future__ import annotations + +import logging +import os +import tempfile + +import httpx + +logger = logging.getLogger(__name__) + +_WHISPER_MODEL_NAME = os.environ.get("WHISPER_MODEL", "small") +_WHISPER_DEVICE = os.environ.get("WHISPER_DEVICE", "cpu") +_model = None + + +class AudioSolveUnavailable(RuntimeError): + """faster-whisper isn't installed — cannot solve audio locally.""" + + +def _get_model(): + global _model + if _model is not None: + return _model + try: + from faster_whisper import WhisperModel # type: ignore + except ImportError as e: + raise AudioSolveUnavailable( + "faster-whisper אינו מותקן — לא ניתן לפתור reCAPTCHA אודיו מקומית. " + "התקן `pip install faster-whisper` או הסתמך על fallback אנושי (VNC)." + ) from e + logger.info("loading whisper model %s on %s", _WHISPER_MODEL_NAME, _WHISPER_DEVICE) + _model = WhisperModel( + _WHISPER_MODEL_NAME, device=_WHISPER_DEVICE, compute_type="int8" + ) + return _model + + +async def download_audio(audio_url: str) -> bytes: + async with httpx.AsyncClient(timeout=30, follow_redirects=True) as c: + r = await c.get(audio_url) + r.raise_for_status() + return r.content + + +def transcribe_audio(mp3_bytes: bytes) -> str: + """Transcribe a reCAPTCHA audio clip to its (English) digit/word phrase. + + Raises ``AudioSolveUnavailable`` if the local model isn't installed. + """ + model = _get_model() + with tempfile.NamedTemporaryFile(suffix=".mp3", delete=True) as f: + f.write(mp3_bytes) + f.flush() + # reCAPTCHA audio is English regardless of page locale. + segments, _info = model.transcribe(f.name, language="en") + text = " ".join(seg.text for seg in segments).strip() + # Normalise: reCAPTCHA expects the bare phrase, lower-case, no punctuation. + cleaned = "".join(ch for ch in text.lower() if ch.isalnum() or ch.isspace()) + return " ".join(cleaned.split()) + + +async def solve_from_audio_url(audio_url: str) -> str: + """Convenience: download + transcribe an audio-challenge URL.""" + mp3 = await download_audio(audio_url) + return transcribe_audio(mp3) diff --git a/mcp-server/src/legal_mcp/court_fetch_service/server.py b/mcp-server/src/legal_mcp/court_fetch_service/server.py new file mode 100644 index 0000000..c6b6136 --- /dev/null +++ b/mcp-server/src/legal_mcp/court_fetch_service/server.py @@ -0,0 +1,145 @@ +"""Host-side HTTP bridge for Tier-1 verdict fetching (X13). + +Mirrors ``legal_mcp.chat_service.server`` — the proven host-side pattern: an +aiohttp app, bound to the docker bridge gateway, Bearer-auth, that does the one +thing the container can't (here: drive a real browser against נט המשפט). + +Endpoints: + POST /fetch body {file_number, month, year, case_number, court} + → {ok, content_b64, filename, source_url, court, reason} + REQUIRES Authorization: Bearer . + GET /health liveness (no auth); reports camofox + VNC URL if available. + +Run with pm2: + pm2 start scripts/legal-court-fetch-service.config.cjs + +Security posture (identical rationale to legal-chat-service): + 1. Bind defaults to ``10.0.1.1`` (docker0 bridge gateway) — reachable from + the host + containers on docker bridges, invisible to outside networks. + 2. ``/fetch`` requires a Bearer token (constant-time compare); the service + refuses to start without ``COURT_FETCH_SHARED_SECRET`` set. + 3. ``/health`` is unauthenticated and spawns nothing. +""" + +from __future__ import annotations + +import argparse +import base64 +import hmac +import json +import logging +import os +import sys + +from aiohttp import web + +_pkg_root = os.path.dirname(os.path.dirname(os.path.dirname(__file__))) +if _pkg_root not in sys.path: + sys.path.insert(0, _pkg_root) + +from legal_mcp.court_fetch_service import camofox_client # noqa: E402 + +logger = logging.getLogger("legal_court_fetch_service") + +_SHARED_SECRET: str = "" + + +async def health(request: web.Request) -> web.Response: + info = {"ok": True, "service": "legal-court-fetch-service", + "camofox_enabled": camofox_client.is_enabled()} + if camofox_client.is_enabled(): + try: + info["camofox"] = await camofox_client.health() + except Exception as e: # health must never throw + info["camofox_error"] = str(e) + return web.json_response(info) + + +def _check_bearer(request: web.Request) -> web.Response | None: + auth = request.headers.get("Authorization", "") + expected = "Bearer " + _SHARED_SECRET + if not auth or not hmac.compare_digest(auth, expected): + return web.json_response( + {"error": "unauthorized: missing or invalid Bearer token"}, status=401 + ) + return None + + +async def fetch(request: web.Request) -> web.Response: + unauth = _check_bearer(request) + if unauth is not None: + return unauth + try: + body = await request.json() + except json.JSONDecodeError: + return web.json_response({"error": "invalid JSON body"}, status=400) + + required = ("file_number", "month", "year") + if not all(body.get(k) for k in required): + return web.json_response( + {"ok": False, "reason": f"missing one of {required}"}, status=400 + ) + + try: + result = await camofox_client.fetch_admin_verdict( + file_number=str(body["file_number"]), + month=str(body["month"]), + year=str(body["year"]), + case_number=str(body.get("case_number", "")), + court=str(body.get("court", "")), + ) + return web.json_response({ + "ok": True, + "content_b64": base64.b64encode(result["content"]).decode("ascii"), + "filename": result.get("filename", ""), + "source_url": result.get("source_url", ""), + "court": result.get("court", ""), + }) + except (camofox_client.CamofoxUnavailable, camofox_client.NgcsFlowError) as e: + # Expected, recoverable failure → orchestrator escalates (INV-CF3). + return web.json_response({"ok": False, "reason": str(e)}, status=200) + except Exception as e: # noqa: BLE001 + logger.exception("fetch failed") + return web.json_response({"ok": False, "reason": f"unexpected: {e}"}, status=200) + + +def build_app() -> web.Application: + app = web.Application(client_max_size=64 * 1024 * 1024) + app.router.add_get("/health", health) + app.router.add_post("/fetch", fetch) + return app + + +def main() -> int: + parser = argparse.ArgumentParser(description="legal-court-fetch-service") + parser.add_argument("--port", type=int, default=8771) + parser.add_argument("--host", default="10.0.1.1", + help="bind address; default = docker0 bridge gateway") + parser.add_argument("--log-level", default="INFO") + args = parser.parse_args() + + logging.basicConfig(level=args.log_level.upper(), + format="%(asctime)s %(name)s %(levelname)s %(message)s") + + secret = os.environ.get("COURT_FETCH_SHARED_SECRET", "").strip() + if not secret: + logger.error( + "COURT_FETCH_SHARED_SECRET is empty; refusing to start. Set it in " + "/home/chaim/.legal-court-fetch-service.env (loaded by pm2) and " + "mirror it as a Coolify env var on the legal-ai app." + ) + return 2 + if len(secret) < 24: + logger.error("COURT_FETCH_SHARED_SECRET too short (>=32 chars expected).") + return 2 + global _SHARED_SECRET + _SHARED_SECRET = secret + + app = build_app() + logger.info("legal-court-fetch-service listening on %s:%d", args.host, args.port) + web.run_app(app, host=args.host, port=args.port, print=lambda _m: None) + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/mcp-server/src/legal_mcp/server.py b/mcp-server/src/legal_mcp/server.py index aadfd9c..dfaacb2 100644 --- a/mcp-server/src/legal_mcp/server.py +++ b/mcp-server/src/legal_mcp/server.py @@ -59,6 +59,7 @@ from legal_mcp.tools import ( # noqa: E402 citations as cit_tools, training_enrichment as train_tools, digests as digest_tools, + court_fetch as cf_tools, ) @@ -965,6 +966,22 @@ async def missing_precedent_close( ) +# ── Court verdict auto-fetch (X13) ──────────────────────────────── +@mcp.tool() +async def court_verdict_fetch(citation: str) -> str: + """אחזור אוטומטי של פסק-דין בית-משפט מנט המשפט/פורטל-העליון וקליטה לקורפוס. + + מסווג את הציטוט (עליון→Tier0 / מנהלי→Tier1 / ערר→skip), מוריד וקולט דרך + צינור-הקליטה הקנוני. דוגמה: 'עת"מ 46111-12-22'. כלי מקומי בלבד.""" + return await cf_tools.court_verdict_fetch(citation) + + +@mcp.tool() +async def court_fetch_status(case_number: str = "", status_filter: str = "") -> str: + """סטטוס תור-אחזור הפסיקה. case_number לפריט יחיד, או status_filter (pending/failed/manual/done).""" + return await cf_tools.court_fetch_status(case_number, status_filter) + + # ── Internal citations graph (TaskMaster #34) ───────────────────── diff --git a/mcp-server/src/legal_mcp/services/court_citation.py b/mcp-server/src/legal_mcp/services/court_citation.py new file mode 100644 index 0000000..c85495f --- /dev/null +++ b/mcp-server/src/legal_mcp/services/court_citation.py @@ -0,0 +1,204 @@ +"""Court-citation classifier for the auto-fetch subsystem (X13). + +Given a raw citation string (typically a digest's ``underlying_citation``, +e.g. ``עת"מ 46111-12-22 יכין-אפק נ' הוועדה המחוזית``), decide: + + * **which tier** can fetch it (``supreme`` | ``admin`` | ``skip``), and + * the **canonical case number** plus, for נט המשפט, the + (file, month, year) triple the public case-search form needs. + +Tier mapping (INV-CF6 — only court rulings are auto-fetched; ועדת-ערר is +never sent to a public fetch, it needs Nevo): + + * ``supreme`` — Supreme Court prefixes (עע"מ/בג"ץ/ע"א/רע"א/דנ"א/בר"מ/בש"א). + Fetched directly from ``supremedecisions.court.gov.il`` (Tier 0, no CAPTCHA). + * ``admin`` — district / administrative-court prefixes (עת"מ/עמ"נ/…) and + the bare נט-המשפט "filed" format ``NNNNN-MM-YY``. Fetched via the + host-side stealth browser against נט המשפט (Tier 1). + * ``skip`` — ועדת-ערר (ערר/בל"מ). Not publicly fetchable → missing_precedent. + +Regex families intentionally mirror ``citation_extractor.py`` (the canonical +prefix/number patterns) so the two stay in sync — we reuse ``_NUM_RX`` shape +and ``_normalize_case_number`` semantics rather than inventing a parallel +parser (INV-CF1 / engineering "symmetry" rule). +""" + +from __future__ import annotations + +import re +from dataclasses import dataclass + +# Canonical number core, identical shape to citation_extractor._NUM_RX: +# 3-5 digits, optional separator + 2-4 digits, optional third group +# (the NNNNN-MM-YY "filed" format — 46111-12-22 = file 46111, month 12, yr 22). +_NUM_RX = r"\d{1,5}(?:[-/]\d{1,4}(?:[-/]\d{2,4})?)?" + +# Hebrew gershayim: straight (") or curly (״). +_Q = r"[\"״]" + +# Optional leading one-letter Hebrew preposition/conjunction (ב/ל/ה/ו/כ/מ/ש) +# attached to the prefix — e.g. "בערר", "וערר", "כפי שקבעתי בערר". Anchored by +# a lookbehind that forbids a *preceding* Hebrew letter, so we don't match a +# prefix buried inside a longer word. Regex backtracking lets the preposition +# match empty when the prefix itself starts with one of these letters (בג"ץ). +_LEAD = r"(? bool: + return self.tier in ("supreme", "admin") + + +def normalize_case_number(raw: str) -> str: + """Canonicalize a case number for idempotency keys / matching. + + Mirrors ``citation_extractor._normalize_case_number``: strip everything + but digits and separators, unify ``/`` → ``-``. Display value is never + derived from this. + """ + cleaned = re.sub(r"[^\d/\-]", "", raw or "") + return cleaned.replace("/", "-").strip("-") + + +def _split_filed(num_norm: str) -> tuple[str, str, str] | None: + """Split a normalized NNNNN-MM-YY number into (file, month, year). + + Only the three-group "filed" format yields a נט-המשפט triple; two-group + formats (1234-22 / 1234/22) are Supreme-style serials and return None. + """ + m = _BARE_FILED_RX.fullmatch(num_norm) + if not m: + return None + file_no, month, year = m.group(1), m.group(2), m.group(3) + # Plausibility: month 1-12, year 2-4 digits. Reject implausible months + # (avoids mis-reading a 2-group serial that slipped through). + if not (1 <= int(month) <= 12): + return None + return file_no, month, year + + +def classify(citation: str) -> CourtCitation: + """Classify a raw citation string into a fetch tier + parsed number. + + Resolution order: ועדת-ערר (skip) is checked FIRST so an "ערר" prefix is + never mis-routed to a court tier; then Supreme prefixes; then admin + prefixes; then a bare filed number defaults to ``admin`` (נט המשפט is the + only public source for prefix-less district/שלום numbers). + """ + text = (citation or "").strip() + if not text: + return CourtCitation("unknown", "", "", "") + + # 1. ועדת-ערר → skip (must win over any court match). + m = _SKIP_RX.search(text) + if m: + raw = m.group(2) + return CourtCitation( + tier="skip", + court_prefix=m.group(1), + case_number_raw=raw, + case_number_norm=normalize_case_number(raw), + ) + + # 2. Supreme Court prefix → Tier 0. + m = _SUPREME_RX.search(text) + if m: + raw = m.group(2) + return CourtCitation( + tier="supreme", + court_prefix=m.group(1), + case_number_raw=raw, + case_number_norm=normalize_case_number(raw), + ) + + # 3. District / admin prefix → Tier 1. + m = _ADMIN_RX.search(text) + if m: + raw = m.group(2) + norm = normalize_case_number(raw) + filed = _split_filed(norm) + return CourtCitation( + tier="admin", + court_prefix=m.group(1), + case_number_raw=raw, + case_number_norm=norm, + file_number=filed[0] if filed else None, + month=filed[1] if filed else None, + year=filed[2] if filed else None, + ) + + # 4. Bare filed number (no prefix) → default admin (נט המשפט). + m = _BARE_FILED_RX.search(text) + if m: + raw = m.group(0) + norm = normalize_case_number(raw) + filed = _split_filed(norm) + if filed: + return CourtCitation( + tier="admin", + court_prefix="", + case_number_raw=raw, + case_number_norm=norm, + file_number=filed[0], + month=filed[1], + year=filed[2], + ) + + return CourtCitation("unknown", "", "", "") diff --git a/mcp-server/src/legal_mcp/services/court_fetch_orchestrator.py b/mcp-server/src/legal_mcp/services/court_fetch_orchestrator.py new file mode 100644 index 0000000..d7a1cea --- /dev/null +++ b/mcp-server/src/legal_mcp/services/court_fetch_orchestrator.py @@ -0,0 +1,241 @@ +"""X13 orchestrator — classify → fetch → ingest → record. + +The single entry point (`fetch_and_ingest`) wires the three tiers to the +**canonical** precedent-ingest pipeline (INV-CF1 — no parallel ingest path) +and keeps the `court_fetch_jobs` row honest at every step (INV-CF2 — a job +always ends in an explicit terminal state, never a silent drop). + +Tier routing (from `court_citation.classify`): + * ``skip`` — ועדת-ערר → never fetched; logged as a missing_precedent gap. + * ``supreme`` — Tier 0, in-process httpx (`court_fetch_supreme`). + * ``admin`` — Tier 1, the host-side stealth-browser service over loopback. + +Fallback (INV-CF3): after ``MAX_AUTONOMOUS_ATTEMPTS`` autonomous failures the +job flips to ``manual`` and a missing_precedent row is opened so the chair +sees the gap and can solve the CAPTCHA live (VNC) or drop the file manually. + +This module runs **in the local MCP server only** — `ingest_precedent` drives +halacha extraction via the local ``claude`` CLI (see `claude_session.py`). It +is invoked from the `court_verdict_fetch` MCP tool, not from the container. +""" + +from __future__ import annotations + +import logging +import os +import tempfile +from pathlib import Path +from uuid import UUID + +import httpx + +from legal_mcp.services import court_citation, db +from legal_mcp.services.court_fetch_supreme import ( + SupremeFetchError, + fetch_supreme_verdict, +) + +logger = logging.getLogger(__name__) + +# After this many autonomous failures, stop auto-retrying and escalate to a +# human (INV-CF3). Kept low — the .gov site shouldn't be hammered (INV-CF4). +MAX_AUTONOMOUS_ATTEMPTS = int(os.environ.get("COURT_FETCH_MAX_ATTEMPTS", "2")) + +# The host-side Tier-1 browser service (pm2). The MCP server runs on the host, +# so it reaches the service over loopback directly (the container bridge in +# web/court_fetch_proxy.py is a separate, optional entry point). +COURT_FETCH_SERVICE_URL = os.environ.get( + "COURT_FETCH_SERVICE_URL", "http://127.0.0.1:8771" +) +_SHARED_SECRET = os.environ.get("COURT_FETCH_SHARED_SECRET", "").strip() +_TIER1_TIMEOUT_S = float(os.environ.get("COURT_FETCH_TIER1_TIMEOUT_S", "300")) + +# Provenance level by tier — Supreme rulings are binding; admin-court verdicts +# are administrative (set is_binding conservatively True, chair can downgrade). +_LEVEL_BY_TIER = {"supreme": "עליון", "admin": "מנהלי"} + + +class _Tier1Unavailable(RuntimeError): + """The host browser service is not reachable / not configured.""" + + +async def _ingest_bytes( + *, content: bytes, filename: str, citation: str, tier: str, + court: str, source_url: str, +) -> dict: + """Stage bytes to a temp file and run the canonical ingest (INV-CF1).""" + from legal_mcp.services import precedent_library + + suffix = Path(filename).suffix or ".pdf" + tmp = tempfile.NamedTemporaryFile( + prefix="court_fetch_", suffix=suffix, delete=False + ) + try: + tmp.write(content) + tmp.flush() + tmp.close() + result = await precedent_library.ingest_precedent( + file_path=tmp.name, + citation=citation, + court=court, + source_type="court_ruling", # INV-CF6 + precedent_level=_LEVEL_BY_TIER.get(tier, ""), + is_binding=True, + ) + # Stamp provenance on the new case_law row (INV-CF7). + case_law_id = result.get("case_law_id") + if case_law_id and source_url: + try: + await db.update_case_law( + UUID(str(case_law_id)), source_url=source_url + ) + except Exception: # provenance is best-effort, never blocks ingest + logger.warning("could not stamp source_url on %s", case_law_id) + return result + finally: + try: + os.unlink(tmp.name) + except OSError: + pass + + +async def _fetch_tier1_admin(cit: court_citation.CourtCitation) -> dict: + """Call the host-side browser service to fetch an admin-court verdict. + + Returns the service's JSON: ``{ok, content_b64, filename, source_url, + court, reason}``. Raises ``_Tier1Unavailable`` if the service can't be + reached, ``SupremeFetchError``-style RuntimeError on a fetch failure the + service reports. + """ + if not (cit.file_number and cit.month and cit.year): + raise RuntimeError( + f"מספר-תיק {cit.case_number_norm} אינו בפורמט נט-המשפט (תיק-חודש-שנה)" + ) + headers = {"Authorization": f"Bearer {_SHARED_SECRET}"} if _SHARED_SECRET else {} + payload = { + "file_number": cit.file_number, + "month": cit.month, + "year": cit.year, + "case_number": cit.case_number_norm, + "court": cit.court_prefix, + } + try: + async with httpx.AsyncClient(timeout=_TIER1_TIMEOUT_S) as client: + resp = await client.post( + f"{COURT_FETCH_SERVICE_URL}/fetch", json=payload, headers=headers + ) + except httpx.ConnectError as e: + raise _Tier1Unavailable( + f"שירות-האחזור (legal-court-fetch-service) אינו זמין ב-" + f"{COURT_FETCH_SERVICE_URL}: {e}" + ) from e + if resp.status_code != 200: + raise RuntimeError(f"שירות-האחזור החזיר {resp.status_code}: {resp.text[:200]}") + return resp.json() + + +async def fetch_and_ingest( + citation: str, *, digest_id: UUID | None = None +) -> dict: + """Classify a citation, fetch the verdict, ingest it, and record the job. + + Idempotent on the canonical case number (INV-CF5): a case already fetched + (job ``done``) is returned without re-fetching. + """ + cit = court_citation.classify(citation) + + # ── skip: ועדת-ערר — never auto-fetched (INV-CF6). Surface as a gap. ── + if cit.tier == "skip": + await _open_gap(citation, reason="ועדת-ערר — לא ניתן לאחזור ציבורי (נדרש נבו)") + return {"status": "skipped", "tier": "skip", "citation": citation, + "reason": "appeals_committee — needs Nevo"} + if cit.tier == "unknown" or not cit.case_number_norm: + return {"status": "unrecognized", "citation": citation} + + # ── idempotent job row ── + job = await db.court_fetch_job_upsert( + case_number_norm=cit.case_number_norm, + citation_raw=citation, + tier=cit.tier, + court=cit.court_prefix, + digest_id=digest_id, + ) + if job.get("status") == "done": + return {"status": "already_done", "job": job} + if job.get("status") == "manual": + return {"status": "awaiting_manual", "job": job} + + job_id = UUID(str(job["id"])) + await db.court_fetch_job_update(job_id, status="running", bump_attempts=True) + + # ── fetch ── + try: + if cit.tier == "supreme": + fetched = await fetch_supreme_verdict( + citation=citation, case_number_norm=cit.case_number_norm + ) + content, filename = fetched.content, fetched.filename + source_url, court = fetched.source_url, fetched.court + else: # admin → Tier 1 + res = await _fetch_tier1_admin(cit) + if not res.get("ok"): + raise RuntimeError(res.get("reason") or "אחזור נכשל") + import base64 + content = base64.b64decode(res["content_b64"]) + filename = res.get("filename") or f"{cit.case_number_norm}.pdf" + source_url = res.get("source_url", "") + court = res.get("court") or cit.court_prefix + except (_Tier1Unavailable, SupremeFetchError, RuntimeError) as e: + return await _record_failure(job_id, cit, citation, str(e)) + + # ── ingest into the canonical pipeline (INV-CF1) ── + try: + result = await _ingest_bytes( + content=content, filename=filename, citation=citation, + tier=cit.tier, court=court, source_url=source_url, + ) + except Exception as e: # noqa: BLE001 — recorded, never swallowed (INV-CF2) + logger.exception("ingest failed for %s", cit.case_number_norm) + return await _record_failure(job_id, cit, citation, f"קליטה נכשלה: {e}") + + case_law_id = result.get("case_law_id") + await db.court_fetch_job_update( + job_id, status="done", + case_law_id=UUID(str(case_law_id)) if case_law_id else None, + source_url=source_url, error="", + ) + return {"status": "done", "tier": cit.tier, "case_law_id": case_law_id, + "citation": citation, "source_url": source_url, "ingest": result} + + +async def _record_failure( + job_id: UUID, cit: court_citation.CourtCitation, citation: str, err: str +) -> dict: + """Record a fetch/ingest failure; escalate to manual after N attempts (INV-CF3).""" + job = await db.court_fetch_job_get(cit.case_number_norm) + attempts = (job or {}).get("attempts", 1) + if attempts >= MAX_AUTONOMOUS_ATTEMPTS: + await db.court_fetch_job_update(job_id, status="manual", error=err) + await _open_gap( + citation, + reason=f"אחזור אוטונומי נכשל ({attempts} נסיונות) — נדרשת הורדה ידנית. {err}", + ) + logger.warning("court fetch escalated to manual: %s — %s", citation, err) + return {"status": "manual", "citation": citation, "error": err, + "attempts": attempts} + await db.court_fetch_job_update(job_id, status="failed", error=err) + logger.warning("court fetch failed (will retry): %s — %s", citation, err) + return {"status": "failed", "citation": citation, "error": err, + "attempts": attempts} + + +async def _open_gap(citation: str, *, reason: str) -> None: + """Open a missing_precedent gap so the chair sees it (INV-CF2/CF3). + + Best-effort + de-duplicated by the missing_precedents layer; a failure + here is logged, never raised (it must not mask the original outcome). + """ + try: + await db.create_missing_precedent(citation=citation, notes=reason) + except Exception: + logger.warning("could not open missing_precedent for %s", citation) diff --git a/mcp-server/src/legal_mcp/services/court_fetch_supreme.py b/mcp-server/src/legal_mcp/services/court_fetch_supreme.py new file mode 100644 index 0000000..7acdbeb --- /dev/null +++ b/mcp-server/src/legal_mcp/services/court_fetch_supreme.py @@ -0,0 +1,181 @@ +"""Tier 0 — Supreme Court verdict fetcher (X13). + +Pulls a published Supreme Court verdict PDF from the **public** decisions +portal ``supremedecisions.court.gov.il`` — no smart-card, no CAPTCHA. The +portal is an AngularJS SPA backed by a small JSON API (reverse-engineered +from ``/Scripts/app/config.js`` + the search/results controllers): + + POST Home/SearchVerdicts body {"document": , "lan": 1} → result list + GET Home/GetCasesYearNum ?... (year + number lookup) → case + docs + GET Home/Download?path=&fileName=&type=4 → the PDF bytes + +Two things matter for getting a 200 instead of an F5 connection-reset +(verified empirically 2026-06-07): + * a **complete** browser header set — UA + Accept + Accept-Language. A bare + UA alone gets reset. + * **politeness** (INV-CF4): one request at a time, a cooldown between them, + a Referer of the portal root. We never parallelise or hammer. + +Honesty / scope: the *result→download* field mapping (where ``path`` and +``fileName`` live in the SearchVerdicts JSON) is derived from the client code, +not yet confirmed against a live JSON response (the live site rate-limited +probing during development). ``fetch_supreme_verdict`` therefore validates the +response shape and **raises** on anything unexpected (INV-CF2 — no silent +swallow) so the orchestrator can record the failure and fall back, rather than +returning a wrong/empty file. The first live run is the validation pass; see +the X13 verification section. +""" + +from __future__ import annotations + +import asyncio +import logging +import os +from dataclasses import dataclass + +import httpx + +logger = logging.getLogger(__name__) + +_BASE = "https://supremedecisions.court.gov.il" + +# A complete, browser-like header set. Empirically required to pass the F5 +# WAF (a bare User-Agent gets a TCP reset). +_HEADERS = { + "User-Agent": ( + "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 " + "(KHTML, like Gecko) Chrome/126.0 Safari/537.36" + ), + "Accept": "application/json, text/plain, */*", + "Accept-Language": "he-IL,he;q=0.9,en;q=0.8", + "Referer": _BASE + "/", +} + +# Politeness knobs (INV-CF4). Serial only — never run these concurrently. +_REQUEST_TIMEOUT_S = float(os.environ.get("COURT_FETCH_HTTP_TIMEOUT_S", "30")) +_INTER_REQUEST_COOLDOWN_S = float(os.environ.get("COURT_FETCH_COOLDOWN_S", "2")) + +# type=4 → PDF in the portal's Download endpoint (from resultsControler.js). +_DOC_TYPE_PDF = "4" + + +@dataclass +class FetchedVerdict: + """A downloaded verdict file held in memory, ready for ingest.""" + + content: bytes + filename: str + source_url: str + court: str = "בית המשפט העליון" + + +class SupremeFetchError(RuntimeError): + """Raised when the public portal returns an unexpected shape / no document. + + Carries a human-readable Hebrew reason so the orchestrator can persist it + on the job row (INV-CF2) and decide on fallback. + """ + + +async def _get(client: httpx.AsyncClient, path: str, **kwargs) -> httpx.Response: + await asyncio.sleep(_INTER_REQUEST_COOLDOWN_S) + resp = await client.get(f"{_BASE}/{path.lstrip('/')}", **kwargs) + resp.raise_for_status() + return resp + + +async def _post(client: httpx.AsyncClient, path: str, json: dict) -> httpx.Response: + await asyncio.sleep(_INTER_REQUEST_COOLDOWN_S) + resp = await client.post(f"{_BASE}/{path.lstrip('/')}", json=json) + resp.raise_for_status() + return resp + + +def _extract_doc_ref(results: object) -> tuple[str, str] | None: + """Pull (path, fileName) of the first verdict document from a results blob. + + The SearchVerdicts/GetCasesYearNum responses nest documents under varying + keys across the portal's endpoints. We probe the known shapes defensively + and return the first (path, fileName) pair found; ``None`` if none. + """ + def walk(node): + if isinstance(node, dict): + # A document node carries both a path and a file name. + path = node.get("Path") or node.get("path") + fname = node.get("FileName") or node.get("fileName") or node.get("Filename") + if path and fname: + yield (str(path), str(fname)) + for v in node.values(): + yield from walk(v) + elif isinstance(node, list): + for v in node: + yield from walk(v) + + for pair in walk(results): + return pair + return None + + +async def fetch_supreme_verdict( + *, citation: str, case_number_norm: str +) -> FetchedVerdict: + """Fetch a Supreme Court verdict PDF by citation. Raises on failure. + + Flow: full-text search for the citation → locate the verdict document's + (path, fileName) → download the PDF. Serial + cooled-down throughout. + """ + async with httpx.AsyncClient( + http2=True, + headers=_HEADERS, + timeout=_REQUEST_TIMEOUT_S, + follow_redirects=True, + ) as client: + # 1. Search. The portal's quick-search posts {document, lan}; lan=1=Hebrew. + try: + search = await _post( + client, "Home/SearchVerdicts", + json={"document": citation, "lan": 1}, + ) + results = search.json() + except httpx.HTTPError as e: + raise SupremeFetchError( + f"חיפוש בפורטל העליון נכשל עבור {citation}: {e}" + ) from e + except ValueError as e: # non-JSON body + raise SupremeFetchError( + f"תשובת-חיפוש לא-JSON מהפורטל עבור {citation}" + ) from e + + ref = _extract_doc_ref(results) + if not ref: + raise SupremeFetchError( + f"לא נמצא מסמך-פסק עבור {citation} בפורטל העליון " + f"(ייתכן שאינו פורסם או שמבנה-התשובה השתנה)." + ) + path, fname = ref + + # 2. Download the PDF. + try: + dl = await _get( + client, "Home/Download", + params={"path": path, "fileName": fname, "type": _DOC_TYPE_PDF}, + ) + except httpx.HTTPError as e: + raise SupremeFetchError( + f"הורדת PDF נכשלה עבור {citation} (path={path}): {e}" + ) from e + + content = dl.content + ctype = dl.headers.get("content-type", "") + if not content or ("pdf" not in ctype.lower() and not content[:4] == b"%PDF"): + raise SupremeFetchError( + f"הקובץ שהתקבל עבור {citation} אינו PDF תקין (content-type={ctype})." + ) + + source_url = ( + f"{_BASE}/Home/Download?path={path}&fileName={fname}&type={_DOC_TYPE_PDF}" + ) + safe_name = fname if fname.lower().endswith(".pdf") else f"{case_number_norm}.pdf" + return FetchedVerdict( + content=content, filename=safe_name, source_url=source_url, + ) diff --git a/mcp-server/src/legal_mcp/services/db.py b/mcp-server/src/legal_mcp/services/db.py index 36e9301..e09a281 100644 --- a/mcp-server/src/legal_mcp/services/db.py +++ b/mcp-server/src/legal_mcp/services/db.py @@ -1352,6 +1352,36 @@ CREATE INDEX IF NOT EXISTS idx_digests_content_tsv ON digests USING gin(content_ """ +# ── X13 — Court Verdict Fetch queue ────────────────────────────────────── +# A lightweight, observable, idempotent job queue for the auto-fetch +# subsystem (docs/spec/X13-court-fetch.md). One row per court verdict we try +# to pull from a public source. Mirrors the extraction-queue pattern: status +# is always explicit (INV-CF2 — no silent drop), the canonical case number is +# the idempotency key (INV-CF5), and ``attempts`` drives the human-fallback +# gate (INV-CF3 — flip to 'manual' after N autonomous failures). +# V31 — digests (X12) took V30 when it merged first. +SCHEMA_V31_SQL = """ +CREATE TABLE IF NOT EXISTS court_fetch_jobs ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + case_number_norm TEXT NOT NULL UNIQUE, -- idempotency key (INV-CF5) + citation_raw TEXT NOT NULL DEFAULT '', + tier TEXT NOT NULL DEFAULT '', -- supreme | admin | skip + court TEXT NOT NULL DEFAULT '', + status TEXT NOT NULL DEFAULT 'pending', -- pending|running|done|failed|manual + attempts INT NOT NULL DEFAULT 0, + error TEXT NOT NULL DEFAULT '', + case_law_id UUID REFERENCES case_law(id) ON DELETE SET NULL, + digest_id UUID, -- source digest (X12), nullable for ad-hoc + source_url TEXT NOT NULL DEFAULT '', -- provenance (INV-CF7) + created_at TIMESTAMPTZ DEFAULT now(), + updated_at TIMESTAMPTZ DEFAULT now() +); +CREATE INDEX IF NOT EXISTS idx_court_fetch_jobs_status ON court_fetch_jobs(status); +CREATE INDEX IF NOT EXISTS idx_court_fetch_jobs_digest ON court_fetch_jobs(digest_id) + WHERE digest_id IS NOT NULL; +""" + + async def _run_schema_migrations(pool: asyncpg.Pool) -> None: async with pool.acquire() as conn: await conn.execute(SCHEMA_SQL) @@ -1385,7 +1415,8 @@ async def _run_schema_migrations(pool: asyncpg.Pool) -> None: await conn.execute(SCHEMA_V28_SQL) await conn.execute(SCHEMA_V29_SQL) await conn.execute(SCHEMA_V30_SQL) - logger.info("Database schema initialized (v1-v30)") + await conn.execute(SCHEMA_V31_SQL) + logger.info("Database schema initialized (v1-v31)") async def init_schema() -> None: @@ -5930,3 +5961,110 @@ async def find_missing_precedent_by_citation( citation.strip(), ) return _row_to_missing_precedent(row) if row else None + + +# ── X13 — Court Verdict Fetch jobs ─────────────────────────────────────── +# CRUD for the auto-fetch queue (docs/spec/X13-court-fetch.md). Status is +# always explicit; failures are recorded, never swallowed (INV-CF2). Upsert +# is keyed on the canonical case number (INV-CF5). + +def _row_to_court_fetch_job(row) -> dict: + return dict(row) if row else None + + +async def court_fetch_job_upsert( + case_number_norm: str, + citation_raw: str = "", + tier: str = "", + court: str = "", + digest_id: UUID | None = None, +) -> dict: + """Idempotent create-or-get of a fetch job by canonical case number. + + Re-requesting the same case number returns the existing row (with a + ``_existing`` flag) rather than creating a duplicate — the canonical + number is a UNIQUE key. A job that already reached a terminal state is + returned as-is so callers can decide whether to retry. + """ + if not (case_number_norm or "").strip(): + raise ValueError("case_number_norm is required") + pool = await get_pool() + async with pool.acquire() as conn: + existing = await conn.fetchrow( + "SELECT * FROM court_fetch_jobs WHERE case_number_norm = $1", + case_number_norm, + ) + if existing: + out = _row_to_court_fetch_job(existing) + out["_existing"] = True + return out + row = await conn.fetchrow( + """INSERT INTO court_fetch_jobs + (case_number_norm, citation_raw, tier, court, digest_id) + VALUES ($1, $2, $3, $4, $5) + RETURNING *""", + case_number_norm, citation_raw, tier, court, digest_id, + ) + out = _row_to_court_fetch_job(row) + out["_existing"] = False + return out + + +async def court_fetch_job_update( + job_id: UUID, + *, + status: str | None = None, + error: str | None = None, + case_law_id: UUID | None = None, + source_url: str | None = None, + bump_attempts: bool = False, +) -> dict: + """Patch a job row. Only provided fields change; ``updated_at`` always does.""" + sets = ["updated_at = now()"] + args: list = [] + if status is not None: + args.append(status); sets.append(f"status = ${len(args)}") + if error is not None: + args.append(error); sets.append(f"error = ${len(args)}") + if case_law_id is not None: + args.append(case_law_id); sets.append(f"case_law_id = ${len(args)}") + if source_url is not None: + args.append(source_url); sets.append(f"source_url = ${len(args)}") + if bump_attempts: + sets.append("attempts = attempts + 1") + args.append(job_id) + pool = await get_pool() + async with pool.acquire() as conn: + row = await conn.fetchrow( + f"UPDATE court_fetch_jobs SET {', '.join(sets)} " + f"WHERE id = ${len(args)} RETURNING *", + *args, + ) + return _row_to_court_fetch_job(row) + + +async def court_fetch_job_get(case_number_norm: str) -> dict | None: + pool = await get_pool() + async with pool.acquire() as conn: + row = await conn.fetchrow( + "SELECT * FROM court_fetch_jobs WHERE case_number_norm = $1", + case_number_norm, + ) + return _row_to_court_fetch_job(row) if row else None + + +async def court_fetch_job_list(status: str | None = None, limit: int = 100) -> list[dict]: + pool = await get_pool() + async with pool.acquire() as conn: + if status: + rows = await conn.fetch( + "SELECT * FROM court_fetch_jobs WHERE status = $1 " + "ORDER BY created_at DESC LIMIT $2", + status, limit, + ) + else: + rows = await conn.fetch( + "SELECT * FROM court_fetch_jobs ORDER BY created_at DESC LIMIT $1", + limit, + ) + return [_row_to_court_fetch_job(r) for r in rows] diff --git a/mcp-server/src/legal_mcp/tools/court_fetch.py b/mcp-server/src/legal_mcp/tools/court_fetch.py new file mode 100644 index 0000000..e4c44b2 --- /dev/null +++ b/mcp-server/src/legal_mcp/tools/court_fetch.py @@ -0,0 +1,56 @@ +"""MCP tools for the X13 court-verdict auto-fetch subsystem. + +- ``court_verdict_fetch`` — classify a citation, fetch the verdict from the + matching public source (Supreme portal / נט המשפט), and ingest it into the + precedent library via the canonical pipeline. The standalone entry point + (also driven automatically from digest auto-link, see X12/X13). +- ``court_fetch_status`` — inspect the fetch-job queue (pending/failed/manual). + +Local-only: ``court_verdict_fetch`` runs the ingest pipeline, which drives +halacha extraction via the local ``claude`` CLI — same constraint as +``precedent_process_pending``. Invoking it from the container will fail. +""" + +from __future__ import annotations + +from legal_mcp.services import court_fetch_orchestrator as orch +from legal_mcp.services import db +from legal_mcp.tools.envelope import err as _err, ok as _ok + + +async def court_verdict_fetch(citation: str) -> str: + """אחזור אוטומטי של פסק-דין בית-משפט וקליטה לקורפוס. + + מקבל ציטוט (למשל 'עת"מ 46111-12-22' או 'עע"מ 1234/22'), מסווג את הערכאה, + מוריד את הפסק מהמקור הציבורי המתאים, וקולט אותו דרך צינור-הקליטה הקנוני. + ערר/בל"מ (ועדת-ערר) אינם ניתנים לאחזור ציבורי ויסומנו כפער. + """ + if not (citation or "").strip(): + return _err("citation is required") + try: + result = await orch.fetch_and_ingest(citation.strip()) + except Exception as e: # noqa: BLE001 — surfaced, not swallowed (INV-CF2) + return _err(f"אחזור נכשל: {e}") + + status = result.get("status") + if status in ("done", "already_done"): + return _ok(result, message="הפסק נקלט לקורפוס") + if status == "skipped": + return _ok(result, message="ועדת-ערר — לא ניתן לאחזור ציבורי (סומן כפער)") + if status in ("manual", "awaiting_manual"): + return _ok(result, message="האחזור האוטונומי נכשל — הוסלם להורדה ידנית") + if status == "unrecognized": + return _err("הציטוט לא זוהה כמספר-תיק תקין") + return _ok(result, message=f"סטטוס: {status}") + + +async def court_fetch_status(case_number: str = "", status_filter: str = "") -> str: + """סטטוס תור-האחזור. case_number לפריט יחיד, או status_filter לסינון רשימה.""" + if case_number.strip(): + from legal_mcp.services.court_citation import normalize_case_number + job = await db.court_fetch_job_get(normalize_case_number(case_number)) + if not job: + return _ok({"job": None}, message="אין job עבור תיק זה") + return _ok({"job": job}) + jobs = await db.court_fetch_job_list(status=status_filter.strip() or None) + return _ok({"jobs": jobs, "count": len(jobs)}) diff --git a/mcp-server/tests/test_court_citation.py b/mcp-server/tests/test_court_citation.py new file mode 100644 index 0000000..3521aa6 --- /dev/null +++ b/mcp-server/tests/test_court_citation.py @@ -0,0 +1,80 @@ +"""Unit tests for the X13 court-citation classifier.""" + +from __future__ import annotations + +from legal_mcp.services.court_citation import classify, normalize_case_number + + +def test_admin_filed_format_the_example(): + """The plan's example: עת"מ 46111-12-22 → admin, parsed into (46111,12,22).""" + c = classify('עת"מ 46111-12-22 יכין-אפק בע"מ נ\' הוועדה המחוזית') + assert c.tier == "admin" + assert c.court_prefix in ('עת"מ', "עת״מ") + assert c.case_number_raw == "46111-12-22" + assert c.case_number_norm == "46111-12-22" + assert (c.file_number, c.month, c.year) == ("46111", "12", "22") + assert c.fetchable is True + + +def test_bare_filed_number_defaults_admin(): + c = classify("46111-12-22") + assert c.tier == "admin" + assert (c.file_number, c.month, c.year) == ("46111", "12", "22") + + +def test_supreme_prefixes(): + for cit, pref in [ + ('עע"מ 1234/22', "supreme"), + ('בג"ץ 5678/21', "supreme"), + ('ע"א 999/20', "supreme"), + ('רע"א 4/19', "supreme"), + ('בר"מ 8126/24', "supreme"), + ]: + c = classify(cit) + assert c.tier == pref, f"{cit} -> {c.tier}" + assert c.fetchable is True + + +def test_appeals_committee_is_skip(): + """ערר / בל"מ must never be auto-fetched (needs Nevo) — INV-CF6.""" + for cit in ['ערר 1110/20', 'בל"מ 8048/24', "ערר 1015-01-24 ירושלים שקופה"]: + c = classify(cit) + assert c.tier == "skip", f"{cit} -> {c.tier}" + assert c.fetchable is False + + +def test_skip_wins_over_court_match(): + """An 'ערר' citation that also contains court-like digits stays skip.""" + c = classify("ראה החלטתי בערר 1041/24 ובהמשך") + assert c.tier == "skip" + + +def test_admin_amn_prefix(): + c = classify('עמ"נ 12345-06-23') + assert c.tier == "admin" + assert (c.file_number, c.month, c.year) == ("12345", "06", "23") + + +def test_two_group_serial_has_no_filed_triple(): + """Supreme serial 1234/22 normalizes but yields no (file,month,year).""" + c = classify('עע"מ 1234/22') + assert c.case_number_norm == "1234-22" + assert c.file_number is None + + +def test_implausible_month_not_parsed_as_filed(): + # 1234-22-05 has month=22 → not a valid filed triple. + assert classify("1234-22-05").tier in ("unknown", "admin") + c = classify("1234-22-05") + if c.tier == "admin": + assert c.month is None + + +def test_empty_and_garbage(): + assert classify("").tier == "unknown" + assert classify("שלום עולם בלי ציטוט").tier == "unknown" + + +def test_normalize_case_number(): + assert normalize_case_number('עת"מ 46111/12/22') == "46111-12-22" + assert normalize_case_number("1110/20") == "1110-20" diff --git a/scripts/SCRIPTS.md b/scripts/SCRIPTS.md index c24f292..2f39b41 100644 --- a/scripts/SCRIPTS.md +++ b/scripts/SCRIPTS.md @@ -19,6 +19,7 @@ | `fu2c_reconcile_external_case_numbers.py` | python | **FU-2c (GAP-08, #68) — תיאום `case_number` של פסיקה חיצונית** (`source_kind <> internal_committee`) מציטוט-מלא לצורה קנונית **מציין-הליך + docket** (החלטת-יו"ר 2026-05-31, Option A: `/` נשמר, *לא* `-`; תואם db.py:369 ו-INV-ID2). דטרמיניסטי (designator+docket; 0/>1 docket → flag). `--dry-run` (ברירת-מחדל) מפיק `data/audit/fu2c-reconciliation-*.{csv,md}` עם flags (MISMATCH / NO_CITATION / CIT_NO_DOCKET / DESIG_MISMATCH / DUP_CHECK). `--apply --approved ` מגבה ואז מעדכן שורות לא-חוסמות (כולל ADVISORY/NO_CITATION). `--overrides ` (id,proposed_canonical,reason) פותח שורות-חוסמות בהכרעת-יו"ר מפורשת (למשל פס"ד מאוחד — ראה `data/audit/fu2c-overrides.csv` לרשומת לויתן/קלמנוביץ). לוגיקת-החילוץ + פיצול flags אומתו offline על 24 רשומות. scope: external בלבד (internal = FU-2b). FK-safe. | חד-פעמי, **chair-gated** (apply רק אחרי אישור דפנה) | | `eval_gold_bootstrap.py` | python | **FU-5 (GAP-11) — bootstrap ל-gold-set** של הערכת-אחזור ל-`data/eval/gold-set.jsonl`. שני מקורות: `--source citations` (cited==relevant מ-`search_relevance_feedback`; ריק עד שייצברו ציטוטים) ו-`--source known_item` (query=שם-תיק → relevant=עצמו; אות אמיתי היום). Idempotent — שומר שורות `source=chair`, מחדש `bootstrap_*`. דורש POSTGRES. | לפני eval; חוזר כשנצבר ground-truth | | `eval_retrieval.py` | python | **FU-5 (GAP-11, INV-RET4/G8) — harness הערכת-אחזור** — מריץ את מסלול-האחזור בייצור (`search_library`/`search_internal`) על ה-gold-set, מחשב precision@k/recall@k/MRR/nDCG@k (k=5,10), מצרף overall+per-corpus+per-PA ל-`data/eval/eval-report-.{json,md}` + delta מול `data/eval/baseline.json` (מתעד retrieval_config). `--self-test` בודק את המטריקות offline; `--update-baseline` מאמץ snapshot. **שער-CI במשמעת:** הרץ לפני/אחרי כל שינוי בשכבת-האחזור באותו קונפיג. דורש POSTGRES+VOYAGE_API_KEY. | לפני/אחרי שינוי RRF/k/embedder/rerank | +| `legal-court-fetch-service.config.cjs` | pm2/js | **שירות-מארח Tier-1 לאחזור פסקי-דין מנט המשפט (X13)** — מריץ `python -m legal_mcp.court_fetch_service.server` ב-pm2, bound ל-`10.0.1.1:8771`, Bearer-auth (`COURT_FETCH_SHARED_SECRET` מ-`~/.legal-court-fetch-service.env`). מריץ דפדפן Camoufox (open-source) כי הקונטיינר לא יכול. תלות לאחזור-בפועל: `camofox-browser` רץ (`CAMOFOX_URL`) + `faster-whisper` ל-reCAPTCHA אודיו; אחרת מחזיר ok:false וה-orchestrator מסלים ל-fallback אנושי. מראָה לדפוס `legal-chat-service.config.cjs`. ספ: `docs/spec/X13-court-fetch.md`. התקנה: `pm2 start scripts/legal-court-fetch-service.config.cjs && pm2 save`. בריאות: `curl http://10.0.1.1:8771/health`. | pm2 (host-side) | | `auto-sync-cases.sh` | bash | סנכרון תיקי ערר ל-Gitea — רץ כל דקה | `* * * * *` (cron) | | `backup-db.sh` | bash | גיבוי PostgreSQL יומי ל-`data/backups/` (gzip) | לתזמן: `0 2 * * *` | | `restore-db.sh` | bash | שחזור DB מגיבוי (companion ל-backup-db.sh) | ידני | diff --git a/scripts/legal-court-fetch-service.config.cjs b/scripts/legal-court-fetch-service.config.cjs new file mode 100644 index 0000000..2cc6ec8 --- /dev/null +++ b/scripts/legal-court-fetch-service.config.cjs @@ -0,0 +1,65 @@ +/** + * pm2 ecosystem entry for legal-court-fetch-service — the host-side Tier-1 + * verdict fetcher (X13). It drives a Camoufox stealth browser against + * נט המשפט to download administrative/district-court verdicts the Supreme + * portal (Tier 0) doesn't carry. Lives on the host because the legal-ai + * container can't run a browser. See docs/spec/X13-court-fetch.md. + * + * Mirrors legal-chat-service.config.cjs (same security model): + * 1. Bind to 10.0.1.1 (docker0 bridge gateway) — host + docker-bridge + * containers only; nothing from outside the host. + * 2. Bearer token auth — COURT_FETCH_SHARED_SECRET loaded from + * /home/chaim/.legal-court-fetch-service.env (chmod 600) and mirrored in + * Coolify so the FastAPI proxy sends a matching Authorization header. + * The service refuses to start without the secret. + * + * Prereqs for Tier-1 to actually fetch (otherwise it returns ok:false and the + * orchestrator escalates to the human fallback — INV-CF3): + * - camofox-browser running, CAMOFOX_URL set (e.g. http://127.0.0.1:9377). + * git clone https://github.com/jo-inc/camofox-browser && npm i && npm start + * - faster-whisper installed in the venv for the reCAPTCHA audio solver. + * + * Install (once): + * pm2 start /home/chaim/legal-ai/scripts/legal-court-fetch-service.config.cjs + * pm2 save + * Smoke test: + * curl http://10.0.1.1:8771/health + * Update: + * pm2 restart legal-court-fetch-service --update-env + */ +const fs = require("fs"); + +const ENV_FILE = "/home/chaim/.legal-court-fetch-service.env"; +const env = { + HOME: "/home/chaim", + PATH: "/home/chaim/.local/bin:/usr/local/bin:/usr/bin:/bin", + PYTHONUNBUFFERED: "1", + // CAMOFOX_URL: "http://127.0.0.1:9377", // set when camofox-browser is up +}; +try { + const text = fs.readFileSync(ENV_FILE, "utf8"); + for (const line of text.split("\n")) { + if (!line || line.trim().startsWith("#")) continue; + const m = line.match(/^\s*([A-Z_][A-Z0-9_]*)\s*=\s*(.*?)\s*$/); + if (m) env[m[1]] = m[2]; + } +} catch (e) { + console.error(`legal-court-fetch-service: failed to load ${ENV_FILE}: ${e.message}`); + console.error("Service will refuse to start without COURT_FETCH_SHARED_SECRET."); +} + +module.exports = { + apps: [ + { + name: "legal-court-fetch-service", + cwd: "/home/chaim/legal-ai/mcp-server", + script: "/home/chaim/legal-ai/mcp-server/.venv/bin/python", + args: "-m legal_mcp.court_fetch_service.server --port 8771 --host 10.0.1.1", + env, + restart_delay: 5000, + max_restarts: 10, + autorestart: true, + max_memory_restart: "1G", + }, + ], +};