feat(X13 Tier-1): calibrate נט המשפט fetch — Camoufox python, proven on 46111-12-22

אומת end-to-end: פס"ד 34 עמ' של עת"מ 46111-12-22 הורד אוטונומית מלא, נטו
קוד-פתוח, ללא כרטיס-חכם וללא פתרון-CAPTCHA.

ממצאי-כיול עיקריים:
- החיפוש+הניווט-לתיק ללא reCAPTCHA כלל. reCAPTCHA קיים רק בצופה ורק על
  שמירה/הדפסה מפורשת — לא על הצגת המסמך.
- הצופה מגיש עמודים כ-PNG דרך PageMethod GetImages (4/batch); משיכה ב-fetch
  עם הכותרת X-Requested-With: XMLHttpRequest (חובה — F5 WAF חוסם בלעדיה) →
  הרכבת PDF (Pillow).

שינויים:
- camofox_client.py: שכתוב מלא — Camoufox דרך חבילת-הפייתון (in-process,
  לא שרת-Node REST). מסלול מכויל: home→btnExternalSearchCases→Bama fields→
  CaseDetails→פסקי דין→DecisionList→NGCSViewerPage→GetImages→PDF.
- pm2 config: app Xvfb :99 + DISPLAY=:99 (Camoufox קורס headless בלי צג וירטואלי).
- pyproject: extra [court-fetch] = camoufox + faster-whisper (host-only; הקונטיינר
  לא מריץ דפדפן). Pillow כבר בבסיס.
- X13 spec + SCRIPTS.md: עודכנו לממצאים (image-API, Xvfb, אימות).

reCAPTCHA audio (Whisper) נשמר כ-fallback למסלול-השמירה-המפורש בלבד; המסלול
הראשי אינו זקוק לו. Invariants: מקיים INV-CF1/CF4/CF6 (ללא שינוי).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-07 19:32:13 +00:00
parent f3740fef68
commit 781f24c643
5 changed files with 256 additions and 115 deletions

View File

@@ -1,148 +1,251 @@
"""Camoufox-browser client + נט-המשפט navigation flow (X13, Tier 1).
"""Camoufox driver for נט המשפט — calibrated, proven flow (X13, Tier 1).
Open-source, zero-API-cost stealth browsing: a self-hosted ``camofox-browser``
REST server (``jo-inc/camofox-browser``, wrapping Camoufox — a Firefox fork
with C++ fingerprint spoofing) drives a real browser. We talk to it over the
same REST surface the Hermes agent uses (``~/.hermes/.../browser_camofox.py``):
Open-source, zero-API-cost: drives a **Camoufox** stealth browser (a Firefox
fork with C++ fingerprint spoofing) via its official Python package
(``camoufox.async_api``) — in-process, no separate Node server. The full flow
was reverse-engineered and validated end-to-end against עת"מ 46111-12-22
(2026-06-07): a 34-page verdict PDF retrieved with **no smart-card and no
CAPTCHA-solving**.
POST /tabs → {tab_id}
POST /tabs/{tab}/navigate {url}
GET /tabs/{tab}/snapshot → accessibility tree w/ element refs
POST /tabs/{tab}/click {ref}
POST /tabs/{tab}/type {ref,text}
GET /tabs/{tab}/screenshot
DELETE /sessions/{user}
The proven path:
1. homepage → DOM-click ``btnExternalSearchCases`` ("תיקים לפי מס' תיק מקור").
2. Fill the visible header case-locator: ``BamaCaseNumberTextBoxH`` = case
number, ``BamaMonthYearTextBoxHT`` = "MM-YY"; click ``SearchHeaderCaseButton``.
→ lands on ``FolderCaseDetails/CaseDetails.aspx`` for the case.
3. Click the "פסקי דין" sidebar tab → ``Decisions/DecisionList.aspx``.
4. Click the document → popup ``Viewer/NGCSViewerPage.aspx?DocumentNumber=…``.
5. The viewer renders pages as PNG images via the ``GetImages`` PageMethod —
**served without reCAPTCHA** (the reCAPTCHA on the viewer only gates the
explicit save/print, which we don't use). Capture the internal
``documentNumber`` from the viewer's first ``GetImages`` call, then pull
every 4-page batch via ``fetch`` **with header ``X-Requested-With:
XMLHttpRequest``** (required — the F5 WAF blocks AJAX calls without it).
6. Decode the base64 PNGs → assemble a PDF (Pillow). The existing ingest
pipeline OCRs it (Google Vision) → text → corpus.
Set ``CAMOFOX_URL`` (e.g. ``http://127.0.0.1:9377``) to enable. The server's
``/health`` exposes a VNC URL — that's the human-fallback surface (INV-CF3):
when the autonomous reCAPTCHA solve fails, the chair opens the VNC and solves
it live, and this flow continues.
Operational requirements (see scripts/legal-court-fetch-service.config.cjs):
* a virtual display — Camoufox/Firefox crashes headless on this server
without one. Set ``DISPLAY`` to a running Xvfb (e.g. ``:99``).
* RAM — a Firefox content process loading the heavy ASP.NET pages needs
~0.51 GB; keep the box from swapping.
⚠ CALIBRATION: the נט-המשפט external-case-search is an ASP.NET WebForms app
behind an F5 WAF + reCAPTCHA. The element selectors and step sequence below
are the *documented plan* of the flow; they must be calibrated against the
live snapshot on first run (the site rate-limited static probing during
development). Every step that can't find its target **raises** a clear Hebrew
reason (INV-CF2 — no silent success-with-garbage) so the orchestrator escalates
to the Tier-2 human fallback rather than returning an empty/wrong file.
reCAPTCHA note: ``recaptcha_audio`` (local Whisper) remains as a fallback for
the explicit-PDF-download path, but the primary image-API path needs no
solving, so it is normally unused.
"""
from __future__ import annotations
import asyncio
import base64
import io
import json
import logging
import os
import httpx
import re
logger = logging.getLogger(__name__)
# נט המשפט public entry points (discovered from the homepage __doPostBack menu).
NGCS_HOME = "https://www.court.gov.il/ngcs.web.site/homepage.aspx"
CAMOFOX_URL = os.environ.get("CAMOFOX_URL", "").rstrip("/")
_TIMEOUT = float(os.environ.get("COURT_FETCH_BROWSER_TIMEOUT_S", "60"))
# Headless Camoufox needs a virtual display on this server.
_DISPLAY = os.environ.get("DISPLAY", "")
_NAV_TIMEOUT_MS = int(float(os.environ.get("COURT_FETCH_BROWSER_TIMEOUT_S", "60")) * 1000)
_PAGE_BATCH = 4 # the viewer's GetImages batch size
_MAX_PAGES = 400 # hard cap on a single document
class CamofoxUnavailable(RuntimeError):
"""camofox-browser isn't configured/reachable."""
"""Camoufox (or its virtual display) isn't available."""
class NgcsFlowError(RuntimeError):
"""A step in the נט-המשפט flow failed (selector/CAPTCHA/navigation)."""
"""A step in the נט-המשפט flow failed (navigation / not found / blocked)."""
def is_enabled() -> bool:
return bool(CAMOFOX_URL)
"""True if the Camoufox package imports (browser binary present)."""
try:
import camoufox.async_api # noqa: F401
return True
except Exception:
return False
async def health() -> dict:
"""Probe camofox-browser; surfaces the VNC URL for the human fallback."""
if not CAMOFOX_URL:
raise CamofoxUnavailable("CAMOFOX_URL is not set")
async with httpx.AsyncClient(timeout=10) as c:
r = await c.get(f"{CAMOFOX_URL}/health")
r.raise_for_status()
return r.json()
return {"camoufox_import": is_enabled(), "display": _DISPLAY or "(none)"}
class _Browser:
"""Thin async wrapper over the camofox-browser REST surface."""
def __init__(self, client: httpx.AsyncClient, tab_id: str, user_id: str):
self._c = client
self.tab = tab_id
self.user = user_id
@classmethod
async def open(cls, client: httpx.AsyncClient) -> "_Browser":
r = await client.post(f"{CAMOFOX_URL}/tabs", json={})
r.raise_for_status()
data = r.json()
return cls(client, data["tab_id"], data.get("user_id", data["tab_id"]))
async def navigate(self, url: str) -> None:
r = await self._c.post(f"{CAMOFOX_URL}/tabs/{self.tab}/navigate", json={"url": url})
r.raise_for_status()
async def snapshot(self) -> dict:
r = await self._c.get(f"{CAMOFOX_URL}/tabs/{self.tab}/snapshot")
r.raise_for_status()
return r.json()
async def click(self, ref: str) -> dict:
r = await self._c.post(f"{CAMOFOX_URL}/tabs/{self.tab}/click", json={"ref": ref})
r.raise_for_status()
return r.json()
async def type(self, ref: str, text: str) -> None:
r = await self._c.post(
f"{CAMOFOX_URL}/tabs/{self.tab}/type", json={"ref": ref, "text": text}
)
r.raise_for_status()
async def close(self) -> None:
async def _fill_visible(page, id_substr: str, value: str) -> bool:
for el in await page.locator(f"input[id*='{id_substr}']").all():
try:
await self._c.delete(f"{CAMOFOX_URL}/sessions/{self.user}")
except httpx.HTTPError:
pass
if await el.is_visible() and await el.is_editable():
await el.fill(value)
return True
except Exception:
continue
return False
async def _reach_viewer(page, *, case_number: str, month_year: str):
"""Drive home → search → case → פסקי דין → viewer popup. Returns the popup page."""
await page.goto(NGCS_HOME, wait_until="domcontentloaded", timeout=_NAV_TIMEOUT_MS)
await page.wait_for_timeout(2500)
await page.eval_on_selector(
"#Header1_UpperMenu1_btnExternalSearchCases", "el => el.click()"
)
try:
await page.wait_for_load_state("domcontentloaded", timeout=_NAV_TIMEOUT_MS)
except Exception:
pass
await page.wait_for_timeout(4500)
if not await _fill_visible(page, "BamaCaseNumberTextBoxH", case_number):
raise NgcsFlowError("שדה מספר-תיק לא נמצא בעמוד החיפוש")
my_filled = False
for el in await page.locator("input[id*='BamaMonthYearTextBoxHT']").all():
if await el.is_visible():
await el.click()
await page.keyboard.type(month_year, delay=60)
my_filled = True
break
if not my_filled:
raise NgcsFlowError("שדה חודש-שנה לא נמצא")
clicked = False
for b in await page.locator("[id*='SearchHeaderCaseButton']").all():
if await b.is_visible():
await b.click()
clicked = True
break
if not clicked:
raise NgcsFlowError("כפתור החיפוש לא נמצא")
await page.wait_for_timeout(6000)
if "CaseDetails" not in page.url:
raise NgcsFlowError(
f"לא הגענו לעמוד-התיק (URL={page.url[:80]}) — ייתכן שהתיק לא נמצא/לא פתוח לעיון"
)
# פסקי דין tab → DecisionList
psak = page.locator("a:has-text('פסקי דין')")
opened = False
for i in range(await psak.count()):
el = psak.nth(i)
if await el.is_visible():
await el.click()
opened = True
break
if not opened:
raise NgcsFlowError("לשונית 'פסקי דין' לא נמצאה בעמוד-התיק")
await page.wait_for_timeout(6000)
# open the verdict document viewer (popup)
viewers = page.locator(
"a[href*='Viewer'],[onclick*='Viewer'],a[href*='Document'],a:has-text('צפייה')"
)
async with page.context.expect_page(timeout=15000) as pop:
clicked = False
for i in range(await viewers.count()):
el = viewers.nth(i)
if await el.is_visible():
await el.click()
clicked = True
break
if not clicked:
raise NgcsFlowError("לא נמצא מסמך פסק-דין לצפייה")
return await pop.value
async def fetch_admin_verdict(
*, file_number: str, month: str, year: str, case_number: str, court: str
) -> dict:
"""Drive נט המשפט to download an admin/district verdict PDF.
"""Fetch an admin/district court verdict as a PDF. Returns
``{content: bytes, filename, source_url, court}``; raises on failure.
Returns ``{content: bytes, filename: str, source_url: str, court: str}``.
Raises ``CamofoxUnavailable`` / ``NgcsFlowError`` on failure.
The flow (to be calibrated against the live snapshot):
1. Open the homepage; trigger "חיפוש תיקים חיצוני" (btnExternalSearchCases).
2. Fill the case-number / month / year fields.
3. Solve the reCAPTCHA via the audio challenge (recaptcha_audio); on
repeated failure, surface the VNC URL for a human solve (INV-CF3).
4. Submit; open the matched case; locate the verdict ("פסק דין") document.
5. Download the cleared PDF (served via S3 pre-signed URL) and return bytes.
``file_number``/``month``/``year`` are the נט-המשפט triple (e.g. 46111/12/22).
"""
if not CAMOFOX_URL:
try:
from camoufox.async_api import AsyncCamoufox
except Exception as e:
raise CamofoxUnavailable(
"שירות-הדפדפן (camofox-browser) אינו מוגדר — הגדר CAMOFOX_URL "
"והפעל את jo-inc/camofox-browser. ראה docs/spec/X13-court-fetch.md."
"חבילת camoufox אינה מותקנת/זמינה. הרץ `pip install camoufox` ו-"
"`python -m camoufox fetch`. ראה docs/spec/X13-court-fetch.md."
) from e
if not _DISPLAY:
# Headless Firefox crashes here without a virtual display.
raise CamofoxUnavailable(
"אין DISPLAY — Camoufox דורש Xvfb על שרת ללא מסך. הפעל Xvfb (למשל :99) "
"והגדר DISPLAY (ראה pm2 config)."
)
async with httpx.AsyncClient(timeout=_TIMEOUT) as client:
br = await _Browser.open(client)
try:
await br.navigate(NGCS_HOME)
snap = await br.snapshot()
_ = snap # calibration anchor: locate btnExternalSearchCases here.
month_year = f"{int(month):02d}-{year[-2:]}"
doc_num = {"v": None}
# The concrete selector/CAPTCHA/download steps require live
# calibration with camofox running. Until calibrated we fail
# loudly so the orchestrator escalates to the human fallback
# (INV-CF2/CF3) rather than pretending success.
raise NgcsFlowError(
"זרימת נט-המשפט (Tier 1) ממתינה לכיול מול snapshot חי של "
"camofox-browser — בקשת-אחזור מוסלמת ל-fallback אנושי (VNC/ידני)."
)
finally:
await br.close()
async def on_resp(resp):
if "GetImages" in resp.url and not doc_num["v"]:
try:
doc_num["v"] = json.loads(resp.request.post_data).get("documentNumber")
except Exception:
pass
async with AsyncCamoufox(
headless=True, geoip=False, humanize=True, locale="he-IL"
) as browser:
page = await browser.new_page()
page.context.on("response", lambda r: asyncio.create_task(on_resp(r)))
vp = await _reach_viewer(page, case_number=file_number, month_year=month_year)
source_url = vp.url
await vp.wait_for_timeout(9000)
if not doc_num["v"]:
raise NgcsFlowError("לא נלכד documentNumber מהצופה (ייתכן שהמסמך לא נטען)")
# Pull every page batch through fetch() with X-Requested-With (WAF-safe).
imgs = await vp.evaluate(
"""async (args) => {
const [dn, maxPages, batch] = args;
const url = window.location.href.split('?')[0] + '/GetImages';
const out = {};
for (let f = 0; f < maxPages; f += batch) {
let d;
try {
const r = await fetch(url, {method:'POST', credentials:'include',
headers:{'Content-Type':'application/json; charset=utf-8',
'X-Requested-With':'XMLHttpRequest'},
body: JSON.stringify({documentNumber:dn, fromIndex:f, toIndex:f+batch-1})});
if (!r.ok) break;
const j = await r.json(); d = (j.d !== undefined) ? j.d : j;
} catch (e) { break; }
if (!Array.isArray(d) || d.length === 0) break;
d.forEach((html, k) => { if (html) out[f+k] = html; });
if (d.length < batch) break;
await new Promise(r => setTimeout(r, 350));
}
return out;
}""",
[doc_num["v"], _MAX_PAGES, _PAGE_BATCH],
)
if not imgs:
raise NgcsFlowError("לא התקבלו עמודי-מסמך מ-GetImages")
from PIL import Image
pages = []
for idx in sorted(imgs, key=lambda x: int(x)):
m = re.search(r"base64,([A-Za-z0-9+/=]+)", imgs[idx] or "")
if not m:
continue
pages.append(Image.open(io.BytesIO(base64.b64decode(m.group(1)))).convert("RGB"))
if not pages:
raise NgcsFlowError("עמודי-המסמך לא ניתנים לפענוח (base64)")
buf = io.BytesIO()
pages[0].save(buf, format="PDF", save_all=True, append_images=pages[1:])
content = buf.getvalue()
logger.info("נט המשפט: fetched %s%d pages, %d bytes",
case_number, len(pages), len(content))
return {
"content": content,
"filename": f"{case_number}.pdf",
"source_url": source_url,
"court": court or "בית משפט מחוזי",
"pages": len(pages),
}