feat(digests): קורפוס יומונים כשכבת-גילוי (radar) — X12

מאגר חדש ליומוני "כל יום" (עפר טויסטר) כשכבת-גילוי מעל קורפוסי-הפסיקה:
מקור-משני המצביע על פסק הדין המקורי, נקלט לטבלה נפרדת `digests`, נחפש
סמנטית, ומקושר לפסק המקורי בספריית הפסיקה — אך לעולם אינו מצוטט בהחלטה
ואינו מחלץ הלכות.

Phase 0 (spec):
- docs/spec/X12-digests-radar.md — INV-DIG1 (מצביע לא מצוטט) /
  INV-DIG2 (מסלול-קליטה נפרד, לא מקביל — מקיים G2) / INV-DIG3 (קישור-לפסק
  הוא הגשר; חוסר-קישור = פער גלוי). עדכון אינדקס 00/03/README.

Phase 1 (MVP):
- SCHEMA_V30: טבלת `digests` (HNSW על embedding — לא ivfflat, להימנע מ-recall
  cliff בקורפוס קטן/צומח) + GIN/FTS + UNIQUE חלקי ל-idempotent.
- services/digest_metadata_extractor.py — חילוץ-LLM (claude_session local-only,
  ייבוא lazy): תג-מושג, כותרת-הלכה, מראה-מקום, שני-תאריכים מובחנים, תגיות.
- services/digest_library.py — מסלול קצר עצמאי (INV-DIG2): extract→hash→LLM→
  embedding יחיד→autolink. לא משתמש ב-ingest.ingest_document.
- tools/digests.py + רישום 7 כלים ב-server.py (digest_upload/list/get/link/
  relink/delete + search_digests).
- scripts/ingest_digests_batch.py — קליטה ידנית מ-data/digests/incoming.
- legal-researcher.md: שלב 2ב.0 (סריקת-radar לפני אימות) + סעיף-דוח ט +
  3 כלים ב-frontmatter. HEARTBEAT §8: ניתוב יומון→digest_upload.

אומת end-to-end: 4 יומונים נקלטו (מטא-דאטה מדויק), חיפוש סמנטי מדרג נכון
("היטל השבחה"→5160, "תמא 38"→5158), link/relink/autolink/revert + מעטפת-MCP.

Invariants: מוסיף INV-DIG1/2/3 (X12). מקיים G2 (bounded context נפרד, לא
מסלול מקביל), G3 (idempotent upsert), G4 (אין בליעה שקטה — פער-קישור מוצף),
G9 (עקיבוּת — היומון מצביע על מקור עקיב). נוגע G7 (RRF) — נדחה, חיפוש
סמנטי-בלבד בשלב 1 (FTS index מוכן).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-07 17:49:00 +00:00
parent 9eaabffba4
commit 8171572cdd
13 changed files with 1353 additions and 5 deletions

View File

@@ -0,0 +1,137 @@
"""Batch ingest of "כל יום" daily digests staged in data/digests/incoming/ (X12).
Sequential (NOT concurrent — same load-spike caution as ingest_incoming_batch.py)
ingest of each yomon PDF via the standalone digest pipeline
(``digest_library.ingest_digest``), which:
- extracts text, dedups on content_hash (idempotent),
- runs the local LLM metadata extractor (concept_tag, headline, underlying
citation, two dates, practice_area, subject_tags),
- stores a single embedding,
- auto-links to the underlying ruling if it is already in the precedent
library (INV-DIG3).
The digest is a SECONDARY, radar-only source — it never enters the precedent /
halacha pipeline and is never cited in a decision (INV-DIG1/2). After this run,
relink unmatched digests once the originals are uploaded, or surface them via
missing_precedent_create.
Yomon number + issue date are parsed from the filename
("יומון 5158 - 31.5.26.pdf") as hints; the LLM also extracts them from the
body and the explicit hint wins. The monthly bulletin (e.g. "201 יוני.pdf") is
multi-topic and skipped (Phase 3).
Run: mcp-server/.venv/bin/python scripts/ingest_digests_batch.py
(optionally pass explicit file paths as args)
Config (POSTGRES_URL, VOYAGE_API_KEY, ANTHROPIC_API_KEY) auto-loads from ~/.env.
"""
import asyncio
import os
import re
import shutil
import sys
import traceback
from pathlib import Path
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "mcp-server", "src"))
from legal_mcp import config # noqa: E402
from legal_mcp.services import digest_library as svc # noqa: E402
INCOMING = Path(config.DATA_DIR) / "digests" / "incoming"
PROCESSED = Path(config.DATA_DIR) / "digests" / "processed"
# Matches "יומון 5158 - 31.5.26" → ("5158", "31.5.26")
_NAME_RE = re.compile(r"יומון\s*(\d+)\s*-\s*(\d{1,2})\.(\d{1,2})\.(\d{2,4})")
def _parse_name(fname: str) -> tuple[str, str | None]:
"""Return (yomon_number, iso_date_or_None) parsed from the filename."""
m = _NAME_RE.search(fname)
if not m:
return "", None
num, dd, mm, yy = m.groups()
year = int(yy)
if year < 100:
year += 2000
try:
iso = f"{year:04d}-{int(mm):02d}-{int(dd):02d}"
except ValueError:
iso = None
return num, iso
def _discover() -> list[Path]:
if not INCOMING.exists():
return []
out = []
for p in sorted(INCOMING.glob("*.pdf")):
if "יומון" not in p.name:
print(f"⊘ skip (not a single yomon): {p.name}", flush=True)
continue
out.append(p)
return out
async def main(argv: list[str]) -> None:
files = [Path(a) for a in argv] if argv else _discover()
if not files:
print(f"No yomon PDFs found in {INCOMING}", flush=True)
return
PROCESSED.mkdir(parents=True, exist_ok=True)
results = []
for idx, fp in enumerate(files):
rec = {"file": fp.name}
if not fp.exists():
rec["error"] = "file-missing"
print(f"{fp.name}: file missing", flush=True)
results.append(rec)
continue
yomon_number, iso_date = _parse_name(fp.name)
try:
out = await svc.ingest_digest(
file_path=fp,
yomon_number=yomon_number,
digest_date=iso_date,
)
rec.update({
"status": out.get("status"),
"digest_id": out.get("digest_id"),
"yomon_number": out.get("yomon_number"),
"underlying_citation": out.get("underlying_citation"),
"linked_case_law_id": out.get("linked_case_law_id"),
})
link = "🔗 linked" if out.get("linked_case_law_id") else "⚠ unlinked"
print(
f"{fp.name}: {out.get('status')} | yomon={out.get('yomon_number')} | "
f"{link} | {out.get('underlying_citation')}",
flush=True,
)
# Move to processed/ so re-runs are clean (idempotent anyway).
try:
shutil.move(str(fp), str(PROCESSED / fp.name))
except Exception as e:
print(f" (could not move {fp.name}: {e})", flush=True)
except Exception as e:
rec["error"] = f"{type(e).__name__}: {e}"
print(f"{fp.name}: {e}", flush=True)
traceback.print_exc()
results.append(rec)
print("\n===SUMMARY===", flush=True)
for r in results:
print(r, flush=True)
linked = sum(1 for r in results if r.get("linked_case_law_id"))
unlinked = sum(
1 for r in results
if r.get("status") in ("completed", "exists") and not r.get("linked_case_law_id")
)
print(
f"\nTotal: {len(results)} | linked: {linked} | unlinked (need precedent upload): {unlinked}",
flush=True,
)
if __name__ == "__main__":
asyncio.run(main(sys.argv[1:]))