refactor(digests): single source of truth — drop processed/ folder state (X12)
ה-DB (`digests`) הוא מקור-האמת היחיד למצב-קליטה. ingest_digests_batch.py העביר קבצים incoming→processed/ — state מבוסס-תיקיות מקביל ל-DB (הפרת-G2 קטנה). - הוסר ה-move ל-processed/ + import shutil + PROCESSED. הסקריפט מסתמך על dedup ב-content_hash (ingest_digest מחזיר 'exists' לקיימים) → הרצה חוזרת בטוחה. - תיקיות (incoming/) = staging בלבד, לא state. - X12 INV-DIG2: תועד מקור-אמת-יחיד + ההפרה-שתוקנה (processed/). - SCRIPTS.md עודכן. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -15,6 +15,13 @@ halacha pipeline and is never cited in a decision (INV-DIG1/2). After this run,
|
||||
relink unmatched digests once the originals are uploaded, or surface them via
|
||||
missing_precedent_create.
|
||||
|
||||
SINGLE SOURCE OF TRUTH: the `digests` table (DB) is the ONLY authority for what
|
||||
has been ingested. This script does NOT move files between folders — re-running
|
||||
is safe because ``ingest_digest`` dedups on content_hash (already-ingested →
|
||||
returns ``exists``). Files left in ``incoming/`` are simply re-checked and
|
||||
skipped. (Earlier versions moved files to a ``processed/`` folder; that created
|
||||
a second, divergent state and was removed.)
|
||||
|
||||
Yomon number + issue date are parsed from the filename
|
||||
("יומון 5158 - 31.5.26.pdf") as hints; the LLM also extracts them from the
|
||||
body and the explicit hint wins. The monthly bulletin (e.g. "201 יוני.pdf") is
|
||||
@@ -28,7 +35,6 @@ Config (POSTGRES_URL, VOYAGE_API_KEY, ANTHROPIC_API_KEY) auto-loads from ~/.env.
|
||||
import asyncio
|
||||
import os
|
||||
import re
|
||||
import shutil
|
||||
import sys
|
||||
import traceback
|
||||
from pathlib import Path
|
||||
@@ -39,7 +45,6 @@ from legal_mcp import config # noqa: E402
|
||||
from legal_mcp.services import digest_library as svc # noqa: E402
|
||||
|
||||
INCOMING = Path(config.DATA_DIR) / "digests" / "incoming"
|
||||
PROCESSED = Path(config.DATA_DIR) / "digests" / "processed"
|
||||
|
||||
# Matches "יומון 5158 - 31.5.26" → ("5158", "31.5.26")
|
||||
_NAME_RE = re.compile(r"יומון\s*(\d+)\s*-\s*(\d{1,2})\.(\d{1,2})\.(\d{2,4})")
|
||||
@@ -78,7 +83,6 @@ async def main(argv: list[str]) -> None:
|
||||
if not files:
|
||||
print(f"No yomon PDFs found in {INCOMING}", flush=True)
|
||||
return
|
||||
PROCESSED.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
results = []
|
||||
for idx, fp in enumerate(files):
|
||||
@@ -108,11 +112,8 @@ async def main(argv: list[str]) -> None:
|
||||
f"{link} | {out.get('underlying_citation')}",
|
||||
flush=True,
|
||||
)
|
||||
# Move to processed/ so re-runs are clean (idempotent anyway).
|
||||
try:
|
||||
shutil.move(str(fp), str(PROCESSED / fp.name))
|
||||
except Exception as e:
|
||||
print(f" (could not move {fp.name}: {e})", flush=True)
|
||||
# No folder move — the DB (content_hash) is the single source of
|
||||
# truth. Re-running re-checks incoming/ and skips already-ingested.
|
||||
except Exception as e:
|
||||
rec["error"] = f"{type(e).__name__}: {e}"
|
||||
print(f"✗ {fp.name}: {e}", flush=True)
|
||||
|
||||
Reference in New Issue
Block a user