תיקון: העלאת פסיקה/החלטת-ועדה (precedent-library + internal-decisions) נכשלה
תחת backend s3-only עם "Package not found at '/data/...docx'" / "Converted file
not found". השורש: `ingest._stage_file` כותב את הקובץ דרך `storage.put_file`
ומחזיר נתיב-DATA_DIR, אבל תחת s3-only ה-blob נכתב רק ל-MinIO ואין עותק בדיסק —
ואז הצינור קרא את הנתיב ישירות מהדיסק (extract_text) → קובץ לא קיים. מסלול
תיקי-המקרה לא נפגע כי הוא שומר עותק-דיסק + mirror_file; רק מסלול _stage_file
המשותף קרא את ה-key כאילו הוא על הדיסק.
התיקון (נרמול-במקור, G1; קריאה דרך שכבת-האחסון, INV-STG1):
- `_stage_file` מחזיר עכשיו את ה-KEY (נתיב יחסי-DATA_DIR), לא Path.
- `ingest_document` ו-`digest_library` מאתרים נתיב-קריאה מקומי דרך
`storage.ensure_local` (עותק-דיסק תחת filesystem/dual; הורדה ל-temp תחת
s3-only) ומנקים את ה-temp ב-finally — בלי דליפה ל-/tmp.
- מולטימודל (PDF) קורא את אותו נתיב מקומי מאומת.
בדיקות: test_unified_ingest::test_ingest_reads_via_ensure_local_when_no_disk_copy
מדמה backend ללא עותק-דיסק ומוודא שהצינור משלים (נכשל מול הקוד הישן). 55 עוברות.
Invariants: מקיים INV-STG1 (קריאה/כתיבה רק דרך שכבת-האחסון), G1 (נרמול-במקור,
לא תיקון-בקריאה), G2 (אין מסלול מקביל — תיקון הצינור הקנוני).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rewire the source-document staging writes onto the unified storage layer
(INV-STG1), replacing direct shutil.copy2 calls:
- tools/documents.py: case originals + training-corpus uploads
- services/ingest.py: _stage_file (now async) — covers precedent-library,
internal-decisions, and digests (the canonical intake helper)
- services/digest_library.py: awaits the now-async _stage_file
Each write goes through storage.put_file(..., bucket=DOCUMENTS) with the
DATA_DIR-relative key; the Hebrew original filename rides as object metadata
(INV-STG2), content-type is guessed from the extension. DB path columns are
unchanged (still the absolute dest) — object_key backfill is Phase 3.
Under the default STORAGE_BACKEND=filesystem the bytes land at the exact
legacy on-disk location (put_file → shutil.copy2 to DATA_DIR/key), so this
is zero behaviour change in prod. shutil import dropped where now unused.
tests: +2 staging regression tests (file lands under DATA_DIR at the legacy
path); 20 storage + 22 ingest tests green; 242 collected with no import
breakage.
Derived/export write sites (thumbnails, extracted text, DOCX exports) are
Phase 2b. Keeps G2; advances INV-STG1. Spec: docs/spec/X14-storage-minio.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
#86.2 backfill + #86.3 benchmark, plus a #86.1 over-strip fix found en route.
extractor.py
- extract_nevo_ratio(): capture Nevo's מיני-רציו block (editorial holdings
summary) before it is stripped — a free professional gold-set (#86.3).
- _DECISION_START hardening (#86.2): the merged #86.1 regex over-stripped.
(a) פסק-דין headers are markdown-wrapped (**פסק דין**); the old anchor
required the keyword as the first line char with one separator, so it
missed the header and matched a citation 32K deep (עמ"נ 50567-07-21,
losing 45% of the body). Now tolerates leading markdown + 0-3 seps,
and the final-nun form (דין ן vs דינו נ).
(b) bare השופט/הנשיא matched CITATIONS ("השופט מ' חשין, פסקה 23"). The
authoring-judge line ends with a colon; we now require it.
ingest.py
- capture the ratio before stripping and store it on the row (best-effort,
non-fatal); also strip the text-upload path (was file-only).
db.py
- add case_law.nevo_ratio column (additive); allow it in update_case_law.
scripts/backfill_nevo_preamble.py (#86.2) — dry-run-by-default data migration:
finds historically-leaked rulings, captures ratio→nevo_ratio, rewrites
full_text (+content_hash), reindexes, and FLAGS (never deletes) halachot whose
quote lives in the removed preamble (review_status=pending_review +
nevo_preamble_leak flag). Safety guard: rows with keep%<--min-keep (60) are
excluded from --apply as suspected over-strip. --apply writes backup+manifest
to data/audit/ first. Chair-gated — NOT applied here.
scripts/nevo_ratio_benchmark.py (#86.3) — LLM-as-judge (local claude_session,
zero cost) measures recall/precision/granularity of our halachot vs the Nevo
ratio. Works pre- and post-backfill (reads nevo_ratio, falls back to full_text).
Verified:
- pytest tests/test_nevo_preamble.py — 12 passed (incl. citation/markdown
over-strip regressions).
- backfill dry-run: 19 leaked rulings, 27 contaminated halachot, all ≥75%
keep (the 32K over-strip is gone).
- benchmark on בג"ץ 1764/05: recall=0.875 precision=1.0 granularity=1.75x.
Invariants: G1 (normalize at source — strip/capture at ingest, not at read);
no silent swallow (contaminated halachot flagged + reported, not dropped);
data-migration is dry-run-default with backup+manifest, chair-gated.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wire db.recompute_searchable into the ingest pipeline (after statuses are set) and into
extract_and_apply (after fields are persisted to DB, success path only).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>