Commit Graph

6 Commits

Author SHA1 Message Date
1986fe3b14 feat(storage): X14 Phase 2a — route source-document writes through storage.py
Rewire the source-document staging writes onto the unified storage layer
(INV-STG1), replacing direct shutil.copy2 calls:
- tools/documents.py: case originals + training-corpus uploads
- services/ingest.py: _stage_file (now async) — covers precedent-library,
  internal-decisions, and digests (the canonical intake helper)
- services/digest_library.py: awaits the now-async _stage_file

Each write goes through storage.put_file(..., bucket=DOCUMENTS) with the
DATA_DIR-relative key; the Hebrew original filename rides as object metadata
(INV-STG2), content-type is guessed from the extension. DB path columns are
unchanged (still the absolute dest) — object_key backfill is Phase 3.

Under the default STORAGE_BACKEND=filesystem the bytes land at the exact
legacy on-disk location (put_file → shutil.copy2 to DATA_DIR/key), so this
is zero behaviour change in prod. shutil import dropped where now unused.

tests: +2 staging regression tests (file lands under DATA_DIR at the legacy
path); 20 storage + 22 ingest tests green; 242 collected with no import
breakage.

Derived/export write sites (thumbnails, extracted text, DOCX exports) are
Phase 2b. Keeps G2; advances INV-STG1. Spec: docs/spec/X14-storage-minio.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 08:00:27 +00:00
fb51a0e869 feat(nevo): backfill leaked preamble + ratio gold-set benchmark (#86)
#86.2 backfill + #86.3 benchmark, plus a #86.1 over-strip fix found en route.

extractor.py
- extract_nevo_ratio(): capture Nevo's מיני-רציו block (editorial holdings
  summary) before it is stripped — a free professional gold-set (#86.3).
- _DECISION_START hardening (#86.2): the merged #86.1 regex over-stripped.
  (a) פסק-דין headers are markdown-wrapped (**פסק  דין**); the old anchor
      required the keyword as the first line char with one separator, so it
      missed the header and matched a citation 32K deep (עמ"נ 50567-07-21,
      losing 45% of the body). Now tolerates leading markdown + 0-3 seps,
      and the final-nun form (דין ן vs דינו נ).
  (b) bare השופט/הנשיא matched CITATIONS ("השופט מ' חשין, פסקה 23"). The
      authoring-judge line ends with a colon; we now require it.

ingest.py
- capture the ratio before stripping and store it on the row (best-effort,
  non-fatal); also strip the text-upload path (was file-only).

db.py
- add case_law.nevo_ratio column (additive); allow it in update_case_law.

scripts/backfill_nevo_preamble.py (#86.2) — dry-run-by-default data migration:
finds historically-leaked rulings, captures ratio→nevo_ratio, rewrites
full_text (+content_hash), reindexes, and FLAGS (never deletes) halachot whose
quote lives in the removed preamble (review_status=pending_review +
nevo_preamble_leak flag). Safety guard: rows with keep%<--min-keep (60) are
excluded from --apply as suspected over-strip. --apply writes backup+manifest
to data/audit/ first. Chair-gated — NOT applied here.

scripts/nevo_ratio_benchmark.py (#86.3) — LLM-as-judge (local claude_session,
zero cost) measures recall/precision/granularity of our halachot vs the Nevo
ratio. Works pre- and post-backfill (reads nevo_ratio, falls back to full_text).

Verified:
- pytest tests/test_nevo_preamble.py — 12 passed (incl. citation/markdown
  over-strip regressions).
- backfill dry-run: 19 leaked rulings, 27 contaminated halachot, all ≥75%
  keep (the 32K over-strip is gone).
- benchmark on בג"ץ 1764/05: recall=0.875 precision=1.0 granularity=1.75x.

Invariants: G1 (normalize at source — strip/capture at ingest, not at read);
no silent swallow (contaminated halachot flagged + reported, not dropped);
data-migration is dry-run-default with backup+manifest, chair-gated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 19:45:43 +00:00
c7c7a1e119 feat(reindex): reindex_case_law from stored text + mark_indexed on ingest (GAP-09, FU-3)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 22:06:17 +00:00
6dbcb7e798 feat(ingest): recompute searchable on ingest + metadata completion (GAP-13, FU-2a)
Wire db.recompute_searchable into the ingest pipeline (after statuses are set) and into
extract_and_apply (after fields are persisted to DB, success path only).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 20:47:51 +00:00
be4f7bbe99 feat(ingest): canonical ingest_document pipeline (FU-1) 2026-05-30 19:13:15 +00:00
d4663eba8f feat(ingest): IntakeSpec + shared helpers for canonical pipeline (FU-1)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-30 19:11:27 +00:00