feat(storage): X14 Phase 2a — route source-document writes through storage.py

Rewire the source-document staging writes onto the unified storage layer
(INV-STG1), replacing direct shutil.copy2 calls:
- tools/documents.py: case originals + training-corpus uploads
- services/ingest.py: _stage_file (now async) — covers precedent-library,
  internal-decisions, and digests (the canonical intake helper)
- services/digest_library.py: awaits the now-async _stage_file

Each write goes through storage.put_file(..., bucket=DOCUMENTS) with the
DATA_DIR-relative key; the Hebrew original filename rides as object metadata
(INV-STG2), content-type is guessed from the extension. DB path columns are
unchanged (still the absolute dest) — object_key backfill is Phase 3.

Under the default STORAGE_BACKEND=filesystem the bytes land at the exact
legacy on-disk location (put_file → shutil.copy2 to DATA_DIR/key), so this
is zero behaviour change in prod. shutil import dropped where now unused.

tests: +2 staging regression tests (file lands under DATA_DIR at the legacy
path); 20 storage + 22 ingest tests green; 242 collected with no import
breakage.

Derived/export write sites (thumbnails, extracted text, DOCX exports) are
Phase 2b. Keeps G2; advances INV-STG1. Spec: docs/spec/X14-storage-minio.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-08 08:00:27 +00:00
parent 81b3de6f4f
commit 1986fe3b14
4 changed files with 82 additions and 18 deletions

View File

@@ -14,8 +14,8 @@ from __future__ import annotations
import asyncio
import logging
import mimetypes
import re
import shutil
from dataclasses import dataclass
from datetime import date
from pathlib import Path
@@ -23,7 +23,7 @@ from typing import Awaitable, Callable
from uuid import UUID, uuid4
from legal_mcp import config
from legal_mcp.services import chunker, db, embeddings, extractor
from legal_mcp.services import chunker, db, embeddings, extractor, storage
logger = logging.getLogger(__name__)
@@ -66,11 +66,20 @@ def _safe_filename(name: str) -> str:
return re.sub(r"[^\w.\-+א-ת ]", "_", base) or f"upload-{uuid4().hex[:8]}"
def _stage_file(src_path: Path, root: Path, subdir: str) -> Path:
dest_dir = root / (subdir or "other")
dest_dir.mkdir(parents=True, exist_ok=True)
dest = dest_dir / f"{uuid4().hex[:8]}_{_safe_filename(src_path.name)}"
shutil.copy2(src_path, dest)
async def _stage_file(src_path: Path, root: Path, subdir: str) -> Path:
"""Stage an intake file through the unified storage layer (INV-STG1).
Returns the DATA_DIR path the rest of the pipeline reads from — under the
filesystem/dual backends the bytes are on disk and the key is the
DATA_DIR-relative path. The Hebrew original filename rides as object
metadata, never as the key (INV-STG2)."""
dest = root / (subdir or "other") / f"{uuid4().hex[:8]}_{_safe_filename(src_path.name)}"
key = dest.relative_to(config.DATA_DIR).as_posix()
await storage.put_file(
src_path, key, bucket=storage.Bucket.DOCUMENTS,
content_type=mimetypes.guess_type(src_path.name)[0],
metadata={"filename": src_path.name},
)
return dest
@@ -151,7 +160,7 @@ async def ingest_document(
if not src.is_file():
raise FileNotFoundError(f"file not found: {src}")
await progress("staging", 5, "מעתיק את הקובץ לאחסון")
staged = _stage_file(src, spec.staging_root, spec.staging_subdir(inputs))
staged = await _stage_file(src, spec.staging_root, spec.staging_subdir(inputs))
await progress("extracting", 15, "מחלץ טקסט מהקובץ")
try:
raw_text, page_count, page_offsets = await extractor.extract_text(str(staged))