feat(retrieval): add voyage-multimodal-3 page-image embeddings (feature flag)
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 1m50s

Stage C: per-page image embeddings via voyage-multimodal-3 + hybrid
text+image search. Off by default; enable with MULTIMODAL_ENABLED=true.

- Schema V9: document_image_embeddings + precedent_image_embeddings
  (vector(1024), page_number, image_thumbnail_path)
- extractor.render_pages_for_multimodal renders PDF pages at
  MULTIMODAL_DPI (144) for embedding + JPEG thumbnails at
  MULTIMODAL_THUMB_DPI (96) for UI preview, in one pass
- embeddings.embed_images calls voyage-multimodal-3 in 50-page batches
- services/hybrid_search.py orchestrator: rerank applied to text side
  first (rerank-2 is text-only); image side cosine; weighted merge
  with text_weight 0.65 (env-tunable); image-only pages surface as
  match_type='image' so dense scanned content still appears
- processor.process_document and precedent_library.ingest_precedent
  gated by flag — non-fatal on multimodal failure
- scripts/multimodal_backfill.py — idempotent per-case CLI to embed
  existing documents without re-extracting text

Validated locally on a 5-page response brief: render 0.31s, embed 8.32s,
hybrid merge surfaces image rows correctly. Production rollout starts
with flag=false (no behavior change), then per-case A/B.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-03 19:24:52 +00:00
parent b9cdcf980d
commit 242f668319
10 changed files with 1038 additions and 40 deletions

View File

@@ -58,6 +58,29 @@ VOYAGE_RERANK_ENABLED = (
# 50 was the depth used in the POC; balances recall vs rerank cost.
VOYAGE_RERANK_FETCH_K = int(os.environ.get("VOYAGE_RERANK_FETCH_K", "50"))
# Multimodal — page-image embeddings via voyage-multimodal-3. Off by
# default; flip with env to enable per-page image embedding during
# ingestion + hybrid (text+image) ranking at search time. POC #3
# validated on a 89-page appraisal PDF (38s, 312K tokens, recovered
# table structure + image-only scanned pages that text-OCR misses).
MULTIMODAL_ENABLED = (
os.environ.get("MULTIMODAL_ENABLED", "false").lower() == "true"
)
MULTIMODAL_MODEL = os.environ.get("MULTIMODAL_MODEL", "voyage-multimodal-3")
# Render DPI for the image fed to the embedder. POC used 144 — sweet
# spot between embedding quality and tokens/page (144 ≈ 3.5K tok/page).
MULTIMODAL_DPI = int(os.environ.get("MULTIMODAL_DPI", "144"))
# Separate, lower DPI for the JPEG thumbnail saved to disk for UI
# preview. ~96dpi → ~20KB/page; ingestion-time, no re-render at view.
MULTIMODAL_THUMB_DPI = int(os.environ.get("MULTIMODAL_THUMB_DPI", "96"))
# Hybrid merge weight for the *text* side. The image side gets
# (1 - this). POC found text dominates most queries; image wins only
# on table/visual queries — slight text bias starting point, tunable
# per env without redeploy.
MULTIMODAL_TEXT_WEIGHT = float(
os.environ.get("MULTIMODAL_TEXT_WEIGHT", "0.65")
)
# Halacha extraction — auto-approve threshold. Halachot with extractor
# confidence >= this value are inserted with review_status='approved'
# instead of 'pending_review' (so they immediately appear in