feat(retrieval): add voyage-multimodal-3 page-image embeddings (feature flag)

Stage C: per-page image embeddings via voyage-multimodal-3 + hybrid text+image search. Off by default; enable with MULTIMODAL_ENABLED=true. - Schema V9: document_image_embeddings + precedent_image_embeddings (vector(1024), page_number, image_thumbnail_path) - extractor.render_pages_for_multimodal renders PDF pages at MULTIMODAL_DPI (144) for embedding + JPEG thumbnails at MULTIMODAL_THUMB_DPI (96) for UI preview, in one pass - embeddings.embed_images calls voyage-multimodal-3 in 50-page batches - services/hybrid_search.py orchestrator: rerank applied to text side first (rerank-2 is text-only); image side cosine; weighted merge with text_weight 0.65 (env-tunable); image-only pages surface as match_type='image' so dense scanned content still appears - processor.process_document and precedent_library.ingest_precedent gated by flag — non-fatal on multimodal failure - scripts/multimodal_backfill.py — idempotent per-case CLI to embed existing documents without re-extracting text Validated locally on a 5-page response brief: render 0.31s, embed 8.32s, hybrid merge surfaces image rows correctly. Production rollout starts with flag=false (no behavior change), then per-case A/B. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 19:24:52 +00:00
parent b9cdcf980d
commit 242f668319
10 changed files with 1038 additions and 40 deletions
--- a/mcp-server/src/legal_mcp/config.py
+++ b/mcp-server/src/legal_mcp/config.py
@@ -58,6 +58,29 @@ VOYAGE_RERANK_ENABLED = (
 # 50 was the depth used in the POC; balances recall vs rerank cost.
 VOYAGE_RERANK_FETCH_K = int(os.environ.get("VOYAGE_RERANK_FETCH_K", "50"))

+# Multimodal — page-image embeddings via voyage-multimodal-3. Off by
+# default; flip with env to enable per-page image embedding during
+# ingestion + hybrid (text+image) ranking at search time. POC #3
+# validated on a 89-page appraisal PDF (38s, 312K tokens, recovered
+# table structure + image-only scanned pages that text-OCR misses).
+MULTIMODAL_ENABLED = (
+    os.environ.get("MULTIMODAL_ENABLED", "false").lower() == "true"
+)
+MULTIMODAL_MODEL = os.environ.get("MULTIMODAL_MODEL", "voyage-multimodal-3")
+# Render DPI for the image fed to the embedder. POC used 144 — sweet
+# spot between embedding quality and tokens/page (144 ≈ 3.5K tok/page).
+MULTIMODAL_DPI = int(os.environ.get("MULTIMODAL_DPI", "144"))
+# Separate, lower DPI for the JPEG thumbnail saved to disk for UI
+# preview. ~96dpi → ~20KB/page; ingestion-time, no re-render at view.
+MULTIMODAL_THUMB_DPI = int(os.environ.get("MULTIMODAL_THUMB_DPI", "96"))
+# Hybrid merge weight for the *text* side. The image side gets
+# (1 - this). POC found text dominates most queries; image wins only
+# on table/visual queries — slight text bias starting point, tunable
+# per env without redeploy.
+MULTIMODAL_TEXT_WEIGHT = float(
+    os.environ.get("MULTIMODAL_TEXT_WEIGHT", "0.65")
+)
+
 # Halacha extraction — auto-approve threshold. Halachot with extractor
 # confidence >= this value are inserted with review_status='approved'
 # instead of 'pending_review' (so they immediately appear in