feat(retrieval): add voyage-multimodal-3 page-image embeddings (feature flag)
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 1m50s
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 1m50s
Stage C: per-page image embeddings via voyage-multimodal-3 + hybrid text+image search. Off by default; enable with MULTIMODAL_ENABLED=true. - Schema V9: document_image_embeddings + precedent_image_embeddings (vector(1024), page_number, image_thumbnail_path) - extractor.render_pages_for_multimodal renders PDF pages at MULTIMODAL_DPI (144) for embedding + JPEG thumbnails at MULTIMODAL_THUMB_DPI (96) for UI preview, in one pass - embeddings.embed_images calls voyage-multimodal-3 in 50-page batches - services/hybrid_search.py orchestrator: rerank applied to text side first (rerank-2 is text-only); image side cosine; weighted merge with text_weight 0.65 (env-tunable); image-only pages surface as match_type='image' so dense scanned content still appears - processor.process_document and precedent_library.ingest_precedent gated by flag — non-fatal on multimodal failure - scripts/multimodal_backfill.py — idempotent per-case CLI to embed existing documents without re-extracting text Validated locally on a 5-page response brief: render 0.31s, embed 8.32s, hybrid merge surfaces image rows correctly. Production rollout starts with flag=false (no behavior change), then per-case A/B. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -3,15 +3,24 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
import voyageai
|
||||
|
||||
from legal_mcp import config
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from PIL import Image as PILImage
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
_client: voyageai.Client | None = None
|
||||
|
||||
# Per-call cap for multimodal_embed. POC ran 89 pages (~312K tokens)
|
||||
# in a single call comfortably; 50 leaves safe headroom for densely-
|
||||
# OCR'd legal pages where tokens/page can exceed 4K.
|
||||
_MULTIMODAL_BATCH_SIZE = 50
|
||||
|
||||
|
||||
def _get_client() -> voyageai.Client:
|
||||
global _client
|
||||
@@ -55,6 +64,45 @@ async def embed_query(query: str) -> list[float]:
|
||||
return results[0]
|
||||
|
||||
|
||||
async def embed_images(
|
||||
images: "list[PILImage.Image]",
|
||||
input_type: str = "document",
|
||||
) -> list[list[float]]:
|
||||
"""Embed page images via voyage-multimodal-3.
|
||||
|
||||
Each input is a single PIL.Image (one page = one embedding).
|
||||
Returns a list of 1024-dim vectors, one per input image, in order.
|
||||
Batches at ``_MULTIMODAL_BATCH_SIZE`` to stay within Voyage's
|
||||
per-request limits on dense legal pages.
|
||||
"""
|
||||
if not images:
|
||||
return []
|
||||
client = _get_client()
|
||||
out: list[list[float]] = []
|
||||
for i in range(0, len(images), _MULTIMODAL_BATCH_SIZE):
|
||||
batch = images[i : i + _MULTIMODAL_BATCH_SIZE]
|
||||
result = client.multimodal_embed(
|
||||
inputs=[[img] for img in batch],
|
||||
model=config.MULTIMODAL_MODEL,
|
||||
input_type=input_type,
|
||||
truncation=True,
|
||||
)
|
||||
out.extend(result.embeddings)
|
||||
return out
|
||||
|
||||
|
||||
async def embed_query_for_multimodal(query: str) -> list[float]:
|
||||
"""Embed a text query in the multimodal vector space, so it can be
|
||||
cosine-compared against page-image embeddings."""
|
||||
client = _get_client()
|
||||
result = client.multimodal_embed(
|
||||
inputs=[[query]],
|
||||
model=config.MULTIMODAL_MODEL,
|
||||
input_type="query",
|
||||
)
|
||||
return result.embeddings[0]
|
||||
|
||||
|
||||
async def voyage_rerank(
|
||||
query: str, documents: list[str], top_k: int | None = None,
|
||||
) -> list[tuple[int, float]]:
|
||||
|
||||
Reference in New Issue
Block a user