Stage C of the voyage-upgrades-plan shipped to production on
2026-05-03. The doc now leads with the final state and the two
empirical corrections vs the original plan:
1. Reciprocal Rank Fusion replaces weighted-sum hybrid merge.
voyage-3 cosines (~0.4-0.5) systematically outscale
voyage-multimodal-3 cosines (~0.20-0.25); a weighted sum lets
text dominate even when image is the better signal. RRF is
rank-based and robust to scale differences.
2. Chunker now propagates page_number end-to-end (extractor returns
per-page offsets, chunker tags each chunk by its first character's
page). A retrofit script backfills page_number on existing
document_chunks without re-OCR — uses the stored
documents.extracted_text plus PyMuPDF direct text reads as page
anchors (linear interpolation for OCR-only pages).
Production state on cases 8174-24 + 8137-24: 419 page-image
embeddings, 819 chunks tagged with page_number, MULTIMODAL_ENABLED=true
in Coolify env, hybrid search verified A/B against text-only baseline.
The original stage C plan section is retained below for reference.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase A — voyage-3 migration (executed):
- VOYAGE_MODEL=voyage-3 set in Coolify (legal-ai app) and ~/.env
- scripts/reembed_voyage.py: re-embeds document_chunks (6157),
case_law_embeddings (9), precedent_chunks (385), and halachot (400)
using the new model. paragraph_embeddings was empty. 6951 rows
re-embedded in 93s, ~75 rows/sec.
- Same 1024 dim → no schema change needed.
Why voyage-3 over voyage-law-2: benchmark on 3 Hebrew legal queries
with real passages from the corpus gave voyage-3 perfect ordering on
3/3 tests AND the largest separation (+0.483 vs voyage-law-2's
+0.238). voyage-4 family had bigger separation but missed top-1 on
the hardest test.
Phase B (voyage-context-3) and Phase C (voyage-multimodal-3.5 for
scanned + appraiser docs) are designed in docs/voyage-upgrades-plan.md
but deferred — to be picked up in a fresh conversation. The plan
includes:
- Phase B: contextualized embeddings refactor (~49% recall lift on
legal docs per Anthropic's research). Same dim, but ingestion
pipeline must pass full doc context per chunk.
- Phase C: page-level image embeddings via voyage-multimodal-3.5,
stored in a parallel *_image_embeddings table. Hybrid text+image
search. Targets appraiser report tables and scanned PDFs where
current OCR loses layout.
After this commit: MCP server needs a /mcp reconnect to pick up the
new VOYAGE_MODEL env, and the legal-ai container will pick it up on
its next redeploy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>