The legacy chunker did not track which PDF page each chunk came from.
Stored chunks had page_number=NULL, which blocked the multimodal
hybrid retriever's text+image boost — it joins (chunk, image) on
(document_id, page_number) and the join could never fire.
This change:
- extractor.extract_text now returns (text, page_count, page_offsets);
page_offsets[i] is the start char offset of page (i+1) in the joined
text. None for non-PDFs.
- chunker.chunk_document accepts an optional page_offsets and tags
each chunk with the page that contains its first character (uses
the existing chunker logic; pages assigned post-hoc by content
search to keep the diff minimal).
- processor.process_document and precedent_library.ingest_precedent
forward page_offsets through the chunker. New uploads now carry
accurate page_number on every chunk.
- Other extract_text callers (tools/documents, tools/workflow,
web/app.py) updated to unpack the third element (ignored).
- scripts/backfill_chunk_pages.py: per-case retrofit. Re-extracts each
PDF (re-OCRs via Google Vision if needed, ~$0.0015/page), computes
page_offsets, and updates page_number on every chunk by content
search. Idempotent; --force re-runs on already-tagged docs.
Forward-only would leave the 419 image embeddings backfilled on
cases 8174-24 + 8137-24 unable to boost their corresponding text
chunks. The retrofit script closes that gap (cost ~$0.60).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stage C: per-page image embeddings via voyage-multimodal-3 + hybrid
text+image search. Off by default; enable with MULTIMODAL_ENABLED=true.
- Schema V9: document_image_embeddings + precedent_image_embeddings
(vector(1024), page_number, image_thumbnail_path)
- extractor.render_pages_for_multimodal renders PDF pages at
MULTIMODAL_DPI (144) for embedding + JPEG thumbnails at
MULTIMODAL_THUMB_DPI (96) for UI preview, in one pass
- embeddings.embed_images calls voyage-multimodal-3 in 50-page batches
- services/hybrid_search.py orchestrator: rerank applied to text side
first (rerank-2 is text-only); image side cosine; weighted merge
with text_weight 0.65 (env-tunable); image-only pages surface as
match_type='image' so dense scanned content still appears
- processor.process_document and precedent_library.ingest_precedent
gated by flag — non-fatal on multimodal failure
- scripts/multimodal_backfill.py — idempotent per-case CLI to embed
existing documents without re-extracting text
Validated locally on a 5-page response brief: render 0.31s, embed 8.32s,
hybrid merge surfaces image rows correctly. Production rollout starts
with flag=false (no behavior change), then per-case A/B.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add delete_document_chunks for reprocessing, save extracted text to disk
- Expand case directory structure (original/extracted/proofread/backup)
- Update classifier patterns (תגובה, הודעת עמדה)
- Fix proofreader agent paths for new directory layout
- Update HEARTBEAT to notify on every task completion
- Improve bidi_table with LRE/PDF directional embedding
- Add Paperclip project verification and auto-close setup issue
- Add auto-sync-cases.sh for Gitea synchronization
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces API-based classifier with:
1. Filename pattern matching (covers 95%+ of legal docs)
2. Content keyword matching for ambiguous filenames
3. Claude Code headless (claude -p) fallback for edge cases
No Anthropic API calls needed for classification.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Text extraction, chunking and embedding proceed even if Claude API
classification or reference extraction fails (e.g. API quota exceeded).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
block_writer: _build_precedents_context now searches both
paragraph_embeddings (other decisions by Dafna) and case_law_embeddings
(precedent case law). Previously only searched document_chunks which
had no cross-case data. Now returns ~2400 chars from 3 other decisions.
processor: Step 1.6 auto-updates case appellants/respondents from
classifier results when they're empty. Prevents blank party fields.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ezer Mishpati - AI legal decision drafting system with:
- MCP server (FastMCP) with document processing pipeline
- Web upload interface (FastAPI) for file upload and classification
- pgvector-based semantic search
- Hebrew legal document chunking and embedding