feat(retrieval): track page_number on text chunks for multimodal hybrid boost
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 6m33s

The legacy chunker did not track which PDF page each chunk came from.
Stored chunks had page_number=NULL, which blocked the multimodal
hybrid retriever's text+image boost — it joins (chunk, image) on
(document_id, page_number) and the join could never fire.

This change:

- extractor.extract_text now returns (text, page_count, page_offsets);
  page_offsets[i] is the start char offset of page (i+1) in the joined
  text. None for non-PDFs.
- chunker.chunk_document accepts an optional page_offsets and tags
  each chunk with the page that contains its first character (uses
  the existing chunker logic; pages assigned post-hoc by content
  search to keep the diff minimal).
- processor.process_document and precedent_library.ingest_precedent
  forward page_offsets through the chunker. New uploads now carry
  accurate page_number on every chunk.
- Other extract_text callers (tools/documents, tools/workflow,
  web/app.py) updated to unpack the third element (ignored).
- scripts/backfill_chunk_pages.py: per-case retrofit. Re-extracts each
  PDF (re-OCRs via Google Vision if needed, ~$0.0015/page), computes
  page_offsets, and updates page_number on every chunk by content
  search. Idempotent; --force re-runs on already-tagged docs.

Forward-only would leave the 419 image embeddings backfilled on
cases 8174-24 + 8137-24 unable to boost their corresponding text
chunks. The retrofit script closes that gap (cost ~$0.60).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-03 19:49:41 +00:00
parent 5724ed8e5b
commit 81ccf3a888
9 changed files with 301 additions and 18 deletions

View File

@@ -23,6 +23,7 @@
| `voyage_rerank_judge_poc.py` | python | POC #4 — voyage-3 vs rerank-2 vs context-3 על אהרון ברק, 18 שאילתות, claude-haiku-4-5 כ-judge. הכרעה: rerank-2 ניצח עם +9% mean@3 | בנצ'מרק חד-פעמי |
| `voyage_rerank_corpus_poc.py` | python | POC #5 — voyage-3 vs rerank-2 על קורפוס מלא (785 docs). הכרעה: +4.5% mean@3 כללי, +11.6% על P queries (practical) | בנצ'מרק חד-פעמי, אישר את שלב B |
| `multimodal_backfill.py` | python | Backfill voyage-multimodal-3 page embeddings על מסמכי תיקים קיימים. idempotent (skips by default), forces `MULTIMODAL_ENABLED=true` ל-run, רץ מהקונטיינר. שלב C — ראה `docs/voyage-upgrades-plan.md` | ידני per-case (`python multimodal_backfill.py 8174-24 8137-24`) |
| `backfill_chunk_pages.py` | python | Backfill `page_number` ב-`document_chunks` קיימים. legacy chunker לא tracked עמודים → `page_number=NULL` חוסם boost של multimodal hybrid (text+image join על אותו עמוד). re-extracts כל PDF (re-OCR אם צריך, ~$0.0015/page), מחשב page_offsets, ומעדכן chunks. idempotent | ידני per-case (`python backfill_chunk_pages.py 8174-24 8137-24`) |
## תיקיית `.archive/` — סקריפטים שהושלמו