feat(retrieval): track page_number on text chunks for multimodal hybrid boost
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 6m33s
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 6m33s
The legacy chunker did not track which PDF page each chunk came from. Stored chunks had page_number=NULL, which blocked the multimodal hybrid retriever's text+image boost — it joins (chunk, image) on (document_id, page_number) and the join could never fire. This change: - extractor.extract_text now returns (text, page_count, page_offsets); page_offsets[i] is the start char offset of page (i+1) in the joined text. None for non-PDFs. - chunker.chunk_document accepts an optional page_offsets and tags each chunk with the page that contains its first character (uses the existing chunker logic; pages assigned post-hoc by content search to keep the diff minimal). - processor.process_document and precedent_library.ingest_precedent forward page_offsets through the chunker. New uploads now carry accurate page_number on every chunk. - Other extract_text callers (tools/documents, tools/workflow, web/app.py) updated to unpack the third element (ignored). - scripts/backfill_chunk_pages.py: per-case retrofit. Re-extracts each PDF (re-OCRs via Google Vision if needed, ~$0.0015/page), computes page_offsets, and updates page_number on every chunk by content search. Idempotent; --force re-runs on already-tagged docs. Forward-only would leave the 419 image embeddings backfilled on cases 8174-24 + 8137-24 unable to boost their corresponding text chunks. The retrofit script closes that gap (cost ~$0.60). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -32,7 +32,7 @@ async def process_document(document_id: UUID, case_id: UUID) -> dict:
|
||||
try:
|
||||
# Step 1: Extract text
|
||||
logger.info("Extracting text from %s", doc["file_path"])
|
||||
text, page_count = await extractor.extract_text(doc["file_path"])
|
||||
text, page_count, page_offsets = await extractor.extract_text(doc["file_path"])
|
||||
|
||||
await db.update_document(
|
||||
document_id,
|
||||
@@ -70,9 +70,9 @@ async def process_document(document_id: UUID, case_id: UUID) -> dict:
|
||||
except Exception as e:
|
||||
logger.warning("Classification failed (non-fatal): %s", e)
|
||||
|
||||
# Step 2: Chunk
|
||||
# Step 2: Chunk (page_offsets propagates page_number into chunks)
|
||||
logger.info("Chunking document (%d chars)", len(text))
|
||||
chunks = chunker.chunk_document(text)
|
||||
chunks = chunker.chunk_document(text, page_offsets=page_offsets)
|
||||
|
||||
if not chunks:
|
||||
await db.update_document(document_id, extraction_status="completed")
|
||||
|
||||
Reference in New Issue
Block a user