feat(retrieval): track page_number on text chunks for multimodal hybrid boost
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 6m33s
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 6m33s
The legacy chunker did not track which PDF page each chunk came from. Stored chunks had page_number=NULL, which blocked the multimodal hybrid retriever's text+image boost — it joins (chunk, image) on (document_id, page_number) and the join could never fire. This change: - extractor.extract_text now returns (text, page_count, page_offsets); page_offsets[i] is the start char offset of page (i+1) in the joined text. None for non-PDFs. - chunker.chunk_document accepts an optional page_offsets and tags each chunk with the page that contains its first character (uses the existing chunker logic; pages assigned post-hoc by content search to keep the diff minimal). - processor.process_document and precedent_library.ingest_precedent forward page_offsets through the chunker. New uploads now carry accurate page_number on every chunk. - Other extract_text callers (tools/documents, tools/workflow, web/app.py) updated to unpack the third element (ignored). - scripts/backfill_chunk_pages.py: per-case retrofit. Re-extracts each PDF (re-OCRs via Google Vision if needed, ~$0.0015/page), computes page_offsets, and updates page_number on every chunk by content search. Idempotent; --force re-runs on already-tagged docs. Forward-only would leave the 419 image embeddings backfilled on cases 8174-24 + 8137-24 unable to boost their corresponding text chunks. The retrofit script closes that gap (cost ~$0.60). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -33,8 +33,15 @@ def chunk_document(
|
||||
text: str,
|
||||
chunk_size: int = config.CHUNK_SIZE_TOKENS,
|
||||
overlap: int = config.CHUNK_OVERLAP_TOKENS,
|
||||
page_offsets: list[int] | None = None,
|
||||
) -> list[Chunk]:
|
||||
"""Split a legal document into chunks, respecting section boundaries."""
|
||||
"""Split a legal document into chunks, respecting section boundaries.
|
||||
|
||||
When ``page_offsets`` is supplied (from a PDF extraction), each chunk
|
||||
is tagged with the page number of its first character — used by the
|
||||
multimodal hybrid retriever to join (text chunk, image at same page)
|
||||
and surface text+image matches.
|
||||
"""
|
||||
if not text.strip():
|
||||
return []
|
||||
|
||||
@@ -52,9 +59,34 @@ def chunk_document(
|
||||
))
|
||||
idx += 1
|
||||
|
||||
if page_offsets:
|
||||
_assign_pages(chunks, text, page_offsets)
|
||||
return chunks
|
||||
|
||||
|
||||
def _assign_pages(chunks: list[Chunk], text: str, page_offsets: list[int]) -> None:
|
||||
"""Locate each chunk's first character in ``text`` and tag with the
|
||||
page that contains that offset. Mutates chunks in-place.
|
||||
|
||||
Chunks have overlap so we search forward from a position slightly
|
||||
past the previous chunk's start. Falls back to a global search if
|
||||
the forward scan misses (rare — happens only when overlap is bigger
|
||||
than the advance distance below).
|
||||
"""
|
||||
from legal_mcp.services.extractor import page_at_offset
|
||||
pos = 0
|
||||
for c in chunks:
|
||||
idx = text.find(c.content, pos)
|
||||
if idx < 0:
|
||||
idx = text.find(c.content)
|
||||
if idx < 0:
|
||||
continue
|
||||
c.page_number = page_at_offset(idx, page_offsets)
|
||||
# advance past the chunk's halfway point — overlap is < 50% so
|
||||
# the next chunk's starting point will be after this cursor.
|
||||
pos = idx + max(1, len(c.content) // 2)
|
||||
|
||||
|
||||
def _split_into_sections(text: str) -> list[tuple[str, str]]:
|
||||
"""Split text into (section_type, text) pairs based on Hebrew headers."""
|
||||
# Find all section headers and their positions
|
||||
|
||||
Reference in New Issue
Block a user