feat(retrieval): track page_number on text chunks for multimodal hybrid boost

The legacy chunker did not track which PDF page each chunk came from. Stored chunks had page_number=NULL, which blocked the multimodal hybrid retriever's text+image boost — it joins (chunk, image) on (document_id, page_number) and the join could never fire. This change: - extractor.extract_text now returns (text, page_count, page_offsets); page_offsets[i] is the start char offset of page (i+1) in the joined text. None for non-PDFs. - chunker.chunk_document accepts an optional page_offsets and tags each chunk with the page that contains its first character (uses the existing chunker logic; pages assigned post-hoc by content search to keep the diff minimal). - processor.process_document and precedent_library.ingest_precedent forward page_offsets through the chunker. New uploads now carry accurate page_number on every chunk. - Other extract_text callers (tools/documents, tools/workflow, web/app.py) updated to unpack the third element (ignored). - scripts/backfill_chunk_pages.py: per-case retrofit. Re-extracts each PDF (re-OCRs via Google Vision if needed, ~$0.0015/page), computes page_offsets, and updates page_number on every chunk by content search. Idempotent; --force re-runs on already-tagged docs. Forward-only would leave the 419 image embeddings backfilled on cases 8174-24 + 8137-24 unable to boost their corresponding text chunks. The retrofit script closes that gap (cost ~$0.60). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 19:49:41 +00:00
parent 5724ed8e5b
commit 81ccf3a888
9 changed files with 301 additions and 18 deletions
--- a/mcp-server/src/legal_mcp/services/chunker.py
+++ b/mcp-server/src/legal_mcp/services/chunker.py
@@ -33,8 +33,15 @@ def chunk_document(
    text: str,
    chunk_size: int = config.CHUNK_SIZE_TOKENS,
    overlap: int = config.CHUNK_OVERLAP_TOKENS,
+    page_offsets: list[int] | None = None,
 ) -> list[Chunk]:
-    """Split a legal document into chunks, respecting section boundaries."""
+    """Split a legal document into chunks, respecting section boundaries.
+
+    When ``page_offsets`` is supplied (from a PDF extraction), each chunk
+    is tagged with the page number of its first character — used by the
+    multimodal hybrid retriever to join (text chunk, image at same page)
+    and surface text+image matches.
+    """
    if not text.strip():
        return []

@@ -52,9 +59,34 @@ def chunk_document(
            ))
            idx += 1

+    if page_offsets:
+        _assign_pages(chunks, text, page_offsets)
    return chunks


+def _assign_pages(chunks: list[Chunk], text: str, page_offsets: list[int]) -> None:
+    """Locate each chunk's first character in ``text`` and tag with the
+    page that contains that offset. Mutates chunks in-place.
+
+    Chunks have overlap so we search forward from a position slightly
+    past the previous chunk's start. Falls back to a global search if
+    the forward scan misses (rare — happens only when overlap is bigger
+    than the advance distance below).
+    """
+    from legal_mcp.services.extractor import page_at_offset
+    pos = 0
+    for c in chunks:
+        idx = text.find(c.content, pos)
+        if idx < 0:
+            idx = text.find(c.content)
+        if idx < 0:
+            continue
+        c.page_number = page_at_offset(idx, page_offsets)
+        # advance past the chunk's halfway point — overlap is < 50% so
+        # the next chunk's starting point will be after this cursor.
+        pos = idx + max(1, len(c.content) // 2)
+
+
 def _split_into_sections(text: str) -> list[tuple[str, str]]:
    """Split text into (section_type, text) pairs based on Hebrew headers."""
    # Find all section headers and their positions