fix(retrieval): make decisions findable by name + unhide committee uploads

Root cause of "agent can't find the Agasi decision in the corpus" (CMPA-55): the decision was fully ingested, but the retrieval layer failed on the realistic agent query — searching by case name. - RC-A (#52): lexical tsvector covered only chunk content + halacha text, so a bare-name query ("אגסי") matched decisions that *cite* the case, not the case itself. Add meta_tsv on case_law(case_name, case_number) (SCHEMA V20) and OR it into the lexical halacha/chunk SQL with a match boost, so a name/number hit surfaces the case's own rows. Agasi: rank 4 → rank 1. - RC-B (#53): precedent_library_list hard-defaulted source_kind=external_upload and never exposed the param, hiding uploaded ערר/בל"מ (internal_committee) decisions. Thread source_kind through service → tool → MCP tool (supports 'internal_committee' / 'all_committees'). - #54: agent instructions (researcher/analyst/writer) — search-by-name protocol: add content/case-number, search both corpora, use all_committees before declaring "not in corpus". - #55: chunker produced tiny fragment chunks ("דיון", "החלטה") from header keywords matched mid-sentence. Anchor SECTION_PATTERNS to line start + merge sub-min sections; exclude <50-char fragments at query time (484 existing fragments hidden; full re-chunk tracked as #57). Tests: scripts/test_retrieval_by_name.py (name ranks case above citer + substantive regressions); chunker unit checks (0 tiny chunks). New findings filed as tasks #56 (halacha source_kind leak) and #57 (re-chunk migration). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 11:26:19 +00:00
parent 165efc62b0
commit 58ab003206
11 changed files with 355 additions and 57 deletions
--- a/mcp-server/src/legal_mcp/services/chunker.py
+++ b/mcp-server/src/legal_mcp/services/chunker.py
@@ -97,13 +97,32 @@ def _assign_pages(chunks: list[Chunk], text: str, page_offsets: list[int]) -> No
        pos = idx + max(1, len(c.content) // 2)


+# A section shorter than this (stripped chars) is not a real section — it's
+# an artifact of a header keyword matched mid-text. Such a fragment is merged
+# into the preceding section rather than emitted as its own chunk. See #55:
+# unanchored keywords like "דיון"/"החלטה"/"מסקנה" appearing inside a sentence
+# used to carve tiny boundary chunks ("דיון). במסגרת ה") that polluted search.
+MIN_SECTION_CHARS = 60
+
+
 def _split_into_sections(text: str) -> list[tuple[str, str]]:
-    """Split text into (section_type, text) pairs based on Hebrew headers."""
+    """Split text into (section_type, text) pairs based on Hebrew headers.
+
+    Header keywords are matched only at the **start of a line** (after
+    optional whitespace / list numbering like ``5.`` or ``ג.``). A real
+    section header in these decisions sits on its own line; anchoring to
+    the line start prevents common words ("דיון", "החלטה", "מסקנה") that
+    appear mid-sentence from being treated as section boundaries — which
+    previously produced tiny fragment chunks (#55).
+    """
    # Find all section headers and their positions
    markers: list[tuple[int, str]] = []

    for pattern, section_type in SECTION_PATTERNS:
-        for match in re.finditer(pattern, text):
+        # ^ + MULTILINE: line start only. Optional leading spaces/tabs and an
+        # optional ordinal prefix ("5.", "5)", "ג.") before the keyword.
+        anchored = rf"^[ \t]*(?:\d+[.)]\s*|[א-ת][.)]\s*)?(?:{pattern})"
+        for match in re.finditer(anchored, text, re.MULTILINE):
            markers.append((match.start(), section_type))

    if not markers:
@@ -120,11 +139,18 @@ def _split_into_sections(text: str) -> list[tuple[str, str]]:
        if intro_text:
            sections.append(("intro", intro_text))

-    # Each section
+    # Each section. A section whose text is too short to stand alone is
+    # merged into the previous section (keeping the previous type) so a
+    # near-adjacent pair of headers can't produce a fragment chunk.
    for i, (pos, section_type) in enumerate(markers):
        end = markers[i + 1][0] if i + 1 < len(markers) else len(text)
        section_text = text[pos:end].strip()
-        if section_text:
+        if not section_text:
+            continue
+        if len(section_text) < MIN_SECTION_CHARS and sections:
+            prev_type, prev_text = sections[-1]
+            sections[-1] = (prev_type, f"{prev_text}\n{section_text}")
+        else:
            sections.append((section_type, section_text))

    return sections