LLM session: async, 30min timeout, semantic chunking + parallel

The claude_session bridge had two structural defects that made any non-trivial document extraction unreliable: 1. subprocess.run() blocks the asyncio event loop in the MCP server for the full duration of every LLM call (60-180s typical). 2. The 120-second timeout was below the cold-cache cost of any document over ~12K Hebrew characters. Three back-to-back timeouts on case 8174-24 dropped 43 appellant claims on the floor. Phase 1 of the remediation plan — keeps claude_session as the engine (no Anthropic API switch) and restructures around it: claude_session.py • query / query_json are now async — asyncio.create_subprocess_exec instead of subprocess.run, so MCP server can serve other coroutines while a call is in flight. • DEFAULT_TIMEOUT 120 → 1800 (30 min). High enough that no realistic document hits it; bounded so a runaway never zombifies forever. • LONG_TIMEOUT 300 → 3600 for opus block writing on full case context. • TimeoutError now actually kills the subprocess (asyncio.wait_for cancellation alone leaves the child running). claims_extractor.py • _split_by_sections: chunks at numbered sections / Hebrew letter headings / "פרק" markers / markdown ##, falls back to paragraph breaks, then to hard splits. Targets 12K chars per chunk — small enough that each chunk reliably finishes inside the timeout. • _extract_chunk: per-chunk retry (1 attempt by default) with structured logging on failure. Failed chunks no longer crash the overall extraction; they're skipped with a partial-result warning. • extract_claims_with_ai now runs chunks in parallel via asyncio.gather bounded by a semaphore (CHUNK_CONCURRENCY=3). For a 25K-char appeal: was sequential 150-300s, now ~70-90s. Updated all 9 callers (claims, appraiser facts, block writer, qa validator, brainstorm, learning loop, style analyzer × 3) to await the now-async API. The one-shot scripts/extract_claims_8174.py used to recover 43 appellant claims on case 8174-24 has been moved to .archive/ — phase 1 makes it obsolete. SCRIPTS.md updated. Phase 2 (background-task wrapper around LLM-bound MCP tools, persistent llm_tasks table, SSE progress) is the structural follow-up — separate PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:21:35 +00:00
parent 9bdfb05350
commit 28f49defff
10 changed files with 329 additions and 82 deletions
--- a/mcp-server/src/legal_mcp/services/claims_extractor.py
+++ b/mcp-server/src/legal_mcp/services/claims_extractor.py
@@ -7,6 +7,7 @@

 from __future__ import annotations

+import asyncio
 import logging
 import re
 from uuid import UUID
@@ -17,6 +18,21 @@ from legal_mcp.services import db, claude_session

 logger = logging.getLogger(__name__)

+# Each chunk targets ~12K chars (≈3K tokens of Hebrew). Smaller than the
+# previous 25K because:
+#   • A single ``claude -p`` call on a 25K-char Hebrew prompt with cold
+#     cache routinely hit ~150-180s. 12K chunks finish in ~60-90s.
+#   • Per-chunk retry costs less when chunks are smaller.
+#   • Parallel chunks benefit more — see CHUNK_CONCURRENCY.
+CHUNK_TARGET_CHARS = 12000
+
+# How many chunks to send to Claude in parallel. Each subprocess holds
+# ~300 MB RSS plus its own MCP stack; concurrency=3 keeps the box usable.
+CHUNK_CONCURRENCY = 3
+
+# How many retry attempts per failed chunk before giving up on it.
+CHUNK_RETRY_ATTEMPTS = 1
+

 EXTRACT_CLAIMS_PROMPT = """אתה מנתח מסמכים משפטיים בתחום תכנון ובניה. תפקידך לחלץ טענות מכתב טענות.

@@ -43,6 +59,103 @@ EXTRACT_CLAIMS_PROMPT = """אתה מנתח מסמכים משפטיים בתחו
 """


+# Section markers we treat as natural chunk boundaries when present.
+# Hebrew legal briefs almost always use numbered sections like "10." or
+# letter-section headings (".א", ".ב"). Splitting between sections keeps
+# every chunk a self-contained argumentative unit.
+_SECTION_BOUNDARY_RE = re.compile(
+    r"\n\s*("
+    r"\d+\.\s+\S"             # numbered section: "10. טענות"
+    r"|[א-ת]\.\s+\S"          # Hebrew letter section: "א. רקע"
+    r"|##\s+\S"               # markdown heading
+    r"|פרק\s+\S"              # "פרק" headings
+    r")"
+)
+
+
+def _split_by_sections(text: str, target: int = CHUNK_TARGET_CHARS) -> list[str]:
+    """Split a long document into roughly ``target``-sized chunks at section
+    boundaries. Falls back to paragraph breaks, then to hard splits if a
+    section happens to be larger than ``target`` on its own.
+    """
+    if len(text) <= target:
+        return [text]
+
+    boundaries = [m.start() for m in _SECTION_BOUNDARY_RE.finditer(text)]
+    boundaries = [0, *boundaries, len(text)]
+
+    chunks: list[str] = []
+    start = 0
+    for cut in boundaries[1:]:
+        # Greedy: keep adding sections to the current chunk until adding
+        # the next one would push past ``target``.
+        if cut - start < target:
+            continue
+        end = cut
+        if end - start > target * 1.5:
+            # Section group exceeds 1.5× target — fall back to paragraph
+            # break inside it to avoid one chunk being far too big.
+            soft = text.rfind("\n\n", start, start + target)
+            if soft > start + target // 2:
+                end = soft
+        chunks.append(text[start:end].strip())
+        start = end
+    if start < len(text):
+        chunks.append(text[start:].strip())
+
+    # Hard splits for any chunk that is still too large (rare, but
+    # documents without any section markers can fall through).
+    final: list[str] = []
+    for c in chunks:
+        if len(c) <= target * 1.5:
+            final.append(c)
+            continue
+        for i in range(0, len(c), target):
+            final.append(c[i:i + target])
+    return [c for c in final if c.strip()]
+
+
+async def _extract_chunk(
+    chunk: str,
+    chunk_index: int,
+    chunk_total: int,
+    context: str,
+) -> tuple[int, list[dict] | None]:
+    """Run extraction on one chunk with retry. Returns ``(chunk_index, claims_or_None)``.
+
+    None means the chunk failed both the initial call and every retry
+    (caller can use this to mark the result as partial).
+    """
+    chunk_label = f" (חלק {chunk_index + 1}/{chunk_total})" if chunk_total > 1 else ""
+    prompt = (
+        f"{EXTRACT_CLAIMS_PROMPT}\n\n"
+        f"{context}{chunk_label}\n\n"
+        f"--- תחילת מסמך ---\n{chunk}\n--- סוף מסמך ---"
+    )
+    last_err: Exception | None = None
+    for attempt in range(CHUNK_RETRY_ATTEMPTS + 1):
+        try:
+            claims = await claude_session.query_json(prompt)
+        except Exception as e:
+            last_err = e
+            logger.warning(
+                "extract_claims chunk %d/%d attempt %d raised: %s",
+                chunk_index + 1, chunk_total, attempt + 1, e,
+            )
+            continue
+        if isinstance(claims, list):
+            return chunk_index, claims
+        logger.warning(
+            "extract_claims chunk %d/%d attempt %d returned non-list (%s)",
+            chunk_index + 1, chunk_total, attempt + 1, type(claims).__name__,
+        )
+    logger.error(
+        "extract_claims chunk %d/%d failed after %d attempts: %s",
+        chunk_index + 1, chunk_total, CHUNK_RETRY_ATTEMPTS + 1, last_err,
+    )
+    return chunk_index, None
+
+
 async def extract_claims_with_ai(
    text: str,
    doc_type: str = "appeal",
@@ -50,68 +163,62 @@ async def extract_claims_with_ai(
 ) -> list[dict]:
    """חילוץ טענות מכתב טענות באמצעות Claude.

+    Splits ``text`` at section boundaries, runs every chunk through
+    Claude in parallel (bounded by ``CHUNK_CONCURRENCY``), retries each
+    failed chunk once, and merges the results in original document order.
+    Failed chunks are logged but don't block the overall extraction —
+    we return what we got and surface the gap via the logs.
+
    Args:
        text: טקסט המסמך
        doc_type: סוג המסמך (appeal/response)
        party_hint: רמז לזהות הצד (אם ידוע)

    Returns:
-        רשימת טענות עם party_role, claim_text, topic
+        רשימת טענות עם party_role, claim_text, topic, claim_index.
    """
    context = f"סוג המסמך: {doc_type}"
    if party_hint:
        context += f"\nהצד המגיש: {party_hint}"

-    # For very long documents, split into chunks and merge results
-    max_chars_per_call = 25000
-    chunks = []
-    if len(text) > max_chars_per_call:
-        # Split at paragraph boundaries
-        pos = 0
-        while pos < len(text):
-            end = min(pos + max_chars_per_call, len(text))
-            if end < len(text):
-                # Find paragraph break near the limit
-                break_pos = text.rfind("\n\n", pos, end)
-                if break_pos > pos + max_chars_per_call // 2:
-                    end = break_pos
-            chunks.append(text[pos:end])
-            pos = end
-        logger.info("Document split into %d chunks (%d chars total)", len(chunks), len(text))
-    else:
-        chunks = [text]
-
-    all_claims = []
-
-    for i, chunk in enumerate(chunks):
-        chunk_label = f" (חלק {i+1}/{len(chunks)})" if len(chunks) > 1 else ""
-        prompt = (
-            f"{EXTRACT_CLAIMS_PROMPT}\n\n"
-            f"{context}{chunk_label}\n\n"
-            f"--- תחילת מסמך ---\n{chunk}\n--- סוף מסמך ---"
+    chunks = _split_by_sections(text)
+    if len(chunks) > 1:
+        logger.info(
+            "extract_claims: split %d chars into %d chunks (target=%d, concurrency=%d)",
+            len(text), len(chunks), CHUNK_TARGET_CHARS, CHUNK_CONCURRENCY,
        )
-        claims = claude_session.query_json(prompt, timeout=120)
-        if claims is None:
-            logger.warning("Failed to parse claims for chunk %d: %s", i, raw[:200])
+
+    sem = asyncio.Semaphore(CHUNK_CONCURRENCY)
+
+    async def _bounded(idx: int, c: str) -> tuple[int, list[dict] | None]:
+        async with sem:
+            return await _extract_chunk(c, idx, len(chunks), context)
+
+    results = await asyncio.gather(*[_bounded(i, c) for i, c in enumerate(chunks)])
+
+    # Merge in original order. Skip chunks that failed entirely.
+    failed = [i for i, r in results if r is None]
+    if failed:
+        logger.warning(
+            "extract_claims: %d/%d chunks failed (indices=%s) — returning partial result",
+            len(failed), len(chunks), failed,
+        )
+    merged: list[dict] = []
+    for idx, claims in sorted(results, key=lambda x: x[0]):
+        if not claims:
            continue
-        if isinstance(claims, list):
-            all_claims.extend(claims)
+        merged.extend(claims)

-    claims = all_claims
-    if not claims:
-        return []
-
-    if not isinstance(claims, list):
-        return []
-
-    # Add claim_index
-    for i, claim in enumerate(claims):
-        claim["claim_index"] = i
-        # Validate required fields
+    # Add claim_index and drop entries missing required fields.
+    cleaned: list[dict] = []
+    for i, claim in enumerate(merged):
+        if not isinstance(claim, dict):
+            continue
        if "party_role" not in claim or "claim_text" not in claim:
            continue
-
-    return [c for c in claims if "party_role" in c and "claim_text" in c]
+        claim["claim_index"] = i
+        cleaned.append(claim)
+    return cleaned


 def _infer_claim_type(doc_type: str, source_name: str) -> str: