LLM session: async, 30min timeout, semantic chunking + parallel

The claude_session bridge had two structural defects that made any non-trivial document extraction unreliable: 1. subprocess.run() blocks the asyncio event loop in the MCP server for the full duration of every LLM call (60-180s typical). 2. The 120-second timeout was below the cold-cache cost of any document over ~12K Hebrew characters. Three back-to-back timeouts on case 8174-24 dropped 43 appellant claims on the floor. Phase 1 of the remediation plan — keeps claude_session as the engine (no Anthropic API switch) and restructures around it: claude_session.py • query / query_json are now async — asyncio.create_subprocess_exec instead of subprocess.run, so MCP server can serve other coroutines while a call is in flight. • DEFAULT_TIMEOUT 120 → 1800 (30 min). High enough that no realistic document hits it; bounded so a runaway never zombifies forever. • LONG_TIMEOUT 300 → 3600 for opus block writing on full case context. • TimeoutError now actually kills the subprocess (asyncio.wait_for cancellation alone leaves the child running). claims_extractor.py • _split_by_sections: chunks at numbered sections / Hebrew letter headings / "פרק" markers / markdown ##, falls back to paragraph breaks, then to hard splits. Targets 12K chars per chunk — small enough that each chunk reliably finishes inside the timeout. • _extract_chunk: per-chunk retry (1 attempt by default) with structured logging on failure. Failed chunks no longer crash the overall extraction; they're skipped with a partial-result warning. • extract_claims_with_ai now runs chunks in parallel via asyncio.gather bounded by a semaphore (CHUNK_CONCURRENCY=3). For a 25K-char appeal: was sequential 150-300s, now ~70-90s. Updated all 9 callers (claims, appraiser facts, block writer, qa validator, brainstorm, learning loop, style analyzer × 3) to await the now-async API. The one-shot scripts/extract_claims_8174.py used to recover 43 appellant claims on case 8174-24 has been moved to .archive/ — phase 1 makes it obsolete. SCRIPTS.md updated. Phase 2 (background-task wrapper around LLM-bound MCP tools, persistent llm_tasks table, SSE progress) is the structural follow-up — separate PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:21:35 +00:00
parent 9bdfb05350
commit 28f49defff
10 changed files with 329 additions and 82 deletions
--- a/mcp-server/src/legal_mcp/services/claude_session.py
+++ b/mcp-server/src/legal_mcp/services/claude_session.py
@@ -1,27 +1,41 @@
 """Claude Code session bridge — runs prompts via `claude -p` instead of API.

-All LLM calls in the project should use this module instead of calling
-the Anthropic API directly. This uses the local Claude Code CLI which
-runs on the user's claude.ai session — zero API cost.
+All LLM calls in the project go through this module. We shell out to the
+local Claude Code CLI which uses the developer's claude.ai session — zero
+direct Anthropic API cost.
+
+History: this module was originally synchronous (``subprocess.run``) with
+a 120-second timeout. That broke for large legal documents:
+
+  1. Sync subprocess stalled the asyncio event loop in the MCP server
+     while a single LLM call was in flight.
+  2. 120 seconds was far too short. A 25K-character Hebrew appeal on cold
+     prompt cache routinely takes 130-180 seconds; we proved this in case
+     8174-24 (three timeouts in a row).
+
+The fix: switch to async subprocess (non-blocking) and raise the default
+ceiling to 30 minutes — long enough that no realistic document hits it,
+but bounded so a runaway never zombifies forever.
 """

 from __future__ import annotations

+import asyncio
 import json
 import logging
-import subprocess
-from pathlib import Path

 from legal_mcp.config import parse_llm_json

 logger = logging.getLogger(__name__)

-# Default timeout for claude -p calls (seconds)
-DEFAULT_TIMEOUT = 120
-LONG_TIMEOUT = 300  # For complex tasks like block writing
+# Default ceiling for any single ``claude -p`` invocation, in seconds.
+# 30 min covers any single-document call we make in practice (chunking
+# handles the rest); the bound exists only to prevent runaway zombies.
+DEFAULT_TIMEOUT = 1800
+LONG_TIMEOUT = 3600  # opus block writing on full case context


-def query(prompt: str, timeout: int = DEFAULT_TIMEOUT, max_turns: int = 1) -> str:
+async def query(prompt: str, timeout: int = DEFAULT_TIMEOUT, max_turns: int = 1) -> str:
    """Send a prompt to Claude Code headless and return the text response.

    Passes the prompt via stdin (not argv) to avoid the OS ARG_MAX limit —
@@ -29,14 +43,14 @@ def query(prompt: str, timeout: int = DEFAULT_TIMEOUT, max_turns: int = 1) -> st

    Args:
        prompt: The prompt to send.
-        timeout: Max seconds to wait.
+        timeout: Max seconds before the subprocess is killed.
        max_turns: Max conversation turns (1 = single response).

    Returns:
        The text response from Claude.

    Raises:
-        RuntimeError: If claude CLI is not available or fails.
+        RuntimeError: If claude CLI is not available, fails, or times out.
    """
    cmd = [
        "claude", "-p",
@@ -45,23 +59,34 @@ def query(prompt: str, timeout: int = DEFAULT_TIMEOUT, max_turns: int = 1) -> st
    ]

    try:
-        result = subprocess.run(
-            cmd,
-            input=prompt,
-            capture_output=True,
-            text=True,
-            timeout=timeout,
+        proc = await asyncio.create_subprocess_exec(
+            *cmd,
+            stdin=asyncio.subprocess.PIPE,
+            stdout=asyncio.subprocess.PIPE,
+            stderr=asyncio.subprocess.PIPE,
        )
    except FileNotFoundError:
        raise RuntimeError("Claude CLI not found. Install Claude Code or add 'claude' to PATH.")
-    except subprocess.TimeoutExpired:
+
+    try:
+        stdout_b, stderr_b = await asyncio.wait_for(
+            proc.communicate(input=prompt.encode("utf-8")),
+            timeout=timeout,
+        )
+    except asyncio.TimeoutError:
+        # wait_for cancellation alone leaves the child running.
+        try:
+            proc.kill()
+            await proc.wait()
+        except ProcessLookupError:
+            pass
        raise RuntimeError(f"Claude CLI timed out after {timeout}s")

-    if result.returncode != 0:
-        stderr = result.stderr.strip()[:500] if result.stderr else "unknown error"
-        raise RuntimeError(f"Claude CLI failed (exit {result.returncode}): {stderr}")
+    if proc.returncode != 0:
+        stderr = stderr_b.decode("utf-8", errors="replace").strip()[:500] or "unknown error"
+        raise RuntimeError(f"Claude CLI failed (exit {proc.returncode}): {stderr}")

-    stdout = result.stdout.strip()
+    stdout = stdout_b.decode("utf-8", errors="replace").strip()
    if not stdout:
        raise RuntimeError("Claude CLI returned empty response")

@@ -75,10 +100,10 @@ def query(prompt: str, timeout: int = DEFAULT_TIMEOUT, max_turns: int = 1) -> st
        return stdout


-def query_json(prompt: str, timeout: int = DEFAULT_TIMEOUT) -> dict | list | None:
+async def query_json(prompt: str, timeout: int = DEFAULT_TIMEOUT) -> dict | list | None:
    """Send a prompt and parse the response as JSON.

    Uses parse_llm_json for robust parsing (handles markdown wrapping, truncation).
    """
-    raw = query(prompt, timeout=timeout)
+    raw = await query(prompt, timeout=timeout)
    return parse_llm_json(raw)