Files
legal-ai/scripts/.archive/extract_claims_8174.py
Chaim 28f49defff
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 1m28s
LLM session: async, 30min timeout, semantic chunking + parallel
The claude_session bridge had two structural defects that made any
non-trivial document extraction unreliable:

  1. subprocess.run() blocks the asyncio event loop in the MCP server
     for the full duration of every LLM call (60-180s typical).
  2. The 120-second timeout was below the cold-cache cost of any
     document over ~12K Hebrew characters. Three back-to-back timeouts
     on case 8174-24 dropped 43 appellant claims on the floor.

Phase 1 of the remediation plan — keeps claude_session as the engine
(no Anthropic API switch) and restructures around it:

claude_session.py
  • query / query_json are now async — asyncio.create_subprocess_exec
    instead of subprocess.run, so MCP server can serve other coroutines
    while a call is in flight.
  • DEFAULT_TIMEOUT 120 → 1800 (30 min). High enough that no realistic
    document hits it; bounded so a runaway never zombifies forever.
  • LONG_TIMEOUT 300 → 3600 for opus block writing on full case context.
  • TimeoutError now actually kills the subprocess (asyncio.wait_for
    cancellation alone leaves the child running).

claims_extractor.py
  • _split_by_sections: chunks at numbered sections / Hebrew letter
    headings / "פרק" markers / markdown ##, falls back to paragraph
    breaks, then to hard splits. Targets 12K chars per chunk — small
    enough that each chunk reliably finishes inside the timeout.
  • _extract_chunk: per-chunk retry (1 attempt by default) with
    structured logging on failure. Failed chunks no longer crash the
    overall extraction; they're skipped with a partial-result warning.
  • extract_claims_with_ai now runs chunks in parallel via
    asyncio.gather bounded by a semaphore (CHUNK_CONCURRENCY=3).
    For a 25K-char appeal: was sequential 150-300s, now ~70-90s.

Updated all 9 callers (claims, appraiser facts, block writer, qa
validator, brainstorm, learning loop, style analyzer × 3) to await
the now-async API.

The one-shot scripts/extract_claims_8174.py used to recover 43
appellant claims on case 8174-24 has been moved to .archive/ — phase 1
makes it obsolete. SCRIPTS.md updated.

Phase 2 (background-task wrapper around LLM-bound MCP tools, persistent
llm_tasks table, SSE progress) is the structural follow-up — separate PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 14:21:35 +00:00

115 lines
3.7 KiB
Python

#!/usr/bin/env python3
"""One-shot: extract appellant claims for case 8174-24.
The analyst (CMPA-13) finished but `extract_claims` timed out three times on
the main 25K-char appeal document, so we have only 19 committee/response
claims in DB and zero appellant claims. This script reruns extraction with
a higher timeout and parallel chunks.
Targets:
• כתב ערר 18.12.24 (appeal, 25,474 chars) — appellant claims
• השלמת מסמכים תמ״א 38 (decision, 3,718 chars) — supplementary appeal filing
After phase 1.1-1.3 lands, this script becomes obsolete.
Usage: /home/chaim/legal-ai/mcp-server/.venv/bin/python scripts/extract_claims_8174.py
"""
from __future__ import annotations
import asyncio
import json
import sys
import time
from pathlib import Path
from uuid import UUID
# Ensure we can import legal_mcp from this repo's mcp-server tree
sys.path.insert(0, str(Path(__file__).resolve().parent.parent / "mcp-server" / "src"))
from legal_mcp.services import claims_extractor, claude_session, db
# ── Patch claude_session to use 30-min ceiling ───────────────────────
# The hard-coded timeout=120 in claims_extractor.extract_claims_with_ai is
# what kept failing. Force every claude_session call here to use 1800s.
_orig_query_json = claude_session.query_json
_orig_query = claude_session.query
def _patched_query_json(prompt: str, timeout: int = 120):
return _orig_query_json(prompt, timeout=max(timeout, 1800))
def _patched_query(prompt: str, timeout: int = 120, max_turns: int = 1):
return _orig_query(prompt, timeout=max(timeout, 1800), max_turns=max_turns)
claude_session.query_json = _patched_query_json
claude_session.query = _patched_query
CASE_NUMBER = "8174-24"
TARGETS = [
# (doc_id, title hint, doc_type override, party_hint)
("655f96f7-d406-44ac-bb53-6b2c1ab2909c", "כתב ערר 18.12.24", "appeal", "יואל גולדמן"),
("13b4795a-4fb7-460e-bddf-a5d282a1a67f", "השלמת מסמכים תמ״א 38", "appeal", "יואל גולדמן"),
]
async def main() -> int:
case = await db.get_case_by_number(CASE_NUMBER)
if not case:
print(f"ERROR: case {CASE_NUMBER} not found")
return 1
case_id = UUID(case["id"])
print(f"=== Case {CASE_NUMBER}{case['title']} ===")
print()
for doc_id, label, doc_type, party_hint in TARGETS:
text = await db.get_document_text(UUID(doc_id))
if not text:
print(f"SKIP {label} — no extracted_text")
continue
chars = len(text)
print(f"--- {label} ({chars:,} chars, doc_type={doc_type}) ---")
t0 = time.monotonic()
try:
result = await claims_extractor.extract_and_store_claims(
case_id=case_id,
document_id=UUID(doc_id),
text=text,
doc_type=doc_type,
party_hint=party_hint,
)
except Exception as e:
print(f" FAILED: {e}")
continue
dt = time.monotonic() - t0
print(f" done in {dt:.1f}s — {json.dumps(result, ensure_ascii=False)}")
print()
# Final tally
pool = await db.get_pool()
async with pool.acquire() as conn:
rows = await conn.fetch(
"""SELECT party_role, claim_type, source_document, count(*) as n
FROM claims WHERE case_id = $1
GROUP BY 1, 2, 3 ORDER BY 1, 3""",
case_id,
)
print("=== Final claims breakdown ===")
total = 0
for r in rows:
n = r["n"]
total += n
print(f" {r['party_role']:12} {r['claim_type']:10} ({n:3}) ← {r['source_document']}")
print(f" TOTAL: {total} claims")
return 0
if __name__ == "__main__":
sys.exit(asyncio.run(main()))