chore(#57): re-chunk+re-embed legacy precedents (pre-#55 chunker remediation)

Adds scripts/rechunk_legacy_precedents.py: selects every case_law with a tiny chunk (content<50 — the pre-fix chunker fingerprint) and runs ingest.reindex_case_law (re-chunk+re-embed from stored full_text only, no re-OCR/LLM, idempotent). Batch-idempotent (re-queries the affected set). Run result (2026-06-03): 73 precedents reindexed, 0 failed. Tiny chunks 483 -> 4 (99.2%); total precedent_chunks 5019 -> 3115 (fragments merged). Search verified healthy (substantial coherent passages, no errors). The 4 residual tiny chunks are isolated section headings ('דיון', 'טענות המשיבים', ...) emitted by the CURRENT (fixed) chunker — not legacy fragments — and are already filtered at query time (>=50, #55). Minor chunker edge case, candidate #55 follow-up. The DB chunk migration is already applied to prod; this commit is the script + SCRIPTS.md entry only (no app code change, no deploy needed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 07:55:42 +00:00
parent c7c6f3eb9c
commit 434341cc29
2 changed files with 83 additions and 0 deletions
--- a/scripts/SCRIPTS.md
+++ b/scripts/SCRIPTS.md
@@ -33,6 +33,7 @@
 | `voyage_rerank_corpus_poc.py` | python | POC #5 — voyage-3 vs rerank-2 על קורפוס מלא (785 docs). הכרעה: +4.5% mean@3 כללי, +11.6% על P queries (practical) | בנצ'מרק חד-פעמי, אישר את שלב B |
 | `multimodal_backfill.py` | python | Backfill voyage-multimodal-3 page embeddings על מסמכי תיקים קיימים. idempotent (skips by default), forces `MULTIMODAL_ENABLED=true` ל-run, רץ מהקונטיינר. שלב C — ראה `docs/voyage-upgrades-plan.md` | ידני per-case (`python multimodal_backfill.py 8174-24 8137-24`) |
 | `backfill_chunk_pages.py` | python | Backfill `page_number` ב-`document_chunks` קיימים. legacy chunker לא tracked עמודים → `page_number=NULL` חוסם boost של multimodal hybrid (text+image join על אותו עמוד). re-extracts כל PDF (re-OCR אם צריך, ~$0.0015/page), מחשב page_offsets, ומעדכן chunks. idempotent | ידני per-case (`python backfill_chunk_pages.py 8174-24 8137-24`) |
+| `rechunk_legacy_precedents.py` | python | **#57** — re-chunk + re-embed פסיקה שהוטמעה לפני תיקון ה-chunker (#55). בוחר כל `case_law` עם chunk זעיר (`length(trim(content))<50` — טביעת-האצבע של ה-chunker הישן) ומריץ `ingest.reindex_case_law` (re-chunk+re-embed מ-`full_text` שמור בלבד — ללא re-OCR/LLM, feedback_no_reocr_retrofit; idempotent DELETE-then-INSERT). idempotent ברמת-הבאטץ' (שואב מחדש את הסט המושפע בכל ריצה). דגל `--limit N`. רץ עם venv של mcp-server (`cd mcp-server && .venv/bin/python ../scripts/rechunk_legacy_precedents.py`) | חד-פעמי — מיגרציית-נתונים של פסיקה legacy (תוקן 2026-06-03) |
 | `audit_corpus_integrity.py` | python | בדיקה תקופתית של עקביות הקורפוס — 3 בדיקות SQL read-only על `case_law` ו-`cases`: (A) `external_upload` עם prefix פנימי `ערר`/`בל"מ`; (B) `internal_committee` חסר `chair_name`/`district`; (C) `cases.practice_area` מחוץ ל-{`rishuy_uvniya`, `betterment_levy`, `compensation_197`, `''`}. כותב log מצטבר ל-`data/logs/corpus_integrity_audit.log` ובמצב הפרות שולח wakeup ל-CEO ב-Paperclip (best-effort, רק אם `PAPERCLIP_API_URL`+`PAPERCLIP_API_KEY` מוגדרים). דגל: `--no-notify`. Idempotent, יוצא 0. **Cron יומי 07:00**: `0 7 * * * /home/chaim/legal-ai/mcp-server/.venv/bin/python /home/chaim/legal-ai/scripts/audit_corpus_integrity.py` | `0 7 * * *` (cron) |
 | `backfill_legal_arguments.py` | python | Backfill `legal_arguments` לתיקים עם `claims` קיימים (TaskMaster #36). מקבץ פרופוזיציות גולמיות לטיעונים משפטיים מובחנים (~6-12 לכל צד) דרך `argument_aggregator.aggregate_claims_to_arguments` (Claude CLI). תומך `--dry-run`/`--apply`/`--force`/`--case <num>...`. **חייב לרוץ מהמכונה המקומית** (לא קונטיינר) — `claude_session` דורש Claude CLI | ידני per-case (`python scripts/backfill_legal_arguments.py --apply --case 1017-03-26`) |
 | `upload_blam_decisions.py` | python | חד-פעמי (2026-05-26) — העלאת 2 החלטות בל"מ ל-`case_law` (8126/24 סופר נוח, 8047/23 הרנון) דרך `ingest_internal_decision` ישיר, עוקף MCP server שטרם נטען מחדש אחרי הוספת `proceeding_type`. **לא להריץ שוב** | חד-פעמי — להעביר ל-`.archive/` בהזדמנות |
--- a/scripts/rechunk_legacy_precedents.py
+++ b/scripts/rechunk_legacy_precedents.py
@@ -0,0 +1,82 @@
+#!/usr/bin/env python3
+"""#57 — re-chunk + re-embed legacy precedents that were embedded before the
+chunker fix (#55).
+
+Selects every case_law row that still has at least one tiny chunk
+(length(trim(content)) < 50) — the fingerprint of the pre-fix chunker — and
+runs ``ingest.reindex_case_law`` on it. That helper re-chunks + re-embeds from
+the STORED full_text only (no re-OCR / no LLM — feedback_no_reocr_retrofit) and
+is idempotent (store_precedent_chunks is DELETE-then-INSERT).
+
+Idempotent at the batch level too: it re-queries the affected set each run, so
+already-fixed rows drop out automatically. Safe to re-run.
+
+Run with the MCP server venv (config loads ~/.env / Infisical for VOYAGE +
+POSTGRES, same as the live MCP tools):
+
+    cd ~/legal-ai/mcp-server
+    .venv/bin/python ../scripts/rechunk_legacy_precedents.py            # all affected
+    .venv/bin/python ../scripts/rechunk_legacy_precedents.py --limit 5  # first N (smoke)
+"""
+import argparse
+import asyncio
+import sys
+
+from legal_mcp.services import db, ingest
+
+
+async def affected_ids(conn) -> list:
+    rows = await conn.fetch(
+        """
+        SELECT pc.case_law_id,
+               cl.case_number,
+               count(*) FILTER (WHERE length(trim(pc.content)) < 50) AS tiny,
+               count(*) AS total
+        FROM precedent_chunks pc
+        JOIN case_law cl ON cl.id = pc.case_law_id
+        GROUP BY pc.case_law_id, cl.case_number
+        HAVING count(*) FILTER (WHERE length(trim(pc.content)) < 50) > 0
+        ORDER BY total ASC
+        """
+    )
+    return rows
+
+
+async def main(limit: int | None) -> int:
+    pool = await db.get_pool()
+    async with pool.acquire() as conn:
+        rows = await affected_ids(conn)
+
+    if limit:
+        rows = rows[:limit]
+
+    n = len(rows)
+    print(f"affected precedents to re-chunk: {n}", flush=True)
+    ok = 0
+    failed = []
+    for i, r in enumerate(rows, 1):
+        cid = r["case_law_id"]
+        cn = r["case_number"]
+        try:
+            res = await ingest.reindex_case_law(cid)
+            ok += 1
+            print(
+                f"[{i}/{n}] OK  {cn}: {r['total']} chunks ({r['tiny']} tiny) "
+                f"-> {res['chunks']} chunks",
+                flush=True,
+            )
+        except Exception as e:  # noqa: BLE001 — report per-doc, keep going
+            failed.append((cn, str(e)))
+            print(f"[{i}/{n}] FAIL {cn}: {e}", flush=True)
+
+    print(f"\nDONE — {ok}/{n} reindexed, {len(failed)} failed", flush=True)
+    for cn, e in failed:
+        print(f"  FAILED {cn}: {e}", flush=True)
+    return 0 if not failed else 1
+
+
+if __name__ == "__main__":
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--limit", type=int, default=None, help="process only first N")
+    args = ap.parse_args()
+    sys.exit(asyncio.run(main(args.limit)))