Merge pull request 'feat(pipeline): עמידות (LangGraph) ל-final_halacha (P0, X16/INV-DUR1, #114)' (#178) from worktree-langgraph-durable-pipeline into main

2026-06-10 09:53:07 +00:00
parent 61d235175f e7d8b24d7c
commit f5650196b7
6 changed files with 303 additions and 33 deletions
--- a/scripts/SCRIPTS.md
+++ b/scripts/SCRIPTS.md
@@ -53,7 +53,8 @@
 | `halacha_panel_approve.py` | python | **פאנל-אישור הלכות (Trust-or-Escalate, dry-run).** 3 שופטים בלתי-תלויי-לינאז' (Opus/claude_session · DeepSeek · Gemini-2.5-flash) מצביעים על ה**ציר-הגס האמין** (92% חוצה-מודלים): נקיות→"הלכה לשמירה?"; nli_unsupported→"הציטוט תומך בכלל?" (שיפוט-מחדש); פגומות→re-extraction. רק ורדיקט מוסכם פועל אוטומטית, **פיצול מסלים ליו"ר** (INV-G10). `--apply` **מחווט** (clean: רוב 2/3; nli: פה-אחד-entailed מנקה flag) — הפיך, מגבה ל-`data/audit/` קודם. מפתחות: DeepSeek מ-`~/.hermes/...`, Gemini מ-`~/.env`. **חובה מקומי**. dry-run 2026-06-07: 197→103 אוטו (פה-אחד) / ~15 (רוב). | ידני / שלב-אימות-הלכות במסלול-הסופי |
 | `style_lesson_panel.py` | python | **פאנל-סגנון דו-סוכני (למידה כפולה).** על-גבי דיסטילציית-ה-Opus (draft↔final ב-`draft_final_pairs.analysis`), שני שופטים בלתי-תלויים — DeepSeek + Gemini-2.5-flash — מצביעים לכל לקח על השאלה הגסה "האם זו הנחיית-סגנון מופשטת ובת-הכללה (INV-LRN5 — קול ולא מהות)?". הסכמה 2/2-keep → נכתב כ-`decision_lesson` (`source=panel:deepseek+gemini`); 2/2-drop → לא נכתב; פיצול/substance → מוסלם ליו"ר. `--apply` הפיך, מגבה ל-`data/audit/`. הטמעה ל-SKILL.md/lessons.md נשארת שער-יו"ר ידני (INV-G10). מפתחות כמו פאנל-ההלכות. **חובה מקומי**. `--case <num>` / `--pair-id <uuid>`. | שלב-למידה במסלול-הסופי |
 | `final_learning_pipeline.py` | python | **תזמור שלב-הלמידה (פקודה אחת).** מופעל ע"י הרמס כשלוחצים "הרץ למידת-קול" במסלול-הסופי. דטרמיניסטי: (1) `ingest_final_version` עם נתיב-הסופי, (2) רישום לקורפוס-הסגנון (idempotent), (3) `style_lesson_panel --apply`. מקפל את הזרימה לפקודה אחת כדי שהסוכן לא ירכיב כמה קריאות (חסין). idempotent. **חובה מקומי**. `--case <num>`. | אוטו (כפתור run-learning) / ידני |
-| `final_halacha_pipeline.py` | python | **תזמור שלב-אימות-ההלכות (פקודה אחת).** מופעל ע"י הרמס כשלוחצים "הרץ אימות-הלכות". דטרמיניסטי: (1) `extract_internal_citations(chair)`, (2) `corroboration.build_all()`, (3) `halacha_panel_approve --apply`. **חובה מקומי**. `--case <num>` / `--limit N` (תקרת תור). | אוטו (כפתור run-halacha) / ידני |
+| `final_halacha_pipeline.py` | python | **תזמור שלב-אימות-ההלכות (פקודה אחת).** מופעל ע"י הרמס כשלוחצים "הרץ אימות-הלכות". דטרמיניסטי: (0) `precedent_extract_halachot` (החלטה), (1) `extract_internal_citations(chair)`, (2) `corroboration.build_all()`, (3) `halacha_panel_approve --apply`. **עמיד (X16/INV-DUR1):** 4 הצעדים רצים דרך `_pipeline_runtime.py` עם checkpoint לכל תיק — קריסה בפאנל [3] ממשיכה מ-[3]. ברירת-מחדל auto-resume; `--fresh` ריצה נקייה. **חובה מקומי**. `--case <num>` / `--limit N` / `--fresh`. | אוטו (כפתור run-halacha) / ידני |
+| `_pipeline_runtime.py` | python | **runtime עמידות משותף (X16 / INV-DUR1)** ל-`final_halacha_pipeline` ו-`final_learning_pipeline` (מימוש אחד, G2). עוטף רשימת-צעדים async ב-LangGraph `StateGraph` ליניארי עם `AsyncSqliteSaver` (checkpoint לכל צעד; resume מדלג על צעדים שהושלמו). **degradation חיננית:** ללא langgraph (`pip install -e ".[durable]"`) — ריצה ליניארית כמו קודם (הכפתור לא נשבר). `Step(name, run)` + `run_pipeline(steps, thread_id, checkpoint_db, fresh)`. נבדק: `mcp-server/tests/test_pipeline_runtime.py`. | מיובא ע"י סקריפטי-המסלול-הסופי |
 | `curator_apply_pipeline_branch.py` | python | **מקור-אמת לחיווט-הכפתורים של הרמס.** prompt-ה-curator חי רק ב-Paperclip DB (`agents.adapter_config.promptTemplate`). הסקריפט מקדים branch כך שיקיצה עם reason `final_learning_*`/`final_halacha_*` מריצה את ה-pipeline המתאים (HOME/DOTENV/DATA_DIR מוחלטים → DeepSeek+Gemini keys + DATA_DIR נפתרים נכון) ועוצרת, אחרת §A/§B כרגיל. idempotent (מסיר branch קודם). מחיל על שני הסוכנים (CMP+CMPA). `--verify`. **להריץ אחרי reset/יצירה-מחדש של סוכן-curator.** | אחרי reset prompt של curator |
 | `halacha_panel_audit.py` | python | **רשת-ביטחון לפאנל** (selective-prediction monitoring) — דוגם הלכות שאושרו ע"י הפאנל (`reviewer LIKE 'panel:%'`), מריץ עליהן **שוב** את הצבעת-ה-KEEP של 3 השופטים, ומציף כל מקרה שכעת נוטה DROP (false-keep פוטנציאלי). report-only כברירת-מחדל; `--flag` מחזיר את ה-flips ל-`pending_review` לסקירת-יו"ר. `--sample N`/`--seed`. בסיס 2026-06-07: 0/15. מיועד להרצה תקופתית (שבועי). מייבא שופטים מ-`halacha_panel_approve`. **חובה מקומי**. | תקופתי (שבועי) — ניטור |
 | `halacha_panel_calibrate.py` | python | **כיול מדיניות-ההצבעה של הפאנל** (Trust-or-Escalate, ICLR 2025). מריץ את שאלת-ה-KEEP של `halacha_panel_approve` על מדגם-הזהב ומודד מול `is_holding` (הציר-הגס) precision+coverage לכל מדיניות (unanimous/majority) + ספירת false-keep/false-drop. נותן את **אחוז-הטעות בפועל** לבחירת סף-סיכון α. מייבא שופטים מ-`halacha_panel_approve` (מקור-אמת יחיד). read-only, **חובה מקומי**. | ידני — לפני חיווט `--apply` |
--- a/scripts/_pipeline_runtime.py
+++ b/scripts/_pipeline_runtime.py
@@ -0,0 +1,130 @@
+"""Durable execution runtime for the local one-shot pipelines (INV-DUR1 / X16).
+
+Wraps an ordered list of named async steps in a LangGraph linear ``StateGraph``
+with a SQLite checkpointer, so a crash / OOM / kill resumes from the last
+COMPLETED step instead of re-running the whole pipeline (idempotency makes a
+re-run *safe*; durability makes it *not pay twice*).
+
+Shared by ``final_halacha_pipeline.py`` and ``final_learning_pipeline.py`` — one
+implementation, not one-per-script (G2).
+
+Graceful degradation: if ``langgraph`` is not installed (e.g. the shared venv
+hasn't been updated yet), the steps run LINEARLY — exactly as before — with a
+warning. The production button (run-halacha / run-learning, driven by Hermes)
+never breaks waiting on the dependency; it simply gains durable resume once
+``langgraph`` + ``langgraph-checkpoint-sqlite`` are present.
+
+Scope (X16 §1): LangGraph is used ONLY as the internal engine of these local
+scripts — never as an agent-platform orchestrator (that would create a parallel
+path to Paperclip, breaking G2/G12). HITL stays with the chair gates / Paperclip.
+
+A "step" is ``Step(name, run)`` where ``run`` is an async callable taking the
+accumulated results dict and returning a dict to merge into it (typically
+``{<something>: <summary>}``). The step's real side-effects (DB writes, the LLM
+panel) happen inside ``run``; LangGraph checkpoints *that the node finished* so a
+resume skips it.
+"""
+from __future__ import annotations
+
+import logging
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Annotated, Any, Awaitable, Callable, TypedDict
+
+logger = logging.getLogger(__name__)
+
+StepFn = Callable[[dict], Awaitable[dict]]
+
+
+@dataclass(frozen=True)
+class Step:
+    name: str
+    run: StepFn
+
+
+def _merge(a: dict, b: dict) -> dict:
+    return {**a, **b}
+
+
+async def _run_linear(steps: list[Step]) -> dict:
+    """Fallback: run steps in order with no checkpointing (pre-X16 behaviour)."""
+    results: dict[str, Any] = {}
+    for step in steps:
+        out = await step.run(results)
+        if out:
+            results.update(out)
+    return results
+
+
+async def run_pipeline(
+    steps: list[Step],
+    *,
+    thread_id: str,
+    checkpoint_db: str | Path,
+    resume: bool = True,
+    fresh: bool = False,
+) -> dict:
+    """Run ``steps`` in order with durable checkpointing keyed by ``thread_id``.
+
+    * A brand-new ``thread_id`` (or ``fresh=True``) runs from the first step.
+    * An INCOMPLETE thread (a prior run crashed mid-way) is RESUMED — completed
+      steps are skipped, execution continues from the failed step.
+    * A COMPLETED thread re-run (idempotent re-extraction) starts fresh — the
+      stale checkpoint is cleared first so step-accumulators don't double-count.
+
+    Returns the accumulated results dict (``{step_name: <return>, ...}``).
+    """
+    try:
+        from langgraph.checkpoint.sqlite.aio import AsyncSqliteSaver
+        from langgraph.graph import END, START, StateGraph
+    except Exception as e:  # noqa: BLE001 — any import failure → safe linear path
+        logger.warning(
+            "langgraph unavailable (%s) — running %d steps LINEARLY without "
+            "durable checkpointing (X16/INV-DUR1 inactive; install langgraph + "
+            "langgraph-checkpoint-sqlite to enable resume).",
+            e, len(steps),
+        )
+        return await _run_linear(steps)
+
+    class State(TypedDict):
+        results: Annotated[dict, _merge]
+
+    def _make_node(step: Step):
+        async def _node(state: State) -> dict:
+            out = await step.run(state.get("results", {}))
+            return {"results": out or {}}
+        return _node
+
+    graph = StateGraph(State)
+    prev = START
+    for step in steps:
+        graph.add_node(step.name, _make_node(step))
+        graph.add_edge(prev, step.name)
+        prev = step.name
+    graph.add_edge(prev, END)
+
+    checkpoint_db = Path(checkpoint_db)
+    checkpoint_db.parent.mkdir(parents=True, exist_ok=True)
+    config = {"configurable": {"thread_id": thread_id}}
+
+    async with AsyncSqliteSaver.from_conn_string(str(checkpoint_db)) as saver:
+        app = graph.compile(checkpointer=saver)
+        snapshot = await app.aget_state(config)
+        ran = (snapshot.values or {}).get("results", {}) if snapshot else {}
+        incomplete = bool(ran) and tuple(snapshot.next or ()) != ()
+
+        if not fresh and incomplete:
+            logger.info(
+                "pipeline %s — resuming from %s (%d step(s) already done: %s)",
+                thread_id, snapshot.next, len(ran), ", ".join(ran),
+            )
+            final = await app.ainvoke(None, config)
+        else:
+            if snapshot and (snapshot.values or {}):
+                # stale/completed checkpoint — clear so this is a true fresh run.
+                await saver.adelete_thread(thread_id)
+            if fresh and ran:
+                logger.info("pipeline %s — --fresh: cleared prior checkpoint", thread_id)
+            final = await app.ainvoke({"results": {}}, config)
+
+    return (final or {}).get("results", {})
--- a/scripts/final_halacha_pipeline.py
+++ b/scripts/final_halacha_pipeline.py
@@ -21,8 +21,16 @@ chair drives that from /precedents when a missing precedent is added.

 Local-only. Idempotent. The panel pass over the full pending queue can take minutes.

+Durable (X16 / INV-DUR1): the 4 steps run through scripts/_pipeline_runtime.py
+with a SQLite checkpoint per case (data/checkpoints/halacha.sqlite). A crash/OOM
+in the long panel [3] RESUMES from [3] on the next run instead of re-paying
+[0]–[2]. Default = auto-resume an interrupted run; ``--fresh`` forces a clean run
+from [0]. Requires the host extra ``pip install -e ".[durable]"`` (mcp-server);
+without it the steps run linearly (same as before) — the button never breaks.
+
    cd ~/legal-ai/mcp-server
    .venv/bin/python ../scripts/final_halacha_pipeline.py --case 8126-03-25
+    .venv/bin/python ../scripts/final_halacha_pipeline.py --case 8126-03-25 --fresh
 """
 from __future__ import annotations

@@ -35,6 +43,8 @@ from pathlib import Path

 sys.path.insert(0, str(Path(__file__).resolve().parent))

+import _pipeline_runtime  # noqa: E402 — durable runtime (X16); scripts/ on sys.path
+from legal_mcp import config  # noqa: E402
 from legal_mcp.services import corroboration, db  # noqa: E402
 from legal_mcp.tools.citations import extract_internal_citations  # noqa: E402
 from legal_mcp.tools.precedent_library import precedent_extract_halachot  # noqa: E402
@@ -59,54 +69,89 @@ async def main(args: argparse.Namespace) -> int:
        print(f"✗ תיק {case_number} לא נמצא")
        return 1
    chair = case.get("chair_name") or "דפנה תמיר"
-
-    # [0] extract the halachot the decision ITSELF states (its own row in case_law) —
-    # so they are not left pending. Idempotent: skip when already completed or on dry-run.
    row = await _decision_law_row(case_number)
-    if not row:
-        print(f"[0/4] ההחלטה {case_number} אינה ב-case_law עדיין — דילוג על חילוץ-הלכות")
-    elif row.get("halacha_extraction_status") == "completed":
-        print(f"[0/4] חילוץ-הלכות מההחלטה — דולג (כבר completed)")
-    elif args.dry_run:
-        print(f"[0/4] חילוץ-הלכות מההחלטה — מדולג (dry-run)")
-    else:
+
+    # The 4 steps as durable nodes (X16 / INV-DUR1): each is checkpointed the
+    # moment it finishes, so a crash/OOM in the long panel [3] resumes from [3]
+    # instead of re-paying [0]–[2]. Steps [0] and [2] stay non-fatal (record the
+    # error and continue); [1]/[3] may raise → the graph halts and the next run
+    # resumes there. All steps are idempotent, so a fresh re-run is also safe.
+
+    async def step_extract(results: dict) -> dict:
+        # [0] extract the halachot the decision ITSELF states (its own case_law row).
+        if not row:
+            print(f"[0/4] ההחלטה {case_number} אינה ב-case_law עדיין — דילוג על חילוץ-הלכות")
+            return {"extract": "skipped:not-enrolled"}
+        if row.get("halacha_extraction_status") == "completed":
+            print("[0/4] חילוץ-הלכות מההחלטה — דולג (כבר completed)")
+            return {"extract": "skipped:completed"}
+        if args.dry_run:
+            print("[0/4] חילוץ-הלכות מההחלטה — מדולג (dry-run)")
+            return {"extract": "skipped:dry-run"}
        print(f"[0/4] precedent_extract_halachot (החלטה {case_number})…", flush=True)
        try:
            raw0 = await precedent_extract_halachot(str(row["id"]))
            d0 = json.loads(raw0).get("data", {})
            print(f"  ✓ status={d0.get('status')} stored={d0.get('stored', d0.get('extracted'))}")
-        except Exception as e:
+            return {"extract": d0.get("status", "done")}
+        except Exception as e:  # non-fatal — record and continue
            print(f"  ⚠ halacha extraction failed (non-fatal): {e}")
+            return {"extract": f"error:{e}"}

-    # [1] citation graph
-    print(f"[1/4] extract_internal_citations (chair={chair})…", flush=True)
-    raw = await extract_internal_citations(chair_name=chair, limit=0)
-    try:
-        d = json.loads(raw).get("data", {})
-        print(f"  ✓ extracted {d.get('extracted')} · linked {d.get('linked')} "
-              f"· new {d.get('new')}")
-    except Exception:
-        print(f"  (citations returned: {str(raw)[:160]})")
+    async def step_citations(results: dict) -> dict:
+        # [1] citation graph
+        print(f"[1/4] extract_internal_citations (chair={chair})…", flush=True)
+        raw = await extract_internal_citations(chair_name=chair, limit=0)
+        try:
+            d = json.loads(raw).get("data", {})
+            print(f"  ✓ extracted {d.get('extracted')} · linked {d.get('linked')} "
+                  f"· new {d.get('new')}")
+            return {"citations": "done"}
+        except Exception:
+            print(f"  (citations returned: {str(raw)[:160]})")
+            return {"citations": "unparsed"}

-    # [2] corroboration signal + policy (whole corpus backfill) — skipped on dry-run
-    if args.dry_run:
-        print("[2/4] corroboration_rebuild — מדולג (dry-run)")
-    else:
+    async def step_corroboration(results: dict) -> dict:
+        # [2] corroboration signal + policy (whole corpus backfill) — skip on dry-run.
+        if args.dry_run:
+            print("[2/4] corroboration_rebuild — מדולג (dry-run)")
+            return {"corroboration": "skipped:dry-run"}
        print("[2/4] corroboration_rebuild (backfill)…", flush=True)
        try:
            cr = await corroboration.build_all()
            print(f"  ✓ {cr}")
-        except Exception as e:
+            return {"corroboration": "done"}
+        except Exception as e:  # non-fatal
            print(f"  ⚠ corroboration failed (non-fatal): {e}")
+            return {"corroboration": f"error:{e}"}

-    # [3] three-judge halacha panel
-    apply = not args.dry_run
-    print(f"[3/4] halacha_panel_approve {'--apply' if apply else '(dry-run)'} "
-          f"(Opus+DeepSeek+Gemini)…", flush=True)
-    import halacha_panel_approve as hpa
-    rc = await hpa.main(Namespace(limit=args.limit, concurrency=6, apply=apply))
+    async def step_panel(results: dict) -> dict:
+        # [3] three-judge halacha panel (the long step durability protects).
+        apply = not args.dry_run
+        print(f"[3/4] halacha_panel_approve {'--apply' if apply else '(dry-run)'} "
+              f"(Opus+DeepSeek+Gemini)…", flush=True)
+        import halacha_panel_approve as hpa
+        rc = await hpa.main(Namespace(limit=args.limit, concurrency=6, apply=apply))
+        return {"panel_rc": rc or 0}
+
+    steps = [
+        _pipeline_runtime.Step("extract_decision_halachot", step_extract),
+        _pipeline_runtime.Step("citations", step_citations),
+        _pipeline_runtime.Step("corroboration", step_corroboration),
+        _pipeline_runtime.Step("panel", step_panel),
+    ]
+    checkpoint_db = config.DATA_DIR / "checkpoints" / "halacha.sqlite"
+    # Stable thread per case → an interrupted real run resumes; dry-runs are
+    # previews (own thread, always fresh — never resume a stale preview).
+    thread_id = f"halacha:{case_number}" + (":dryrun" if args.dry_run else "")
+    results = await _pipeline_runtime.run_pipeline(
+        steps,
+        thread_id=thread_id,
+        checkpoint_db=checkpoint_db,
+        fresh=bool(args.fresh) or args.dry_run,
+    )
    print("\n✓ pipeline-אימות-הלכות הושלם" + (" (dry-run)" if args.dry_run else ""))
-    return rc or 0
+    return int(results.get("panel_rc", 0) or 0)


 if __name__ == "__main__":
@@ -117,4 +162,7 @@ if __name__ == "__main__":
                    help="cap pending halachot judged (0 = full queue)")
    ap.add_argument("--dry-run", dest="dry_run", action="store_true",
                    help="citations only; skip corroboration writes; panel in dry-run")
+    ap.add_argument("--fresh", action="store_true",
+                    help="ignore any incomplete checkpoint and run from step [0] "
+                         "(default: auto-resume an interrupted run; X16/INV-DUR1)")
    raise SystemExit(asyncio.run(main(ap.parse_args())))