feat(halacha): NLI entailment validator via claude_session (#81.3) + task #86

#81.3 — a post-extraction validator that flags halachot whose rule_statement is NOT entailed by its supporting_quote (the model over-reaching beyond its source). - Engine: claude_session-as-judge (local CLI, zero API cost) per chaim's standing preference — one batched judge call per chunk, NOT a hosted NLI model. - Pure, unit-tested helpers in halacha_quality: NLI_SYSTEM, build_nli_prompt, parse_nli_verdicts (fails OPEN — any shape/label ambiguity → 'entailed'). - halacha_extractor._nli_check wraps the call; fails OPEN on any error (e.g. no CLI in the container) so a flaky judge never blocks a genuine halacha. - Non-entailed (neutral/contradiction) → quality_flag 'nli_unsupported' which blocks auto-approve (routes to pending_review) via the existing store gate. - config: HALACHA_NLI_ENABLED/MODEL/EFFORT (effort 'low' — entailment is simple). Verified: suite 166 passed (10 new); LIVE smoke test against the real claude CLI returned ['entailed','neutral'] for a supported vs unsupported rule. Also commits TaskMaster #86 (Nevo preamble/ratio: anti-contamination strip fix + gold-set benchmark) capturing today's strip_nevo_preamble findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 14:46:12 +00:00
parent e25507f9ad
commit f196bed564
5 changed files with 226 additions and 28 deletions
--- a/.taskmaster/tasks/tasks.json
+++ b/.taskmaster/tasks/tasks.json
@@ -2739,7 +2739,7 @@
        "description": "התגלו שני באגים: (1) halacha_index מוקצה per-chunk ולכן אינו ייחודי לפסק — שני עקרונות שונים מקבלים אותו מספר (לא כפילות, אך שובר dedup/מיון מבוסס-אינדקס); (2) חילוץ רץ פי-2/3 על אותו פסק (למשל 85026-17 שלוש ריצות תוך דקתיים) ומוסיף append במקום להחליף — ה-advisory lock לא מנע. המשימה: אינדוקס ייחודי לפסק, force=True שמוחק לפני re-extract, וחיזוק ה-lock/אידמפוטנטיות. מחקר קצר: דפוסי idempotency/exactly-once ב-pipelines.",
        "details": "קוד: halacha_extractor.py (global advisory lock, per-chunk checkpoints ב-precedent_chunks.halacha_extracted_at, force flag), db.store_halachot_for_chunk (הקצאת halacha_index). לשקול unique constraint (case_law_id, halacha_index) אחרי תיקון ההקצאה.",
        "testStrategy": "הרצת חילוץ פעמיים ברצף על אותו פסק → ספירה זהה, אפס אינדקסים כפולים. הרצת force=True → המאגר מוחלף ולא מצטבר. בדיקת מרוץ: שתי הרצות במקביל → רק אחת מבצעת (lock).",
-        "status": "pending",
+        "status": "done",
        "dependencies": [],
        "priority": "medium",
        "subtasks": [
@@ -2748,70 +2748,76 @@
            "title": "נעילה גלובלית אמינה (advisory lock על חיבור ייעודי)",
            "description": "העברת ה-pg_advisory_lock מחיבור pooled לחיבור ייעודי שאינו ממוחזר לאורך כל ה-job (או pg_advisory_xact_lock בדפוס מתאים), עם finally שמשחרר תמיד; תיעוד איסור transaction-pooler לפני החיבור.",
            "details": "מקור: PG docs — session advisory lock לא בטוח תחת transaction-pooling (PgBouncer ממליץ נגדו). השורש האמיתי של 3 ריצות תוך 2 דקות: הנעילה לא החזיקה. קוד: halacha_extractor.py:372-394.",
-            "status": "pending",
+            "status": "done",
            "dependencies": [],
            "testStrategy": "שתי הרצות extract במקביל על אותו precedent — השנייה מחזירה busy ולא רצה (נבדק עם 2 תהליכים נפרדים).",
-            "parentId": "83"
+            "parentId": "83",
+            "updatedAt": "2026-06-03T13:08:00.800Z"
          },
          {
            "id": 2,
            "title": "אילוץ UNIQUE (case_law_id, halacha_index)",
            "description": "migration ל-halachot עם UNIQUE (case_law_id, halacha_index) כרשת ביטחון נגד התנגשויות מספור.",
            "details": "מקור: FireHydrant/OneUptime — per-scope ordinal דורש UNIQUE(scope,number) כערובת התקינות; הנעילה היא אופטימיזציה. כיום ההקצאה MAX+1 מוגנת רק ב-asyncio.Lock תוך-תהליכי. db.py:645-669.",
-            "status": "pending",
+            "status": "done",
            "dependencies": [],
            "testStrategy": "INSERT ידני של index כפול לאותו precedent נכשל ב-DB; query GROUP BY case_law_id,halacha_index HAVING count>1 מחזיר 0 בקורפוס.",
-            "parentId": "83"
+            "parentId": "83",
+            "updatedAt": "2026-06-03T13:08:00.811Z"
          },
          {
            "id": 3,
            "title": "דה-דופ לפי תוכן (content_hash + ON CONFLICT DO NOTHING)",
            "description": "עמודת content_hash (md5 של rule_statement+supporting_quote) + UNIQUE(case_law_id, content_hash); שינוי ה-INSERT ל-ON CONFLICT DO NOTHING.",
            "details": "מקור: upsert/replace הם פרימיטיב האידמפוטנטיות. משלים את שער ה-fuzzy של #82 במפתח-זהות מדויק. db.py:3325-3374.",
-            "status": "pending",
+            "status": "cancelled",
            "dependencies": [],
            "testStrategy": "הרצת extract(force=False) פעמיים ברצף על precedent שהושלם → מספר ה-halachot לא גדל.",
-            "parentId": "83"
+            "parentId": "83",
+            "updatedAt": "2026-06-03T13:08:06.185Z"
          },
          {
            "id": 4,
            "title": "מספור עמיד למרוץ ב-store_halachot_for_chunk",
            "description": "retry-on-unique-violation סביב read-MAX→insert (או חישוב index מ-sequence עם RETURNING) — ללא הסתמכות על asyncio.Lock בלבד לבטיחות חוצת-תהליכים.",
            "details": "מקור: race condition handling ב-PG. כיום MAX+1 מוגן רק תוך-תהליכית. db.py:3341-3344.",
-            "status": "pending",
+            "status": "done",
            "dependencies": [
              2
            ],
            "testStrategy": "בדיקה שמדמה שני קוראי-MAX מקבילים — שניהם נשמרים עם indexים שונים רצופים, אפס DuplicateKey לא-מטופל.",
-            "parentId": "83"
+            "parentId": "83",
+            "updatedAt": "2026-06-03T13:08:00.826Z"
          },
          {
            "id": 5,
            "title": "semantics של force = replace אטומי + עמידות לקריסה",
            "description": "ודאות ש-force=True מבצע delete+checkpoint-clear בטרנזקציה אחת (קיים ב-reset_halacha_extraction), וש-resume אחרי קריסה אמצע re-extract לא מכפיל (הודות לאילוצי 83.2/83.3).",
            "details": "מקור: delete-before-insert atomic / replace-partition. per-chunk commits נשמרים ל-resumability. db.py:3299-3304.",
-            "status": "pending",
+            "status": "done",
            "dependencies": [
              2,
              3
            ],
            "testStrategy": "הזרקת קריסה אחרי חלק מהצ'אנקים → resume משלים ללא כפילויות; count(halachot) תואם הרצה נקייה.",
-            "parentId": "83"
+            "parentId": "83",
+            "updatedAt": "2026-06-03T13:08:00.834Z"
          },
          {
            "id": 6,
            "title": "ניקוי נתונים היסטוריים לפני החלת אילוצים",
            "description": "סקריפט חד-פעמי (scripts/ + SCRIPTS.md) שמזהה precedents עם indexים מתנגשים, ממספר מחדש רציף, ומכין את הקורפוס להחלת UNIQUE של 83.2/83.3.",
            "details": "הניקוי של 2026-06-03 טיפל בכפילויות תוכן אך לא במספור; אילוצי UNIQUE ייכשלו אם יש index כפול שריר. גיבוי קיים: data/audit/halacha-cleanup-backup-*.sql.",
-            "status": "pending",
+            "status": "done",
            "dependencies": [
              2
            ],
            "testStrategy": "אחרי הרצה, אילוצי 83.2/83.3 נוצרים ללא שגיאה; דוח CSV ב-data/audit/ מפרט כמה תוקנו לכל precedent.",
-            "parentId": "83"
+            "parentId": "83",
+            "updatedAt": "2026-06-03T13:08:00.841Z"
          }
        ],
-        "updatedAt": "2026-06-03T00:00:00.000Z"
+        "updatedAt": "2026-06-03T13:08:10.793Z"
      },
      {
        "id": "84",
@@ -2819,7 +2825,7 @@
        "description": "אישור ההלכות ידני ומתיש: קריאת עקרונות כמעט-זהים שוב ושוב, ללא תיעדוף או קיבוץ. המשימה: לייעל את חוויית האישור — מיון לפי ביטחון/corroboration, קיבוץ near-duplicates יחד, auto-defer/הסתרה של פריטים באיכות נמוכה, ופעולות batch (אישור/דחייה מרובים). מבוססת מחקר (human-in-the-loop review UX, active-learning prioritization, triage queues). תתי-המשימות לאחר המחקר.",
        "details": "הקשר: הדחייה כמעט לא בשימוש (1/1650) — התור הוא 'אשר-או-השאר-תלוי'. כלים קיימים: halachot_pending, halacha_review (MCP), דף ביקורת ב-UI. לשלב עם פלט #81 (איכות) ו-#82 (dedup) כדי שהתור יציג רק מועמדים אמיתיים ומקובצים.",
        "testStrategy": "מדידת זמן/קליקים לאישור N הלכות לפני/אחרי. בדיקה: פריטים כמעט-זהים מוצגים כקבוצה אחת; פריטי איכות-נמוכה אינם מופיעים כברירת-מחדל בתור.",
-        "status": "pending",
+        "status": "in-progress",
        "dependencies": [
          "81",
          "82"
@@ -2831,10 +2837,11 @@
            "title": "סינון מועמדים אמיתיים בלבד בתור (quality gating מ-#81)",
            "description": "הסתרת פריטים שסומנו low-quality (quote_verified=false, rule_type=application, truncated) מתצוגת ברירת-המחדל של halachot_pending; ניתובם ל-bucket 'דורש תיקון-חילוץ'.",
            "details": "מקור: default-defer/auto-archive של איכות-נמוכה (Prodigy/content-moderation). צורך פלט #81. כלי: halachot_pending.",
-            "status": "pending",
+            "status": "in-progress",
            "dependencies": [],
            "testStrategy": "תור ברירת-מחדל מחזיר 0 פריטים עם דגלי low-quality; פרמטר include_low_quality=true עדיין חושף אותם.",
-            "parentId": "84"
+            "parentId": "84",
+            "updatedAt": "2026-06-03T13:43:18.478Z"
          },
          {
            "id": 2,
@@ -2851,43 +2858,47 @@
            "title": "תיעדוף התור לפי ציון משוקלל (uncertainty + impact)",
            "description": "החלפת FIFO בציון עדיפות משולב: ביטחון באזור-אפור קודם, מוגבר ע\"י corroboration של ציטוט-בודד ותחומי-עיסוק בכיסוי דליל; דיכוי כפילויות של פריט-הראש.",
            "details": "מקור: active learning — least-confidence first + diversity/impact weighting (Encord/Label Studio/greip). uncertainty לבד מדגים יתר-על-המידה near-dups.",
-            "status": "pending",
+            "status": "in-progress",
            "dependencies": [],
            "testStrategy": "בהינתן set מתוכנן, ראש התור הוא הפריט בעל הציון המשולב הגבוה; שתי כפילויות לא מופיעות יחד ב-top-5.",
-            "parentId": "84"
+            "parentId": "84",
+            "updatedAt": "2026-06-03T13:43:18.488Z"
          },
          {
            "id": 4,
            "title": "פעולות batch: אישור/דחייה לכל הקבוצה",
            "description": "הרחבת halacha_review לקבלת מזהה-קבוצה והחלת approve/reject על כל הוריאנטים בקריאה אחת, עם אפשרות override ידני של וריאנט לפני commit.",
            "details": "מקור: propagate-one-decision-to-group + checkpoint אנושי על bulk (Labelbox). סיכון: bulk-apply עיוור מפיץ שגיאה — חובה הצגת וריאנטים לפני אישור.",
-            "status": "pending",
+            "status": "done",
            "dependencies": [],
            "testStrategy": "אישור קבוצת 5 וריאנטים מסמן את כולם approved בפעולה אחת; דחייה מסמנת את כולם rejected; override של וריאנט בודד אפשרי.",
-            "parentId": "84"
+            "parentId": "84",
+            "updatedAt": "2026-06-03T13:43:13.227Z"
          },
          {
            "id": 5,
            "title": "דחייה/השהיה זולה + סמנטיקת reject נכונה",
            "description": "הוספת outcomes מפורשים reject ו-defer ל-halacha_review; reject שומר אות שלילי מתמשך (מזין חזרה ל-#81), defer משאיר pending ומחזיר לסוף התור.",
            "details": "מקור: Prodigy accept/reject/ignore — הבחנה סמנטית היא הפתרון ל-1/1650. כיום אין פועל 'זבל' זול ולכן זבל מצטבר כ-pending.",
-            "status": "pending",
+            "status": "done",
            "dependencies": [],
            "testStrategy": "reject קובע status rejected מתמשך + סיבה (queryable למשוב מחלץ); defer משאיר pending אך מוריד עדיפות.",
-            "parentId": "84"
+            "parentId": "84",
+            "updatedAt": "2026-06-03T13:43:13.239Z"
          },
          {
            "id": 6,
            "title": "UI: ביקורת keyboard-first עם 4 מקשים (a/r/space/e)",
            "description": "עיצוב-מחדש של דף הביקורת ב-Next.js לכרטיס-בכל-פעם, מונע-מקלדת (Approve a / Reject r / Defer space / Edit e), עם הקשר-קבוצה וציטוט-מקור inline.",
            "details": "מקור: keyboard-first מעלה throughput 20-30% בלי פגיעה באיכות (CleverX); one-card-at-a-time מפחית עומס קוגניטיבי (Hick's law). דף: web-ui /feedback או דף ביקורת ייעודי.",
-            "status": "pending",
+            "status": "done",
            "dependencies": [
              4,
              5
            ],
            "testStrategy": "מבקר יכול approve/reject/defer/edit-then-approve כולו במקלדת בלי עכבר; הכרטיס מציג מונה-וריאנטים וציטוט-מקור.",
-            "parentId": "84"
+            "parentId": "84",
+            "updatedAt": "2026-06-03T13:43:13.253Z"
          },
          {
            "id": 7,
@@ -2902,7 +2913,7 @@
            "parentId": "84"
          }
        ],
-        "updatedAt": "2026-06-03T00:00:00.000Z"
+        "updatedAt": "2026-06-03T13:43:18.488Z"
      },
      {
        "id": "85",
@@ -2914,13 +2925,60 @@
        "dependencies": [],
        "priority": "high",
        "subtasks": []
+      },
+      {
+        "id": "86",
+        "title": "טיפול ב-preamble/רציו של נבו — anti-contamination + gold-set מהרציו",
+        "description": "התגלה (2026-06-03) ש-`strip_nevo_preamble` קיים ומחווט ל-ingest, אבל ה-regex `_DECISION_START` מזהה רק פתיחות של ועדת ערר (בפנינו/הערר שבנדון/ועדת הערר לתכנון/רקע עובדתי/עסקינן) — ולא פסקי-דין שנפתחים ב'פסק-דין' (כמו בג\"ץ 1764/05). לכן בפסקי-דין מנבו — בדיוק אלה שיש להם מיני-רציו — ה-preamble/רציו **אינו נחתך**, דולף לצ'אנקים, ועלול לזהם את חילוץ ההלכות (המחלץ קורא את התשובון של נבו) ואת הקורפוס. במקביל — הרציו של נבו הוא gold-set אנושי-מקצועי חינמי לאמידת איכות החילוץ.",
+        "details": "קוד: mcp-server/src/legal_mcp/services/extractor.py — `strip_nevo_preamble` (~367), `_NEVO_MARKERS` (ספרות:/חקיקה שאוזכרה:/מיני-רציו:/...), `_DECISION_START` (~361). מחווט ב-ingest.py:161 ו-documents.py:152. הוכחה: ב-1764/05 המיני-רציו שרד כ-chunk מסוג intro (לא נחתך) ורק במזל לא חולץ (intro לא ב-EXTRACTABLE_SECTIONS). השוואת benchmark שבוצעה ידנית על 1764/05: 14 הלכות שלנו כיסו 100% מ-4 הלכות-הרציו של נבו + 2 נוספות, בגרנולריות פי ~3.5 (קשור ל-#81.5).",
+        "testStrategy": "strip_nevo_preamble על טקסט 1764/05 מסיר את בלוק המיני-רציו ומתחיל מ'פסק-דין'; regression: פתיחות ועדת-ערר ממשיכות להיחתך נכון. benchmark מפיק recall/precision/granularity.",
+        "status": "pending",
+        "dependencies": [],
+        "priority": "high",
+        "subtasks": [
+          {
+            "id": 1,
+            "title": "הרחבת _DECISION_START לפסקי-דין (anti-contamination)",
+            "description": "הוספת פתיחות פסק-דין ל-`_DECISION_START` (פסק-דין / פסק דין / 'השופט'/'כב' השופט'/'לפני:') כך ש-strip_nevo_preamble חותך את ה-preamble/רציו גם בפסקי-דין מנבו.",
+            "details": "קוד: extractor.py `_DECISION_START`. שמירה על תאימות לאחור לפתיחות ועדת-ערר הקיימות.",
+            "status": "pending",
+            "dependencies": [],
+            "testStrategy": "unit: strip_nevo_preamble(טקסט 1764/05) מסיר את המיני-רציו ומתחיל מ'פסק-דין'; טקסט ועדת-ערר עם בפנינו עדיין נחתך נכון; טקסט ללא preamble חוזר ללא שינוי.",
+            "parentId": "86"
+          },
+          {
+            "id": 2,
+            "title": "backfill — זיהוי וטיהור פסקי-דין שהרציו דלף אליהם",
+            "description": "סקריפט לזיהוי פסקי-דין בקורפוס שה-preamble/רציו של נבו דלף לצ'אנקים (intro/legal_analysis) או להלכות שחולצו; re-ingest/strip + בדיקת זיהום בהלכות הקיימות.",
+            "details": "להריץ אחרי 85.1. גיבוי לפני re-ingest. לבדוק האם הלכות קיימות הן העתק של רציו.",
+            "status": "pending",
+            "dependencies": [
+              1
+            ],
+            "testStrategy": "דוח (CSV ב-data/audit/) של פסקים מושפעים; אחרי טיהור — אף chunk לא מכיל בלוק מיני-רציו; re-extraction נקי.",
+            "parentId": "86"
+          },
+          {
+            "id": 3,
+            "title": "Nevo-ratio gold-set benchmark (מזין #81.7)",
+            "description": "חילוץ בלוק המיני-רציו החתוך כ-ground-truth לכל פסק-דין מנבו; harness שמשווה הלכות-שלנו מול הרציו ומפיק recall (כיסוי הלכות-הרציו) / precision / יחס-גרנולריות.",
+            "details": "מקור ground-truth חינמי ואיכותי. ה-benchmark על 1764/05 כבר הודגם ידנית (recall=100%). לשמור את הרציו בשדה ייעודי (למשל case_law.headnote) במקום למחוק.",
+            "status": "pending",
+            "dependencies": [
+              1
+            ],
+            "testStrategy": "על 1764/05: recall=100%, מדווח granularity ratio; ניתן להריץ batch על כל פסקי-נבו ולהפיק טבלת איכות.",
+            "parentId": "86"
+          }
+        ],
+        "updatedAt": "2026-06-03T00:00:00.000Z"
      }
    ],
    "metadata": {
      "version": "1.0.0",
-      "lastModified": "2026-06-03T12:32:19.721Z",
+      "lastModified": "2026-06-03T13:43:18.488Z",
      "taskCount": 85,
-      "completedCount": 76,
+      "completedCount": 77,
      "tags": [
        "legal-ai"
      ]
--- a/mcp-server/src/legal_mcp/config.py
+++ b/mcp-server/src/legal_mcp/config.py
@@ -154,6 +154,15 @@ HALACHA_AUTO_APPROVE_THRESHOLD = float(
 # principle. Set > 1.0 to disable semantic dedup (exact-quote dedup still runs).
 HALACHA_DEDUP_COSINE = float(os.environ.get("HALACHA_DEDUP_COSINE", "0.93"))

+# Halacha NLI entailment validator (#81.3) — after extraction, a claude_session
+# judge checks each halacha's rule_statement is entailed by its supporting_quote.
+# Non-entailed (neutral/contradiction) → quality flag 'nli_unsupported' that
+# blocks auto-approve. Runs through the local CLI (zero cost); fails OPEN if the
+# CLI is unavailable (e.g. container). 'low' effort — entailment is a simple call.
+HALACHA_NLI_ENABLED = os.environ.get("HALACHA_NLI_ENABLED", "true").lower() == "true"
+HALACHA_NLI_MODEL = os.environ.get("HALACHA_NLI_MODEL", HALACHA_EXTRACT_MODEL)
+HALACHA_NLI_EFFORT = os.environ.get("HALACHA_NLI_EFFORT", "low")
+
 # Google Cloud Vision (OCR for scanned PDFs)
 GOOGLE_CLOUD_VISION_API_KEY = os.environ.get("GOOGLE_CLOUD_VISION_API_KEY", "")

--- a/mcp-server/src/legal_mcp/services/halacha_extractor.py
+++ b/mcp-server/src/legal_mcp/services/halacha_extractor.py
@@ -284,6 +284,27 @@ def _coerce_halacha(raw: dict, is_binding: bool = True) -> dict | None:
    }


+async def _nli_check(items: list[dict]) -> list[str]:
+    """Entailment verdict per item (rule ⊨ quote) via claude_session — #81.3.
+
+    Local CLI, zero cost. FAILS OPEN: any error returns all-'entailed' so a
+    flaky/unavailable judge (e.g. in the container) never blocks a halacha.
+    """
+    if not items:
+        return []
+    try:
+        raw = await claude_session.query_json(
+            halacha_quality.build_nli_prompt(items),
+            system=halacha_quality.NLI_SYSTEM,
+            model=config.HALACHA_NLI_MODEL or None,
+            effort=config.HALACHA_NLI_EFFORT or None,
+        )
+    except Exception as e:
+        logger.warning("halacha NLI check failed (fail-open, no flags): %s", e)
+        return ["entailed"] * len(items)
+    return halacha_quality.parse_nli_verdicts(raw, len(items))
+
+
 async def _extract_chunk(
    chunk_text: str,
    section_type: str,
@@ -511,6 +532,12 @@ async def _extract_impl(case_law_id: UUID, force: bool = False,
            if halacha_quality.FLAG_NON_DECISION in flags and coerced["rule_type"] != "obiter":
                coerced["rule_type"] = "obiter"
            cleaned.append(coerced)
+        # #81.3 NLI entailment — one batched judge call per chunk (fail-open).
+        if config.HALACHA_NLI_ENABLED and cleaned:
+            verdicts = await _nli_check(cleaned)
+            for h, v in zip(cleaned, verdicts):
+                if v != "entailed" and halacha_quality.FLAG_NLI_UNSUPPORTED not in h["quality_flags"]:
+                    h["quality_flags"].append(halacha_quality.FLAG_NLI_UNSUPPORTED)
        if cleaned:
            embed_inputs = [
                f"{h['rule_statement']} — {h['reasoning_summary']}".strip(" —")
--- a/mcp-server/src/legal_mcp/services/halacha_quality.py
+++ b/mcp-server/src/legal_mcp/services/halacha_quality.py
@@ -134,6 +134,55 @@ FLAG_NON_DECISION = "non_decision"
 FLAG_TRUNCATED_QUOTE = "truncated_quote"
 FLAG_THIN_RESTATEMENT = "thin_restatement"
 FLAG_QUOTE_UNVERIFIED = "quote_unverified"
+FLAG_NLI_UNSUPPORTED = "nli_unsupported"  # rule not entailed by its quote (#81.3)
+
+
+# ── NLI entailment check (rule_statement ⊨ supporting_quote) — #81.3 ──
+#
+# Pure prompt-builder + verdict-parser; the LLM call itself runs through
+# claude_session in halacha_extractor (local CLI, zero cost). A rule that the
+# quote does not actually support (neutral) or contradicts is the model
+# over-reaching beyond its source — flag it (blocks auto-approve). EVERYTHING
+# here fails OPEN: any parse ambiguity resolves to "entailed" so a flaky judge
+# never blocks a genuine halacha.
+
+NLI_SYSTEM = (
+    "אתה בודק היסק (entailment) משפטי. לכל זוג {כלל, ציטוט} החלט האם **הכלל נובע מהציטוט** — "
+    "כלומר הציטוט תומך בכלל ואינו מרחיב מעבר למה שנכתב בו. שלוש תוויות בלבד:\n"
+    "- entailed = הכלל נתמך במלואו בציטוט.\n"
+    "- neutral = הציטוט אינו תומך בכלל (הכלל מרחיב/מוסיף מעבר לציטוט).\n"
+    "- contradiction = הכלל סותר את הציטוט.\n"
+    'החזר JSON array בלבד באורך מספר הזוגות, לדוגמה: ["entailed","neutral",...]. '
+    "ללא markdown, ללא הסבר."
+)
+
+_NLI_LABELS = {"entailed", "neutral", "contradiction"}
+
+
+def build_nli_prompt(items: list[dict]) -> str:
+    """Build the user message: a numbered list of {rule, quote} pairs."""
+    blocks = []
+    for i, h in enumerate(items, 1):
+        rule = (h.get("rule_statement") or "").strip()
+        quote = (h.get("supporting_quote") or "").strip()
+        blocks.append(f"### זוג {i}\nכלל: {rule}\nציטוט: {quote}")
+    return "\n\n".join(blocks)
+
+
+def parse_nli_verdicts(raw, n: int) -> list[str]:
+    """Coerce the judge's output into exactly ``n`` labels — fail-open.
+
+    Any shape mismatch / unknown label resolves to 'entailed' so a flaky or
+    unavailable judge never blocks a halacha.
+    """
+    if not isinstance(raw, list) or len(raw) != n:
+        return ["entailed"] * n
+    out: list[str] = []
+    for item in raw:
+        v = item.get("verdict") if isinstance(item, dict) else item
+        v = str(v or "").strip().lower()
+        out.append(v if v in _NLI_LABELS else "entailed")
+    return out


 def compute_quality_flags(
--- a/mcp-server/tests/test_halacha_quality.py
+++ b/mcp-server/tests/test_halacha_quality.py
@@ -91,3 +91,58 @@ def test_flags_accumulate():

 def test_normalize_text_quote_variants():
    assert hq.normalize_text('עע"מ   317/10') == hq.normalize_text("עע״מ 317/10")
+
+
+# ── #81.3 NLI entailment — pure prompt + parser ──
+
+def test_build_nli_prompt_contains_pairs():
+    items = [
+        {"rule_statement": "כלל אלף", "supporting_quote": "ציטוט אלף"},
+        {"rule_statement": "כלל בית", "supporting_quote": "ציטוט בית"},
+    ]
+    p = hq.build_nli_prompt(items)
+    assert "כלל אלף" in p and "ציטוט בית" in p
+    assert "זוג 1" in p and "זוג 2" in p
+
+
+@pytest.mark.parametrize("raw,n,expected", [
+    (["entailed", "neutral"], 2, ["entailed", "neutral"]),
+    (["ENTAILED", "Contradiction"], 2, ["entailed", "contradiction"]),  # case-insensitive
+    ([{"verdict": "neutral"}, {"verdict": "entailed"}], 2, ["neutral", "entailed"]),  # dict shape
+    (["entailed"], 2, ["entailed", "entailed"]),          # length mismatch -> fail-open
+    (None, 2, ["entailed", "entailed"]),                  # non-list -> fail-open
+    (["bananas", "neutral"], 2, ["entailed", "neutral"]), # unknown label -> entailed
+])
+def test_parse_nli_verdicts(raw, n, expected):
+    assert hq.parse_nli_verdicts(raw, n) == expected
+
+
+# ── _nli_check (async, via claude_session) — fail-open + verdict mapping ──
+
+def test_nli_check_fail_open(monkeypatch):
+    import asyncio
+    from legal_mcp.services import halacha_extractor as he
+
+    async def boom(*a, **k):
+        raise RuntimeError("no claude CLI here")
+    monkeypatch.setattr(he.claude_session, "query_json", boom)
+    items = [{"rule_statement": "a", "supporting_quote": "b"}]
+    assert asyncio.run(he._nli_check(items)) == ["entailed"]  # never blocks
+
+
+def test_nli_check_maps_verdicts(monkeypatch):
+    import asyncio
+    from legal_mcp.services import halacha_extractor as he
+
+    async def fake(*a, **k):
+        return ["entailed", "neutral"]
+    monkeypatch.setattr(he.claude_session, "query_json", fake)
+    items = [{"rule_statement": "a", "supporting_quote": "b"},
+             {"rule_statement": "c", "supporting_quote": "d"}]
+    assert asyncio.run(he._nli_check(items)) == ["entailed", "neutral"]
+
+
+def test_nli_check_empty():
+    import asyncio
+    from legal_mcp.services import halacha_extractor as he
+    assert asyncio.run(he._nli_check([])) == []