From 808c2e4c4619a0f84dcb5f15f067cd39cde053e9 Mon Sep 17 00:00:00 2001
From: Chaim <chaim@marcus-law.co.il>
Date: Sun, 7 Jun 2026 20:12:58 +0000
Subject: [PATCH] feat(goldset): independent second-judge for rule_role (break
 AI-anchoring)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The gold-set's human role tags were made while seeing a claude AI recommendation,
so human↔AI agreement (~100%) is anchoring, not an independent accuracy signal.
This adds a third, genuinely independent judge — a DIFFERENT model (DeepSeek,
direct OpenAI-compatible API) classifies rule_role BLIND (never sees the human
tag nor the first AI's answer) — and reports an inter-rater agreement matrix.

Finding (100 tagged items): ai↔human 100% (anchored) vs deepseek↔human 50%
fine-grained — BUT 92% on the coarse axis (generalizable-rule vs application/
obiter). Conclusion: the fine sub-type (holding/interpretive/procedural) is an
inherently fuzzy boundary two capable models split differently; the coarse
"is this a real rule" axis is robust across models. Use the coarse axis as
ground truth; treat the sub-type as advisory, never as a gate.

Zero chair tagging, read-only on the gold-set. Key from ~/.hermes deepseek env.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 scripts/SCRIPTS.md                   |   1 +
 scripts/goldset_independent_judge.py | 166 +++++++++++++++++++++++++++
 2 files changed, 167 insertions(+)
 create mode 100644 scripts/goldset_independent_judge.py
diff --git a/scripts/SCRIPTS.md b/scripts/SCRIPTS.md
index 7cf8350..76e0d63 100644
--- a/scripts/SCRIPTS.md
+++ b/scripts/SCRIPTS.md
@@ -43,6 +43,7 @@
 | `nevo_ratio_benchmark.py` | python | **#86.3** — מדידת איכות חילוץ-הלכות מול ה-מיני-רציו של נבו (gold-set מקצועי חינמי). לכל פסק עם `nevo_ratio` (או נגזר מ-`full_text` אם טרם בוצע backfill): LLM-judge מקומי (`claude_session`, אפס עלות) ממפה סמנטית את הלכות-המערכת מול הלכות-נבו ומפיק **recall** (כיסוי הלכות-נבו), **precision** (אחוז הלכותינו הממופות), **granularity** (יחס פירוק — איתות over-extraction ל-#81.5). `--case <num>` / `--all [--limit N]` / `--model` / `--out`. כותב CSV ל-`data/audit/`. רץ עם venv של mcp-server (דורש Claude CLI מקומי). אומת על בג"ץ 1764/05: recall 0.875, precision 1.0, granularity 1.75x | ידני — מדידת-איכות (CI/ad-hoc) |
 | `halacha_goldset.py` | python | **#81.7** — הארנס gold-set לאיכות חילוץ-הלכות. `export --n N` מייצא מדגם מרובד (לפי precedent×rule_type) ל-CSV עם עמודות-תיוג ריקות (`is_holding`/`correct_type`/`quote_complete`) לתיוג ידני (חיים/דפנה). `score --in <csv>` קורא את ה-CSV המתויג ומודד כל ולידטור (`compute_quality_flags`/`is_fact_dependent`/`is_quote_truncated`/`is_thin_restatement`) מול אמת-המידה האנושית: P/R/F1 + confusion. בסיס ל-#81.8 (כיול סף האישור). מייבא את אותם ולידטורים שה-extractor מריץ. רץ עם venv של mcp-server. **הערה:** קיים גם דף-תיוג אינטראקטיבי DB-backed (`/goldset`) — זה ה-CSV-fallback | ידני — export→תיוג→score |
 | `goldset_ai_recommend.py` | python | **#81.7 QA** — מייצר **חוות-דעת-AI שנייה** (claude מקומי, אפס עלות) לכל פריט ב-`halacha_goldset`: `is_holding`+`type`+נימוק, נשמר ב-`ai_*` ומוצג בדף לצד התיוג האנושי לזיהוי אי-הסכמות. **עצמאי** מהוולידטורים שנמדדים (אין מעגליות) ו**לא** מוחל אוטומטית. `--force` (חידוש)/`--limit N`. **חובה מקומי** (claude_session). | ידני — לאחר יצירת/הרחבת batch |
+| `goldset_independent_judge.py` | python | **INV-DM7 ולידציה** — שופט-תפקיד **עצמאי שני** ממודל אחר (DeepSeek API ישיר, OpenAI-compatible) ששובר את עיגון-ה-AI: מסווג rule_role **בעיוור** (בלי לראות תיוג-אדם או המלצת-claude) ומחשב מטריצת-הסכמה (deepseek↔אדם מול ai↔אדם) + ציר-גס (כלל-בר-הכללה מול application/obiter). **ממצא (2026-06-07):** ai↔אדם=100% (מעוגן), deepseek↔אדם=50% מדויק אך **92% גס** → תת-הסוג holding/interpretive/procedural עמום-מטבעו (לא לשער עליו); הציר-הגס אמין חוצה-מודלים. read-only על הזהב. `--model`/`--limit`/`--concurrency`. מפתח מ-`~/.hermes/profiles/deepseek/.env`. raw→`/tmp/goldset_judge_raw.json`. | ידני — ולידציית אמינות-תוויות |
 | `halacha_rule_role_backfill.py` | python | **INV-DM7** — backfill חד-פעמי: מסווג-מחדש את ההלכות הישנות (`rule_type IN ('binding','persuasive')` — ערכי-סמכות שנשמרו במסווה תפקיד לפני פיצול הצירים) לאחד מחמשת **תפקידי-הכלל** (holding/interpretive/procedural/application/obiter) דרך claude_session המקומי (אפס עלות). **לא נוגע בסמכות** (נגזרת מ-`precedent_level`). `--apply` (ברירת-מחדל dry-run) / `--limit N` / `--concurrency`. כותב backup CSV ל-`data/audit/` תחילה. fail-safe (פריט שנכשל → נשמר ערך ישן). **חובה מקומי** (claude_session). | ידני חד-פעמי אחרי deploy של פיצול-הסמכות |
 | `halacha_batch_reconcile.py` | python | **#82.7** — dedup חוצה-פסקים offline (שמרני, **dry-run בלבד**). dedup-on-insert משווה רק תוך-פסק; כאן סף מחמיר (cosine ≥0.95, `--cosine`) ולא-הרסני: מאתר זוגות הלכות near-duplicate בין פסקים שונים (pgvector `<=>` exact) עם איתות לקסיקלי (Jaccard/Levenshtein) ומדווח ל-CSV ב-`data/audit/` לסקירת היו"ר. לא מדלג/ממזג/מוחק. `--include-pending`. **`--link`** רושם את הזוגות שנמצאו כ-`equivalent_halachot` (parallel authority, #84.2 — קישור-מקביל ברמת-הלכה, **לא** ציטוט; idempotent, לא-הרסני). רץ עם venv של mcp-server. אומת: 800 הלכות → 5 זוגות (קושרו). | ידני — דוח-סקירה / `--link` לקישור |
 | `calibrate_halacha_dedup.py` | python | **#82.1** — כיול ספי ה-dedup הלקסיקלי (#82.3) מול gold-set הניקוי. קורא `halacha-cleanup-manifest-*.csv` (זוגות duplicate↔survivor מתויגי-אדם), טוען טקסט-survivor מה-DB, ו-sweep של (jaccard_min × levenshtein_min) עם P/R/F1, מסמן את נקודת-העבודה המוגדרת. אימת ש-(0.55, 0.70) → **precision 1.0** (אפס false-merge), recall 0.30 — מתאים לאיתות-משני שחוסם auto-approve. `--manifest <path>`. רץ עם venv של mcp-server | חד-פעמי — כיול (בוצע 2026-06-06) |
diff --git a/scripts/goldset_independent_judge.py b/scripts/goldset_independent_judge.py
new file mode 100644
index 0000000..6b31b6c
--- /dev/null
+++ b/scripts/goldset_independent_judge.py
@@ -0,0 +1,166 @@
+#!/usr/bin/env python3
+"""Independent second-judge for gold-set rule_ROLE — breaks the AI-anchoring loop.
+
+The gold-set human role tags were made WHILE seeing a claude AI recommendation,
+so human↔AI agreement (~100%) is contaminated by anchoring — it is not an
+independent measure of role-classification accuracy. This script adds a THIRD,
+genuinely independent judge: a DIFFERENT model (DeepSeek, OpenAI-compatible API)
+classifies the rule ROLE blind — it never sees the human tag NOR the first AI's
+answer. Comparing deepseek↔human against ai↔human tells us whether the labels
+are trustworthy or just anchored.
+
+Zero tagging from the chair. Read-only on the gold-set.
+
+    cd ~/legal-ai/mcp-server
+    .venv/bin/python ../scripts/goldset_independent_judge.py            # all tagged
+    .venv/bin/python ../scripts/goldset_independent_judge.py --limit 10 # smoke
+    .venv/bin/python ../scripts/goldset_independent_judge.py --model deepseek-reasoner
+"""
+from __future__ import annotations
+
+import argparse
+import asyncio
+import json
+import os
+import sys
+from collections import Counter
+from pathlib import Path
+
+import httpx
+
+from legal_mcp.services import db
+
+ROLES = {"holding", "interpretive", "procedural", "application", "obiter"}
+
+SYSTEM = (
+    "אתה משפטן בכיר המסווג 'הלכות' שחולצו מפסיקה ישראלית לפי **סוג הכלל** בלבד. "
+    "אל תסווג מחייב/משכנע (דרגת-המחייבות אינה רלוונטית). בחר ערך אחד:\n"
+    "- holding — עיקרון מהותי שהיה הכרחי להכרעה (ratio; מבחן Wambaugh).\n"
+    "- interpretive — פרשנות הוראת-חוק/מונח/תכנית.\n"
+    "- procedural — סדר-דין: סמכות/מועדים/זכות-עמידה/מיצוי/נטל.\n"
+    "- application — החלה תלוית-עובדות על נסיבות התיק (לרוב לא-הלכה בת-הכללה).\n"
+    "- obiter — אמרת-אגב שלא הוכרעה.\n"
+    'החזר JSON בלבד: {"role":"<אחד מהחמישה>"}. ללא markdown, ללא הסבר.'
+)
+
+
+def _deepseek_key() -> str:
+    for p in (Path.home() / ".hermes/profiles/deepseek/.env", Path.home() / ".env"):
+        if p.exists():
+            for line in p.read_text().splitlines():
+                if line.startswith("DEEPSEEK_API_KEY="):
+                    return line.split("=", 1)[1].strip()
+    return os.environ.get("DEEPSEEK_API_KEY", "")
+
+
+def _user_prompt(it: dict) -> str:
+    src = "פסק-דין" if it.get("source_type") == "court_ruling" else "החלטת ועדת-ערר"
+    return (
+        f"מקור: {src}.\n\n"
+        f"ניסוח הכלל:\n{it.get('rule_statement') or ''}\n\n"
+        f"היגיון:\n{it.get('reasoning_summary') or ''}\n\n"
+        f"ציטוט תומך:\n{it.get('supporting_quote') or ''}"
+    )
+
+
+async def _judge(client: httpx.AsyncClient, key: str, model: str, it: dict) -> str | None:
+    try:
+        r = await client.post(
+            "https://api.deepseek.com/v1/chat/completions",
+            headers={"Authorization": f"Bearer {key}", "Content-Type": "application/json"},
+            json={
+                "model": model,
+                "messages": [
+                    {"role": "system", "content": SYSTEM},
+                    {"role": "user", "content": _user_prompt(it)},
+                ],
+                "temperature": 0,
+                "max_tokens": 60,
+                "response_format": {"type": "json_object"},
+            },
+            timeout=90,
+        )
+        r.raise_for_status()
+        content = r.json()["choices"][0]["message"]["content"]
+        role = str(json.loads(content).get("role", "")).strip().lower()
+        return role if role in ROLES else None
+    except Exception as e:  # noqa: BLE001
+        print(f"  ! judge error: {e}", flush=True)
+        return None
+
+
+def _agree(rows: list[dict], a: str, b: str) -> tuple[int, int, float]:
+    """Return (matches, comparable, percent) — percent is 0..100."""
+    valid = [r for r in rows if r.get(a) and r.get(b)]
+    ok = sum(1 for r in valid if r[a] == r[b])
+    return ok, len(valid), (100.0 * ok / len(valid) if valid else 0.0)
+
+
+async def main(args: argparse.Namespace) -> int:
+    key = _deepseek_key()
+    if not key:
+        print("no DEEPSEEK_API_KEY found", flush=True)
+        return 1
+
+    items = await db.goldset_list(args.batch)
+    # only items with a HUMAN role tag (the ground truth we are testing)
+    tagged = [it for it in items if (it.get("correct_type") or "").strip() in ROLES]
+    if args.limit:
+        tagged = tagged[: args.limit]
+    print(f"independent judge ({args.model}) on {len(tagged)} human-tagged items\n", flush=True)
+
+    sem = asyncio.Semaphore(args.concurrency)
+    rows: list[dict] = []
+    async with httpx.AsyncClient() as client:
+        async def one(it: dict):
+            async with sem:
+                ds = await _judge(client, key, args.model, it)
+            rows.append({
+                "human": (it.get("correct_type") or "").strip().lower(),
+                "ai": (it.get("ai_correct_type") or "").strip().lower(),
+                "deepseek": ds,
+                "machine": (it.get("rule_type") or "").strip().lower(),
+                "source": it.get("source_type"),
+            })
+        for i in range(0, len(tagged), args.concurrency):
+            await asyncio.gather(*(one(it) for it in tagged[i : i + args.concurrency]))
+            print(f"  …{len(rows)}/{len(tagged)}", flush=True)
+
+    judged = [r for r in rows if r["deepseek"]]
+    print(f"\n=== INTER-RATER AGREEMENT on rule_role ({len(judged)} judged) ===")
+    print("  ai↔human       (anchored baseline):   %d/%d = %.0f%%" % _agree(rows, "ai", "human"))
+    print("  deepseek↔human (INDEPENDENT — key):    %d/%d = %.0f%%" % _agree(judged, "deepseek", "human"))
+    print("  deepseek↔ai    (cross-model):          %d/%d = %.0f%%" % _agree(judged, "deepseek", "ai"))
+    una = [r for r in judged if r["human"] == r["ai"] == r["deepseek"]]
+    print(f"  3-way unanimous (human=ai=deepseek):   {len(una)}/{len(judged)} = {len(una)/max(1,len(judged)):.0%}")
+
+    print("\n=== where the INDEPENDENT judge disagrees with the human (the real signal) ===")
+    mm = Counter((r["human"], r["deepseek"]) for r in judged if r["human"] != r["deepseek"])
+    for (h, d), n in mm.most_common():
+        print(f"  human={h} → deepseek={d}: {n}")
+
+    # COARSE axis: is this a generalizable rule at all? (holding/interpretive/
+    # procedural collapse to one class) vs the non-generalizable markers
+    # (application/obiter). If fine-grained agreement is low but coarse is high,
+    # the disagreement is a cosmetic sub-distinction, not a meaningful one.
+    GEN = {"holding", "interpretive", "procedural"}
+    def coarse(v): return "rule" if v in GEN else ("nonrule" if v in {"application", "obiter"} else None)
+    for r in judged:
+        r["human_c"], r["deepseek_c"], r["ai_c"] = coarse(r["human"]), coarse(r["deepseek"]), coarse(r["ai"])
+    print("\n=== COARSE agreement (generalizable-rule vs application/obiter) ===")
+    print("  deepseek↔human (coarse):   %d/%d = %.0f%%" % _agree(judged, "deepseek_c", "human_c"))
+    print("  ai↔human       (coarse):   %d/%d = %.0f%%" % _agree(judged, "ai_c", "human_c"))
+
+    Path("/tmp/goldset_judge_raw.json").write_text(json.dumps(rows, ensure_ascii=False, indent=1))
+    print("\nraw judgments → /tmp/goldset_judge_raw.json")
+    return 0
+
+
+if __name__ == "__main__":
+    ap = argparse.ArgumentParser(description=__doc__,
+                                 formatter_class=argparse.RawDescriptionHelpFormatter)
+    ap.add_argument("--batch", default="default")
+    ap.add_argument("--model", default="deepseek-chat", help="deepseek-chat | deepseek-reasoner")
+    ap.add_argument("--limit", type=int, default=0)
+    ap.add_argument("--concurrency", type=int, default=6)
+    sys.exit(asyncio.run(main(ap.parse_args())))