feat(goldset): AI second-opinion per item (QA aid) — compare vs human tag

The chair wanted an independent recommendation beside each tag, to reconsider his own judgments. Adds a NON-ground-truth AI second-opinion: - schema: halacha_goldset.ai_is_holding / ai_correct_type / ai_rationale / ai_generated_at (additive). - db.goldset_set_ai_recommendation + goldset_list now returns the ai_* fields. - scripts/goldset_ai_recommend.py — local claude_session judges is_holding + type + a one-line rationale per item, INDEPENDENTLY (own legal rubric). Independent of the rule-based validators #81.8 measures → no circularity. Never auto-applied; QA aid only. - web-ui: each card shows "🤖 המלצת AI: הלכה/לא · type" + rationale and an agreement/disagreement chip vs the human tag (amber on disagree); a "⚠ אי-הסכמות AI (N)" filter to review only the conflicts. Methodology note kept explicit: the human stays the ground truth; the AI is a prompt to reconsider, not to copy. Verified: tsc --noEmit 0; generator stores recs and flags disagreements with existing human tags. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-07 14:24:35 +00:00
parent a0c1b74c55
commit 0e35060d3d
5 changed files with 184 additions and 3 deletions
--- a/mcp-server/src/legal_mcp/services/db.py
+++ b/mcp-server/src/legal_mcp/services/db.py
@@ -1275,6 +1275,15 @@ CREATE TABLE IF NOT EXISTS halacha_goldset (
    UNIQUE (halacha_id, batch)
 );
 CREATE INDEX IF NOT EXISTS idx_goldset_batch ON halacha_goldset(batch);
+
+-- AI second-opinion (a QA aid, NOT ground truth): an INDEPENDENT local-LLM
+-- judgment shown beside the human tag so the chair can spot disagreements and
+-- reconsider. Independent of the rule-based validators that #81.8 measures, so
+-- no circularity. Generated locally (claude_session); never auto-applied.
+ALTER TABLE halacha_goldset ADD COLUMN IF NOT EXISTS ai_is_holding BOOLEAN;
+ALTER TABLE halacha_goldset ADD COLUMN IF NOT EXISTS ai_correct_type TEXT DEFAULT '';
+ALTER TABLE halacha_goldset ADD COLUMN IF NOT EXISTS ai_rationale TEXT DEFAULT '';
+ALTER TABLE halacha_goldset ADD COLUMN IF NOT EXISTS ai_generated_at TIMESTAMPTZ;
 """


@@ -4338,6 +4347,7 @@ async def goldset_list(batch: str = "default") -> list[dict]:
    rows = await pool.fetch(
        "SELECT g.id, g.halacha_id::text AS halacha_id, g.is_holding, "
        "       g.correct_type, g.quote_complete, g.tagged_by, g.tagged_at, "
+        "       g.ai_is_holding, g.ai_correct_type, g.ai_rationale, g.ai_generated_at, "
        "       h.rule_statement, h.supporting_quote, h.reasoning_summary, "
        "       h.rule_type, h.confidence, h.quality_flags, h.review_status, "
        "       cl.case_number, cl.case_name, cl.source_type "
@@ -4350,12 +4360,27 @@ async def goldset_list(batch: str = "default") -> list[dict]:
        d = dict(r)
        if d.get("tagged_at") is not None:
            d["tagged_at"] = d["tagged_at"].isoformat()
+        if d.get("ai_generated_at") is not None:
+            d["ai_generated_at"] = d["ai_generated_at"].isoformat()
        if d.get("confidence") is not None:
            d["confidence"] = float(d["confidence"])
        out.append(d)
    return out


+async def goldset_set_ai_recommendation(
+    goldset_id: UUID, *, ai_is_holding: bool | None,
+    ai_correct_type: str = "", ai_rationale: str = "",
+) -> None:
+    """Store the independent AI second-opinion for a gold-set item (QA aid)."""
+    pool = await get_pool()
+    await pool.execute(
+        "UPDATE halacha_goldset SET ai_is_holding = $2, ai_correct_type = $3, "
+        "ai_rationale = $4, ai_generated_at = now() WHERE id = $1",
+        goldset_id, ai_is_holding, ai_correct_type, ai_rationale,
+    )
+
+
 async def goldset_tag(
    goldset_id: UUID, *, is_holding: bool | None = None,
    correct_type: str | None = None, quote_complete: bool | None = None,
--- a/scripts/SCRIPTS.md
+++ b/scripts/SCRIPTS.md
@@ -38,7 +38,8 @@
 | `rechunk_legacy_precedents.py` | python | **#57** — re-chunk + re-embed פסיקה שהוטמעה לפני תיקון ה-chunker (#55). בוחר כל `case_law` עם chunk זעיר (`length(trim(content))<50` — טביעת-האצבע של ה-chunker הישן) ומריץ `ingest.reindex_case_law` (re-chunk+re-embed מ-`full_text` שמור בלבד — ללא re-OCR/LLM, feedback_no_reocr_retrofit; idempotent DELETE-then-INSERT). idempotent ברמת-הבאטץ' (שואב מחדש את הסט המושפע בכל ריצה). דגל `--limit N`. רץ עם venv של mcp-server (`cd mcp-server && .venv/bin/python ../scripts/rechunk_legacy_precedents.py`) | חד-פעמי — מיגרציית-נתונים של פסיקה legacy (תוקן 2026-06-03) |
 | `backfill_nevo_preamble.py` | python | **#86.2** — מיגרציית-נתונים: חיתוך preamble/רציו של נבו שדלף לפסיקה שהוטמעה לפני תיקון #86.1. מאתר כל `case_law` ש-`strip_nevo_preamble(full_text)` עדיין מקצר (דליפה היסטורית), ומבצע: (1) לכידת ה-מיני-רציו ל-`case_law.nevo_ratio` (gold-set ל-#86.3); (2) שכתוב `full_text` החתוך + חישוב-מחדש של `content_hash`; (3) `reindex_case_law` (re-chunk+embed, ללא re-OCR/LLM); (4) **סימון (לא מחיקה)** הלכות ש-`supporting_quote` שלהן בתוך ה-preamble שהוסר → `pending_review` + quality_flag `nevo_preamble_leak`. **שומר-בטיחות:** שורות עם keep%<`--min-keep` (ברירת-מחדל 60) מוחרגות מ-`--apply` כחשד over-strip (אלא אם `--include-suspicious`). **dry-run כברירת-מחדל**; `--apply` כותב backup JSON + manifest CSV ל-`data/audit/` תחילה. idempotent. רץ עם venv של mcp-server. **chair-gated** (לאמת manifest לפני apply) | מיגרציית-נתונים — dry-run בוצע (19 פסקים, 27 הלכות מזוהמות); apply ממתין לאישור |
 | `nevo_ratio_benchmark.py` | python | **#86.3** — מדידת איכות חילוץ-הלכות מול ה-מיני-רציו של נבו (gold-set מקצועי חינמי). לכל פסק עם `nevo_ratio` (או נגזר מ-`full_text` אם טרם בוצע backfill): LLM-judge מקומי (`claude_session`, אפס עלות) ממפה סמנטית את הלכות-המערכת מול הלכות-נבו ומפיק **recall** (כיסוי הלכות-נבו), **precision** (אחוז הלכותינו הממופות), **granularity** (יחס פירוק — איתות over-extraction ל-#81.5). `--case <num>` / `--all [--limit N]` / `--model` / `--out`. כותב CSV ל-`data/audit/`. רץ עם venv של mcp-server (דורש Claude CLI מקומי). אומת על בג"ץ 1764/05: recall 0.875, precision 1.0, granularity 1.75x | ידני — מדידת-איכות (CI/ad-hoc) |
-| `halacha_goldset.py` | python | **#81.7** — הארנס gold-set לאיכות חילוץ-הלכות. `export --n N` מייצא מדגם מרובד (לפי precedent×rule_type) ל-CSV עם עמודות-תיוג ריקות (`is_holding`/`correct_type`/`quote_complete`) לתיוג ידני (חיים/דפנה). `score --in <csv>` קורא את ה-CSV המתויג ומודד כל ולידטור (`compute_quality_flags`/`is_fact_dependent`/`is_quote_truncated`/`is_thin_restatement`) מול אמת-המידה האנושית: P/R/F1 + confusion. בסיס ל-#81.8 (כיול סף האישור). מייבא את אותם ולידטורים שה-extractor מריץ. רץ עם venv של mcp-server | ידני — export→תיוג→score |
+| `halacha_goldset.py` | python | **#81.7** — הארנס gold-set לאיכות חילוץ-הלכות. `export --n N` מייצא מדגם מרובד (לפי precedent×rule_type) ל-CSV עם עמודות-תיוג ריקות (`is_holding`/`correct_type`/`quote_complete`) לתיוג ידני (חיים/דפנה). `score --in <csv>` קורא את ה-CSV המתויג ומודד כל ולידטור (`compute_quality_flags`/`is_fact_dependent`/`is_quote_truncated`/`is_thin_restatement`) מול אמת-המידה האנושית: P/R/F1 + confusion. בסיס ל-#81.8 (כיול סף האישור). מייבא את אותם ולידטורים שה-extractor מריץ. רץ עם venv של mcp-server. **הערה:** קיים גם דף-תיוג אינטראקטיבי DB-backed (`/goldset`) — זה ה-CSV-fallback | ידני — export→תיוג→score |
+| `goldset_ai_recommend.py` | python | **#81.7 QA** — מייצר **חוות-דעת-AI שנייה** (claude מקומי, אפס עלות) לכל פריט ב-`halacha_goldset`: `is_holding`+`type`+נימוק, נשמר ב-`ai_*` ומוצג בדף לצד התיוג האנושי לזיהוי אי-הסכמות. **עצמאי** מהוולידטורים שנמדדים (אין מעגליות) ו**לא** מוחל אוטומטית. `--force` (חידוש)/`--limit N`. **חובה מקומי** (claude_session). | ידני — לאחר יצירת/הרחבת batch |
 | `halacha_batch_reconcile.py` | python | **#82.7** — dedup חוצה-פסקים offline (שמרני, **dry-run בלבד**). dedup-on-insert משווה רק תוך-פסק; כאן סף מחמיר (cosine ≥0.95, `--cosine`) ולא-הרסני: מאתר זוגות הלכות near-duplicate בין פסקים שונים (pgvector `<=>` exact) עם איתות לקסיקלי (Jaccard/Levenshtein) ומדווח ל-CSV ב-`data/audit/` לסקירת היו"ר. לא מדלג/ממזג/מוחק. `--include-pending`. **`--link`** רושם את הזוגות שנמצאו כ-`equivalent_halachot` (parallel authority, #84.2 — קישור-מקביל ברמת-הלכה, **לא** ציטוט; idempotent, לא-הרסני). רץ עם venv של mcp-server. אומת: 800 הלכות → 5 זוגות (קושרו). | ידני — דוח-סקירה / `--link` לקישור |
 | `calibrate_halacha_dedup.py` | python | **#82.1** — כיול ספי ה-dedup הלקסיקלי (#82.3) מול gold-set הניקוי. קורא `halacha-cleanup-manifest-*.csv` (זוגות duplicate↔survivor מתויגי-אדם), טוען טקסט-survivor מה-DB, ו-sweep של (jaccard_min × levenshtein_min) עם P/R/F1, מסמן את נקודת-העבודה המוגדרת. אימת ש-(0.55, 0.70) → **precision 1.0** (אפס false-merge), recall 0.30 — מתאים לאיתות-משני שחוסם auto-approve. `--manifest <path>`. רץ עם venv של mcp-server | חד-פעמי — כיול (בוצע 2026-06-06) |
 | `audit_corpus_integrity.py` | python | בדיקה תקופתית של עקביות הקורפוס — 3 בדיקות SQL read-only על `case_law` ו-`cases`: (A) `external_upload` עם prefix פנימי `ערר`/`בל"מ`; (B) `internal_committee` חסר `chair_name`/`district`; (C) `cases.practice_area` מחוץ ל-{`rishuy_uvniya`, `betterment_levy`, `compensation_197`, `''`}. כותב log מצטבר ל-`data/logs/corpus_integrity_audit.log` ובמצב הפרות שולח wakeup ל-CEO ב-Paperclip (best-effort, רק אם `PAPERCLIP_API_URL`+`PAPERCLIP_API_KEY` מוגדרים). דגל: `--no-notify`. Idempotent, יוצא 0. **Cron יומי 07:00**: `0 7 * * * /home/chaim/legal-ai/mcp-server/.venv/bin/python /home/chaim/legal-ai/scripts/audit_corpus_integrity.py` | `0 7 * * *` (cron) |
--- a/scripts/goldset_ai_recommend.py
+++ b/scripts/goldset_ai_recommend.py
@@ -0,0 +1,100 @@
+#!/usr/bin/env python3
+"""Generate the AI second-opinion for gold-set items (#81.7 QA aid).
+
+For each gold-set halacha, an INDEPENDENT local-LLM (claude_session, zero cost)
+judges: is it a real generalizable holding, what is its correct rule_type, and a
+one-line rationale. Stored in halacha_goldset.ai_* and shown beside the human
+tag so the chair can spot disagreements and reconsider.
+
+This is a QA aid, NOT ground truth and NOT auto-applied. It is also independent
+of the rule-based validators that #81.8 measures, so it doesn't bias that score.
+
+Must run locally (claude_session needs the local CLI — not the container):
+
+    cd ~/legal-ai/mcp-server
+    .venv/bin/python ../scripts/goldset_ai_recommend.py            # missing only
+    .venv/bin/python ../scripts/goldset_ai_recommend.py --force    # regenerate all
+    .venv/bin/python ../scripts/goldset_ai_recommend.py --limit 10 # smoke
+"""
+from __future__ import annotations
+
+import argparse
+import asyncio
+import sys
+from uuid import UUID
+
+from legal_mcp.services import claude_session, db
+
+VALID_TYPES = {"binding", "interpretive", "obiter", "application", "procedural", "persuasive"}
+
+SYSTEM = (
+    "אתה בוחן-איכות משפטי המסווג 'הלכות' שחולצו מהחלטות ועדת-ערר ומפסקי-דין. "
+    "לכל פריט הכרע שתי שאלות, באופן עצמאי ולפי המהות:\n"
+    "1) is_holding — האם זו הלכה אמיתית בת-הכללה ובת-הסתמכות (true), או שזו יישום "
+    "תלוי-עובדות / אמרת-אגב / ציטוט-עובדה ולא כלל בר-הכללה (false).\n"
+    "2) type — הסוג הנכון: 'binding' (עיקרון הכרחי להכרעה), 'interpretive' (פרשנות "
+    "חוק/מונח/תכנית), 'procedural' (סדר-דין: מועדים/סמכות/מיצוי/נטל), 'persuasive' "
+    "(אסמכתה לא-מחייבת), 'application' (החלה על עובדות התיק — לרוב לא-הלכה), "
+    "'obiter' (אמרת-אגב שלא הוכרעה — לא-הלכה).\n"
+    "עקביות: is_holding=true → binding/interpretive/procedural/persuasive; "
+    "is_holding=false → application/obiter.\n"
+    'החזר JSON בלבד: {"is_holding": true/false, "type": "<אחד מהשישה>", '
+    '"rationale": "<משפט אחד קצר בעברית>"}. ללא markdown.'
+)
+
+
+def _prompt(item: dict) -> str:
+    src = "פסק-דין" if item.get("source_type") == "court_ruling" else "החלטת ועדת-ערר"
+    return (
+        f"מקור: {src} ({item.get('case_number') or ''}).\n"
+        f"סוג שהמכונה נתנה: {item.get('rule_type')}.\n\n"
+        f"ניסוח הכלל:\n{item.get('rule_statement') or ''}\n\n"
+        f"ציטוט תומך:\n{item.get('supporting_quote') or ''}"
+    )
+
+
+async def main(args: argparse.Namespace) -> int:
+    items = await db.goldset_list(args.batch)
+    todo = [it for it in items if args.force or not it.get("ai_generated_at")]
+    if args.limit:
+        todo = todo[: args.limit]
+    print(f"gold-set {args.batch}: {len(items)} items, {len(todo)} to recommend", flush=True)
+
+    ok, fail, disagree = 0, 0, 0
+    for i, it in enumerate(todo, 1):
+        try:
+            v = await claude_session.query_json(_prompt(it), system=SYSTEM, effort="low")
+        except Exception as e:  # noqa: BLE001
+            fail += 1
+            print(f"[{i}/{len(todo)}] {it['case_number']}: FAIL {e}", flush=True)
+            continue
+        if not isinstance(v, dict):
+            fail += 1
+            continue
+        ai_hold = bool(v.get("is_holding"))
+        ai_type = str(v.get("type") or "").strip()
+        if ai_type not in VALID_TYPES:
+            ai_type = ""
+        await db.goldset_set_ai_recommendation(
+            UUID(str(it["id"])), ai_is_holding=ai_hold, ai_correct_type=ai_type,
+            ai_rationale=str(v.get("rationale") or "")[:300],
+        )
+        ok += 1
+        # note disagreements with the human tag (if tagged)
+        flag = ""
+        if it.get("is_holding") is not None and it["is_holding"] != ai_hold:
+            disagree += 1
+            flag = "  ⚠ DISAGREE is_holding"
+        print(f"[{i}/{len(todo)}] {it['case_number']}: ai={ai_hold}/{ai_type}{flag}", flush=True)
+
+    print(f"\nDONE — {ok} stored, {fail} failed, {disagree} disagree with existing human tag",
+          flush=True)
+    return 0
+
+
+if __name__ == "__main__":
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--batch", default="default")
+    ap.add_argument("--force", action="store_true", help="regenerate even if present")
+    ap.add_argument("--limit", type=int, default=None)
+    sys.exit(asyncio.run(main(ap.parse_args())))
--- a/web-ui/src/components/goldset/goldset-panel.tsx
+++ b/web-ui/src/components/goldset/goldset-panel.tsx
@@ -67,6 +67,16 @@ function isTagged(it: GoldsetItem): boolean {
  return it.is_holding !== null && it.quote_complete !== null && !!it.correct_type;
 }

+// The AI second-opinion disagrees with the human tag (on is_holding or type).
+function aiDisagrees(it: GoldsetItem): boolean {
+  if (!it.ai_generated_at) return false;
+  const holdDiff = it.is_holding !== null && it.ai_is_holding !== null
+    && it.is_holding !== it.ai_is_holding;
+  const typeDiff = !!it.correct_type && !!it.ai_correct_type
+    && it.correct_type !== it.ai_correct_type;
+  return holdDiff || typeDiff;
+}
+
 // ─── Score panel ──────────────────────────────────────────────────────────────

 function ScorePanel({ batch }: { batch: string }) {
@@ -248,6 +258,36 @@ function TagCard({
        &ldquo;{it.supporting_quote}&rdquo;
      </blockquote>

+      {it.ai_generated_at && (() => {
+        const aiType = TYPES.find((t) => t.value === it.ai_correct_type)?.label ?? it.ai_correct_type;
+        const holdDisagree = it.is_holding !== null && it.ai_is_holding !== null
+          && it.is_holding !== it.ai_is_holding;
+        const typeDisagree = !!it.correct_type && !!it.ai_correct_type
+          && it.correct_type !== it.ai_correct_type;
+        const anyTag = it.is_holding !== null || !!it.correct_type;
+        return (
+          <div className={`rounded-md border p-2.5 text-[0.78rem] space-y-1
+            ${holdDisagree ? "border-amber-400 bg-amber-50" : "border-rule bg-rule-soft/20"}`} dir="rtl">
+            <div className="flex items-center gap-2 flex-wrap">
+              <span className="font-semibold text-navy">🤖 המלצת AI:</span>
+              <span>{it.ai_is_holding ? "הלכה" : "לא הלכה"}</span>
+              {aiType && <span className="text-ink-muted">· {aiType}</span>}
+              {anyTag && (
+                <span className={`ms-auto text-[0.7rem] px-1.5 py-0.5 rounded
+                  ${holdDisagree || typeDisagree
+                    ? "bg-amber-100 text-amber-800"
+                    : "bg-emerald-50 text-emerald-700"}`}>
+                  {holdDisagree ? "⚠ חולק על 'הלכה/לא'"
+                    : typeDisagree ? "⚠ חולק על הסוג"
+                    : "✓ מסכים איתך"}
+                </span>
+              )}
+            </div>
+            {it.ai_rationale && <div className="text-ink-soft leading-relaxed">{it.ai_rationale}</div>}
+          </div>
+        );
+      })()}
+
      <div className="grid gap-3 sm:grid-cols-3 pt-1 border-t border-rule-soft">
        {/* is_holding */}
        <div>
@@ -308,11 +348,13 @@ export function GoldsetPanel() {
  const createSample = useCreateGoldsetSample(batch);
  const [focusedId, setFocusedId] = useState<string | null>(null);
  const [hideTagged, setHideTagged] = useState(false);
+  const [disagreeOnly, setDisagreeOnly] = useState(false);
  const [sourceFilter, setSourceFilter] =
    useState<"all" | "court_ruling" | "appeals_committee">("all");

  const items = useMemo(() => data?.items ?? [], [data]);
  const taggedCount = items.filter(isTagged).length;
+  const disagreeCount = items.filter(aiDisagrees).length;
  const sourceCounts = useMemo(() => ({
    court_ruling: items.filter((i) => i.source_type === "court_ruling").length,
    appeals_committee: items.filter((i) => i.source_type === "appeals_committee").length,
@@ -321,11 +363,12 @@ export function GoldsetPanel() {
    let v = items;
    if (sourceFilter !== "all") v = v.filter((i) => i.source_type === sourceFilter);
    if (hideTagged) v = v.filter((i) => !isTagged(i));
+    if (disagreeOnly) v = v.filter(aiDisagrees);
    // group-sort: כל פסקי-הדין יחד, ואז כל החלטות ועדת-הערר (הפרדה ברורה).
    const order = (s: string | null) =>
      s === "court_ruling" ? 0 : s === "appeals_committee" ? 1 : 2;
    return [...v].sort((a, b) => order(a.source_type) - order(b.source_type));
-  }, [items, hideTagged, sourceFilter]);
+  }, [items, hideTagged, sourceFilter, disagreeOnly]);

  const focused = focusedId ? visible.find((i) => i.id === focusedId) ?? null : null;

@@ -424,7 +467,14 @@ export function GoldsetPanel() {
          {" "}· הלכה <kbd className="bg-rule-soft px-1.5 rounded">H</kbd> / לא <kbd className="bg-rule-soft px-1.5 rounded">N</kbd>
          {" "}· ציטוט שלם <kbd className="bg-rule-soft px-1.5 rounded">C</kbd> / קטוע <kbd className="bg-rule-soft px-1.5 rounded">X</kbd>
        </span>
-        <Button size="sm" variant="ghost" className="ms-auto" onClick={() => setHideTagged((v) => !v)}>
+        {disagreeCount > 0 && (
+          <Button size="sm" variant={disagreeOnly ? "default" : "ghost"}
+            className={disagreeOnly ? "ms-auto bg-amber-500 text-white hover:bg-amber-600" : "ms-auto text-amber-700"}
+            onClick={() => setDisagreeOnly((v) => !v)}>
+            ⚠ אי-הסכמות AI ({disagreeCount})
+          </Button>
+        )}
+        <Button size="sm" variant="ghost" className={disagreeCount > 0 ? "" : "ms-auto"} onClick={() => setHideTagged((v) => !v)}>
          {hideTagged ? "הצג הכל" : "הסתר מתויגים"}
        </Button>
      </div>
--- a/web-ui/src/lib/api/goldset.ts
+++ b/web-ui/src/lib/api/goldset.ts
@@ -29,6 +29,11 @@ export type GoldsetItem = {
  case_number: string | null;
  case_name: string | null;
  source_type: string | null;  // 'court_ruling' | 'appeals_committee' | ''
+  // AI second-opinion (QA aid — independent, not ground truth, not auto-applied)
+  ai_is_holding: boolean | null;
+  ai_correct_type: string;
+  ai_rationale: string;
+  ai_generated_at: string | null;
 };

 export type GoldsetScore = {