The old independent toggles had a trap: clicking "אי-הסכמות AI" set a filter,
and once all disagreements were resolved the toggle button disappeared
(rendered only when count>0) while the filter stayed ON — so the list showed
zero items and the untagged ones were unreachable.
Replaced hideTagged + disagreeOnly with one mutually-exclusive segmented
control: הכל / לא תויגו / תויגו / ⚠ אי-הסכמות, each with a live count and always
visible. No stuck state; "לא תויגו" makes the remaining work obvious.
Verified: tsc --noEmit 0.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The chair wanted an independent recommendation beside each tag, to reconsider
his own judgments. Adds a NON-ground-truth AI second-opinion:
- schema: halacha_goldset.ai_is_holding / ai_correct_type / ai_rationale /
ai_generated_at (additive).
- db.goldset_set_ai_recommendation + goldset_list now returns the ai_* fields.
- scripts/goldset_ai_recommend.py — local claude_session judges is_holding +
type + a one-line rationale per item, INDEPENDENTLY (own legal rubric).
Independent of the rule-based validators #81.8 measures → no circularity.
Never auto-applied; QA aid only.
- web-ui: each card shows "🤖 המלצת AI: הלכה/לא · type" + rationale and an
agreement/disagreement chip vs the human tag (amber on disagree); a
"⚠ אי-הסכמות AI (N)" filter to review only the conflicts.
Methodology note kept explicit: the human stays the ground truth; the AI is a
prompt to reconsider, not to copy.
Verified: tsc --noEmit 0; generator stores recs and flags disagreements with
existing human tags.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The validator score panel was collapsed by default, so taggers thought nothing
was happening. Now open by default, with a caption explaining the metrics
measure "not-a-holding" detection and become meaningful as more "לא הלכה" items
are tagged (showing the current negative count while it's small).
Verified: tsc --noEmit 0.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Tagging is easier one source-type at a time. goldset_list now returns
case_law.source_type; the page adds:
- a filter (הכל / פסקי דין / ועדת ערר) with live counts,
- a group-sort so even in "הכל" all court rulings come first, then all
committee decisions,
- a per-card source badge (פסק-דין / ועדת ערר).
Verified: tsc --noEmit 0; source_type splits the live batch 58 court / 92 committee.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
TYPE_HELP popover now follows the same order as the type buttons:
מחייבת · פרשני · יישום · אמרת-אגב · פרוצדורלי · משכנע.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
"לא הלכה" + "מחייבת" (or any holding-type) is a logical contradiction — binding
means it IS the holding. Likewise "הלכה" + application/obiter. The three controls
are independent, so the combo was clickable with no signal.
Adds a non-blocking amber warning under the type buttons when is_holding and
correct_type contradict (holding ↔ binding/interpretive/procedural/persuasive;
not-holding ↔ application/obiter). Soft by design — flags the inconsistency for
the tagger to fix without forcing, leaving room for genuine edge cases.
Verified: tsc --noEmit exits 0.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two UX fixes on the gold-set tagging page:
1. isTagged now requires is_holding AND correct_type AND quote_complete — not
just is_holding. Previously, in "hide tagged" mode the card vanished the
instant is_holding was clicked, so the type and quote-complete answers could
never be set. The progress counter / "תויג" badge now reflect full tagging.
2. An info (ℹ) icon next to "הסוג הנכון" opens a popover explaining the six
rule types (definition + the deciding test + an example each), so the tagger
has the criteria in front of them while tagging.
Verified: tsc --noEmit exits 0.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replaces the CSV-edit workflow with an in-app tagging page so the chair/Dafna
can label the extraction-quality gold-set by clicking, and see validator
precision/recall live.
Schema (V29): halacha_goldset — a stratified, human-tagged evaluation batch
(is_holding / correct_type / quote_complete, NULL until tagged).
db.py:
- goldset_create_sample (stratified round-robin over case×rule_type, idempotent),
- goldset_list (items + halacha content + the machine's own labels),
- goldset_tag (partial — one field at a time for keyboard tagging),
- goldset_score (ports the script's P/R/F1: each validator scored as a
not-a-holding detector against the human tags — the #81.8 input).
API: GET /api/goldset, POST /api/goldset/sample, GET /api/goldset/score,
PATCH /api/goldset/{id}.
web-ui:
- lib/api/goldset.ts (hooks),
- components/goldset/goldset-panel.tsx — card-per-item, keyboard-first
(J/K nav, H/N holding, C/X quote), progress bar, hide-tagged toggle, and a
collapsible live score table,
- app/goldset/page.tsx + nav link "מדגם-זהב" under ידע ולמידה.
Methodology guard kept explicit in UI + docstrings: tags are HUMAN ground truth,
no AI pre-fill (circular bias). Populated a 150-item stratified batch.
Verified: backend create/list/tag/score against the live DB; tsc --noEmit 0;
py_compile ok. (Local Turbopack build blocked by worktree symlink — CI builds clean.)
Invariants: G1 (eval set modeled at source in its own table); G2 (reuses the same
halacha_quality validators the extractor runs — no parallel scoring logic).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>