feat(goldset): interactive gold-set tagging page (#81.7/#81.8)

Replaces the CSV-edit workflow with an in-app tagging page so the chair/Dafna
can label the extraction-quality gold-set by clicking, and see validator
precision/recall live.

Schema (V29): halacha_goldset — a stratified, human-tagged evaluation batch
(is_holding / correct_type / quote_complete, NULL until tagged).

db.py:
- goldset_create_sample (stratified round-robin over case×rule_type, idempotent),
- goldset_list (items + halacha content + the machine's own labels),
- goldset_tag (partial — one field at a time for keyboard tagging),
- goldset_score (ports the script's P/R/F1: each validator scored as a
  not-a-holding detector against the human tags — the #81.8 input).

API: GET /api/goldset, POST /api/goldset/sample, GET /api/goldset/score,
PATCH /api/goldset/{id}.

web-ui:
- lib/api/goldset.ts (hooks),
- components/goldset/goldset-panel.tsx — card-per-item, keyboard-first
  (J/K nav, H/N holding, C/X quote), progress bar, hide-tagged toggle, and a
  collapsible live score table,
- app/goldset/page.tsx + nav link "מדגם-זהב" under ידע ולמידה.

Methodology guard kept explicit in UI + docstrings: tags are HUMAN ground truth,
no AI pre-fill (circular bias). Populated a 150-item stratified batch.

Verified: backend create/list/tag/score against the live DB; tsc --noEmit 0;
py_compile ok. (Local Turbopack build blocked by worktree symlink — CI builds clean.)

Invariants: G1 (eval set modeled at source in its own table); G2 (reuses the same
halacha_quality validators the extractor runs — no parallel scoring logic).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-06 21:52:05 +00:00
parent 9bd247c421
commit ac279220c4
6 changed files with 632 additions and 1 deletions

View File

@@ -0,0 +1,41 @@
"use client";
import Link from "next/link";
import { AppShell } from "@/components/app-shell";
import { GoldsetPanel } from "@/components/goldset/goldset-panel";
/**
* Gold-set tagging page (#81.7 / #81.8).
*
* Interactive review of a stratified halacha sample. The chair/Dafna labels each
* item (is_holding / correct_type / quote_complete); those human labels are the
* ground truth that measures the extraction validators and recalibrates the
* auto-approve threshold. Tags MUST be human — no AI pre-fill (circular bias).
*/
export default function GoldsetPage() {
return (
<AppShell>
<section className="space-y-6">
<header>
<nav className="text-[0.78rem] text-ink-muted mb-1">
<Link href="/" className="hover:text-gold-deep">בית</Link>
<span aria-hidden> · </span>
<span className="text-navy">מדגם-זהב לתיוג</span>
</nav>
<h1 className="text-navy mb-0">מדגם-זהב לתיוג איכות</h1>
<p className="text-ink-muted text-sm mt-1 max-w-3xl">
מדגם מרובד של הלכות שחולצו. לכל הלכה הכריעו שלוש שאלות
<strong> האם זו הלכה אמיתית</strong>, <strong>מה הסוג הנכון</strong>,
ו<strong>האם הציטוט שלם</strong>. ההכרעות שלכם הן אמת-המידה שמודדת את
דיוק המחלץ ומכיילת את סף-האישור האוטומטי. שיפוט משפטי אנושי בלבד
לא תיוג-AI (כדי למנוע הטיה מעגלית).
</p>
</header>
<div className="h-[2px] bg-gradient-to-l from-transparent via-gold to-transparent" />
<GoldsetPanel />
</section>
</AppShell>
);
}