feat(goldset): interactive gold-set tagging page (#81.7/#81.8)

Replaces the CSV-edit workflow with an in-app tagging page so the chair/Dafna
can label the extraction-quality gold-set by clicking, and see validator
precision/recall live.

Schema (V29): halacha_goldset — a stratified, human-tagged evaluation batch
(is_holding / correct_type / quote_complete, NULL until tagged).

db.py:
- goldset_create_sample (stratified round-robin over case×rule_type, idempotent),
- goldset_list (items + halacha content + the machine's own labels),
- goldset_tag (partial — one field at a time for keyboard tagging),
- goldset_score (ports the script's P/R/F1: each validator scored as a
  not-a-holding detector against the human tags — the #81.8 input).

API: GET /api/goldset, POST /api/goldset/sample, GET /api/goldset/score,
PATCH /api/goldset/{id}.

web-ui:
- lib/api/goldset.ts (hooks),
- components/goldset/goldset-panel.tsx — card-per-item, keyboard-first
  (J/K nav, H/N holding, C/X quote), progress bar, hide-tagged toggle, and a
  collapsible live score table,
- app/goldset/page.tsx + nav link "מדגם-זהב" under ידע ולמידה.

Methodology guard kept explicit in UI + docstrings: tags are HUMAN ground truth,
no AI pre-fill (circular bias). Populated a 150-item stratified batch.

Verified: backend create/list/tag/score against the live DB; tsc --noEmit 0;
py_compile ok. (Local Turbopack build blocked by worktree symlink — CI builds clean.)

Invariants: G1 (eval set modeled at source in its own table); G2 (reuses the same
halacha_quality validators the extractor runs — no parallel scoring logic).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-06 21:52:05 +00:00
parent 9bd247c421
commit ac279220c4
6 changed files with 632 additions and 1 deletions

View File

@@ -6099,6 +6099,59 @@ async def halacha_equivalents_unlink(halacha_id: str, other_id: str):
return {"ok": await db.unlink_equivalent_halachot(hid, oid)}
# ── Gold-set tagging (#81.7 / #81.8) ─────────────────────────────────────────
class GoldsetSampleRequest(BaseModel):
n: int = 150
batch: str = "default"
reset: bool = False
class GoldsetTagRequest(BaseModel):
is_holding: bool | None = None
correct_type: str | None = None
quote_complete: bool | None = None
tagged_by: str = "chair"
@app.get("/api/goldset")
async def goldset_list_ep(batch: str = "default"):
"""The gold-set tagging queue (halacha content + machine labels + human tags)."""
return {"items": await db.goldset_list(batch), "batch": batch}
@app.post("/api/goldset/sample")
async def goldset_sample_ep(req: GoldsetSampleRequest):
"""Create/extend a stratified gold-set batch for tagging (#81.7)."""
return await db.goldset_create_sample(n=req.n, batch=req.batch, reset=req.reset)
@app.get("/api/goldset/score")
async def goldset_score_ep(batch: str = "default"):
"""Measure the extraction validators against the human tags (#81.8)."""
return await db.goldset_score(batch)
@app.patch("/api/goldset/{goldset_id}")
async def goldset_tag_ep(goldset_id: str, req: GoldsetTagRequest):
"""Save one human tag on a gold-set item."""
try:
gid = UUID(goldset_id)
except ValueError:
raise HTTPException(400, "מזהה לא תקין")
if req.correct_type and req.correct_type not in (
"binding", "interpretive", "obiter", "application", "procedural", "persuasive",
):
raise HTTPException(400, "correct_type לא תקין")
row = await db.goldset_tag(
gid, is_holding=req.is_holding, correct_type=req.correct_type,
quote_complete=req.quote_complete, tagged_by=req.tagged_by,
)
if not row:
raise HTTPException(404, "פריט לא נמצא")
return {"ok": True}
@app.patch("/api/halachot/{halacha_id}")
async def halacha_update(halacha_id: str, req: HalachaUpdateRequest):
"""Approve / reject / edit a halacha. Used by the chair review queue."""