feat(goldset): interactive gold-set tagging page (#81.7/#81.8)

Replaces the CSV-edit workflow with an in-app tagging page so the chair/Dafna can label the extraction-quality gold-set by clicking, and see validator precision/recall live. Schema (V29): halacha_goldset — a stratified, human-tagged evaluation batch (is_holding / correct_type / quote_complete, NULL until tagged). db.py: - goldset_create_sample (stratified round-robin over case×rule_type, idempotent), - goldset_list (items + halacha content + the machine's own labels), - goldset_tag (partial — one field at a time for keyboard tagging), - goldset_score (ports the script's P/R/F1: each validator scored as a not-a-holding detector against the human tags — the #81.8 input). API: GET /api/goldset, POST /api/goldset/sample, GET /api/goldset/score, PATCH /api/goldset/{id}. web-ui: - lib/api/goldset.ts (hooks), - components/goldset/goldset-panel.tsx — card-per-item, keyboard-first (J/K nav, H/N holding, C/X quote), progress bar, hide-tagged toggle, and a collapsible live score table, - app/goldset/page.tsx + nav link "מדגם-זהב" under ידע ולמידה. Methodology guard kept explicit in UI + docstrings: tags are HUMAN ground truth, no AI pre-fill (circular bias). Populated a 150-item stratified batch. Verified: backend create/list/tag/score against the live DB; tsc --noEmit 0; py_compile ok. (Local Turbopack build blocked by worktree symlink — CI builds clean.) Invariants: G1 (eval set modeled at source in its own table); G2 (reuses the same halacha_quality validators the extractor runs — no parallel scoring logic). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 21:52:05 +00:00
parent 9bd247c421
commit ac279220c4
6 changed files with 632 additions and 1 deletions
--- a/web/app.py
+++ b/web/app.py
@@ -6099,6 +6099,59 @@ async def halacha_equivalents_unlink(halacha_id: str, other_id: str):
    return {"ok": await db.unlink_equivalent_halachot(hid, oid)}


+# ── Gold-set tagging (#81.7 / #81.8) ─────────────────────────────────────────
+
+class GoldsetSampleRequest(BaseModel):
+    n: int = 150
+    batch: str = "default"
+    reset: bool = False
+
+
+class GoldsetTagRequest(BaseModel):
+    is_holding: bool | None = None
+    correct_type: str | None = None
+    quote_complete: bool | None = None
+    tagged_by: str = "chair"
+
+
+@app.get("/api/goldset")
+async def goldset_list_ep(batch: str = "default"):
+    """The gold-set tagging queue (halacha content + machine labels + human tags)."""
+    return {"items": await db.goldset_list(batch), "batch": batch}
+
+
+@app.post("/api/goldset/sample")
+async def goldset_sample_ep(req: GoldsetSampleRequest):
+    """Create/extend a stratified gold-set batch for tagging (#81.7)."""
+    return await db.goldset_create_sample(n=req.n, batch=req.batch, reset=req.reset)
+
+
+@app.get("/api/goldset/score")
+async def goldset_score_ep(batch: str = "default"):
+    """Measure the extraction validators against the human tags (#81.8)."""
+    return await db.goldset_score(batch)
+
+
+@app.patch("/api/goldset/{goldset_id}")
+async def goldset_tag_ep(goldset_id: str, req: GoldsetTagRequest):
+    """Save one human tag on a gold-set item."""
+    try:
+        gid = UUID(goldset_id)
+    except ValueError:
+        raise HTTPException(400, "מזהה לא תקין")
+    if req.correct_type and req.correct_type not in (
+        "binding", "interpretive", "obiter", "application", "procedural", "persuasive",
+    ):
+        raise HTTPException(400, "correct_type לא תקין")
+    row = await db.goldset_tag(
+        gid, is_holding=req.is_holding, correct_type=req.correct_type,
+        quote_complete=req.quote_complete, tagged_by=req.tagged_by,
+    )
+    if not row:
+        raise HTTPException(404, "פריט לא נמצא")
+    return {"ok": True}
+
+
@app.patch("/api/halachot/{halacha_id}")
 async def halacha_update(halacha_id: str, req: HalachaUpdateRequest):
    """Approve / reject / edit a halacha. Used by the chair review queue."""