feat(eval): FU-5 — retrieval eval harness + halacha backlog visibility (#63)
Covers GAP-11 (INV-RET4/G8) and GAP-14 (INV-QA1/G10). Retrieval quality was
never measured (only telemetry observation) and the halacha review backlog was
invisible (the 10/19 gap was found by accident).
Unit B — backlog visibility (pure code, container):
- metrics.halacha_backlog(conn) → {pending_review, approved, rejected, published,
total, oldest_pending_at}; surfaced in metrics.get_dashboard() (get_metrics MCP
tool) and /api/system/diagnostics. Live count revealed 178 pending / 1552 total,
oldest from 2026-05-03 — previously invisible.
Unit A — retrieval eval harness (host-side scripts):
- scripts/eval_gold_bootstrap.py — seeds data/eval/gold-set.jsonl. Two sources:
citations (cited==relevant via search_relevance_feedback — empty until decisions
cite precedents) and known_item (query=case_name → relevant=self; a real
citation-free signal, the methodology #52 checked by hand). Idempotent; preserves
source='chair' rows.
- scripts/eval_retrieval.py — runs the production retrieval path (search_library /
search_internal) over the gold-set; computes precision@k, recall@k, MRR, nDCG@k
(k=5,10); aggregates overall + per-corpus + per-practice_area; writes a report and
a delta vs committed baseline.json (which records the retrieval_config it reflects).
--self-test unit-checks the metric math offline.
Gold-set strategy = hybrid (chair decision): bootstrap + chair review. The citation
source is empty today (0 cited precedents in decisions), so the seed is known-item
(77 queries: 54 internal_decisions + 23 precedent_library). The gold-set is
PROVISIONAL until Dafna reviews it (the domain chair-gate).
Baseline (production config: multimodal+rerank on): R@10=0.987, MRR=0.837,
nDCG@10=0.872. Finding: MULTIMODAL_ENABLED=true slightly lowers known-item recall
(image-page results displace exact name matches) — relevant to #15. precedent_library
weaker than internal (R@10 0.957 vs 1.0) — one external precedent unfindable by name.
"CI gate" realized as discipline (re-runnable harness + committed baseline + run
before/after any retrieval-layer change) — retrieval needs prod DB + Voyage, no CI
runner has that access.
Spec: docs/superpowers/specs/2026-05-31-fu5-eval-harness-design.md
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -103,6 +103,30 @@ async def get_case_metrics(case_id: UUID) -> dict:
|
||||
return metrics
|
||||
|
||||
|
||||
async def halacha_backlog(conn) -> dict:
|
||||
"""תור אישור-ההלכות (GAP-14 / INV-QA1 / G10) — נראות ה-backlog האנושי.
|
||||
|
||||
הלכות נכנסות כ-`pending_review` ובלתי-נראות לחיפוש עד אישור היו"ר; בלי ספירה
|
||||
גלויה, אישור-חסר נשאר סמוי (10/19 התגלה במקרה). מקבל connection פתוח כדי
|
||||
שאפשר יהיה לשלב בסנאפ-שוט קיים (get_dashboard, /api/system/diagnostics).
|
||||
"""
|
||||
rows = await conn.fetch(
|
||||
"SELECT review_status, COUNT(*) AS n FROM halachot GROUP BY review_status"
|
||||
)
|
||||
counts = {r["review_status"]: r["n"] for r in rows}
|
||||
oldest = await conn.fetchval(
|
||||
"SELECT MIN(created_at) FROM halachot WHERE review_status = 'pending_review'"
|
||||
)
|
||||
return {
|
||||
"pending_review": counts.get("pending_review", 0),
|
||||
"approved": counts.get("approved", 0),
|
||||
"rejected": counts.get("rejected", 0),
|
||||
"published": counts.get("published", 0),
|
||||
"total": sum(counts.values()),
|
||||
"oldest_pending_at": oldest.isoformat() if oldest else None,
|
||||
}
|
||||
|
||||
|
||||
async def get_dashboard() -> dict:
|
||||
"""דשבורד כולל — סיכום מדדים על כל התיקים."""
|
||||
pool = await db.get_pool()
|
||||
@@ -152,6 +176,9 @@ async def get_dashboard() -> dict:
|
||||
"SELECT AVG(total_words) FROM decisions WHERE total_words > 0"
|
||||
)
|
||||
|
||||
# Halacha review backlog (GAP-14 / INV-QA1 / G10)
|
||||
backlog = await halacha_backlog(conn)
|
||||
|
||||
return {
|
||||
"summary": {
|
||||
"total_cases": total_cases,
|
||||
@@ -168,6 +195,7 @@ async def get_dashboard() -> dict:
|
||||
"stale_embedding_case_law": stale_embedding_case_law,
|
||||
},
|
||||
"cases_by_status": cases_by_status,
|
||||
"halacha_backlog": backlog,
|
||||
"qa": {
|
||||
"cases_validated": qa_total,
|
||||
"cases_passed": qa_passed,
|
||||
|
||||
Reference in New Issue
Block a user