feat(eval): FU-5 — retrieval eval harness + halacha backlog visibility (#63)

Covers GAP-11 (INV-RET4/G8) and GAP-14 (INV-QA1/G10). Retrieval quality was never measured (only telemetry observation) and the halacha review backlog was invisible (the 10/19 gap was found by accident). Unit B — backlog visibility (pure code, container): - metrics.halacha_backlog(conn) → {pending_review, approved, rejected, published, total, oldest_pending_at}; surfaced in metrics.get_dashboard() (get_metrics MCP tool) and /api/system/diagnostics. Live count revealed 178 pending / 1552 total, oldest from 2026-05-03 — previously invisible. Unit A — retrieval eval harness (host-side scripts): - scripts/eval_gold_bootstrap.py — seeds data/eval/gold-set.jsonl. Two sources: citations (cited==relevant via search_relevance_feedback — empty until decisions cite precedents) and known_item (query=case_name → relevant=self; a real citation-free signal, the methodology #52 checked by hand). Idempotent; preserves source='chair' rows. - scripts/eval_retrieval.py — runs the production retrieval path (search_library / search_internal) over the gold-set; computes precision@k, recall@k, MRR, nDCG@k (k=5,10); aggregates overall + per-corpus + per-practice_area; writes a report and a delta vs committed baseline.json (which records the retrieval_config it reflects). --self-test unit-checks the metric math offline. Gold-set strategy = hybrid (chair decision): bootstrap + chair review. The citation source is empty today (0 cited precedents in decisions), so the seed is known-item (77 queries: 54 internal_decisions + 23 precedent_library). The gold-set is PROVISIONAL until Dafna reviews it (the domain chair-gate). Baseline (production config: multimodal+rerank on): R@10=0.987, MRR=0.837, nDCG@10=0.872. Finding: MULTIMODAL_ENABLED=true slightly lowers known-item recall (image-page results displace exact name matches) — relevant to #15. precedent_library weaker than internal (R@10 0.957 vs 1.0) — one external precedent unfindable by name. "CI gate" realized as discipline (re-runnable harness + committed baseline + run before/after any retrieval-layer change) — retrieval needs prod DB + Voyage, no CI runner has that access. Spec: docs/superpowers/specs/2026-05-31-fu5-eval-harness-design.md Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 14:58:13 +00:00
parent cfcac80de2
commit 6ff2e36bf9
10 changed files with 776 additions and 10 deletions
--- a/data/eval/baseline.json
+++ b/data/eval/baseline.json
@@ -0,0 +1,70 @@
+{
+  "gold_size": 77,
+  "retrieval_config": {
+    "MULTIMODAL_ENABLED": true,
+    "VOYAGE_RERANK_ENABLED": true,
+    "VOYAGE_MODEL": "voyage-3",
+    "MULTIMODAL_TEXT_WEIGHT": 0.5,
+    "MULTIMODAL_RRF_K": 60,
+    "BM25_HYBRID_ENABLED": true
+  },
+  "overall": {
+    "P@5": 0.1922,
+    "R@5": 0.9351,
+    "nDCG@5": 0.8545,
+    "P@10": 0.1013,
+    "R@10": 0.987,
+    "nDCG@10": 0.8718,
+    "MRR": 0.8367
+  },
+  "by_corpus": {
+    "internal_decisions": {
+      "P@5": 0.1963,
+      "R@5": 0.963,
+      "nDCG@5": 0.887,
+      "P@10": 0.1019,
+      "R@10": 1.0,
+      "nDCG@10": 0.899,
+      "MRR": 0.871
+    },
+    "precedent_library": {
+      "P@5": 0.1826,
+      "R@5": 0.8696,
+      "nDCG@5": 0.778,
+      "P@10": 0.1,
+      "R@10": 0.9565,
+      "nDCG@10": 0.808,
+      "MRR": 0.7562
+    }
+  },
+  "by_practice_area": {
+    "betterment_levy": {
+      "P@5": 0.1897,
+      "R@5": 0.9231,
+      "nDCG@5": 0.8595,
+      "P@10": 0.1,
+      "R@10": 0.9744,
+      "nDCG@10": 0.8761,
+      "MRR": 0.8432
+    },
+    "compensation_197": {
+      "P@5": 0.2,
+      "R@5": 1.0,
+      "nDCG@5": 1.0,
+      "P@10": 0.1,
+      "R@10": 1.0,
+      "nDCG@10": 1.0,
+      "MRR": 1.0
+    },
+    "rishuy_uvniya": {
+      "P@5": 0.2,
+      "R@5": 0.9706,
+      "nDCG@5": 0.861,
+      "P@10": 0.1029,
+      "R@10": 1.0,
+      "nDCG@10": 0.8708,
+      "MRR": 0.8346
+    }
+  },
+  "generated_at": "20260531T145742Z"
+}