feat(eval): FU-5 — retrieval eval harness + halacha backlog visibility (#63)

Covers GAP-11 (INV-RET4/G8) and GAP-14 (INV-QA1/G10). Retrieval quality was never measured (only telemetry observation) and the halacha review backlog was invisible (the 10/19 gap was found by accident). Unit B — backlog visibility (pure code, container): - metrics.halacha_backlog(conn) → {pending_review, approved, rejected, published, total, oldest_pending_at}; surfaced in metrics.get_dashboard() (get_metrics MCP tool) and /api/system/diagnostics. Live count revealed 178 pending / 1552 total, oldest from 2026-05-03 — previously invisible. Unit A — retrieval eval harness (host-side scripts): - scripts/eval_gold_bootstrap.py — seeds data/eval/gold-set.jsonl. Two sources: citations (cited==relevant via search_relevance_feedback — empty until decisions cite precedents) and known_item (query=case_name → relevant=self; a real citation-free signal, the methodology #52 checked by hand). Idempotent; preserves source='chair' rows. - scripts/eval_retrieval.py — runs the production retrieval path (search_library / search_internal) over the gold-set; computes precision@k, recall@k, MRR, nDCG@k (k=5,10); aggregates overall + per-corpus + per-practice_area; writes a report and a delta vs committed baseline.json (which records the retrieval_config it reflects). --self-test unit-checks the metric math offline. Gold-set strategy = hybrid (chair decision): bootstrap + chair review. The citation source is empty today (0 cited precedents in decisions), so the seed is known-item (77 queries: 54 internal_decisions + 23 precedent_library). The gold-set is PROVISIONAL until Dafna reviews it (the domain chair-gate). Baseline (production config: multimodal+rerank on): R@10=0.987, MRR=0.837, nDCG@10=0.872. Finding: MULTIMODAL_ENABLED=true slightly lowers known-item recall (image-page results displace exact name matches) — relevant to #15. precedent_library weaker than internal (R@10 0.957 vs 1.0) — one external precedent unfindable by name. "CI gate" realized as discipline (re-runnable harness + committed baseline + run before/after any retrieval-layer change) — retrieval needs prod DB + Voyage, no CI runner has that access. Spec: docs/superpowers/specs/2026-05-31-fu5-eval-harness-design.md Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 14:58:13 +00:00
parent cfcac80de2
commit 6ff2e36bf9
10 changed files with 776 additions and 10 deletions
--- a/docs/superpowers/specs/2026-05-31-fu5-eval-harness-design.md
+++ b/docs/superpowers/specs/2026-05-31-fu5-eval-harness-design.md
@@ -0,0 +1,92 @@
+# FU-5 — Retrieval Eval Harness + Backlog Visibility (design)
+
+**Task:** #63 (legal-ai tag) · **Covers:** GAP-11, GAP-14 · **Provides:** INV-RET4, G8, INV-QA1, G10
+**Status:** approved 2026-05-31 (gold-set strategy = hybrid, chair decision). Technical architecture
+decided per `feedback_research_architecture_decisions` (chair adjudicates domain, not architecture).
+
+## Problem
+
+1. **GAP-11 (INV-RET4/G8):** retrieval quality is never measured. Only `telemetry.log_search_bg`
+   records queries (observation, not evaluation). No gold-set, no precision/recall. Every RRF-weight
+   / `k` / embedder change is tuned "by feel".
+2. **GAP-14 (INV-QA1/G10):** the halacha review backlog (`review_status='pending_review'`) is
+   invisible — the 10/19-approved gap was found by accident. The human gate has no visibility.
+
+## Two independent units
+
+### Unit A — Retrieval eval harness (GAP-11)
+
+**Existing leverage:** `search_relevance_feedback` already captures a real ground-truth signal —
+when a finalized decision cites a precedent, `infer_relevance_from_citations` marks it
+`relevance_score=3` against the `search_logs` where it appeared (telemetry.py). This bootstraps the
+gold-set without hand-labeling.
+
+**A1. Gold-set — versioned file `data/eval/gold-set.jsonl`** (single SoT; reviewable/diffable/
+chair-editable). One JSON object per line:
+```json
+{"id":"g001","query":"...","practice_area":"betterment_levy",
+ "corpus":"precedent_library|internal_decisions",
+ "relevant_case_law_ids":["uuid",...],"source":"bootstrap|chair","note":""}
+```
+
+**A2. Bootstrap generator — `scripts/eval_gold_bootstrap.py`** (host-side, mcp-server venv):
+reads `search_relevance_feedback` (score=3) ⨝ `search_logs`, groups by normalized query →
+relevant `case_law_id` set, emits `source=bootstrap` entries. Idempotent: re-run regenerates the
+bootstrap section; never overwrites `source=chair` rows. **Chair gate:** Dafna reviews the file,
+corrects/augments, promotes entries to `source=chair`.
+
+**A3. Harness — `scripts/eval_retrieval.py`** (host-side, mcp-server venv; needs POSTGRES + VOYAGE):
+runs the **production retrieval path** (same service functions the MCP search tools call) for each
+gold query, computes per-query **precision@k, recall@k, MRR, nDCG@k** (k∈{5,10}); relevant = gold
+ids. Aggregates mean overall + per corpus + per practice_area. Writes
+`data/eval/eval-report-<ts>.{json,md}`, prints a summary, and a delta vs the committed
+`data/eval/baseline.json`. `--update-baseline` rewrites the snapshot.
+
+**"CI gate" — realized as discipline, not automation.** Retrieval needs the prod DB + Voyage API;
+no CI runner has that access. The gate is: re-runnable harness + committed `baseline.json` + a
+documented "run before/after any retrieval-layer change, attach the delta" rule (SCRIPTS.md). A true
+automated CI gate would require a separate frozen corpus fixture — out of scope, noted as future.
+
+**Scope:** the two precedent corpora (`search_precedent_library` + `search_internal_decisions`),
+where the citation signal exists. `search_decisions`/`search_case_documents` return case-document
+chunks (not `case_law`) and carry no citation ground-truth — deliberately out of scope.
+
+**Metrics rationale:** precision@k + recall@k are spec-required (INV-RET4). MRR (first-relevant
+rank) and nDCG@k (graded, position-weighted) are standard IR complements (Manning et al., 2008) —
+nDCG matches the telemetry docstring's stated nDCG@10 aspiration.
+
+### Unit B — Backlog visibility (GAP-14) — pure code
+
+Expose the halacha review backlog where health is already surfaced:
+- **`metrics.get_dashboard()`** (mcp-server/src/legal_mcp/services/metrics.py) — add
+  `halacha_backlog: {pending_review, approved, rejected, published, total, oldest_pending_at}` from
+  `halachot.review_status` + `min(created_at) where pending_review`. Surfaces through the
+  `get_metrics` MCP tool (agents + dashboard).
+- **`/api/system/diagnostics`** (web/app.py) — add the same `halacha_backlog` block to the health
+  snapshot.
+
+## Files
+
+| File | Unit | Kind | Deploy |
+|------|------|------|--------|
+| `scripts/eval_gold_bootstrap.py` | A2 | new, host-side | none |
+| `scripts/eval_retrieval.py` | A3 | new, host-side | none |
+| `data/eval/gold-set.jsonl` | A1 | data (on disk; chair-reviewed) | none |
+| `data/eval/baseline.json` | A3 | committed snapshot | none |
+| `mcp-server/src/legal_mcp/services/metrics.py` | B | edit `get_dashboard` | Coolify |
+| `web/app.py` | B | edit diagnostics | Coolify |
+| `scripts/SCRIPTS.md` | A | doc | none |
+
+## Test strategy
+
+- Bootstrap: idempotent (re-run = same bootstrap rows; chair rows untouched); 0 chair rows clobbered.
+- Harness: metric math unit-verified offline on a synthetic (ranking, relevant-set) fixture
+  (precision@k / recall@k / MRR / nDCG@k against hand-computed values) before any DB run.
+- Unit B: `get_metrics` (no case_number) returns `halacha_backlog` with counts summing to total;
+  diagnostics endpoint returns the same block. Verified against prod counts.
+
+## Chair gate (domain — the only thing requiring Dafna)
+
+After bootstrap produces `gold-set.jsonl`, Dafna reviews: are these queries representative, and are
+the marked precedents the *correct* answers? Her edits make the gold-set authoritative. Until then
+the baseline is "provisional (bootstrap-only)".