# FU-5 — Retrieval Eval Harness + Backlog Visibility (design) **Task:** #63 (legal-ai tag) · **Covers:** GAP-11, GAP-14 · **Provides:** INV-RET4, G8, INV-QA1, G10 **Status:** approved 2026-05-31 (gold-set strategy = hybrid, chair decision). Technical architecture decided per `feedback_research_architecture_decisions` (chair adjudicates domain, not architecture). ## Problem 1. **GAP-11 (INV-RET4/G8):** retrieval quality is never measured. Only `telemetry.log_search_bg` records queries (observation, not evaluation). No gold-set, no precision/recall. Every RRF-weight / `k` / embedder change is tuned "by feel". 2. **GAP-14 (INV-QA1/G10):** the halacha review backlog (`review_status='pending_review'`) is invisible — the 10/19-approved gap was found by accident. The human gate has no visibility. ## Two independent units ### Unit A — Retrieval eval harness (GAP-11) **Existing leverage:** `search_relevance_feedback` already captures a real ground-truth signal — when a finalized decision cites a precedent, `infer_relevance_from_citations` marks it `relevance_score=3` against the `search_logs` where it appeared (telemetry.py). This bootstraps the gold-set without hand-labeling. **A1. Gold-set — versioned file `data/eval/gold-set.jsonl`** (single SoT; reviewable/diffable/ chair-editable). One JSON object per line: ```json {"id":"g001","query":"...","practice_area":"betterment_levy", "corpus":"precedent_library|internal_decisions", "relevant_case_law_ids":["uuid",...],"source":"bootstrap|chair","note":""} ``` **A2. Bootstrap generator — `scripts/eval_gold_bootstrap.py`** (host-side, mcp-server venv): reads `search_relevance_feedback` (score=3) ⨝ `search_logs`, groups by normalized query → relevant `case_law_id` set, emits `source=bootstrap` entries. Idempotent: re-run regenerates the bootstrap section; never overwrites `source=chair` rows. **Chair gate:** Dafna reviews the file, corrects/augments, promotes entries to `source=chair`. **A3. Harness — `scripts/eval_retrieval.py`** (host-side, mcp-server venv; needs POSTGRES + VOYAGE): runs the **production retrieval path** (same service functions the MCP search tools call) for each gold query, computes per-query **precision@k, recall@k, MRR, nDCG@k** (k∈{5,10}); relevant = gold ids. Aggregates mean overall + per corpus + per practice_area. Writes `data/eval/eval-report-.{json,md}`, prints a summary, and a delta vs the committed `data/eval/baseline.json`. `--update-baseline` rewrites the snapshot. **"CI gate" — realized as discipline, not automation.** Retrieval needs the prod DB + Voyage API; no CI runner has that access. The gate is: re-runnable harness + committed `baseline.json` + a documented "run before/after any retrieval-layer change, attach the delta" rule (SCRIPTS.md). A true automated CI gate would require a separate frozen corpus fixture — out of scope, noted as future. **Scope:** the two precedent corpora (`search_precedent_library` + `search_internal_decisions`), where the citation signal exists. `search_decisions`/`search_case_documents` return case-document chunks (not `case_law`) and carry no citation ground-truth — deliberately out of scope. **Metrics rationale:** precision@k + recall@k are spec-required (INV-RET4). MRR (first-relevant rank) and nDCG@k (graded, position-weighted) are standard IR complements (Manning et al., 2008) — nDCG matches the telemetry docstring's stated nDCG@10 aspiration. ### Unit B — Backlog visibility (GAP-14) — pure code Expose the halacha review backlog where health is already surfaced: - **`metrics.get_dashboard()`** (mcp-server/src/legal_mcp/services/metrics.py) — add `halacha_backlog: {pending_review, approved, rejected, published, total, oldest_pending_at}` from `halachot.review_status` + `min(created_at) where pending_review`. Surfaces through the `get_metrics` MCP tool (agents + dashboard). - **`/api/system/diagnostics`** (web/app.py) — add the same `halacha_backlog` block to the health snapshot. ## Files | File | Unit | Kind | Deploy | |------|------|------|--------| | `scripts/eval_gold_bootstrap.py` | A2 | new, host-side | none | | `scripts/eval_retrieval.py` | A3 | new, host-side | none | | `data/eval/gold-set.jsonl` | A1 | data (on disk; chair-reviewed) | none | | `data/eval/baseline.json` | A3 | committed snapshot | none | | `mcp-server/src/legal_mcp/services/metrics.py` | B | edit `get_dashboard` | Coolify | | `web/app.py` | B | edit diagnostics | Coolify | | `scripts/SCRIPTS.md` | A | doc | none | ## Test strategy - Bootstrap: idempotent (re-run = same bootstrap rows; chair rows untouched); 0 chair rows clobbered. - Harness: metric math unit-verified offline on a synthetic (ranking, relevant-set) fixture (precision@k / recall@k / MRR / nDCG@k against hand-computed values) before any DB run. - Unit B: `get_metrics` (no case_number) returns `halacha_backlog` with counts summing to total; diagnostics endpoint returns the same block. Verified against prod counts. ## Chair gate (domain — the only thing requiring Dafna) After bootstrap produces `gold-set.jsonl`, Dafna reviews: are these queries representative, and are the marked precedents the *correct* answers? Her edits make the gold-set authoritative. Until then the baseline is "provisional (bootstrap-only)".