# FU-5 — Retrieval Eval Harness + Backlog Visibility (design)

**Task:** #63 (legal-ai tag) · **Covers:** GAP-11, GAP-14 · **Provides:** INV-RET4, G8, INV-QA1, G10
**Status:** approved 2026-05-31 (gold-set strategy = hybrid, chair decision). Technical architecture
decided per `feedback_research_architecture_decisions` (chair adjudicates domain, not architecture).

## Problem

1. **GAP-11 (INV-RET4/G8):** retrieval quality is never measured. Only `telemetry.log_search_bg`
   records queries (observation, not evaluation). No gold-set, no precision/recall. Every RRF-weight
   / `k` / embedder change is tuned "by feel".
2. **GAP-14 (INV-QA1/G10):** the halacha review backlog (`review_status='pending_review'`) is
   invisible — the 10/19-approved gap was found by accident. The human gate has no visibility.

## Two independent units

### Unit A — Retrieval eval harness (GAP-11)

**Existing leverage:** `search_relevance_feedback` already captures a real ground-truth signal —
when a finalized decision cites a precedent, `infer_relevance_from_citations` marks it
`relevance_score=3` against the `search_logs` where it appeared (telemetry.py). This bootstraps the
gold-set without hand-labeling.

**A1. Gold-set — versioned file `data/eval/gold-set.jsonl`** (single SoT; reviewable/diffable/
chair-editable). One JSON object per line:
```json
{"id":"g001","query":"...","practice_area":"betterment_levy",
 "corpus":"precedent_library|internal_decisions",
 "relevant_case_law_ids":["uuid",...],"source":"bootstrap|chair","note":""}
```

**A2. Bootstrap generator — `scripts/eval_gold_bootstrap.py`** (host-side, mcp-server venv):
reads `search_relevance_feedback` (score=3) ⨝ `search_logs`, groups by normalized query →
relevant `case_law_id` set, emits `source=bootstrap` entries. Idempotent: re-run regenerates the
bootstrap section; never overwrites `source=chair` rows. **Chair gate:** Dafna reviews the file,
corrects/augments, promotes entries to `source=chair`.

**A3. Harness — `scripts/eval_retrieval.py`** (host-side, mcp-server venv; needs POSTGRES + VOYAGE):
runs the **production retrieval path** (same service functions the MCP search tools call) for each
gold query, computes per-query **precision@k, recall@k, MRR, nDCG@k** (k∈{5,10}); relevant = gold
ids. Aggregates mean overall + per corpus + per practice_area. Writes
`data/eval/eval-report-<ts>.{json,md}`, prints a summary, and a delta vs the committed
`data/eval/baseline.json`. `--update-baseline` rewrites the snapshot.

**"CI gate" — realized as discipline, not automation.** Retrieval needs the prod DB + Voyage API;
no CI runner has that access. The gate is: re-runnable harness + committed `baseline.json` + a
documented "run before/after any retrieval-layer change, attach the delta" rule (SCRIPTS.md). A true
automated CI gate would require a separate frozen corpus fixture — out of scope, noted as future.

**Scope:** the two precedent corpora (`search_precedent_library` + `search_internal_decisions`),
where the citation signal exists. `search_decisions`/`search_case_documents` return case-document
chunks (not `case_law`) and carry no citation ground-truth — deliberately out of scope.

**Metrics rationale:** precision@k + recall@k are spec-required (INV-RET4). MRR (first-relevant
rank) and nDCG@k (graded, position-weighted) are standard IR complements (Manning et al., 2008) —
nDCG matches the telemetry docstring's stated nDCG@10 aspiration.

### Unit B — Backlog visibility (GAP-14) — pure code

Expose the halacha review backlog where health is already surfaced:
- **`metrics.get_dashboard()`** (mcp-server/src/legal_mcp/services/metrics.py) — add
  `halacha_backlog: {pending_review, approved, rejected, published, total, oldest_pending_at}` from
  `halachot.review_status` + `min(created_at) where pending_review`. Surfaces through the
  `get_metrics` MCP tool (agents + dashboard).
- **`/api/system/diagnostics`** (web/app.py) — add the same `halacha_backlog` block to the health
  snapshot.

## Files

| File | Unit | Kind | Deploy |
|------|------|------|--------|
| `scripts/eval_gold_bootstrap.py` | A2 | new, host-side | none |
| `scripts/eval_retrieval.py` | A3 | new, host-side | none |
| `data/eval/gold-set.jsonl` | A1 | data (on disk; chair-reviewed) | none |
| `data/eval/baseline.json` | A3 | committed snapshot | none |
| `mcp-server/src/legal_mcp/services/metrics.py` | B | edit `get_dashboard` | Coolify |
| `web/app.py` | B | edit diagnostics | Coolify |
| `scripts/SCRIPTS.md` | A | doc | none |

## Test strategy

- Bootstrap: idempotent (re-run = same bootstrap rows; chair rows untouched); 0 chair rows clobbered.
- Harness: metric math unit-verified offline on a synthetic (ranking, relevant-set) fixture
  (precision@k / recall@k / MRR / nDCG@k against hand-computed values) before any DB run.
- Unit B: `get_metrics` (no case_number) returns `halacha_backlog` with counts summing to total;
  diagnostics endpoint returns the same block. Verified against prod counts.

## Chair gate (domain — the only thing requiring Dafna)

After bootstrap produces `gold-set.jsonl`, Dafna reviews: are these queries representative, and are
the marked precedents the *correct* answers? Her edits make the gold-set authoritative. Until then
the baseline is "provisional (bootstrap-only)".