Covers GAP-11 (INV-RET4/G8) and GAP-14 (INV-QA1/G10). Retrieval quality was
never measured (only telemetry observation) and the halacha review backlog was
invisible (the 10/19 gap was found by accident).
Unit B — backlog visibility (pure code, container):
- metrics.halacha_backlog(conn) → {pending_review, approved, rejected, published,
total, oldest_pending_at}; surfaced in metrics.get_dashboard() (get_metrics MCP
tool) and /api/system/diagnostics. Live count revealed 178 pending / 1552 total,
oldest from 2026-05-03 — previously invisible.
Unit A — retrieval eval harness (host-side scripts):
- scripts/eval_gold_bootstrap.py — seeds data/eval/gold-set.jsonl. Two sources:
citations (cited==relevant via search_relevance_feedback — empty until decisions
cite precedents) and known_item (query=case_name → relevant=self; a real
citation-free signal, the methodology #52 checked by hand). Idempotent; preserves
source='chair' rows.
- scripts/eval_retrieval.py — runs the production retrieval path (search_library /
search_internal) over the gold-set; computes precision@k, recall@k, MRR, nDCG@k
(k=5,10); aggregates overall + per-corpus + per-practice_area; writes a report and
a delta vs committed baseline.json (which records the retrieval_config it reflects).
--self-test unit-checks the metric math offline.
Gold-set strategy = hybrid (chair decision): bootstrap + chair review. The citation
source is empty today (0 cited precedents in decisions), so the seed is known-item
(77 queries: 54 internal_decisions + 23 precedent_library). The gold-set is
PROVISIONAL until Dafna reviews it (the domain chair-gate).
Baseline (production config: multimodal+rerank on): R@10=0.987, MRR=0.837,
nDCG@10=0.872. Finding: MULTIMODAL_ENABLED=true slightly lowers known-item recall
(image-page results displace exact name matches) — relevant to #15. precedent_library
weaker than internal (R@10 0.957 vs 1.0) — one external precedent unfindable by name.
"CI gate" realized as discipline (re-runnable harness + committed baseline + run
before/after any retrieval-layer change) — retrieval needs prod DB + Voyage, no CI
runner has that access.
Spec: docs/superpowers/specs/2026-05-31-fu5-eval-harness-design.md
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
5.2 KiB
FU-5 — Retrieval Eval Harness + Backlog Visibility (design)
Task: #63 (legal-ai tag) · Covers: GAP-11, GAP-14 · Provides: INV-RET4, G8, INV-QA1, G10
Status: approved 2026-05-31 (gold-set strategy = hybrid, chair decision). Technical architecture
decided per feedback_research_architecture_decisions (chair adjudicates domain, not architecture).
Problem
- GAP-11 (INV-RET4/G8): retrieval quality is never measured. Only
telemetry.log_search_bgrecords queries (observation, not evaluation). No gold-set, no precision/recall. Every RRF-weight /k/ embedder change is tuned "by feel". - GAP-14 (INV-QA1/G10): the halacha review backlog (
review_status='pending_review') is invisible — the 10/19-approved gap was found by accident. The human gate has no visibility.
Two independent units
Unit A — Retrieval eval harness (GAP-11)
Existing leverage: search_relevance_feedback already captures a real ground-truth signal —
when a finalized decision cites a precedent, infer_relevance_from_citations marks it
relevance_score=3 against the search_logs where it appeared (telemetry.py). This bootstraps the
gold-set without hand-labeling.
A1. Gold-set — versioned file data/eval/gold-set.jsonl (single SoT; reviewable/diffable/
chair-editable). One JSON object per line:
{"id":"g001","query":"...","practice_area":"betterment_levy",
"corpus":"precedent_library|internal_decisions",
"relevant_case_law_ids":["uuid",...],"source":"bootstrap|chair","note":""}
A2. Bootstrap generator — scripts/eval_gold_bootstrap.py (host-side, mcp-server venv):
reads search_relevance_feedback (score=3) ⨝ search_logs, groups by normalized query →
relevant case_law_id set, emits source=bootstrap entries. Idempotent: re-run regenerates the
bootstrap section; never overwrites source=chair rows. Chair gate: Dafna reviews the file,
corrects/augments, promotes entries to source=chair.
A3. Harness — scripts/eval_retrieval.py (host-side, mcp-server venv; needs POSTGRES + VOYAGE):
runs the production retrieval path (same service functions the MCP search tools call) for each
gold query, computes per-query precision@k, recall@k, MRR, nDCG@k (k∈{5,10}); relevant = gold
ids. Aggregates mean overall + per corpus + per practice_area. Writes
data/eval/eval-report-<ts>.{json,md}, prints a summary, and a delta vs the committed
data/eval/baseline.json. --update-baseline rewrites the snapshot.
"CI gate" — realized as discipline, not automation. Retrieval needs the prod DB + Voyage API;
no CI runner has that access. The gate is: re-runnable harness + committed baseline.json + a
documented "run before/after any retrieval-layer change, attach the delta" rule (SCRIPTS.md). A true
automated CI gate would require a separate frozen corpus fixture — out of scope, noted as future.
Scope: the two precedent corpora (search_precedent_library + search_internal_decisions),
where the citation signal exists. search_decisions/search_case_documents return case-document
chunks (not case_law) and carry no citation ground-truth — deliberately out of scope.
Metrics rationale: precision@k + recall@k are spec-required (INV-RET4). MRR (first-relevant rank) and nDCG@k (graded, position-weighted) are standard IR complements (Manning et al., 2008) — nDCG matches the telemetry docstring's stated nDCG@10 aspiration.
Unit B — Backlog visibility (GAP-14) — pure code
Expose the halacha review backlog where health is already surfaced:
metrics.get_dashboard()(mcp-server/src/legal_mcp/services/metrics.py) — addhalacha_backlog: {pending_review, approved, rejected, published, total, oldest_pending_at}fromhalachot.review_status+min(created_at) where pending_review. Surfaces through theget_metricsMCP tool (agents + dashboard)./api/system/diagnostics(web/app.py) — add the samehalacha_backlogblock to the health snapshot.
Files
| File | Unit | Kind | Deploy |
|---|---|---|---|
scripts/eval_gold_bootstrap.py |
A2 | new, host-side | none |
scripts/eval_retrieval.py |
A3 | new, host-side | none |
data/eval/gold-set.jsonl |
A1 | data (on disk; chair-reviewed) | none |
data/eval/baseline.json |
A3 | committed snapshot | none |
mcp-server/src/legal_mcp/services/metrics.py |
B | edit get_dashboard |
Coolify |
web/app.py |
B | edit diagnostics | Coolify |
scripts/SCRIPTS.md |
A | doc | none |
Test strategy
- Bootstrap: idempotent (re-run = same bootstrap rows; chair rows untouched); 0 chair rows clobbered.
- Harness: metric math unit-verified offline on a synthetic (ranking, relevant-set) fixture (precision@k / recall@k / MRR / nDCG@k against hand-computed values) before any DB run.
- Unit B:
get_metrics(no case_number) returnshalacha_backlogwith counts summing to total; diagnostics endpoint returns the same block. Verified against prod counts.
Chair gate (domain — the only thing requiring Dafna)
After bootstrap produces gold-set.jsonl, Dafna reviews: are these queries representative, and are
the marked precedents the correct answers? Her edits make the gold-set authoritative. Until then
the baseline is "provisional (bootstrap-only)".