Files
legal-ai/docs/superpowers/specs/2026-05-31-fu5-eval-harness-design.md
Chaim 6ff2e36bf9 feat(eval): FU-5 — retrieval eval harness + halacha backlog visibility (#63)
Covers GAP-11 (INV-RET4/G8) and GAP-14 (INV-QA1/G10). Retrieval quality was
never measured (only telemetry observation) and the halacha review backlog was
invisible (the 10/19 gap was found by accident).

Unit B — backlog visibility (pure code, container):
- metrics.halacha_backlog(conn) → {pending_review, approved, rejected, published,
  total, oldest_pending_at}; surfaced in metrics.get_dashboard() (get_metrics MCP
  tool) and /api/system/diagnostics. Live count revealed 178 pending / 1552 total,
  oldest from 2026-05-03 — previously invisible.

Unit A — retrieval eval harness (host-side scripts):
- scripts/eval_gold_bootstrap.py — seeds data/eval/gold-set.jsonl. Two sources:
  citations (cited==relevant via search_relevance_feedback — empty until decisions
  cite precedents) and known_item (query=case_name → relevant=self; a real
  citation-free signal, the methodology #52 checked by hand). Idempotent; preserves
  source='chair' rows.
- scripts/eval_retrieval.py — runs the production retrieval path (search_library /
  search_internal) over the gold-set; computes precision@k, recall@k, MRR, nDCG@k
  (k=5,10); aggregates overall + per-corpus + per-practice_area; writes a report and
  a delta vs committed baseline.json (which records the retrieval_config it reflects).
  --self-test unit-checks the metric math offline.

Gold-set strategy = hybrid (chair decision): bootstrap + chair review. The citation
source is empty today (0 cited precedents in decisions), so the seed is known-item
(77 queries: 54 internal_decisions + 23 precedent_library). The gold-set is
PROVISIONAL until Dafna reviews it (the domain chair-gate).

Baseline (production config: multimodal+rerank on): R@10=0.987, MRR=0.837,
nDCG@10=0.872. Finding: MULTIMODAL_ENABLED=true slightly lowers known-item recall
(image-page results displace exact name matches) — relevant to #15. precedent_library
weaker than internal (R@10 0.957 vs 1.0) — one external precedent unfindable by name.

"CI gate" realized as discipline (re-runnable harness + committed baseline + run
before/after any retrieval-layer change) — retrieval needs prod DB + Voyage, no CI
runner has that access.

Spec: docs/superpowers/specs/2026-05-31-fu5-eval-harness-design.md

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 14:58:13 +00:00

5.2 KiB

FU-5 — Retrieval Eval Harness + Backlog Visibility (design)

Task: #63 (legal-ai tag) · Covers: GAP-11, GAP-14 · Provides: INV-RET4, G8, INV-QA1, G10 Status: approved 2026-05-31 (gold-set strategy = hybrid, chair decision). Technical architecture decided per feedback_research_architecture_decisions (chair adjudicates domain, not architecture).

Problem

  1. GAP-11 (INV-RET4/G8): retrieval quality is never measured. Only telemetry.log_search_bg records queries (observation, not evaluation). No gold-set, no precision/recall. Every RRF-weight / k / embedder change is tuned "by feel".
  2. GAP-14 (INV-QA1/G10): the halacha review backlog (review_status='pending_review') is invisible — the 10/19-approved gap was found by accident. The human gate has no visibility.

Two independent units

Unit A — Retrieval eval harness (GAP-11)

Existing leverage: search_relevance_feedback already captures a real ground-truth signal — when a finalized decision cites a precedent, infer_relevance_from_citations marks it relevance_score=3 against the search_logs where it appeared (telemetry.py). This bootstraps the gold-set without hand-labeling.

A1. Gold-set — versioned file data/eval/gold-set.jsonl (single SoT; reviewable/diffable/ chair-editable). One JSON object per line:

{"id":"g001","query":"...","practice_area":"betterment_levy",
 "corpus":"precedent_library|internal_decisions",
 "relevant_case_law_ids":["uuid",...],"source":"bootstrap|chair","note":""}

A2. Bootstrap generator — scripts/eval_gold_bootstrap.py (host-side, mcp-server venv): reads search_relevance_feedback (score=3) ⨝ search_logs, groups by normalized query → relevant case_law_id set, emits source=bootstrap entries. Idempotent: re-run regenerates the bootstrap section; never overwrites source=chair rows. Chair gate: Dafna reviews the file, corrects/augments, promotes entries to source=chair.

A3. Harness — scripts/eval_retrieval.py (host-side, mcp-server venv; needs POSTGRES + VOYAGE): runs the production retrieval path (same service functions the MCP search tools call) for each gold query, computes per-query precision@k, recall@k, MRR, nDCG@k (k∈{5,10}); relevant = gold ids. Aggregates mean overall + per corpus + per practice_area. Writes data/eval/eval-report-<ts>.{json,md}, prints a summary, and a delta vs the committed data/eval/baseline.json. --update-baseline rewrites the snapshot.

"CI gate" — realized as discipline, not automation. Retrieval needs the prod DB + Voyage API; no CI runner has that access. The gate is: re-runnable harness + committed baseline.json + a documented "run before/after any retrieval-layer change, attach the delta" rule (SCRIPTS.md). A true automated CI gate would require a separate frozen corpus fixture — out of scope, noted as future.

Scope: the two precedent corpora (search_precedent_library + search_internal_decisions), where the citation signal exists. search_decisions/search_case_documents return case-document chunks (not case_law) and carry no citation ground-truth — deliberately out of scope.

Metrics rationale: precision@k + recall@k are spec-required (INV-RET4). MRR (first-relevant rank) and nDCG@k (graded, position-weighted) are standard IR complements (Manning et al., 2008) — nDCG matches the telemetry docstring's stated nDCG@10 aspiration.

Unit B — Backlog visibility (GAP-14) — pure code

Expose the halacha review backlog where health is already surfaced:

  • metrics.get_dashboard() (mcp-server/src/legal_mcp/services/metrics.py) — add halacha_backlog: {pending_review, approved, rejected, published, total, oldest_pending_at} from halachot.review_status + min(created_at) where pending_review. Surfaces through the get_metrics MCP tool (agents + dashboard).
  • /api/system/diagnostics (web/app.py) — add the same halacha_backlog block to the health snapshot.

Files

File Unit Kind Deploy
scripts/eval_gold_bootstrap.py A2 new, host-side none
scripts/eval_retrieval.py A3 new, host-side none
data/eval/gold-set.jsonl A1 data (on disk; chair-reviewed) none
data/eval/baseline.json A3 committed snapshot none
mcp-server/src/legal_mcp/services/metrics.py B edit get_dashboard Coolify
web/app.py B edit diagnostics Coolify
scripts/SCRIPTS.md A doc none

Test strategy

  • Bootstrap: idempotent (re-run = same bootstrap rows; chair rows untouched); 0 chair rows clobbered.
  • Harness: metric math unit-verified offline on a synthetic (ranking, relevant-set) fixture (precision@k / recall@k / MRR / nDCG@k against hand-computed values) before any DB run.
  • Unit B: get_metrics (no case_number) returns halacha_backlog with counts summing to total; diagnostics endpoint returns the same block. Verified against prod counts.

Chair gate (domain — the only thing requiring Dafna)

After bootstrap produces gold-set.jsonl, Dafna reviews: are these queries representative, and are the marked precedents the correct answers? Her edits make the gold-set authoritative. Until then the baseline is "provisional (bootstrap-only)".