feat(eval): FU-5 — retrieval eval harness + halacha backlog visibility (#63)

Covers GAP-11 (INV-RET4/G8) and GAP-14 (INV-QA1/G10). Retrieval quality was
never measured (only telemetry observation) and the halacha review backlog was
invisible (the 10/19 gap was found by accident).

Unit B — backlog visibility (pure code, container):
- metrics.halacha_backlog(conn) → {pending_review, approved, rejected, published,
  total, oldest_pending_at}; surfaced in metrics.get_dashboard() (get_metrics MCP
  tool) and /api/system/diagnostics. Live count revealed 178 pending / 1552 total,
  oldest from 2026-05-03 — previously invisible.

Unit A — retrieval eval harness (host-side scripts):
- scripts/eval_gold_bootstrap.py — seeds data/eval/gold-set.jsonl. Two sources:
  citations (cited==relevant via search_relevance_feedback — empty until decisions
  cite precedents) and known_item (query=case_name → relevant=self; a real
  citation-free signal, the methodology #52 checked by hand). Idempotent; preserves
  source='chair' rows.
- scripts/eval_retrieval.py — runs the production retrieval path (search_library /
  search_internal) over the gold-set; computes precision@k, recall@k, MRR, nDCG@k
  (k=5,10); aggregates overall + per-corpus + per-practice_area; writes a report and
  a delta vs committed baseline.json (which records the retrieval_config it reflects).
  --self-test unit-checks the metric math offline.

Gold-set strategy = hybrid (chair decision): bootstrap + chair review. The citation
source is empty today (0 cited precedents in decisions), so the seed is known-item
(77 queries: 54 internal_decisions + 23 precedent_library). The gold-set is
PROVISIONAL until Dafna reviews it (the domain chair-gate).

Baseline (production config: multimodal+rerank on): R@10=0.987, MRR=0.837,
nDCG@10=0.872. Finding: MULTIMODAL_ENABLED=true slightly lowers known-item recall
(image-page results displace exact name matches) — relevant to #15. precedent_library
weaker than internal (R@10 0.957 vs 1.0) — one external precedent unfindable by name.

"CI gate" realized as discipline (re-runnable harness + committed baseline + run
before/after any retrieval-layer change) — retrieval needs prod DB + Voyage, no CI
runner has that access.

Spec: docs/superpowers/specs/2026-05-31-fu5-eval-harness-design.md

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-31 14:58:13 +00:00
parent cfcac80de2
commit 6ff2e36bf9
10 changed files with 776 additions and 10 deletions

View File

@@ -30,7 +30,7 @@ import asyncpg
import httpx
from legal_mcp import config
from legal_mcp.services import chunker, db, embeddings, extractor, git_sync, processor, proofreader, research_md
from legal_mcp.services import chunker, db, embeddings, extractor, git_sync, metrics as metrics_service, processor, proofreader, research_md
from legal_mcp.tools import cases as cases_tools, search as search_tools, workflow as workflow_tools, drafting as drafting_tools, precedents as precedents_tools
# Import integration clients (same directory)
@@ -2210,6 +2210,9 @@ async def system_diagnostics():
"ORDER BY d.created_at DESC LIMIT 20"
)
# Halacha review backlog (GAP-14 / INV-QA1 / G10) — human gate visibility
halacha_backlog = await metrics_service.halacha_backlog(conn)
active_tasks = [
{"task_id": tid, "filename": d.get("filename", ""),
"status": d.get("status", ""), "step": d.get("step", "")}
@@ -2219,6 +2222,7 @@ async def system_diagnostics():
return {
"db_ok": db_ok,
"tables": tables,
"halacha_backlog": halacha_backlog,
"failed_documents": [
{
"id": str(r["id"]),