feat(eval): FU-5 — retrieval eval harness + halacha backlog visibility (#63)
Covers GAP-11 (INV-RET4/G8) and GAP-14 (INV-QA1/G10). Retrieval quality was
never measured (only telemetry observation) and the halacha review backlog was
invisible (the 10/19 gap was found by accident).
Unit B — backlog visibility (pure code, container):
- metrics.halacha_backlog(conn) → {pending_review, approved, rejected, published,
total, oldest_pending_at}; surfaced in metrics.get_dashboard() (get_metrics MCP
tool) and /api/system/diagnostics. Live count revealed 178 pending / 1552 total,
oldest from 2026-05-03 — previously invisible.
Unit A — retrieval eval harness (host-side scripts):
- scripts/eval_gold_bootstrap.py — seeds data/eval/gold-set.jsonl. Two sources:
citations (cited==relevant via search_relevance_feedback — empty until decisions
cite precedents) and known_item (query=case_name → relevant=self; a real
citation-free signal, the methodology #52 checked by hand). Idempotent; preserves
source='chair' rows.
- scripts/eval_retrieval.py — runs the production retrieval path (search_library /
search_internal) over the gold-set; computes precision@k, recall@k, MRR, nDCG@k
(k=5,10); aggregates overall + per-corpus + per-practice_area; writes a report and
a delta vs committed baseline.json (which records the retrieval_config it reflects).
--self-test unit-checks the metric math offline.
Gold-set strategy = hybrid (chair decision): bootstrap + chair review. The citation
source is empty today (0 cited precedents in decisions), so the seed is known-item
(77 queries: 54 internal_decisions + 23 precedent_library). The gold-set is
PROVISIONAL until Dafna reviews it (the domain chair-gate).
Baseline (production config: multimodal+rerank on): R@10=0.987, MRR=0.837,
nDCG@10=0.872. Finding: MULTIMODAL_ENABLED=true slightly lowers known-item recall
(image-page results displace exact name matches) — relevant to #15. precedent_library
weaker than internal (R@10 0.957 vs 1.0) — one external precedent unfindable by name.
"CI gate" realized as discipline (re-runnable harness + committed baseline + run
before/after any retrieval-layer change) — retrieval needs prod DB + Voyage, no CI
runner has that access.
Spec: docs/superpowers/specs/2026-05-31-fu5-eval-harness-design.md
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
92
docs/superpowers/specs/2026-05-31-fu5-eval-harness-design.md
Normal file
92
docs/superpowers/specs/2026-05-31-fu5-eval-harness-design.md
Normal file
@@ -0,0 +1,92 @@
|
||||
# FU-5 — Retrieval Eval Harness + Backlog Visibility (design)
|
||||
|
||||
**Task:** #63 (legal-ai tag) · **Covers:** GAP-11, GAP-14 · **Provides:** INV-RET4, G8, INV-QA1, G10
|
||||
**Status:** approved 2026-05-31 (gold-set strategy = hybrid, chair decision). Technical architecture
|
||||
decided per `feedback_research_architecture_decisions` (chair adjudicates domain, not architecture).
|
||||
|
||||
## Problem
|
||||
|
||||
1. **GAP-11 (INV-RET4/G8):** retrieval quality is never measured. Only `telemetry.log_search_bg`
|
||||
records queries (observation, not evaluation). No gold-set, no precision/recall. Every RRF-weight
|
||||
/ `k` / embedder change is tuned "by feel".
|
||||
2. **GAP-14 (INV-QA1/G10):** the halacha review backlog (`review_status='pending_review'`) is
|
||||
invisible — the 10/19-approved gap was found by accident. The human gate has no visibility.
|
||||
|
||||
## Two independent units
|
||||
|
||||
### Unit A — Retrieval eval harness (GAP-11)
|
||||
|
||||
**Existing leverage:** `search_relevance_feedback` already captures a real ground-truth signal —
|
||||
when a finalized decision cites a precedent, `infer_relevance_from_citations` marks it
|
||||
`relevance_score=3` against the `search_logs` where it appeared (telemetry.py). This bootstraps the
|
||||
gold-set without hand-labeling.
|
||||
|
||||
**A1. Gold-set — versioned file `data/eval/gold-set.jsonl`** (single SoT; reviewable/diffable/
|
||||
chair-editable). One JSON object per line:
|
||||
```json
|
||||
{"id":"g001","query":"...","practice_area":"betterment_levy",
|
||||
"corpus":"precedent_library|internal_decisions",
|
||||
"relevant_case_law_ids":["uuid",...],"source":"bootstrap|chair","note":""}
|
||||
```
|
||||
|
||||
**A2. Bootstrap generator — `scripts/eval_gold_bootstrap.py`** (host-side, mcp-server venv):
|
||||
reads `search_relevance_feedback` (score=3) ⨝ `search_logs`, groups by normalized query →
|
||||
relevant `case_law_id` set, emits `source=bootstrap` entries. Idempotent: re-run regenerates the
|
||||
bootstrap section; never overwrites `source=chair` rows. **Chair gate:** Dafna reviews the file,
|
||||
corrects/augments, promotes entries to `source=chair`.
|
||||
|
||||
**A3. Harness — `scripts/eval_retrieval.py`** (host-side, mcp-server venv; needs POSTGRES + VOYAGE):
|
||||
runs the **production retrieval path** (same service functions the MCP search tools call) for each
|
||||
gold query, computes per-query **precision@k, recall@k, MRR, nDCG@k** (k∈{5,10}); relevant = gold
|
||||
ids. Aggregates mean overall + per corpus + per practice_area. Writes
|
||||
`data/eval/eval-report-<ts>.{json,md}`, prints a summary, and a delta vs the committed
|
||||
`data/eval/baseline.json`. `--update-baseline` rewrites the snapshot.
|
||||
|
||||
**"CI gate" — realized as discipline, not automation.** Retrieval needs the prod DB + Voyage API;
|
||||
no CI runner has that access. The gate is: re-runnable harness + committed `baseline.json` + a
|
||||
documented "run before/after any retrieval-layer change, attach the delta" rule (SCRIPTS.md). A true
|
||||
automated CI gate would require a separate frozen corpus fixture — out of scope, noted as future.
|
||||
|
||||
**Scope:** the two precedent corpora (`search_precedent_library` + `search_internal_decisions`),
|
||||
where the citation signal exists. `search_decisions`/`search_case_documents` return case-document
|
||||
chunks (not `case_law`) and carry no citation ground-truth — deliberately out of scope.
|
||||
|
||||
**Metrics rationale:** precision@k + recall@k are spec-required (INV-RET4). MRR (first-relevant
|
||||
rank) and nDCG@k (graded, position-weighted) are standard IR complements (Manning et al., 2008) —
|
||||
nDCG matches the telemetry docstring's stated nDCG@10 aspiration.
|
||||
|
||||
### Unit B — Backlog visibility (GAP-14) — pure code
|
||||
|
||||
Expose the halacha review backlog where health is already surfaced:
|
||||
- **`metrics.get_dashboard()`** (mcp-server/src/legal_mcp/services/metrics.py) — add
|
||||
`halacha_backlog: {pending_review, approved, rejected, published, total, oldest_pending_at}` from
|
||||
`halachot.review_status` + `min(created_at) where pending_review`. Surfaces through the
|
||||
`get_metrics` MCP tool (agents + dashboard).
|
||||
- **`/api/system/diagnostics`** (web/app.py) — add the same `halacha_backlog` block to the health
|
||||
snapshot.
|
||||
|
||||
## Files
|
||||
|
||||
| File | Unit | Kind | Deploy |
|
||||
|------|------|------|--------|
|
||||
| `scripts/eval_gold_bootstrap.py` | A2 | new, host-side | none |
|
||||
| `scripts/eval_retrieval.py` | A3 | new, host-side | none |
|
||||
| `data/eval/gold-set.jsonl` | A1 | data (on disk; chair-reviewed) | none |
|
||||
| `data/eval/baseline.json` | A3 | committed snapshot | none |
|
||||
| `mcp-server/src/legal_mcp/services/metrics.py` | B | edit `get_dashboard` | Coolify |
|
||||
| `web/app.py` | B | edit diagnostics | Coolify |
|
||||
| `scripts/SCRIPTS.md` | A | doc | none |
|
||||
|
||||
## Test strategy
|
||||
|
||||
- Bootstrap: idempotent (re-run = same bootstrap rows; chair rows untouched); 0 chair rows clobbered.
|
||||
- Harness: metric math unit-verified offline on a synthetic (ranking, relevant-set) fixture
|
||||
(precision@k / recall@k / MRR / nDCG@k against hand-computed values) before any DB run.
|
||||
- Unit B: `get_metrics` (no case_number) returns `halacha_backlog` with counts summing to total;
|
||||
diagnostics endpoint returns the same block. Verified against prod counts.
|
||||
|
||||
## Chair gate (domain — the only thing requiring Dafna)
|
||||
|
||||
After bootstrap produces `gold-set.jsonl`, Dafna reviews: are these queries representative, and are
|
||||
the marked precedents the *correct* answers? Her edits make the gold-set authoritative. Until then
|
||||
the baseline is "provisional (bootstrap-only)".
|
||||
Reference in New Issue
Block a user