Files
legal-ai/data/eval/baseline.json
Chaim 7161c3d010 chore(eval): add 9 chair-approved semantic queries to gold-set (FU-5)
The gold-set was 77 known-item probes (query=case_name). Added 9 chair-approved
SEMANTIC queries (S1–S9) — a real legal question per row, relevant = the
precedents that should surface (drawn from subject_tags, chair-confirmed). These
test what matters: does retrieval answer a legal issue, not just find a case by
name. source='chair' (preserved across re-bootstrap). practice_area left empty
so the filter never excludes a cross-tagged precedent (s.197 rulings sit under
betterment_levy).

Baseline now 86 queries. Finding from the 9 semantic queries: MRR ≈ 1.0 — the
system surfaces a lead relevant precedent at rank 1 for nearly every question —
but R@10 ranges 0.5–1.0: for broad questions with many co-relevant precedents
(e.g. נטרול תמ"א 38 = 5 relevant → R@10 0.60; שמאי מכריע = 2 → 0.50) some
co-relevant rulings miss the top-10. Lead-precedent retrieval is strong;
exhaustive multi-precedent recall is the gap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 15:57:45 +00:00

1.4 KiB