The gold-set was 77 known-item probes (query=case_name). Added 9 chair-approved SEMANTIC queries (S1–S9) — a real legal question per row, relevant = the precedents that should surface (drawn from subject_tags, chair-confirmed). These test what matters: does retrieval answer a legal issue, not just find a case by name. source='chair' (preserved across re-bootstrap). practice_area left empty so the filter never excludes a cross-tagged precedent (s.197 rulings sit under betterment_levy). Baseline now 86 queries. Finding from the 9 semantic queries: MRR ≈ 1.0 — the system surfaces a lead relevant precedent at rank 1 for nearly every question — but R@10 ranges 0.5–1.0: for broad questions with many co-relevant precedents (e.g. נטרול תמ"א 38 = 5 relevant → R@10 0.60; שמאי מכריע = 2 → 0.50) some co-relevant rulings miss the top-10. Lead-precedent retrieval is strong; exhaustive multi-precedent recall is the gap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
27 KiB
27 KiB