Files
legal-ai/data/eval/baseline.json
Chaim 7161c3d010 chore(eval): add 9 chair-approved semantic queries to gold-set (FU-5)
The gold-set was 77 known-item probes (query=case_name). Added 9 chair-approved
SEMANTIC queries (S1–S9) — a real legal question per row, relevant = the
precedents that should surface (drawn from subject_tags, chair-confirmed). These
test what matters: does retrieval answer a legal issue, not just find a case by
name. source='chair' (preserved across re-bootstrap). practice_area left empty
so the filter never excludes a cross-tagged precedent (s.197 rulings sit under
betterment_levy).

Baseline now 86 queries. Finding from the 9 semantic queries: MRR ≈ 1.0 — the
system surfaces a lead relevant precedent at rank 1 for nearly every question —
but R@10 ranges 0.5–1.0: for broad questions with many co-relevant precedents
(e.g. נטרול תמ"א 38 = 5 relevant → R@10 0.60; שמאי מכריע = 2 → 0.50) some
co-relevant rulings miss the top-10. Lead-precedent retrieval is strong;
exhaustive multi-precedent recall is the gap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 15:57:45 +00:00

70 lines
1.4 KiB
JSON

{
"gold_size": 86,
"retrieval_config": {
"MULTIMODAL_ENABLED": true,
"VOYAGE_RERANK_ENABLED": true,
"VOYAGE_MODEL": "voyage-3",
"MULTIMODAL_TEXT_WEIGHT": 0.5,
"MULTIMODAL_RRF_K": 60,
"BM25_HYBRID_ENABLED": true
},
"overall": {
"P@5": 0.214,
"R@5": 0.899,
"nDCG@5": 0.8311,
"P@10": 0.1163,
"R@10": 0.9649,
"nDCG@10": 0.8554,
"MRR": 0.8482
},
"by_corpus": {
"internal_decisions": {
"P@5": 0.1963,
"R@5": 0.963,
"nDCG@5": 0.887,
"P@10": 0.1019,
"R@10": 1.0,
"nDCG@10": 0.8994,
"MRR": 0.8713
},
"precedent_library": {
"P@5": 0.2438,
"R@5": 0.7911,
"nDCG@5": 0.7367,
"P@10": 0.1406,
"R@10": 0.9057,
"nDCG@10": 0.7813,
"MRR": 0.8092
}
},
"by_practice_area": {
"betterment_levy": {
"P@5": 0.1897,
"R@5": 0.9231,
"nDCG@5": 0.8595,
"P@10": 0.1,
"R@10": 0.9744,
"nDCG@10": 0.8766,
"MRR": 0.8437
},
"compensation_197": {
"P@5": 0.2,
"R@5": 1.0,
"nDCG@5": 1.0,
"P@10": 0.1,
"R@10": 1.0,
"nDCG@10": 1.0,
"MRR": 1.0
},
"rishuy_uvniya": {
"P@5": 0.2,
"R@5": 0.9706,
"nDCG@5": 0.861,
"P@10": 0.1029,
"R@10": 1.0,
"nDCG@10": 0.8708,
"MRR": 0.8346
}
},
"generated_at": "20260531T155717Z"
}