legal-ai

Author	SHA1	Message	Date
Chaim	4debe9995b	chore(#15 ): adopt MULTIMODAL_TEXT_WEIGHT=0.65 + close #15 , open #80 A/B eval (eval_retrieval.py, 86-query gold-set) showed the 0.5 default was mis-tuned: the image side was too heavy and dragged precedent_library recall 0.971 -> 0.885. Sweep 0.5..0.75 — at 0.65 multimodal beats text-only on every overall metric AND every corpus (R@5 0.994 vs 0.989, nDCG@5 0.960 vs 0.944, MRR 0.954 vs 0.936). Dafna approved. - MULTIMODAL_TEXT_WEIGHT=0.65 set in Coolify (legal-ai, runtime) + redeploy. - baseline.json updated to the 0.65 config (future regression reference). - #15 done (premise was stale — multimodal already default on 110 docs; the win was tuning the weight, not the backfill). - #80 opened: the costly 140-doc legacy backfill is deferred until a targeted image-answer gold-set proves the table/image value prop (untested here). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 08:45:06 +00:00
Chaim	7161c3d010	chore(eval): add 9 chair-approved semantic queries to gold-set (FU-5) The gold-set was 77 known-item probes (query=case_name). Added 9 chair-approved SEMANTIC queries (S1–S9) — a real legal question per row, relevant = the precedents that should surface (drawn from subject_tags, chair-confirmed). These test what matters: does retrieval answer a legal issue, not just find a case by name. source='chair' (preserved across re-bootstrap). practice_area left empty so the filter never excludes a cross-tagged precedent (s.197 rulings sit under betterment_levy). Baseline now 86 queries. Finding from the 9 semantic queries: MRR ≈ 1.0 — the system surfaces a lead relevant precedent at rank 1 for nearly every question — but R@10 ranges 0.5–1.0: for broad questions with many co-relevant precedents (e.g. נטרול תמ"א 38 = 5 relevant → R@10 0.60; שמאי מכריע = 2 → 0.50) some co-relevant rulings miss the top-10. Lead-precedent retrieval is strong; exhaustive multi-precedent recall is the gap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-31 15:57:45 +00:00
Chaim	411ee18786	chore(eval): chair review — rename code-named record + refresh gold-set Chair review of the FU-5 gold-set surfaced one internal_committee record whose case_name was a code ("ARAR-24-9002") rather than a real name. Per the chair's citation (ערר 9002/24 קרקעות ירושלים 2 בע"מ נ' הוועדה המקומית ירושלים, נבו 13.8.2025, a s.197 compensation appeal), case_name corrected in the DB to "קרקעות ירושלים 2" (case_number 9002-24 and citation_formatted were already correct; only 1 such code-named record exists corpus-wide). Re-bootstrapped the gold-set (the known-item query is now the real name) and refreshed baseline (aggregate unchanged — the case retrieves identically under the corrected name). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-31 15:47:57 +00:00
Chaim	6ff2e36bf9	feat(eval): FU-5 — retrieval eval harness + halacha backlog visibility (#63 ) Covers GAP-11 (INV-RET4/G8) and GAP-14 (INV-QA1/G10). Retrieval quality was never measured (only telemetry observation) and the halacha review backlog was invisible (the 10/19 gap was found by accident). Unit B — backlog visibility (pure code, container): - metrics.halacha_backlog(conn) → {pending_review, approved, rejected, published, total, oldest_pending_at}; surfaced in metrics.get_dashboard() (get_metrics MCP tool) and /api/system/diagnostics. Live count revealed 178 pending / 1552 total, oldest from 2026-05-03 — previously invisible. Unit A — retrieval eval harness (host-side scripts): - scripts/eval_gold_bootstrap.py — seeds data/eval/gold-set.jsonl. Two sources: citations (cited==relevant via search_relevance_feedback — empty until decisions cite precedents) and known_item (query=case_name → relevant=self; a real citation-free signal, the methodology #52 checked by hand). Idempotent; preserves source='chair' rows. - scripts/eval_retrieval.py — runs the production retrieval path (search_library / search_internal) over the gold-set; computes precision@k, recall@k, MRR, nDCG@k (k=5,10); aggregates overall + per-corpus + per-practice_area; writes a report and a delta vs committed baseline.json (which records the retrieval_config it reflects). --self-test unit-checks the metric math offline. Gold-set strategy = hybrid (chair decision): bootstrap + chair review. The citation source is empty today (0 cited precedents in decisions), so the seed is known-item (77 queries: 54 internal_decisions + 23 precedent_library). The gold-set is PROVISIONAL until Dafna reviews it (the domain chair-gate). Baseline (production config: multimodal+rerank on): R@10=0.987, MRR=0.837, nDCG@10=0.872. Finding: MULTIMODAL_ENABLED=true slightly lowers known-item recall (image-page results displace exact name matches) — relevant to #15. precedent_library weaker than internal (R@10 0.957 vs 1.0) — one external precedent unfindable by name. "CI gate" realized as discipline (re-runnable harness + committed baseline + run before/after any retrieval-layer change) — retrieval needs prod DB + Voyage, no CI runner has that access. Spec: docs/superpowers/specs/2026-05-31-fu5-eval-harness-design.md Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-31 14:58:13 +00:00

4 Commits