legal-ai

Author	SHA1	Message	Date
Chaim	7161c3d010	chore(eval): add 9 chair-approved semantic queries to gold-set (FU-5) The gold-set was 77 known-item probes (query=case_name). Added 9 chair-approved SEMANTIC queries (S1–S9) — a real legal question per row, relevant = the precedents that should surface (drawn from subject_tags, chair-confirmed). These test what matters: does retrieval answer a legal issue, not just find a case by name. source='chair' (preserved across re-bootstrap). practice_area left empty so the filter never excludes a cross-tagged precedent (s.197 rulings sit under betterment_levy). Baseline now 86 queries. Finding from the 9 semantic queries: MRR ≈ 1.0 — the system surfaces a lead relevant precedent at rank 1 for nearly every question — but R@10 ranges 0.5–1.0: for broad questions with many co-relevant precedents (e.g. נטרול תמ"א 38 = 5 relevant → R@10 0.60; שמאי מכריע = 2 → 0.50) some co-relevant rulings miss the top-10. Lead-precedent retrieval is strong; exhaustive multi-precedent recall is the gap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-31 15:57:45 +00:00
Chaim	411ee18786	chore(eval): chair review — rename code-named record + refresh gold-set Chair review of the FU-5 gold-set surfaced one internal_committee record whose case_name was a code ("ARAR-24-9002") rather than a real name. Per the chair's citation (ערר 9002/24 קרקעות ירושלים 2 בע"מ נ' הוועדה המקומית ירושלים, נבו 13.8.2025, a s.197 compensation appeal), case_name corrected in the DB to "קרקעות ירושלים 2" (case_number 9002-24 and citation_formatted were already correct; only 1 such code-named record exists corpus-wide). Re-bootstrapped the gold-set (the known-item query is now the real name) and refreshed baseline (aggregate unchanged — the case retrieves identically under the corrected name). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-31 15:47:57 +00:00
Chaim	6ff2e36bf9	feat(eval): FU-5 — retrieval eval harness + halacha backlog visibility (#63 ) Covers GAP-11 (INV-RET4/G8) and GAP-14 (INV-QA1/G10). Retrieval quality was never measured (only telemetry observation) and the halacha review backlog was invisible (the 10/19 gap was found by accident). Unit B — backlog visibility (pure code, container): - metrics.halacha_backlog(conn) → {pending_review, approved, rejected, published, total, oldest_pending_at}; surfaced in metrics.get_dashboard() (get_metrics MCP tool) and /api/system/diagnostics. Live count revealed 178 pending / 1552 total, oldest from 2026-05-03 — previously invisible. Unit A — retrieval eval harness (host-side scripts): - scripts/eval_gold_bootstrap.py — seeds data/eval/gold-set.jsonl. Two sources: citations (cited==relevant via search_relevance_feedback — empty until decisions cite precedents) and known_item (query=case_name → relevant=self; a real citation-free signal, the methodology #52 checked by hand). Idempotent; preserves source='chair' rows. - scripts/eval_retrieval.py — runs the production retrieval path (search_library / search_internal) over the gold-set; computes precision@k, recall@k, MRR, nDCG@k (k=5,10); aggregates overall + per-corpus + per-practice_area; writes a report and a delta vs committed baseline.json (which records the retrieval_config it reflects). --self-test unit-checks the metric math offline. Gold-set strategy = hybrid (chair decision): bootstrap + chair review. The citation source is empty today (0 cited precedents in decisions), so the seed is known-item (77 queries: 54 internal_decisions + 23 precedent_library). The gold-set is PROVISIONAL until Dafna reviews it (the domain chair-gate). Baseline (production config: multimodal+rerank on): R@10=0.987, MRR=0.837, nDCG@10=0.872. Finding: MULTIMODAL_ENABLED=true slightly lowers known-item recall (image-page results displace exact name matches) — relevant to #15. precedent_library weaker than internal (R@10 0.957 vs 1.0) — one external precedent unfindable by name. "CI gate" realized as discipline (re-runnable harness + committed baseline + run before/after any retrieval-layer change) — retrieval needs prod DB + Voyage, no CI runner has that access. Spec: docs/superpowers/specs/2026-05-31-fu5-eval-harness-design.md Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-31 14:58:13 +00:00
Chaim	b409f1c7eb	Add case data, benchmark embeddings, and bug report Add cases symlink, Google Vision extraction and benchmark embedding data, and Paperclip bug report. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 17:20:40 +00:00
Chaim	4d674bf475	Add proofreader and exporter agents + abbreviations dictionary - legal-proofreader: OCR proofreading agent (Opus) that fixes broken Hebrew text before legal analysis — corrects abbreviations (עוייד→עו"ד), broken words, and illogical sentences - legal-exporter: Final draft export agent — validates decision, exports DOCX, saves versioned drafts (טיוטה-V1.docx etc.) - abbreviations.json: Dictionary of ~70 Hebrew legal/general/planning abbreviations for automated OCR correction - legal-ceo.md: Updated workflow to include proofreader before analyst and exporter after QA Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 20:34:10 +00:00

5 Commits