legal-ai

Author	SHA1	Message	Date
Chaim	420cb819f5	feat(halacha-triage): quality-gated + prioritized review queue + metrics (#84 ) Backend for the halacha approval-queue triage (#84). The keyboard UI, batch actions and defer/reject (#84.4–6) already shipped; this adds the gating, prioritization and metrics the queue was missing. db.list_halachot — two opt-in triage controls: * exclude_low_quality (#84.1): drop items carrying ANY quality_flag (application / quote_unverified / truncated / non_decision / thin / nli_unsupported / near_duplicate) — they belong in a 'needs extraction fix' bucket, not the chair's approve queue. * order_by_priority (#84.3): active-learning order — negatively-treated first, then most-uncertain (lowest confidence), then oldest — instead of FIFO, so the highest-value decisions surface first. halachot_pending (MCP) — now gated + prioritized BY DEFAULT; include_low_quality= true reveals the needs-fix bucket. The agent review path benefits immediately. GET /api/halachot — same two params, default OFF (non-breaking; the UI opts in). metrics.halacha_backlog (#84.7) — splits pending into clean vs flagged, adds deferred, reviewed_total, approve_ratio, and a pending_by_flag breakdown, so the backlog distinguishes real review work from extraction noise. Deferred (documented): #84.2 near-duplicate cluster cards and wiring the UI fetch to the new params require frontend work + an api:types regen AFTER this deploys (the new query params aren't in prod's OpenAPI until then) — a clean follow-up. The backend fully supports both now. Verified against the live DB (read-only): - pending 177 → gated-clean 110, 0 flagged items leak into the clean queue. - priority order surfaces the lowest-confidence items first (0.55, 0.55, ...). - backlog: pending_clean=110 / pending_flagged=67 / approve_ratio=0.916, pending_by_flag={nli_unsupported:59, quote_unverified:3, thin:3, truncated:2}. - pytest tests/test_halacha_quality.py — 52 passed (no regression). Invariants: G1 (gate at source — SQL filter, not post-hoc); G2 (no parallel path — same list_halachot); §6 (flagged items routed to a bucket, never dropped). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 20:00:52 +00:00
Chaim	6ff2e36bf9	feat(eval): FU-5 — retrieval eval harness + halacha backlog visibility (#63 ) Covers GAP-11 (INV-RET4/G8) and GAP-14 (INV-QA1/G10). Retrieval quality was never measured (only telemetry observation) and the halacha review backlog was invisible (the 10/19 gap was found by accident). Unit B — backlog visibility (pure code, container): - metrics.halacha_backlog(conn) → {pending_review, approved, rejected, published, total, oldest_pending_at}; surfaced in metrics.get_dashboard() (get_metrics MCP tool) and /api/system/diagnostics. Live count revealed 178 pending / 1552 total, oldest from 2026-05-03 — previously invisible. Unit A — retrieval eval harness (host-side scripts): - scripts/eval_gold_bootstrap.py — seeds data/eval/gold-set.jsonl. Two sources: citations (cited==relevant via search_relevance_feedback — empty until decisions cite precedents) and known_item (query=case_name → relevant=self; a real citation-free signal, the methodology #52 checked by hand). Idempotent; preserves source='chair' rows. - scripts/eval_retrieval.py — runs the production retrieval path (search_library / search_internal) over the gold-set; computes precision@k, recall@k, MRR, nDCG@k (k=5,10); aggregates overall + per-corpus + per-practice_area; writes a report and a delta vs committed baseline.json (which records the retrieval_config it reflects). --self-test unit-checks the metric math offline. Gold-set strategy = hybrid (chair decision): bootstrap + chair review. The citation source is empty today (0 cited precedents in decisions), so the seed is known-item (77 queries: 54 internal_decisions + 23 precedent_library). The gold-set is PROVISIONAL until Dafna reviews it (the domain chair-gate). Baseline (production config: multimodal+rerank on): R@10=0.987, MRR=0.837, nDCG@10=0.872. Finding: MULTIMODAL_ENABLED=true slightly lowers known-item recall (image-page results displace exact name matches) — relevant to #15. precedent_library weaker than internal (R@10 0.957 vs 1.0) — one external precedent unfindable by name. "CI gate" realized as discipline (re-runnable harness + committed baseline + run before/after any retrieval-layer change) — retrieval needs prod DB + Voyage, no CI runner has that access. Spec: docs/superpowers/specs/2026-05-31-fu5-eval-harness-design.md Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-31 14:58:13 +00:00
Chaim	f008820ec8	feat(reindex): health-check stale_embedding_case_law count (GAP-09, FU-3) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-30 22:08:27 +00:00
Chaim	677f29ddec	feat(audit): blocks_stale drift flag + health-check visibility (GAP-17, FU-7) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-30 21:36:56 +00:00
Chaim	358d82e90e	feat(retrieval): require practice_area only for internal/cases; enable searchable filter + health visibility (GAP-13, FU-2a) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-30 20:57:27 +00:00
Chaim	d9e5ef0f46	Add full decision writing pipeline: classify, extract, brainstorm, write, QA, export New services (11 files): - classifier.py: auto doc-type classification + party identification (Claude Haiku) - claims_extractor.py: claim extraction from pleadings (Claude Sonnet + regex) - references_extractor.py: plan/case-law/legislation detection (regex) - brainstorm.py: direction generation with 2-3 options (Claude Sonnet) - block_writer.py: 12-block decision writer (template + Claude Sonnet/Opus) - docx_exporter.py: DOCX export with David font, RTL, headings - qa_validator.py: 6 QA checks with export blocking on critical failure - learning_loop.py: draft vs final comparison + lesson extraction - metrics.py: KPIs dashboard per case and global - audit.py: action audit log - cli.py: standalone CLI with 11 commands Updated pipeline: extract → classify → chunk → embed → store → extract_references New MCP tools: 29 total (was 16) New DB tables: audit_log, decisions CRUD, claims CRUD Config: Infisical support, external service allowlist Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 10:21:47 +00:00

6 Commits