Halacha-extraction quality (#81) and dedup-on-insert (#82) — engine changes
(pure + tested) plus measurement/ops tooling.
halacha_quality.py
- #81.4 application gate: is_fact_dependent() (high-precision "applied to THIS
case" deixis per the strict rubric §3/§27) + FLAG_APPLICATION. compute_quality_flags
now takes rule_type and flags rule_type=='application' OR fact-dependent —
blocking auto-approve (an illustration is not a generalizable holding).
- #82.3 lexical tail signal: jaccard_shingles / normalized_levenshtein /
lexical_near_duplicate + FLAG_NEAR_DUPLICATE, for the 0.83–0.93 cosine band.
halacha_extractor.py — pass rule_type to the flag computation; re-type a
binding-labeled fact-application to 'application' (mirrors non_decision→obiter).
db.py (store_halachot_for_chunk) — dedup now fetches the nearest same-precedent
neighbor once: cosine ≥ DEDUP → skip (unchanged); cosine in [BAND, DEDUP) with
high lexical overlap → FLAG_NEAR_DUPLICATE (review, not skip — never drop a
possibly-distinct principle unreviewed).
config.py — HALACHA_DEDUP_BAND_COSINE (0.83).
Scripts:
- scripts/halacha_goldset.py (#81.7) — export stratified sample for human
tagging; score validators (P/R/F1) against the tags. Backbone for #81.8.
- scripts/halacha_batch_reconcile.py (#82.7) — conservative cross-precedent
dedup (cosine ≥0.95), dry-run report only.
- scripts/calibrate_halacha_dedup.py (#82.1) — calibrate the lexical thresholds
against the 2026-06-03 cleanup gold-set.
Deferred (documented): #82.4 merge-provenance and #82.5 DB ON CONFLICT/UNIQUE
on normalized quote are NOT included — the current skip+flag behavior is safe,
whereas a UNIQUE on normalized_quote would fail on existing dups and a blind
merge risks losing provenance; they need their own chair-reviewed migration.
#82.6 over-merge guard is moot until merge lands. #81.6 full rhetorical-role
classifier deferred (section pre-filter + application flag cover the practical
case); #81.8 blocked on the human-tagged gold-set (harness now provided).
Verified:
- pytest tests/test_halacha_quality.py — 52 passed (14 new).
- calibrate: configured (0.55,0.70) → precision 1.0 (zero false-merge), recall
0.30 — correct profile for an auto-approve-blocking signal.
- goldset export: 15-row sample CSV. batch reconcile: 819 halachot → 5
cross-precedent candidate pairs.
Invariants: G1 (normalize at source — flag at insert, not at read); §6 (no
silent swallow — suspect items flagged to review, never dropped); G2 (no
parallel path — same store_halachot_for_chunk / compute_quality_flags).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After a precedent finishes extracting, a claude_session pass folds facets of the
SAME legal question (below #82's dedup cosine — the שפר 14-vs-4 / 403-17→89
granularity gap) into one canonical; the rest are marked 'rejected' (reversible:
out of the active corpus AND the review queue, but recoverable). FOLD-ONLY —
never merges distinct legal questions, never invents.
- Engine: claude_session-as-judge (local CLI, zero cost), 'high' effort — folding
needs careful judgment. One pass per precedent, runs in _extract_impl once all
chunks are done (the prompt dedups within a chunk; this catches across chunks).
- Pure, unit-tested helpers in halacha_quality: CONSOLIDATE_SYSTEM,
build_consolidation_prompt, parse_fold_groups (fails SAFE → [] on any malformed
shape; drops <2-member groups; coerces/dedups indices).
- halacha_extractor._consolidate_precedent picks the canonical per group
(approved>pending, higher confidence, quote_verified, longer) and rejects the
rest via the existing update_halachot_batch (#84). Never rejects a canonical.
Fails OPEN on any error (no CLI / parse fail → 0 folds, data untouched).
- config: HALACHA_CONSOLIDATE_ENABLED/MODEL/EFFORT.
Verified: suite 176 passed (10 new); integration vs dev DB — a 2-facet group
folds to 1 canonical + 1 rejected (tagged), distinct rules untouched, claude
error → 0 folds (fail-open).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
#81.3 — a post-extraction validator that flags halachot whose rule_statement is
NOT entailed by its supporting_quote (the model over-reaching beyond its source).
- Engine: claude_session-as-judge (local CLI, zero API cost) per chaim's standing
preference — one batched judge call per chunk, NOT a hosted NLI model.
- Pure, unit-tested helpers in halacha_quality: NLI_SYSTEM, build_nli_prompt,
parse_nli_verdicts (fails OPEN — any shape/label ambiguity → 'entailed').
- halacha_extractor._nli_check wraps the call; fails OPEN on any error (e.g. no
CLI in the container) so a flaky judge never blocks a genuine halacha.
- Non-entailed (neutral/contradiction) → quality_flag 'nli_unsupported' which
blocks auto-approve (routes to pending_review) via the existing store gate.
- config: HALACHA_NLI_ENABLED/MODEL/EFFORT (effort 'low' — entailment is simple).
Verified: suite 166 passed (10 new); LIVE smoke test against the real claude CLI
returned ['entailed','neutral'] for a supported vs unsupported rule.
Also commits TaskMaster #86 (Nevo preamble/ratio: anti-contamination strip fix +
gold-set benchmark) capturing today's strip_nevo_preamble findings.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bake the 2026-06-03 strict-cleanup rubric into the extraction pipeline so the
corpus stays clean at the source instead of accumulating duplicates, obiter
dicta, truncated quotes and thin restatements that clog the review queue.
#81 — quality gate:
- New pure module halacha_quality.py with unit-tested validators:
non-decision/obiter (Wambaugh markers), truncated-quote (mid-word cut),
thin-restatement (rule≈quote), quote-unverified.
- Validators run in halacha_extractor._process; a non-decision is re-typed
obiter; flags persist in new halachot.quality_flags column.
- Auto-approve now requires confidence>=threshold AND no quality flags;
flagged items route to pending_review regardless of confidence.
- Both extraction prompts hardened: reject undecided dicta, exclude
case-specific applications, require abstraction, forbid over-splitting.
#82 — dedup-on-insert (store_halachot_for_chunk):
- Within the same precedent, skip a halacha whose normalized supporting_quote
already exists, or whose rule-embedding has cosine>=HALACHA_DEDUP_COSINE
(0.93) against an already-stored one. Makes re-runs idempotent.
Migration: halachot.quality_flags TEXT[] (additive, idempotent ALTER).
Tests: 19 new unit tests; full suite 156 passed. Validated end-to-end against
dev DB (dedup skips dups, flag blocks auto-approve, re-run inserts 0).
Calibration: flags fire on only ~10% of current survivors (low false-positive).
Spec: docs/halacha-strict-rubric.md
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
xhigh is the quality sweet-spot for a single precedent but very slow at scale
(64-chunk case ≈ 20 min). Bulk queue-drains (process_pending over many
precedents) now use a lighter effort to cut wall-clock; interactive single
re-extraction keeps xhigh quality.
- config.HALACHA_BULK_EXTRACT_EFFORT (env, default 'high'; set 'medium' for max
speed, 'xhigh' to match single).
- extract()/_extract_impl()/_extract_chunk() take an `effort` override threaded
to claude_session.query_json; None falls back to HALACHA_EXTRACT_EFFORT (xhigh).
- process_pending_extractions(kind='halacha') passes the bulk effort; single
reextract_halachot keeps xhigh.
Verified end-to-end (mocked LLM): _extract_chunk(effort='medium') → query_json
effort='medium'; effort=None → 'xhigh' fallback. Closes the open item in #72.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Halacha extraction held ALL chunk results in memory and stored once at the very
end — a crash/interrupt mid-run (e.g. the 2026-05-31 freeze) lost everything and
re-paid the full LLM cost on retry.
Now each chunk's halachot are stored AND the chunk is checkpointed
(precedent_chunks.halacha_extracted_at) the moment it finishes:
- V25 schema: precedent_chunks.halacha_extracted_at (per-chunk checkpoint).
- db.store_halachot_for_chunk: atomic per-chunk insert (halacha_index continues
from MAX, caller serializes via an in-process store-lock) + checkpoint mark.
- db.reset_halacha_extraction (force) / mark_all_chunks_extracted (legacy backfill).
- _extract_impl rewritten: resume by default (skip checkpointed chunks; failed
chunks stay pending and are retried; status stays 'processing' until all done);
force=True wipes + redoes all. reextract_halachot passes force=True; the queue
drain (process_pending) resumes by default.
- Legacy guard: a pre-V25 precedent (halachot exist, no checkpoints) is
backfilled and treated as complete — never re-extracted (would duplicate).
Verified on 9002-24 (55 halachot, legacy): resume → legacy-backfill, NO
duplication (stays 55), all chunks checkpointed. Index continuation: store at
55,56 after max 54, no collision. Tracks #72.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31: opus-4-8 @ xhigh extraction + overlapping driver processes (agent
fallback retries each spawn an independent `python -c` driver; process_pending is
serial WITHIN a process but the box ran 4-5 drivers in parallel) → 12-16 concurrent
xhigh `claude -p` procs → load 69 → hard reboot.
Fix: halacha_extractor.extract() now takes a Postgres advisory lock
(pg_try_advisory_lock, key 'HALA') before any work. If another extraction (any
process/agent/driver — all share the legal-ai DB) holds it, the call returns
status='busy' and the precedent stays pending for the next drain. Guarantees ONE
extraction at a time ACROSS PROCESSES — an in-process Semaphore cannot (drivers
are separate OS processes). Core logic moved to _extract_impl (unchanged) under
the lock. CHUNK_CONCURRENCY now env-tunable (HALACHA_CHUNK_CONCURRENCY, default 3).
Verified: while a lock is held, extract() returns 'busy' with no LLM call; lock
releases cleanly and the next extraction proceeds. Tracks #72.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Observed 2026-05-03: a `precedent_process_pending(halacha)` run that
chained two precedents (1110/20 → 317/10) succeeded for the first
(9 halachot, 129 chunks) and produced status=`no_halachot` for the
second despite it being a 47KB Supreme Court ruling with rich legal
analysis. A manual single-precedent re-run on 317/10 immediately
extracted 53 halachot. Diagnosis: every chunk's claude_session call
in the back-to-back run silently failed (likely Anthropic rate-limit
storm after the 1110/20 token burn), and the empty list was reported
as "Claude looked and found nothing" — same code path as a real
0-halacha ruling. The user couldn't tell the difference.
Three changes:
1. Surface chunk-level failures (halacha_extractor.py)
`_extract_chunk` now returns `(halachot, succeeded)` so the caller
can count how many chunks crashed. `extract()` uses this to
distinguish:
- `no_halachot` — chunks ran cleanly, Claude found nothing
- `extraction_failed` — ≥50% of chunks crashed AND zero halachot
came back (rate limit, subprocess crash, etc.)
When `extraction_failed`, DB status is left as 'processing' so the
request stays in the queue for the caller to retry — instead of
the old behaviour where it got marked 'completed' and silently
dropped from the queue.
2. Inter-precedent cooldown (precedent_library.py)
`process_pending_extractions` now sleeps 30s between precedents.
Anthropic rate-limits per-org, and back-to-back large rulings
(~4M tokens for 1110/20, immediately followed by another 2-3M)
was the empirical trigger. 30s gives the per-minute counter time
to drain.
3. Auto-retry on extraction_failed (precedent_library.py)
When a precedent comes back as `extraction_failed`, retry once
after a 60s cooldown before giving up. Rate-limit storms are
transient — the manual re-run of 317/10 minutes later succeeded
with 53 halachot and zero chunk failures, confirming a single
retry is sufficient. Only retries `extraction_failed`; never
`no_halachot` (Claude looked and there genuinely is no holding).
The DB status now ends up as 'failed' only after retries are
exhausted, matching the UI's terminal-failure chip.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three fixes to the precedent library after the first end-to-end test on
403-17 surfaced runtime issues:
1. Anthropic SDK fallback in claude_session. The legal-ai Docker container
does not ship the `claude` CLI, so every halacha and metadata extraction
was failing with "Claude CLI not found." Module now tries the CLI first
(zero-cost local path) and falls back to the Anthropic SDK with
ANTHROPIC_API_KEY when the binary is absent. Default model is
claude-sonnet-4-6, overridable via CLAUDE_SDK_MODEL env. The system
message gets cache_control: ephemeral so multi-chunk runs reuse the
cached instruction prefix at ~10% read cost. Adds `anthropic` to
pyproject deps.
2. precedent_metadata_extractor crashed with KeyError because the JSON
example inside the prompt template contained literal { } characters
that str.format() interpreted as placeholders. Switched to f-string
concatenation; the prompt template no longer needs format() at all.
3. Library list query stays stale after upload because the upload
mutation's onSuccess fires when the POST returns task_id, not when
SSE reports completion. Added a second invalidate inside the SSE
watcher in PrecedentUploadSheet so the new row appears with up-to-date
chunk and halachot counts the moment processing finishes.
Halacha and metadata extractors now route the long static prompt through
the new `system=` parameter so the SDK path actually caches it; the CLI
path concatenates and behaves as before.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three improvements to the precedent library based on usage feedback:
1. Auto-fill metadata at upload time. New service
precedent_metadata_extractor reads the ruling's full_text and
suggests case_name (short), summary, headnote, key_quote,
subject_tags, appeal_subtype. The merge policy fills only empty
fields, preserving everything the chair typed in the upload form.
Wired into the ingest pipeline; also exposed as a re-run endpoint
POST /api/precedent-library/{id}/extract-metadata for existing
records.
2. Edit sheet in the UI. Pencil icon on each library row opens a
pre-populated form covering every field. A Sparkles button on the
sheet runs the metadata extractor on demand and refreshes the
form. The case_number is read-only because halachot are FK'd to
it; renaming requires delete + re-upload.
3. Halacha extractor branches on is_binding. Sources marked binding
(Supreme/Administrative) keep the strict halacha prompt. Non-binding
sources (other appeals committees, district courts on planning
matters) get a different prompt that extracts applications,
interpretive principles, and persuasive conclusions — labeled with
new rule_types 'application' and 'persuasive'. The fallback also
widens chunk selection: if the chunker labeled nothing as
legal_analysis/ruling/conclusion, we now run on all chunks rather
than returning zero halachot for a usable ruling.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a third corpus of legal authority distinct from style_corpus
(Daphna's prior decisions for voice) and case_precedents (chair-attached
quotes per case). The new corpus holds chair-uploaded court rulings and
other appeals committee decisions, with binding rules (הלכות) extracted
automatically and queued for chair approval.
Pipeline (web/app.py + services/precedent_library.py):
file → extract → chunk → Voyage embed → halacha_extractor → store +
publish progress over the existing Redis SSE channel.
Schema V7 (services/db.py): extends case_law with source_kind +
extraction status fields under a CHECK constraint pinning practice_area
to the three appeals committee domains (rishuy_uvniya, betterment_levy,
compensation_197). New precedent_chunks (vector(1024)) and halachot
tables (vector(1024) over rule_statement, IVFFlat indexes, gin on
practice_areas/subject_tags). Halachot start as pending_review; only
approved/published rows are visible to search_precedent_library.
Agents: legal-writer, legal-researcher, legal-analyst, legal-ceo,
legal-qa get search_precedent_library. legal-writer prompt explains
the three-corpus distinction and CREAC use; legal-qa now verifies that
every cited halacha resolves to an approved row in the corpus.
UI: /precedents page with four tabs — library / semantic search /
pending review (J/K nav, A/R/E shortcuts, badge count) / stats.
Reuses the existing upload-sheet progress + SSE pattern.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>