feat: Stage C — RAG advanced (#33, #47, #48, #49, #50, #51)
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 1m35s

Six independent sub-tasks dispatched in parallel; aggregated here.

## #33 — Hide case_name column
library-list-panel.tsx: `<TableHead>` + `<TableCell>` for "שם"
get `className="hidden"` in both Court and Committee row variants.
DB column preserved for future use.

## #47 — Audit script periodic
New scripts/audit_corpus_integrity.py — 3 SQL checks (external+ערר
prefix, internal missing chair/district, cases.practice_area enum)
+ CEO wakeup on violations + cron `0 7 * * *`. First run: 0 issues.

## #48 — Parent-doc retrieval (gated, default off)
Schema V17: precedent_chunks.parent_chunk_id + chunk_role
('child'|'parent'). New chunker.chunk_document_hierarchical() —
section-aware parents (~1500 tokens) containing ~5 overlapping
children (~300 tokens each). New db.store_precedent_chunks_hierarchical
two-pass writer. Search SQL (semantic + lexical) LEFT-JOIN parent and
swap content + dedupe by parent_chunk_id when flag on. Toggle:
PARENT_DOC_RETRIEVAL_ENABLED + PARENT_DOC_{CHILD,PARENT}_SIZE_TOKENS.
Backfill ~3min and ~$0.20 — deferred to follow-up.

## #49 — Multimodal backfill
New scripts/backfill_multimodal_precedents.py with token-matching
case_number ↔ source files (PDF + DOCX via PyMuPDF). Ran in container:
26 precedents embedded, 503 pages, $0.21, 0 errors. precedent_image_embeddings
grew 3 → 29 rows. 44 remaining are style_corpus-migrated rows (no
source file on disk) — will catch up when re-uploaded.

## #50 — Closed-loop feedback + nDCG
Schema V18: search_logs + search_relevance_feedback. New telemetry.py
with fire-and-forget log_search_bg (p50 = 0.002ms — zero overhead) +
auto-infer_relevance_from_citations (reads case drafts → marks score=3
when cited precedent appears in past search top-K). Hooks added to 5
search paths. scripts/compute_ndcg.py for aggregation. Two admin API
endpoints (GET /api/admin/rag-metrics + POST .../infer). Dashboard UI
deferred — API is enough for now.

## #51 — Halacha quality monitoring
New scripts/monitor_halacha_quality.py — baseline avg confidence
(trusted=0.849, all=0.833, pending=0.694) with rolling window drift
detection. Default 5% threshold. Exits non-zero on alert for cron
integration. Recommended: `0 8 * * 1` weekly Mon 8am.

## Bonus: 230 unlinked citations → missing_precedents
Bulk-imported 230 distinct unlinked citations from
precedent_internal_citations to missing_precedents.status='open',
party='committee', with notes listing source citers. Top candidate:
ע"א 3213/97 (cited 5x). Total open missing_precedents now 237.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-05-26 11:26:52 +00:00
parent 3a05e30c8d
commit 2aee398b4a
15 changed files with 2493 additions and 57 deletions

View File

@@ -132,6 +132,43 @@ def find_case_dir(case_number: str) -> Path:
CHUNK_SIZE_TOKENS = 600
CHUNK_OVERLAP_TOKENS = 100
# Parent-doc retrieval (TaskMaster #48) — hierarchical chunking + lookup.
# When enabled:
# - The ingest pipeline emits two tiers of precedent_chunks: small
# "child" chunks (~300 tokens) for high-recall semantic/lexical
# matching, and larger "parent" chunks (~1500 tokens) that contain
# ~5 children each. Children are embedded and indexed; parents
# carry the broader text the LLM gets back.
# - Search runs against children, then swaps each hit for its parent
# row before returning — so the writer sees a coherent passage
# instead of a 300-token sliver.
#
# Off by default: the schema (V17) is safe to apply even when the flag
# is false (the chunker still emits single-tier chunks and search just
# returns them unchanged). Flip to true ONLY after the corpus has been
# re-ingested with the hierarchical chunker — see precedent_library
# ingest pipeline + the backfill plan in TaskMaster #48.
PARENT_DOC_RETRIEVAL_ENABLED = (
os.environ.get("PARENT_DOC_RETRIEVAL_ENABLED", "false").lower() == "true"
)
# Child chunks are what get embedded + matched. Smaller = higher recall,
# more rows. 300 tokens (~600 chars Hebrew) is the empirical sweet spot
# referenced in the original parent-doc literature (Anthropic, LlamaIndex).
PARENT_DOC_CHILD_SIZE_TOKENS = int(
os.environ.get("PARENT_DOC_CHILD_SIZE_TOKENS", "300")
)
# Parent chunks are what get returned to the LLM. Large enough to hold
# a full rule statement plus the surrounding paragraph and any cited
# authority. 1500 tokens = ~5 children at 300 each.
PARENT_DOC_PARENT_SIZE_TOKENS = int(
os.environ.get("PARENT_DOC_PARENT_SIZE_TOKENS", "1500")
)
# Child overlap — keeps neighbouring children sharing ~50 tokens so a
# sentence on a chunk boundary still matches the natural phrasing.
PARENT_DOC_CHILD_OVERLAP_TOKENS = int(
os.environ.get("PARENT_DOC_CHILD_OVERLAP_TOKENS", "50")
)
# External service allowlist — case materials may ONLY be sent to these domains
ALLOWED_EXTERNAL_SERVICES = {
"api.voyageai.com", # Voyage AI (embeddings)