feat: Stage C — RAG advanced (#33, #47, #48, #49, #50, #51)
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 1m35s

Six independent sub-tasks dispatched in parallel; aggregated here.

## #33 — Hide case_name column
library-list-panel.tsx: `<TableHead>` + `<TableCell>` for "שם"
get `className="hidden"` in both Court and Committee row variants.
DB column preserved for future use.

## #47 — Audit script periodic
New scripts/audit_corpus_integrity.py — 3 SQL checks (external+ערר
prefix, internal missing chair/district, cases.practice_area enum)
+ CEO wakeup on violations + cron `0 7 * * *`. First run: 0 issues.

## #48 — Parent-doc retrieval (gated, default off)
Schema V17: precedent_chunks.parent_chunk_id + chunk_role
('child'|'parent'). New chunker.chunk_document_hierarchical() —
section-aware parents (~1500 tokens) containing ~5 overlapping
children (~300 tokens each). New db.store_precedent_chunks_hierarchical
two-pass writer. Search SQL (semantic + lexical) LEFT-JOIN parent and
swap content + dedupe by parent_chunk_id when flag on. Toggle:
PARENT_DOC_RETRIEVAL_ENABLED + PARENT_DOC_{CHILD,PARENT}_SIZE_TOKENS.
Backfill ~3min and ~$0.20 — deferred to follow-up.

## #49 — Multimodal backfill
New scripts/backfill_multimodal_precedents.py with token-matching
case_number ↔ source files (PDF + DOCX via PyMuPDF). Ran in container:
26 precedents embedded, 503 pages, $0.21, 0 errors. precedent_image_embeddings
grew 3 → 29 rows. 44 remaining are style_corpus-migrated rows (no
source file on disk) — will catch up when re-uploaded.

## #50 — Closed-loop feedback + nDCG
Schema V18: search_logs + search_relevance_feedback. New telemetry.py
with fire-and-forget log_search_bg (p50 = 0.002ms — zero overhead) +
auto-infer_relevance_from_citations (reads case drafts → marks score=3
when cited precedent appears in past search top-K). Hooks added to 5
search paths. scripts/compute_ndcg.py for aggregation. Two admin API
endpoints (GET /api/admin/rag-metrics + POST .../infer). Dashboard UI
deferred — API is enough for now.

## #51 — Halacha quality monitoring
New scripts/monitor_halacha_quality.py — baseline avg confidence
(trusted=0.849, all=0.833, pending=0.694) with rolling window drift
detection. Default 5% threshold. Exits non-zero on alert for cron
integration. Recommended: `0 8 * * 1` weekly Mon 8am.

## Bonus: 230 unlinked citations → missing_precedents
Bulk-imported 230 distinct unlinked citations from
precedent_internal_citations to missing_precedents.status='open',
party='committee', with notes listing source citers. Top candidate:
ע"א 3213/97 (cited 5x). Total open missing_precedents now 237.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-05-26 11:26:52 +00:00
parent 3a05e30c8d
commit 2aee398b4a
15 changed files with 2493 additions and 57 deletions

View File

@@ -172,34 +172,100 @@ async def ingest_precedent(
case_law_id = UUID(str(record["id"]))
try:
await progress("chunking", 40, f"מחלק את הטקסט ל-chunks ({page_count} עמ')")
chunks = chunker.chunk_document(text, page_offsets=page_offsets)
if not chunks:
await db.set_case_law_extraction_status(case_law_id, "completed")
await db.set_case_law_halacha_status(case_law_id, "completed")
await progress("completed", 100, "אין טקסט לעיבוד")
return {
"status": "completed",
"case_law_id": str(case_law_id),
"chunks": 0,
"halachot": 0,
}
# Parent-doc retrieval (TaskMaster #48): when enabled, emit
# two tiers (parents + children). Only children are embedded
# and indexed; parents carry retrieval context. When disabled,
# fall back to legacy single-tier chunking — identical
# behaviour to pre-V17.
if config.PARENT_DOC_RETRIEVAL_ENABLED:
await progress(
"chunking", 40,
f"מחלק את הטקסט ל-chunks היררכיים ({page_count} עמ')",
)
h_chunks = chunker.chunk_document_hierarchical(
text, page_offsets=page_offsets,
)
if not h_chunks:
await db.set_case_law_extraction_status(case_law_id, "completed")
await db.set_case_law_halacha_status(case_law_id, "completed")
await progress("completed", 100, "אין טקסט לעיבוד")
return {
"status": "completed",
"case_law_id": str(case_law_id),
"chunks": 0,
"halachot": 0,
}
await progress("embedding", 55, f"מייצר embeddings ל-{len(chunks)} chunks")
chunk_texts = [c.content for c in chunks]
chunk_vectors = await embeddings.embed_texts(chunk_texts, input_type="document")
children = [c for c in h_chunks if c.role == "child"]
parents = [c for c in h_chunks if c.role == "parent"]
await progress(
"embedding", 55,
f"מייצר embeddings ל-{len(children)} children "
f"({len(parents)} parents)",
)
child_texts = [c.content for c in children]
child_vectors = await embeddings.embed_texts(
child_texts, input_type="document",
)
# Build flat dict list for the two-pass writer.
chunk_dicts: list[dict] = []
for p in parents:
chunk_dicts.append({
"role": "parent",
"local_id": p.local_id,
"parent_local_id": None,
"chunk_index": p.chunk_index,
"content": p.content,
"section_type": p.section_type,
"page_number": p.page_number,
"embedding": None,
})
for c, v in zip(children, child_vectors):
chunk_dicts.append({
"role": "child",
"local_id": c.local_id,
"parent_local_id": c.parent_local_id,
"chunk_index": c.chunk_index,
"content": c.content,
"section_type": c.section_type,
"page_number": c.page_number,
"embedding": v,
})
counts = await db.store_precedent_chunks_hierarchical(
case_law_id, chunk_dicts,
)
stored_chunks = counts["children"]
else:
await progress(
"chunking", 40, f"מחלק את הטקסט ל-chunks ({page_count} עמ')",
)
chunks = chunker.chunk_document(text, page_offsets=page_offsets)
if not chunks:
await db.set_case_law_extraction_status(case_law_id, "completed")
await db.set_case_law_halacha_status(case_law_id, "completed")
await progress("completed", 100, "אין טקסט לעיבוד")
return {
"status": "completed",
"case_law_id": str(case_law_id),
"chunks": 0,
"halachot": 0,
}
chunk_dicts = [
{
"chunk_index": c.chunk_index,
"content": c.content,
"section_type": c.section_type,
"page_number": c.page_number,
"embedding": v,
}
for c, v in zip(chunks, chunk_vectors)
]
stored_chunks = await db.store_precedent_chunks(case_law_id, chunk_dicts)
await progress("embedding", 55, f"מייצר embeddings ל-{len(chunks)} chunks")
chunk_texts = [c.content for c in chunks]
chunk_vectors = await embeddings.embed_texts(chunk_texts, input_type="document")
chunk_dicts = [
{
"chunk_index": c.chunk_index,
"content": c.content,
"section_type": c.section_type,
"page_number": c.page_number,
"embedding": v,
}
for c, v in zip(chunks, chunk_vectors)
]
stored_chunks = await db.store_precedent_chunks(case_law_id, chunk_dicts)
# Multimodal page-image embeddings (V9). Gated by feature flag.
# Non-fatal: text path already succeeded. Only PDFs.