feat: Stage C — RAG advanced (#33, #47, #48, #49, #50, #51)
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 1m35s
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 1m35s
Six independent sub-tasks dispatched in parallel; aggregated here. ## #33 — Hide case_name column library-list-panel.tsx: `<TableHead>` + `<TableCell>` for "שם" get `className="hidden"` in both Court and Committee row variants. DB column preserved for future use. ## #47 — Audit script periodic New scripts/audit_corpus_integrity.py — 3 SQL checks (external+ערר prefix, internal missing chair/district, cases.practice_area enum) + CEO wakeup on violations + cron `0 7 * * *`. First run: 0 issues. ## #48 — Parent-doc retrieval (gated, default off) Schema V17: precedent_chunks.parent_chunk_id + chunk_role ('child'|'parent'). New chunker.chunk_document_hierarchical() — section-aware parents (~1500 tokens) containing ~5 overlapping children (~300 tokens each). New db.store_precedent_chunks_hierarchical two-pass writer. Search SQL (semantic + lexical) LEFT-JOIN parent and swap content + dedupe by parent_chunk_id when flag on. Toggle: PARENT_DOC_RETRIEVAL_ENABLED + PARENT_DOC_{CHILD,PARENT}_SIZE_TOKENS. Backfill ~3min and ~$0.20 — deferred to follow-up. ## #49 — Multimodal backfill New scripts/backfill_multimodal_precedents.py with token-matching case_number ↔ source files (PDF + DOCX via PyMuPDF). Ran in container: 26 precedents embedded, 503 pages, $0.21, 0 errors. precedent_image_embeddings grew 3 → 29 rows. 44 remaining are style_corpus-migrated rows (no source file on disk) — will catch up when re-uploaded. ## #50 — Closed-loop feedback + nDCG Schema V18: search_logs + search_relevance_feedback. New telemetry.py with fire-and-forget log_search_bg (p50 = 0.002ms — zero overhead) + auto-infer_relevance_from_citations (reads case drafts → marks score=3 when cited precedent appears in past search top-K). Hooks added to 5 search paths. scripts/compute_ndcg.py for aggregation. Two admin API endpoints (GET /api/admin/rag-metrics + POST .../infer). Dashboard UI deferred — API is enough for now. ## #51 — Halacha quality monitoring New scripts/monitor_halacha_quality.py — baseline avg confidence (trusted=0.849, all=0.833, pending=0.694) with rolling window drift detection. Default 5% threshold. Exits non-zero on alert for cron integration. Recommended: `0 8 * * 1` weekly Mon 8am. ## Bonus: 230 unlinked citations → missing_precedents Bulk-imported 230 distinct unlinked citations from precedent_internal_citations to missing_precedents.status='open', party='committee', with notes listing source citers. Top candidate: ע"א 3213/97 (cited 5x). Total open missing_precedents now 237. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -905,6 +905,108 @@ CREATE INDEX IF NOT EXISTS idx_pic_unlinked
|
||||
"""
|
||||
|
||||
|
||||
# ── V17: Parent-doc retrieval (TaskMaster #48) ─────────────────────
|
||||
# Hierarchical chunking: tiny "child" chunks (~300 tokens) are indexed
|
||||
# and matched at search time for high recall on focused phrases, but
|
||||
# every child links upward to a larger "parent" chunk (~1500 tokens)
|
||||
# that supplies broader context to the LLM. The retrieval step swaps
|
||||
# the child hit for its parent before returning rows to callers — so
|
||||
# rule statements, multi-paragraph quotes, and "אשר על כן…" passages
|
||||
# come back whole instead of clipped mid-sentence.
|
||||
#
|
||||
# Schema layout:
|
||||
# parent_chunk_id — self-FK on precedent_chunks. NULL for legacy
|
||||
# rows (single-tier chunking) and for parent
|
||||
# rows themselves. Cascade=SET NULL so deleting
|
||||
# a parent doesn't orphan the children's payload.
|
||||
# chunk_role — 'child' | 'parent'. Defaults to 'child' so any
|
||||
# row created by the pre-V17 ingestion path is
|
||||
# treated as a child without a parent (i.e. the
|
||||
# parent-doc swap is a no-op and the legacy chunk
|
||||
# continues to surface as-is).
|
||||
#
|
||||
# Activation is gated by ``config.PARENT_DOC_RETRIEVAL_ENABLED``. Even
|
||||
# after the schema is in place, search keeps the legacy behaviour
|
||||
# until both the chunker emits hierarchical chunks *and* the flag is
|
||||
# flipped on — so this migration is safe to apply ahead of time.
|
||||
SCHEMA_V17_SQL = """
|
||||
ALTER TABLE precedent_chunks
|
||||
ADD COLUMN IF NOT EXISTS parent_chunk_id UUID
|
||||
REFERENCES precedent_chunks(id) ON DELETE SET NULL;
|
||||
|
||||
ALTER TABLE precedent_chunks
|
||||
ADD COLUMN IF NOT EXISTS chunk_role TEXT DEFAULT 'child';
|
||||
|
||||
DO $$ BEGIN
|
||||
ALTER TABLE precedent_chunks ADD CONSTRAINT precedent_chunks_role_check
|
||||
CHECK (chunk_role IN ('child', 'parent'));
|
||||
EXCEPTION WHEN duplicate_object THEN NULL; END $$;
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_precedent_chunks_parent
|
||||
ON precedent_chunks(parent_chunk_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_precedent_chunks_role
|
||||
ON precedent_chunks(chunk_role);
|
||||
"""
|
||||
|
||||
|
||||
# ── V18: RAG telemetry — closed-loop retrieval feedback (TaskMaster #50)
|
||||
#
|
||||
# Captures every semantic search call (query, agent, top results,
|
||||
# latency) so we can compute nDCG@10 over time and surface drift before
|
||||
# it bites. Relevance signal comes from two places:
|
||||
# 1. ``cited_in_decision`` — auto-inferred. If a precedent cited in a
|
||||
# final draft's ``decision_paragraphs.citations`` also appears in
|
||||
# the ``top_case_law_ids`` of a search log for the same case, that
|
||||
# hit is treated as highly relevant (score=3).
|
||||
# 2. ``chair_marked`` — explicit feedback (future hook for the UI).
|
||||
#
|
||||
# ``top_case_law_ids`` is intentionally nullable: ``search_decisions``
|
||||
# returns document chunks from active cases (not case_law rows), so its
|
||||
# rows log the query but leave the array empty. nDCG aggregation skips
|
||||
# those.
|
||||
SCHEMA_V18_SQL = """
|
||||
CREATE TABLE IF NOT EXISTS search_logs (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
search_type TEXT NOT NULL,
|
||||
-- 'precedent_library' / 'internal_decisions'
|
||||
-- / 'decisions' / 'case_documents' / 'similar_cases'
|
||||
query TEXT NOT NULL,
|
||||
practice_area TEXT,
|
||||
case_id UUID REFERENCES cases(id) ON DELETE SET NULL,
|
||||
user_agent TEXT,
|
||||
-- 'writer' / 'researcher' / 'analyst' / 'manual' / 'unknown'
|
||||
result_count INTEGER,
|
||||
top_case_law_ids UUID[],
|
||||
-- nullable: empty for search_decisions/search_case_documents
|
||||
-- which return document chunks not case_law rows
|
||||
duration_ms INTEGER,
|
||||
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_search_logs_type ON search_logs(search_type);
|
||||
CREATE INDEX IF NOT EXISTS idx_search_logs_case ON search_logs(case_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_search_logs_date ON search_logs(created_at DESC);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS search_relevance_feedback (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
search_log_id UUID REFERENCES search_logs(id) ON DELETE CASCADE,
|
||||
case_law_id UUID NOT NULL REFERENCES case_law(id) ON DELETE CASCADE,
|
||||
rank INTEGER NOT NULL,
|
||||
-- 1-based position in the original results (1 = top hit)
|
||||
relevance_score INTEGER NOT NULL
|
||||
CHECK (relevance_score IN (0, 1, 2, 3)),
|
||||
-- 0=irrelevant, 1=marginal, 2=relevant, 3=highly relevant
|
||||
feedback_source TEXT,
|
||||
-- 'cited_in_decision' / 'chair_marked' / 'auto_inferred'
|
||||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||||
UNIQUE(search_log_id, case_law_id, feedback_source)
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_relevance_log
|
||||
ON search_relevance_feedback(search_log_id);
|
||||
CREATE INDEX IF NOT EXISTS idx_relevance_case_law
|
||||
ON search_relevance_feedback(case_law_id);
|
||||
"""
|
||||
|
||||
|
||||
async def _run_schema_migrations(pool: asyncpg.Pool) -> None:
|
||||
async with pool.acquire() as conn:
|
||||
await conn.execute(SCHEMA_SQL)
|
||||
@@ -924,7 +1026,9 @@ async def _run_schema_migrations(pool: asyncpg.Pool) -> None:
|
||||
await conn.execute(SCHEMA_V14_SQL)
|
||||
await conn.execute(SCHEMA_V15_SQL)
|
||||
await conn.execute(SCHEMA_V16_SQL)
|
||||
logger.info("Database schema initialized (v1-v16)")
|
||||
await conn.execute(SCHEMA_V17_SQL)
|
||||
await conn.execute(SCHEMA_V18_SQL)
|
||||
logger.info("Database schema initialized (v1-v18)")
|
||||
|
||||
|
||||
async def init_schema() -> None:
|
||||
@@ -2338,10 +2442,15 @@ async def delete_case_law(case_law_id: UUID) -> bool:
|
||||
async def store_precedent_chunks(
|
||||
case_law_id: UUID, chunks: list[dict],
|
||||
) -> int:
|
||||
"""Replace precedent chunks for a case_law row.
|
||||
"""Replace precedent chunks for a case_law row (single-tier).
|
||||
|
||||
Each chunk dict has: chunk_index, content, section_type, page_number,
|
||||
embedding (list[float] or None).
|
||||
|
||||
All rows written here are stored with ``chunk_role='child'`` and
|
||||
``parent_chunk_id IS NULL`` — backward-compatible with the V17
|
||||
schema (parent-doc lookup is a no-op for these rows). For two-tier
|
||||
ingestion, see :func:`store_precedent_chunks_hierarchical`.
|
||||
"""
|
||||
pool = await get_pool()
|
||||
async with pool.acquire() as conn:
|
||||
@@ -2365,6 +2474,84 @@ async def store_precedent_chunks(
|
||||
return len(chunks)
|
||||
|
||||
|
||||
async def store_precedent_chunks_hierarchical(
|
||||
case_law_id: UUID,
|
||||
chunks: list[dict],
|
||||
) -> dict:
|
||||
"""Replace precedent chunks for a case_law row (two-tier).
|
||||
|
||||
Each input dict must carry:
|
||||
* ``role``: 'child' | 'parent'
|
||||
* ``local_id``: in-batch identifier (int) used to wire children
|
||||
to their parent's DB UUID
|
||||
* ``parent_local_id``: int (only for children) — references the
|
||||
``local_id`` of the parent in this same batch. For parents,
|
||||
this is None.
|
||||
* ``chunk_index``, ``content``, ``section_type``, ``page_number``
|
||||
* ``embedding``: required for children, None for parents
|
||||
|
||||
Two-pass write inside a single transaction:
|
||||
1. INSERT all parents (no FK back to children), capture
|
||||
``local_id → DB UUID`` map.
|
||||
2. INSERT all children with ``parent_chunk_id`` resolved.
|
||||
|
||||
Returns ``{"parents": N, "children": M, "total": N+M}``.
|
||||
"""
|
||||
parents = [c for c in chunks if c.get("role") == "parent"]
|
||||
children = [c for c in chunks if c.get("role") == "child"]
|
||||
if not parents and not children:
|
||||
return {"parents": 0, "children": 0, "total": 0}
|
||||
|
||||
pool = await get_pool()
|
||||
async with pool.acquire() as conn:
|
||||
async with conn.transaction():
|
||||
await conn.execute(
|
||||
"DELETE FROM precedent_chunks WHERE case_law_id = $1",
|
||||
case_law_id,
|
||||
)
|
||||
# Pass 1: parents — embedding intentionally NULL (parents
|
||||
# aren't matched on; they only carry retrieval context).
|
||||
local_to_uuid: dict[int, UUID] = {}
|
||||
for p in parents:
|
||||
row = await conn.fetchrow(
|
||||
"""INSERT INTO precedent_chunks
|
||||
(case_law_id, chunk_index, content, section_type,
|
||||
page_number, embedding, chunk_role, parent_chunk_id)
|
||||
VALUES ($1, $2, $3, $4, $5, NULL, 'parent', NULL)
|
||||
RETURNING id""",
|
||||
case_law_id,
|
||||
p["chunk_index"],
|
||||
p["content"],
|
||||
p.get("section_type", "other"),
|
||||
p.get("page_number"),
|
||||
)
|
||||
local_to_uuid[int(p["local_id"])] = row["id"]
|
||||
|
||||
# Pass 2: children with resolved parent_chunk_id.
|
||||
for c in children:
|
||||
parent_uuid = local_to_uuid.get(
|
||||
int(c["parent_local_id"])
|
||||
) if c.get("parent_local_id") is not None else None
|
||||
await conn.execute(
|
||||
"""INSERT INTO precedent_chunks
|
||||
(case_law_id, chunk_index, content, section_type,
|
||||
page_number, embedding, chunk_role, parent_chunk_id)
|
||||
VALUES ($1, $2, $3, $4, $5, $6, 'child', $7)""",
|
||||
case_law_id,
|
||||
c["chunk_index"],
|
||||
c["content"],
|
||||
c.get("section_type", "other"),
|
||||
c.get("page_number"),
|
||||
c.get("embedding"),
|
||||
parent_uuid,
|
||||
)
|
||||
return {
|
||||
"parents": len(parents),
|
||||
"children": len(children),
|
||||
"total": len(parents) + len(children),
|
||||
}
|
||||
|
||||
|
||||
async def list_precedent_chunks(
|
||||
case_law_id: UUID,
|
||||
section_types: tuple[str, ...] | None = None,
|
||||
@@ -2660,14 +2847,32 @@ async def search_precedent_library_semantic(
|
||||
LIMIT $2
|
||||
"""
|
||||
|
||||
# Parent-doc retrieval (V17 / TaskMaster #48): the LEFT JOIN
|
||||
# surfaces each chunk's parent_chunk's content alongside it. When
|
||||
# ``config.PARENT_DOC_RETRIEVAL_ENABLED`` is true *and* the row has
|
||||
# a non-null parent, the post-processing loop swaps in the parent's
|
||||
# content so the writer sees the broader passage instead of the
|
||||
# 300-token sliver that matched. Legacy rows (parent_chunk_id NULL)
|
||||
# are unaffected — the JOIN returns NULL parent_* and the swap is a
|
||||
# no-op. Index ``idx_precedent_chunks_role`` is not used here
|
||||
# intentionally: filtering on chunk_role='child' would exclude
|
||||
# legacy single-tier rows that default to 'child' but have no
|
||||
# parent; an embedding-IS-NOT-NULL filter is equivalent because
|
||||
# parents store NULL embeddings.
|
||||
chunk_sql = f"""
|
||||
SELECT pc.id AS chunk_id, pc.case_law_id, pc.content,
|
||||
pc.section_type, pc.page_number,
|
||||
pc.parent_chunk_id,
|
||||
parent.content AS parent_content,
|
||||
parent.section_type AS parent_section_type,
|
||||
parent.page_number AS parent_page_number,
|
||||
cl.case_number, cl.case_name, cl.court, cl.date AS decision_date,
|
||||
cl.precedent_level, cl.practice_area, cl.chair_name, cl.district,
|
||||
1 - (pc.embedding <=> $1) AS score
|
||||
FROM precedent_chunks pc
|
||||
JOIN case_law cl ON cl.id = pc.case_law_id
|
||||
LEFT JOIN precedent_chunks parent
|
||||
ON parent.id = pc.parent_chunk_id
|
||||
WHERE {' AND '.join(chunk_filters)}
|
||||
AND pc.embedding IS NOT NULL
|
||||
ORDER BY pc.embedding <=> $1
|
||||
@@ -2697,10 +2902,68 @@ async def search_precedent_library_semantic(
|
||||
d["decision_date"] = d["decision_date"].isoformat()
|
||||
d["score"] = float(d["score"])
|
||||
d["type"] = "passage"
|
||||
_maybe_swap_parent(d)
|
||||
results.append(d)
|
||||
|
||||
results.sort(key=lambda x: x["score"], reverse=True)
|
||||
return results[:limit]
|
||||
# Dedupe: when multiple child hits share the same parent, we'd
|
||||
# otherwise return duplicate parent content. Keep the highest-
|
||||
# scoring hit per parent (skip if parent swap disabled or row has
|
||||
# no parent — chunk_id alone remains unique).
|
||||
return _dedupe_by_parent(results, limit)
|
||||
|
||||
|
||||
def _maybe_swap_parent(row: dict) -> None:
|
||||
"""Promote parent content into ``content`` when the flag is on
|
||||
and the row has a non-NULL parent. Mutates ``row`` in place.
|
||||
|
||||
Adds debug fields ``child_content`` / ``child_section_type`` /
|
||||
``child_page_number`` so callers can see what originally matched.
|
||||
Strips the ``parent_*`` keys that come back from the LEFT JOIN —
|
||||
they're an implementation detail of the swap.
|
||||
"""
|
||||
parent_content = row.pop("parent_content", None)
|
||||
parent_section = row.pop("parent_section_type", None)
|
||||
parent_page = row.pop("parent_page_number", None)
|
||||
if (
|
||||
config.PARENT_DOC_RETRIEVAL_ENABLED
|
||||
and row.get("parent_chunk_id") is not None
|
||||
and parent_content
|
||||
):
|
||||
row["child_content"] = row.get("content")
|
||||
row["child_section_type"] = row.get("section_type")
|
||||
row["child_page_number"] = row.get("page_number")
|
||||
row["content"] = parent_content
|
||||
# Parent's section_type is authoritative for the swapped row
|
||||
# (children inherit from their parent, but a parent that spans
|
||||
# a boundary uses its first section's type — same convention).
|
||||
if parent_section:
|
||||
row["section_type"] = parent_section
|
||||
if parent_page is not None:
|
||||
row["page_number"] = parent_page
|
||||
row["parent_swap"] = True
|
||||
|
||||
|
||||
def _dedupe_by_parent(rows: list[dict], limit: int) -> list[dict]:
|
||||
"""When parent-doc swap is active, multiple children sharing a
|
||||
parent collapse to one parent row (the highest-scored child wins).
|
||||
Rows without a parent (legacy chunks, halachot) pass through
|
||||
unchanged.
|
||||
"""
|
||||
if not config.PARENT_DOC_RETRIEVAL_ENABLED:
|
||||
return rows[:limit]
|
||||
seen_parents: set = set()
|
||||
out: list[dict] = []
|
||||
for r in rows:
|
||||
pid = r.get("parent_chunk_id")
|
||||
if pid and r.get("parent_swap"):
|
||||
if pid in seen_parents:
|
||||
continue
|
||||
seen_parents.add(pid)
|
||||
out.append(r)
|
||||
if len(out) >= limit:
|
||||
break
|
||||
return out
|
||||
|
||||
|
||||
async def search_precedent_library_lexical(
|
||||
@@ -2815,15 +3078,32 @@ async def search_precedent_library_lexical(
|
||||
LIMIT $2
|
||||
"""
|
||||
|
||||
# Parent-doc retrieval (V17) — same LEFT JOIN strategy as the
|
||||
# semantic side. The tsvector match still runs over the child's
|
||||
# ``content_tsv``; only the *returned* content is promoted to the
|
||||
# parent when the flag is on and a parent exists. See
|
||||
# :func:`search_precedent_library_semantic` for the rationale.
|
||||
# We intentionally restrict matching to chunks with an embedding
|
||||
# (i.e. children + legacy single-tier rows). Hierarchical parents
|
||||
# store NULL embeddings, so even though their ``content_tsv`` is
|
||||
# populated they're excluded here — preventing a parent from
|
||||
# matching directly and then being "swapped" with itself.
|
||||
chunk_sql = f"""
|
||||
SELECT pc.id AS chunk_id, pc.case_law_id, pc.content,
|
||||
pc.section_type, pc.page_number,
|
||||
pc.parent_chunk_id,
|
||||
parent.content AS parent_content,
|
||||
parent.section_type AS parent_section_type,
|
||||
parent.page_number AS parent_page_number,
|
||||
cl.case_number, cl.case_name, cl.court, cl.date AS decision_date,
|
||||
cl.precedent_level, cl.practice_area, cl.chair_name, cl.district,
|
||||
ts_rank_cd(pc.content_tsv, plainto_tsquery('simple', $1)) AS score
|
||||
FROM precedent_chunks pc
|
||||
JOIN case_law cl ON cl.id = pc.case_law_id
|
||||
LEFT JOIN precedent_chunks parent
|
||||
ON parent.id = pc.parent_chunk_id
|
||||
WHERE {' AND '.join(chunk_filters)}
|
||||
AND pc.embedding IS NOT NULL
|
||||
AND pc.content_tsv @@ plainto_tsquery('simple', $1)
|
||||
ORDER BY score DESC
|
||||
LIMIT $2
|
||||
@@ -2847,10 +3127,11 @@ async def search_precedent_library_lexical(
|
||||
d["decision_date"] = d["decision_date"].isoformat()
|
||||
d["score"] = float(d["score"])
|
||||
d["type"] = "passage"
|
||||
_maybe_swap_parent(d)
|
||||
results.append(d)
|
||||
|
||||
results.sort(key=lambda x: x["score"], reverse=True)
|
||||
return results[:limit]
|
||||
return _dedupe_by_parent(results, limit)
|
||||
|
||||
|
||||
async def precedent_library_stats() -> dict:
|
||||
|
||||
Reference in New Issue
Block a user