FU-3: re-index on content change (GAP-09) #14
@@ -2105,7 +2105,7 @@
|
||||
"description": "embedding מתעדכן אוטומטית בשינוי תוכן (כיום trigger-dependent, לא GENERATED).",
|
||||
"details": "מכסה GAP-09. מספק INV-DM3/G6. severity: High. סוג: קוד + מיגרציה (re-embed). תלוי ב-FU-1.",
|
||||
"testStrategy": "",
|
||||
"status": "pending",
|
||||
"status": "done",
|
||||
"dependencies": [
|
||||
"59"
|
||||
],
|
||||
@@ -2117,7 +2117,7 @@
|
||||
"description": "embedding לא GENERATED בניגוד ל-tsvectors; נקודת-drift.",
|
||||
"dependencies": [],
|
||||
"details": "INV-DM3/G6",
|
||||
"status": "pending",
|
||||
"status": "done",
|
||||
"testStrategy": "",
|
||||
"parentId": "61"
|
||||
},
|
||||
@@ -2128,8 +2128,8 @@
|
||||
"dependencies": [
|
||||
1
|
||||
],
|
||||
"details": "מקור: בדיקת DB 2026-05-30 (precedent_image_embeddings JOIN case_law). internal_committee: 14/56 עם page-images, 42 בלי. נגזר מ-GAP-02/FU-1 boundary discussion. לא פער-תקינות — שיפור multimodal coverage.",
|
||||
"status": "pending",
|
||||
"details": "מקור: בדיקת DB 2026-05-30 (precedent_image_embeddings JOIN case_law). internal_committee: 14/56 עם page-images, 42 בלי. נגזר מ-GAP-02/FU-1 boundary discussion. לא פער-תקינות — שיפור multimodal coverage. | CLOSED not-applicable 2026-05-30: כל 42 הרשומות document_id=NULL + אין PDF בדיסק; multimodal דורש רינדור PDF → בלתי-אפשרי לרשומות-טקסט. אם יועלה PDF — ingest רגיל מטפל.",
|
||||
"status": "cancelled",
|
||||
"testStrategy": "",
|
||||
"parentId": "61"
|
||||
}
|
||||
|
||||
421
docs/superpowers/plans/2026-05-30-fu3-reindex-on-change.md
Normal file
421
docs/superpowers/plans/2026-05-30-fu3-reindex-on-change.md
Normal file
@@ -0,0 +1,421 @@
|
||||
# FU-3: Re-Index on Content Change — Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Detect content changes via a SHA-256 `content_hash`, expose a standalone `reindex_case_law` that re-embeds from stored `full_text` (no re-OCR, no file needed), and surface embedding-drift in the health-check — enforcing INV-G6 where embeddings can't be DB-GENERATED.
|
||||
|
||||
**Architecture:** Two additive `case_law` columns (V23): `content_hash` (hash of current full_text, written at the create boundary) and `indexed_hash` (hash the current chunks/embeddings were built from, set by `mark_indexed` after a successful store). Stale ⇔ `content_hash IS DISTINCT FROM indexed_hash`. `reindex_case_law` reuses the canonical `_chunk_embed_store` over stored text. Backfill only computes hashes (no re-embed — existing rows keep their vectors).
|
||||
|
||||
**Tech Stack:** Python 3.12, asyncpg, PostgreSQL@localhost:5433, voyage embeddings API, pytest offline, `.venv` at `mcp-server/.venv`.
|
||||
|
||||
**Spec:** [docs/superpowers/specs/2026-05-30-fu3-reindex-on-change-design.md](../specs/2026-05-30-fu3-reindex-on-change-design.md)
|
||||
|
||||
**Run tests:** `cd ~/legal-ai/mcp-server && .venv/bin/python -m pytest tests/test_reindex_on_change.py -v`
|
||||
|
||||
---
|
||||
|
||||
## File Structure
|
||||
|
||||
- **Modify** `mcp-server/src/legal_mcp/services/db.py` — `_content_hash`; V23 migration; `content_hash` in `create_external_case_law`/`create_internal_committee_decision`/`create_case`; `mark_indexed`; `list_stale_case_law`; `recompute_content_hashes`.
|
||||
- **Modify** `mcp-server/src/legal_mcp/services/ingest.py` — `reindex_case_law`; call `mark_indexed` after `_chunk_embed_store` in `ingest_document`.
|
||||
- **Modify** `mcp-server/src/legal_mcp/services/metrics.py` — `stale_embedding_case_law` count.
|
||||
- **Modify** `mcp-server/src/legal_mcp/tools/precedent_library.py` + `server.py` — MCP tool `precedent_reindex`.
|
||||
- **Create** `mcp-server/tests/test_reindex_on_change.py`.
|
||||
|
||||
---
|
||||
|
||||
## Task 1: Failing tests
|
||||
|
||||
**Files:** Create `mcp-server/tests/test_reindex_on_change.py`
|
||||
|
||||
- [ ] **Step 1: Write the failing tests**
|
||||
|
||||
```python
|
||||
"""FU-3: re-index on content change (offline, monkeypatched I/O)."""
|
||||
from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
from uuid import uuid4
|
||||
|
||||
import pytest
|
||||
|
||||
from legal_mcp.services import db, ingest
|
||||
|
||||
|
||||
def _run(coro):
|
||||
return asyncio.run(coro)
|
||||
|
||||
|
||||
# ── content_hash is deterministic ──────────────────────────────────────
|
||||
def test_content_hash_deterministic():
|
||||
h1 = db._content_hash("פסק דין כלשהו")
|
||||
h2 = db._content_hash("פסק דין כלשהו")
|
||||
assert h1 == h2 and len(h1) == 64 # sha256 hex
|
||||
|
||||
|
||||
def test_content_hash_empty_is_blank():
|
||||
assert db._content_hash("") == ""
|
||||
assert db._content_hash(None) == ""
|
||||
|
||||
|
||||
def test_content_hash_changes_with_text():
|
||||
assert db._content_hash("alpha") != db._content_hash("beta")
|
||||
|
||||
|
||||
# ── mark_indexed copies content_hash → indexed_hash ─────────────────────
|
||||
def test_mark_indexed_executes_update(monkeypatch):
|
||||
seen = {}
|
||||
|
||||
class _Conn:
|
||||
async def execute(self, q, *a):
|
||||
seen["q"] = q; seen["args"] = a
|
||||
async def __aenter__(self): return self
|
||||
async def __aexit__(self, *a): return False
|
||||
|
||||
class _Pool:
|
||||
def acquire(self): return _Conn()
|
||||
|
||||
async def _pool(): return _Pool()
|
||||
monkeypatch.setattr(db, "get_pool", _pool)
|
||||
|
||||
cid = uuid4()
|
||||
_run(db.mark_indexed(cid))
|
||||
assert "indexed_hash" in seen["q"] and "content_hash" in seen["q"]
|
||||
assert seen["args"][0] == cid
|
||||
|
||||
|
||||
# ── reindex_case_law re-embeds from stored text, no extractor/LLM ───────
|
||||
def test_reindex_case_law_uses_stored_text(monkeypatch):
|
||||
cid = uuid4()
|
||||
calls = {"chunk_embed_store": [], "mark_indexed": []}
|
||||
|
||||
async def _get_case_law(x):
|
||||
return {"id": cid, "full_text": "טקסט שמור של ההחלטה"}
|
||||
monkeypatch.setattr(ingest.db, "get_case_law", _get_case_law)
|
||||
|
||||
async def _ces(case_law_id, text, page_offsets, page_count, progress):
|
||||
calls["chunk_embed_store"].append((case_law_id, text))
|
||||
return 5
|
||||
monkeypatch.setattr(ingest, "_chunk_embed_store", _ces)
|
||||
|
||||
async def _mark(x):
|
||||
calls["mark_indexed"].append(x)
|
||||
monkeypatch.setattr(ingest.db, "mark_indexed", _mark)
|
||||
|
||||
out = _run(ingest.reindex_case_law(cid))
|
||||
assert out["chunks"] == 5 and out["reindexed"] is True
|
||||
assert calls["chunk_embed_store"][0][1] == "טקסט שמור של ההחלטה"
|
||||
assert calls["mark_indexed"] == [cid]
|
||||
|
||||
|
||||
def test_reindex_case_law_missing_row_raises(monkeypatch):
|
||||
async def _none(x): return None
|
||||
monkeypatch.setattr(ingest.db, "get_case_law", _none)
|
||||
with pytest.raises(ValueError, match="not found"):
|
||||
_run(ingest.reindex_case_law(uuid4()))
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Run to verify failure**
|
||||
|
||||
Run: `cd ~/legal-ai/mcp-server && .venv/bin/python -m pytest tests/test_reindex_on_change.py -v`
|
||||
Expected: FAIL — `AttributeError: ... no attribute '_content_hash'` / `mark_indexed` / `reindex_case_law`.
|
||||
|
||||
- [ ] **Step 3: Commit**
|
||||
|
||||
```bash
|
||||
cd ~/legal-ai
|
||||
git add mcp-server/tests/test_reindex_on_change.py
|
||||
git commit -m "test(reindex): failing tests for content-hash re-index (FU-3)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 2: V23 + hash helpers + content_hash at write
|
||||
|
||||
**Files:** Modify `mcp-server/src/legal_mcp/services/db.py`
|
||||
|
||||
- [ ] **Step 1: Ensure `hashlib` import + add `_content_hash`**
|
||||
|
||||
READ the top imports of db.py. If `import hashlib` is absent, add it. Add this helper near `_canonical_case_number` (~line 1227):
|
||||
|
||||
```python
|
||||
def _content_hash(text: str) -> str:
|
||||
"""SHA-256 hex of the text — deterministic content fingerprint (FU-3/GAP-09).
|
||||
|
||||
Empty/None → "" (a row with no text has no content fingerprint).
|
||||
"""
|
||||
if not text:
|
||||
return ""
|
||||
return hashlib.sha256(text.encode("utf-8")).hexdigest()
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Add `SCHEMA_V23_SQL` after `SCHEMA_V22_SQL` + wire it**
|
||||
|
||||
READ near `SCHEMA_V22_SQL` and `_run_schema_migrations`. Add after the V22 block:
|
||||
|
||||
```python
|
||||
# ── V23: case_law content/indexed hashes — re-index on content change (GAP-09) ──
|
||||
# content_hash = SHA-256 of current full_text (written at the create boundary).
|
||||
# indexed_hash = the content_hash the CURRENT chunks/embeddings were built from
|
||||
# (set by mark_indexed after a successful store). Stale ⇔ content_hash IS
|
||||
# DISTINCT FROM indexed_hash. embedding can't be a GENERATED column (needs an
|
||||
# API call), so freshness is enforced by detection + reindex_case_law + health-check.
|
||||
SCHEMA_V23_SQL = """
|
||||
ALTER TABLE case_law ADD COLUMN IF NOT EXISTS content_hash text NOT NULL DEFAULT '';
|
||||
ALTER TABLE case_law ADD COLUMN IF NOT EXISTS indexed_hash text;
|
||||
"""
|
||||
```
|
||||
After `await conn.execute(SCHEMA_V22_SQL)` add `await conn.execute(SCHEMA_V23_SQL)`; bump the log line to `v1-v23`.
|
||||
|
||||
- [ ] **Step 3: Write `content_hash` in the two case_law create functions**
|
||||
|
||||
In `create_external_case_law` and `create_internal_committee_decision` (db.py ~2610-2760), the `INSERT ... ON CONFLICT ... DO UPDATE` was built in FU-2a. For EACH:
|
||||
1. Add `content_hash` to the INSERT column list (append after the last data column, before the closing `)`).
|
||||
2. Add a matching `$N` placeholder in VALUES (next number after the current max).
|
||||
3. Add `content_hash = EXCLUDED.content_hash` to the `DO UPDATE SET` clause.
|
||||
4. Append `_content_hash(full_text)` as the LAST positional arg in the `conn.fetchrow(..., <args>)` call (matching the new `$N`).
|
||||
|
||||
CRITICAL: the new placeholder number must equal `(current highest $N) + 1`, and the new arg must be appended LAST in the args tuple in the SAME order. Read the current SQL + args carefully and count. After editing, verify param count = placeholder count (Step 5 import check will catch a gross mismatch; the DB smoke in Task 6 confirms at runtime).
|
||||
|
||||
- [ ] **Step 4: Write `content_hash` in `create_case`**
|
||||
|
||||
In `create_case` (db.py ~1130-1165), the INSERT into `cases` — add `content_hash`? NO: `cases` is a different table (active appeal cases), and FU-3's scope is `case_law` (the corpus). Do NOT alter `create_case` or the `cases` table here. (The spec §3 mentioned create_case for normalization in FU-2a; for FU-3 hashing, scope is `case_law` only. Skip create_case.)
|
||||
|
||||
- [ ] **Step 5: Add `mark_indexed`, `list_stale_case_law`, `recompute_content_hashes` (after `get_case_law`, ~line 2547)**
|
||||
|
||||
```python
|
||||
async def mark_indexed(case_law_id: UUID) -> None:
|
||||
"""Mark a case_law row's embeddings as built from its current content (FU-3).
|
||||
|
||||
Sets indexed_hash := content_hash. Call AFTER a successful chunk+embed+store.
|
||||
"""
|
||||
pool = await get_pool()
|
||||
async with pool.acquire() as conn:
|
||||
await conn.execute(
|
||||
"UPDATE case_law SET indexed_hash = content_hash WHERE id = $1",
|
||||
case_law_id,
|
||||
)
|
||||
|
||||
|
||||
async def list_stale_case_law(limit: int = 500) -> list[dict]:
|
||||
"""case_law rows whose embeddings are stale vs current content (GAP-09/INV-G6)."""
|
||||
pool = await get_pool()
|
||||
async with pool.acquire() as conn:
|
||||
rows = await conn.fetch(
|
||||
"""SELECT id, case_number, source_kind
|
||||
FROM case_law
|
||||
WHERE coalesce(full_text, '') <> ''
|
||||
AND content_hash IS DISTINCT FROM indexed_hash
|
||||
ORDER BY created_at LIMIT $1""",
|
||||
limit,
|
||||
)
|
||||
return [dict(r) for r in rows]
|
||||
|
||||
|
||||
async def recompute_content_hashes() -> dict:
|
||||
"""Backfill (FU-3): set content_hash for all rows; set indexed_hash=content_hash
|
||||
only where chunks already exist (those are already embedded). Rows with text but
|
||||
no chunks get indexed_hash=NULL → surface as stale. Hash-only; no re-embed."""
|
||||
pool = await get_pool()
|
||||
updated = 0
|
||||
async with pool.acquire() as conn:
|
||||
rows = await conn.fetch("SELECT id, full_text FROM case_law")
|
||||
for r in rows:
|
||||
ch = _content_hash(r["full_text"] or "")
|
||||
has_chunks = await conn.fetchval(
|
||||
"SELECT EXISTS(SELECT 1 FROM precedent_chunks WHERE case_law_id = $1)",
|
||||
r["id"])
|
||||
await conn.execute(
|
||||
"UPDATE case_law SET content_hash = $2, "
|
||||
"indexed_hash = CASE WHEN $3 THEN $2 ELSE indexed_hash END WHERE id = $1",
|
||||
r["id"], ch, bool(has_chunks))
|
||||
updated += 1
|
||||
return {"updated": updated}
|
||||
```
|
||||
|
||||
- [ ] **Step 6: Run the helper tests**
|
||||
|
||||
Run: `cd ~/legal-ai/mcp-server && .venv/bin/python -m pytest tests/test_reindex_on_change.py -k "content_hash or mark_indexed" -v`
|
||||
Expected: `test_content_hash_*` (3) + `test_mark_indexed_executes_update` PASS.
|
||||
|
||||
- [ ] **Step 7: Commit**
|
||||
|
||||
```bash
|
||||
cd ~/legal-ai
|
||||
git add mcp-server/src/legal_mcp/services/db.py
|
||||
git commit -m "feat(reindex): V23 content/indexed hashes + helpers + write content_hash (GAP-09, FU-3)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 3: `reindex_case_law` + mark_indexed on ingest
|
||||
|
||||
**Files:** Modify `mcp-server/src/legal_mcp/services/ingest.py`
|
||||
|
||||
- [ ] **Step 1: Call `mark_indexed` after successful chunk+embed+store in `ingest_document`**
|
||||
|
||||
READ `ingest_document` — find the line `stored_chunks = await _chunk_embed_store(case_law_id, raw_text, page_offsets, page_count, progress)` (~line 184). Immediately AFTER it, add:
|
||||
|
||||
```python
|
||||
await db.mark_indexed(case_law_id)
|
||||
```
|
||||
(After a fresh ingest, chunks were just built from the current text → indexed_hash = content_hash.)
|
||||
|
||||
- [ ] **Step 2: Add `reindex_case_law` (append to ingest.py)**
|
||||
|
||||
```python
|
||||
async def reindex_case_law(
|
||||
case_law_id: "UUID | str",
|
||||
progress: ProgressCb | None = None,
|
||||
) -> dict:
|
||||
"""Re-chunk + re-embed an existing case_law row from its STORED full_text (GAP-09).
|
||||
|
||||
No re-extract / no re-OCR (uses the stored text — see feedback_no_reocr_retrofit)
|
||||
and no LLM/CLI (only chunker + voyage embeddings), so it is safe to run anywhere.
|
||||
Idempotent: store_precedent_chunks(_hierarchical) is DELETE-then-INSERT.
|
||||
"""
|
||||
progress = progress or _noop_progress
|
||||
cid = case_law_id if isinstance(case_law_id, UUID) else UUID(str(case_law_id))
|
||||
row = await db.get_case_law(cid)
|
||||
if not row:
|
||||
raise ValueError(f"case_law not found: {cid}")
|
||||
text = (row.get("full_text") or "").strip()
|
||||
if not text:
|
||||
raise ValueError("case_law has no stored full_text to re-index")
|
||||
stored = await _chunk_embed_store(cid, text, None, 0, progress)
|
||||
await db.mark_indexed(cid)
|
||||
await progress("completed", 100, f"הוטמע מחדש: {stored} chunks")
|
||||
return {"status": "completed", "case_law_id": str(cid), "chunks": stored, "reindexed": True}
|
||||
```
|
||||
(`UUID`, `db`, `_chunk_embed_store`, `_noop_progress`, `ProgressCb` are already in ingest.py.)
|
||||
|
||||
- [ ] **Step 3: Run reindex tests**
|
||||
|
||||
Run: `cd ~/legal-ai/mcp-server && .venv/bin/python -m pytest tests/test_reindex_on_change.py -v`
|
||||
Expected: ALL pass (incl `test_reindex_case_law_uses_stored_text`, `test_reindex_case_law_missing_row_raises`).
|
||||
|
||||
- [ ] **Step 4: Commit**
|
||||
|
||||
```bash
|
||||
cd ~/legal-ai
|
||||
git add mcp-server/src/legal_mcp/services/ingest.py
|
||||
git commit -m "feat(reindex): reindex_case_law from stored text + mark_indexed on ingest (GAP-09, FU-3)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 4: Health-check drift count
|
||||
|
||||
**Files:** Modify `mcp-server/src/legal_mcp/services/metrics.py`
|
||||
|
||||
- [ ] **Step 1: Add `stale_embedding_case_law` count**
|
||||
|
||||
READ metrics.py — the aggregation that holds `non_searchable_case_law` / `cases_with_stale_blocks` (added in FU-2a/FU-7). Add a sibling, mirroring the exact pattern:
|
||||
|
||||
```python
|
||||
stale_embedding_case_law = await conn.fetchval(
|
||||
"SELECT COUNT(*) FROM case_law "
|
||||
"WHERE coalesce(full_text,'') <> '' AND content_hash IS DISTINCT FROM indexed_hash")
|
||||
```
|
||||
and expose it in the returned summary dict: `"stale_embedding_case_law": stale_embedding_case_law`.
|
||||
|
||||
- [ ] **Step 2: Smoke-import + commit**
|
||||
|
||||
Run: `cd ~/legal-ai/mcp-server && .venv/bin/python -c "from legal_mcp.services import metrics; print('clean')"`
|
||||
```bash
|
||||
cd ~/legal-ai
|
||||
git add mcp-server/src/legal_mcp/services/metrics.py
|
||||
git commit -m "feat(reindex): health-check stale_embedding_case_law count (GAP-09, FU-3)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 5: MCP tool `precedent_reindex`
|
||||
|
||||
**Files:** Modify `mcp-server/src/legal_mcp/tools/precedent_library.py`, `mcp-server/src/legal_mcp/server.py`
|
||||
|
||||
- [ ] **Step 1: Add the tool function in precedent_library.py (mirror `precedent_extract_metadata`)**
|
||||
|
||||
READ `precedent_extract_metadata` (tools/precedent_library.py ~205-216) for the `_ok`/`_err`/UUID pattern. Add:
|
||||
|
||||
```python
|
||||
async def precedent_reindex(case_law_id: str) -> str:
|
||||
"""re-chunk + re-embed פסיקה קיימת מה-full_text השמור (FU-3/GAP-09).
|
||||
|
||||
לתיקון drift של embeddings או אחרי שינוי-תוכן. אינו מריץ OCR/LLM — רק
|
||||
chunking + voyage embeddings. idempotent (מוחק ובונה chunks מחדש).
|
||||
"""
|
||||
try:
|
||||
cid = UUID(case_law_id)
|
||||
except ValueError:
|
||||
return _err("case_law_id לא תקין")
|
||||
try:
|
||||
from legal_mcp.services import ingest
|
||||
result = await ingest.reindex_case_law(cid)
|
||||
except Exception as e:
|
||||
return _err(str(e))
|
||||
return _ok(result)
|
||||
```
|
||||
|
||||
- [ ] **Step 2: Register in server.py (mirror the precedent tools' `@mcp.tool()` registration)**
|
||||
|
||||
READ server.py — find where `precedent_extract_metadata` (or another `precedent_*` tool) is registered with `@mcp.tool()` and delegated to `tools.precedent_library`. Add an equivalent registration for `precedent_reindex` following the identical pattern (decorator + delegation + the same import style). Report the exact registration block you added.
|
||||
|
||||
- [ ] **Step 3: Smoke-import + commit**
|
||||
|
||||
Run: `cd ~/legal-ai/mcp-server && .venv/bin/python -c "from legal_mcp.tools import precedent_library; import legal_mcp.server; print('clean')"`
|
||||
```bash
|
||||
cd ~/legal-ai
|
||||
git add mcp-server/src/legal_mcp/tools/precedent_library.py mcp-server/src/legal_mcp/server.py
|
||||
git commit -m "feat(reindex): precedent_reindex MCP tool (GAP-09, FU-3)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 6: Backfill + full suite + DB smoke + lint + TaskMaster
|
||||
|
||||
- [ ] **Step 1: Full offline suite**
|
||||
|
||||
Run: `cd ~/legal-ai/mcp-server && .venv/bin/python -m pytest tests/ -q`
|
||||
Expected: all pass (FU-1/2a/7 + new FU-3). If a pre-existing test that calls `ingest_document` breaks because `mark_indexed` isn't stubbed, fix that fixture to stub `db.mark_indexed` (same pattern as the FU-2a `recompute_searchable` fixture fix). Report.
|
||||
|
||||
- [ ] **Step 2: DB smoke + backfill (real Postgres — applies V23, runs backfill)**
|
||||
|
||||
```bash
|
||||
cd ~/legal-ai && set -a && source ~/.env 2>/dev/null && set +a
|
||||
cd mcp-server && .venv/bin/python -c "
|
||||
import asyncio
|
||||
from legal_mcp.services import db
|
||||
async def main():
|
||||
await db.get_pool() # applies V23
|
||||
pool = await db.get_pool()
|
||||
async with pool.acquire() as c:
|
||||
cols = await c.fetchval(\"SELECT count(*) FROM information_schema.columns WHERE table_name='case_law' AND column_name IN ('content_hash','indexed_hash')\")
|
||||
print('V23 columns present:', cols, '(expect 2)')
|
||||
res = await db.recompute_content_hashes()
|
||||
print('backfill:', res)
|
||||
stale = await db.list_stale_case_law()
|
||||
print('stale after backfill:', len(stale))
|
||||
asyncio.run(main())
|
||||
" 2>&1 | grep -vE 'INFO|WARNING|httpx|deprecat|command not found|\^\^\^' | tail -5
|
||||
```
|
||||
Expected: `V23 columns present: 2`, backfill updated ~129, `stale after backfill:` a small number (rows with text but no chunks, e.g. cited_only). Report the stale count.
|
||||
|
||||
- [ ] **Step 3: Lint**
|
||||
|
||||
Run: `cd ~/legal-ai/mcp-server && .venv/bin/python -m ruff check src/legal_mcp/services/db.py src/legal_mcp/services/ingest.py 2>/dev/null; echo "exit=$?"`
|
||||
Expected: clean or "ruff not available".
|
||||
|
||||
- [ ] **Step 4: TaskMaster** — controller marks #61 + subtask 61.1 done (61.2 already cancelled), verifies via MCP.
|
||||
|
||||
---
|
||||
|
||||
## Self-Review Notes
|
||||
|
||||
- **GAP-09** → content_hash detection (Task 2) + reindex_case_law (Task 3) + drift health-check (Task 4) + MCP tool (Task 5).
|
||||
- **No re-OCR:** reindex uses stored `full_text` only (Task 3) — honors feedback_no_reocr_retrofit.
|
||||
- **Backfill is hash-only** (Task 6 Step 2) — no re-embed, no API cost; existing vectors untouched.
|
||||
- **#61.2 closed** (not-applicable, in the spec commit) — no multimodal backfill task here.
|
||||
- **Scope:** `case_law` only — `create_case`/`cases` table NOT touched (Task 2 Step 4).
|
||||
- **Type consistency:** `_content_hash(text)->str`, `mark_indexed(case_law_id)`, `reindex_case_law(id)->{chunks,reindexed}`, `list_stale_case_law()`, `recompute_content_hashes()->{updated}` — names identical across tasks + tests.
|
||||
- **Param-count risk** (Task 2 Step 3): the FU-2a upsert SQL must get exactly one new placeholder + one new arg per function; verified at runtime by the Task 6 DB smoke (a mismatch raises immediately).
|
||||
@@ -0,0 +1,113 @@
|
||||
# FU-3 — Re-Index on Content Change — עיצוב
|
||||
|
||||
**סטטוס:** מאושר-לעיצוב · **תאריך:** 2026-05-30 · **ענף:** TBD
|
||||
**מכסה:** GAP-09 · **מספק:** INV-DM3, INV-G6, INV-ING4 (freshness) · **משימה:** TaskMaster #61
|
||||
**תלוי ב:** FU-1 (#59) · **סוג:** pure-code + backfill-hash זול (אפס re-embed בריצה רגילה)
|
||||
**מיגרציה:** V23 additive (2 עמודות-hash) + backfill-hash דטרמיניסטי הפיך. אין re-embed המוני.
|
||||
|
||||
---
|
||||
|
||||
## 1. הבעיה (מאומת בקוד)
|
||||
|
||||
`embedding` אינו עמודת `GENERATED` (בניגוד ל-tsvectors שמתעדכנים אוטומטית בשינוי-תוכן). חילוץ
|
||||
embedding דורש קריאת-API, ולכן אי-אפשר להפוך אותו ל-GENERATED. הממצא של מיפוי-הקוד:
|
||||
|
||||
- **re-ingest דרך `ingest_document` כבר מבצע re-index נכון** — `_chunk_embed_store` רץ ללא-תנאי
|
||||
ו-`store_precedent_chunks(_hierarchical)` הן DELETE-then-INSERT. אז המסלול המלא תקין.
|
||||
- **3 פערים אמיתיים:** (א) אין **גילוי שינוי-תוכן** (אין `content_hash`/`updated_at` ב-case_law);
|
||||
(ב) אין **נקודת re-index עצמאית** — כדי להטמיע מחדש חייבים לקלוט מחדש את ה**קובץ**, אך רשומות
|
||||
רבות (למשל 42 החלטות-ועדה) נקלטו מ-`text` בלי קובץ; (ג) אין **גילוי-drift** בין תוכן ל-embeddings.
|
||||
|
||||
**אכיפת INV-G6** ("re-index בכל שינוי-תוכן") כשהטמעה אינה GENERATED = **גילוי (hash) + כלי-reindex
|
||||
מתוכן-שמור + health-check** — בדיוק כדפוס ה-drift של FU-7 (detect-don't-auto-magic).
|
||||
|
||||
## 2. הכרעות אדריכליות (מאומתות ≥3 מקורות)
|
||||
|
||||
| החלטה | נימוק | מקורות |
|
||||
|-------|--------|--------|
|
||||
| `content_hash` (SHA-256 של full_text) לגילוי-שינוי, לא timestamp | hash תוכן הוא הדפוס המומלץ כשאין timestamp מהימן; דטרמיניסטי + collision-safe | Hash-based change detection (DeepWiki); Andy Dote content-hash; moby#9391 |
|
||||
| re-index **מ-full_text שמור**, לא מ-re-extract/re-OCR | OCR לא-דטרמיניסטי; להשתמש בטקסט השמור (תואם [[feedback_no_reocr_retrofit]]) | RAG re-embed-on-edit (Medium); particula incremental update |
|
||||
| detect→re-embed **רק שהשתנה** (לא rebuild מלא) + health-check staleness | incremental sync; ניטור recall כשהאינדקס מתיישן | apxml RAG updates; Pinecone/Weaviate (gap-audit) |
|
||||
| backfill = hash בלבד (לא re-embed) — רשומות קיימות כבר מוטמעות | זול, הפיך, אפס עלות-API; re-embed רק כשתוכן באמת השתנה | — (נגזר מהמצב: 80 רשומות כבר embedded) |
|
||||
|
||||
## 3. הקבצים
|
||||
|
||||
- **Modify** `services/db.py`: V23 (`content_hash`, `indexed_hash` ב-case_law); `_content_hash(text)`;
|
||||
כתיבת `content_hash` בכניסת `create_external_case_law`/`create_internal_committee_decision`/`create_case`;
|
||||
`mark_indexed(case_law_id)` (מעתיק content_hash→indexed_hash); `recompute_content_hashes()` (backfill);
|
||||
`list_stale_case_law()` (drift query).
|
||||
- **Modify** `services/ingest.py`: אחרי `_chunk_embed_store` המוצלח → `mark_indexed(case_law_id)`; הוספת
|
||||
`reindex_case_law(case_law_id)` — טוען row, chunk+embed+store מ-full_text שמור, ואז `mark_indexed`.
|
||||
- **Modify** `services/metrics.py`: חשיפת `stale_embedding_case_law` count.
|
||||
- **Add** MCP tool `precedent_reindex(case_law_id)` (wrapper דק ל-`ingest.reindex_case_law`) — מאפשר
|
||||
הפעלה ידנית; voyage-API בלבד (אין CLI/LLM → בטוח גם בקונטיינר).
|
||||
- **Test** `tests/test_reindex_on_change.py` (חדש).
|
||||
|
||||
**גבול:** אין שינוי לחתימות ציבוריות. `reindex_case_law` הוא **תוסף**; המסלול הקיים לא משתנה.
|
||||
|
||||
## 4. content_hash + indexed_hash
|
||||
|
||||
- `_content_hash(text) -> str`: `hashlib.sha256(text.encode()).hexdigest()`; על `""`/None → `""`.
|
||||
- `content_hash` = hash של ה-full_text **הנוכחי**, נכתב בכל כתיבת-row (ב-create_*; גבול-הכתיבה כמו נרמול FU-2a).
|
||||
- `indexed_hash` = ה-hash שעליו נבנו ה-chunks/embeddings **הנוכחיים**, נכתב ב-`mark_indexed` אחרי
|
||||
store מוצלח (ב-ingest + ב-reindex).
|
||||
- **טרי** ⇔ `content_hash = indexed_hash`. **stale** ⇔ `content_hash IS DISTINCT FROM indexed_hash`
|
||||
(כולל indexed_hash=NULL = "מעולם לא הוטמע מהתוכן הזה").
|
||||
|
||||
## 5. `reindex_case_law(case_law_id)` (GAP-09 enforcement)
|
||||
|
||||
```
|
||||
load case_law row → full_text (שמור)
|
||||
→ _chunk_embed_store(case_law_id, full_text, page_offsets=None, ...) # אותו מסלול קנוני
|
||||
→ mark_indexed(case_law_id) # indexed_hash = content_hash
|
||||
return {chunks, reindexed: true}
|
||||
```
|
||||
- **לא** קורא ל-extractor/OCR ולא ל-LLM — רק chunk (טקסט שמור) + embed (voyage) + store. תואם
|
||||
[[feedback_no_reocr_retrofit]] ו-claude_session (אין CLI).
|
||||
- multimodal: מדלג (page-images דורשים PDF; רשומות-טקסט אין להן — ראה §7). אם בעתיד יש קובץ — המסלול
|
||||
המלא של ingest מטפל.
|
||||
- idempotent (store = DELETE-then-INSERT; mark_indexed דטרמיניסטי).
|
||||
|
||||
## 6. גילוי-drift + health-check
|
||||
|
||||
- `list_stale_case_law()` → רשומות עם full_text לא-ריק ו-`content_hash IS DISTINCT FROM indexed_hash`.
|
||||
- health-check (metrics.py) חושף `stale_embedding_case_law` count (INV-G6 observability; אחות ל-
|
||||
`non_searchable_case_law`/`cases_with_stale_blocks` מ-FU-2a/FU-7).
|
||||
|
||||
## 7. #61.2 (multimodal backfill) — נסגר כלא-ישים
|
||||
|
||||
בדיקת-DB (2026-05-30): 42 החלטות-ועדה ללא page-images — **כולן** `document_id=NULL` ו-full_text
|
||||
קיים, ואין PDF מקור בדיסק (`data/internal-decisions/` מכיל קובץ אחד). page-images דורשים **רינדור
|
||||
PDF**; לרשומות-טקסט אין PDF → **בלתי-אפשרי**. לכן #61.2 נסגר כ-not-applicable. (אם יועלה PDF לאחת —
|
||||
מסלול-ה-ingest הרגיל יטפל ב-multimodal.) FU-3 core מטמיע-מחדש את ה**טקסט** של כל 42 במידת-הצורך.
|
||||
|
||||
## 8. שינויי-התנהגות וסיכון
|
||||
|
||||
| שינוי | השפעה | סיכון |
|
||||
|--------|--------|--------|
|
||||
| content_hash בכתיבה | כל קליטה חדשה נושאת hash; טרי-מעצם-הקליטה | נמוך — additive |
|
||||
| mark_indexed ב-ingest | רשומות חדשות = טרי (content=indexed) | נמוך |
|
||||
| reindex_case_law | re-embed מתוכן שמור; עלות-API לפי-בקשה | נמוך — תוסף, ידני/מבוקר; לא רץ אוטומטית בהמוני |
|
||||
| backfill hashes | content_hash לכולם; indexed_hash=content רק אם יש chunks, אחרת NULL | נמוך — הפיך, אפס re-embed |
|
||||
| health-check stale count | חשיפת drift | נמוך — read-only |
|
||||
|
||||
## 9. אסטרטגיית בדיקה
|
||||
|
||||
`tests/test_reindex_on_change.py` — offline, monkeypatch. מקרים:
|
||||
1. `_content_hash`: דטרמיניסטי; `""`/None→`""`; טקסט שונה→hash שונה.
|
||||
2. stale-predicate: content≠indexed → stale; שווים → טרי; indexed=NULL → stale.
|
||||
3. `mark_indexed` מריץ UPDATE שמעתיק content_hash→indexed_hash (monkeypatch conn).
|
||||
4. `reindex_case_law`: טוען full_text, קורא _chunk_embed_store ו-mark_indexed (monkeypatch), לא קורא extractor/LLM.
|
||||
5. create_* כותב content_hash (monkeypatch — assert ה-hash מועבר ל-INSERT/upsert).
|
||||
|
||||
> בדיקות-DB אמיתיות (V23, backfill, drift query) — smoke מול DB מקומי (5433) בסיום, כמו FU-2a/FU-7.
|
||||
|
||||
## 10. סדר-ביצוע
|
||||
|
||||
1. בדיקות אדומות.
|
||||
2. V23 (`content_hash`,`indexed_hash`) + `_content_hash` + `mark_indexed` + כתיבת content_hash ב-create_*.
|
||||
3. `reindex_case_law` ב-ingest.py + קריאת `mark_indexed` אחרי `_chunk_embed_store` בקליטה.
|
||||
4. `list_stale_case_law` + health-check `stale_embedding_case_law`.
|
||||
5. MCP tool `precedent_reindex`.
|
||||
6. backfill (DB smoke): `recompute_content_hashes()` — content_hash לכולם, indexed_hash=content אם יש chunks.
|
||||
7. בדיקות ירוקות + smoke מול DB + lint + סגירת #61.2 + TaskMaster #61.
|
||||
@@ -258,6 +258,12 @@ async def precedent_extract_metadata(case_law_id: str) -> str:
|
||||
return await plib.precedent_extract_metadata(case_law_id)
|
||||
|
||||
|
||||
@mcp.tool()
|
||||
async def precedent_reindex(case_law_id: str) -> str:
|
||||
"""re-chunk + re-embed פסיקה קיימת מה-full_text השמור (FU-3/GAP-09). אינו מריץ OCR/LLM — רק chunking + voyage embeddings. idempotent."""
|
||||
return await plib.precedent_reindex(case_law_id)
|
||||
|
||||
|
||||
@mcp.tool()
|
||||
async def style_corpus_enrich(corpus_id: str, overwrite: bool = False) -> str:
|
||||
"""חילוץ מטא-דאטה (summary, outcome, key_principles, appeal_subtype) להחלטה בקורפוס הסגנון של דפנה. ברירת מחדל: ממלא רק שדות ריקים. שלח `overwrite=true` כדי לרענן."""
|
||||
|
||||
@@ -3,6 +3,7 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
import hashlib
|
||||
import json
|
||||
import logging
|
||||
import re
|
||||
@@ -1116,6 +1117,18 @@ ALTER TABLE cases ADD COLUMN IF NOT EXISTS blocks_stale boolean NOT NULL DEFAULT
|
||||
"""
|
||||
|
||||
|
||||
# ── V23: case_law content/indexed hashes — re-index on content change (GAP-09) ──
|
||||
# content_hash = SHA-256 of current full_text (written at the create boundary).
|
||||
# indexed_hash = the content_hash the CURRENT chunks/embeddings were built from
|
||||
# (set by mark_indexed after a successful store). Stale ⇔ content_hash IS
|
||||
# DISTINCT FROM indexed_hash. embedding can't be a GENERATED column (needs an
|
||||
# API call), so freshness is enforced by detection + reindex_case_law + health-check.
|
||||
SCHEMA_V23_SQL = """
|
||||
ALTER TABLE case_law ADD COLUMN IF NOT EXISTS content_hash text NOT NULL DEFAULT '';
|
||||
ALTER TABLE case_law ADD COLUMN IF NOT EXISTS indexed_hash text;
|
||||
"""
|
||||
|
||||
|
||||
async def _run_schema_migrations(pool: asyncpg.Pool) -> None:
|
||||
async with pool.acquire() as conn:
|
||||
await conn.execute(SCHEMA_SQL)
|
||||
@@ -1141,7 +1154,8 @@ async def _run_schema_migrations(pool: asyncpg.Pool) -> None:
|
||||
await conn.execute(SCHEMA_V20_SQL)
|
||||
await conn.execute(SCHEMA_V21_SQL)
|
||||
await conn.execute(SCHEMA_V22_SQL)
|
||||
logger.info("Database schema initialized (v1-v22)")
|
||||
await conn.execute(SCHEMA_V23_SQL)
|
||||
logger.info("Database schema initialized (v1-v23)")
|
||||
|
||||
|
||||
async def init_schema() -> None:
|
||||
@@ -1279,6 +1293,16 @@ def _canonical_case_number(s: str) -> str:
|
||||
return s.strip().replace("/", "-")
|
||||
|
||||
|
||||
def _content_hash(text: str) -> str:
|
||||
"""SHA-256 hex of the text — deterministic content fingerprint (FU-3/GAP-09).
|
||||
|
||||
Empty/None → "" (a row with no text has no content fingerprint).
|
||||
"""
|
||||
if not text:
|
||||
return ""
|
||||
return hashlib.sha256(text.encode("utf-8")).hexdigest()
|
||||
|
||||
|
||||
async def get_case_by_number(case_number: str) -> dict | None:
|
||||
pool = await get_pool()
|
||||
norm = _normalize_case_number(case_number)
|
||||
@@ -2546,6 +2570,55 @@ async def get_case_law(case_law_id: UUID) -> dict | None:
|
||||
return _row_to_case_law(row) if row else None
|
||||
|
||||
|
||||
async def mark_indexed(case_law_id: UUID) -> None:
|
||||
"""Mark a case_law row's embeddings as built from its current content (FU-3).
|
||||
|
||||
Sets indexed_hash := content_hash. Call AFTER a successful chunk+embed+store.
|
||||
"""
|
||||
pool = await get_pool()
|
||||
async with pool.acquire() as conn:
|
||||
await conn.execute(
|
||||
"UPDATE case_law SET indexed_hash = content_hash WHERE id = $1",
|
||||
case_law_id,
|
||||
)
|
||||
|
||||
|
||||
async def list_stale_case_law(limit: int = 500) -> list[dict]:
|
||||
"""case_law rows whose embeddings are stale vs current content (GAP-09/INV-G6)."""
|
||||
pool = await get_pool()
|
||||
async with pool.acquire() as conn:
|
||||
rows = await conn.fetch(
|
||||
"""SELECT id, case_number, source_kind
|
||||
FROM case_law
|
||||
WHERE coalesce(full_text, '') <> ''
|
||||
AND content_hash IS DISTINCT FROM indexed_hash
|
||||
ORDER BY created_at LIMIT $1""",
|
||||
limit,
|
||||
)
|
||||
return [dict(r) for r in rows]
|
||||
|
||||
|
||||
async def recompute_content_hashes() -> dict:
|
||||
"""Backfill (FU-3): set content_hash for all rows; set indexed_hash=content_hash
|
||||
only where chunks already exist (those are already embedded). Rows with text but
|
||||
no chunks get indexed_hash=NULL → surface as stale. Hash-only; no re-embed."""
|
||||
pool = await get_pool()
|
||||
updated = 0
|
||||
async with pool.acquire() as conn:
|
||||
rows = await conn.fetch("SELECT id, full_text FROM case_law")
|
||||
for r in rows:
|
||||
ch = _content_hash(r["full_text"] or "")
|
||||
has_chunks = await conn.fetchval(
|
||||
"SELECT EXISTS(SELECT 1 FROM precedent_chunks WHERE case_law_id = $1)",
|
||||
r["id"])
|
||||
await conn.execute(
|
||||
"UPDATE case_law SET content_hash = $2, "
|
||||
"indexed_hash = CASE WHEN $3 THEN $2 ELSE indexed_hash END WHERE id = $1",
|
||||
r["id"], ch, bool(has_chunks))
|
||||
updated += 1
|
||||
return {"updated": updated}
|
||||
|
||||
|
||||
async def add_case_law_relation(
|
||||
a_id: UUID, b_id: UUID, relation_type: str = "same_case_chain"
|
||||
) -> None:
|
||||
@@ -2649,11 +2722,11 @@ async def create_external_case_law(
|
||||
summary, key_quote, full_text, source_url,
|
||||
source_kind, document_id, extraction_status,
|
||||
halacha_extraction_status, practice_area, appeal_subtype,
|
||||
headnote, source_type, precedent_level, is_binding
|
||||
headnote, source_type, precedent_level, is_binding, content_hash
|
||||
) VALUES (
|
||||
$1, $2, $3, $4, $5, $6, $7, $8, $9,
|
||||
'external_upload', $10, 'processing', 'pending',
|
||||
$11, $12, $13, $14, $15, $16
|
||||
$11, $12, $13, $14, $15, $16, $17
|
||||
)
|
||||
ON CONFLICT (case_number) WHERE source_kind <> 'internal_committee'
|
||||
DO UPDATE SET
|
||||
@@ -2674,13 +2747,15 @@ async def create_external_case_law(
|
||||
document_id = COALESCE(EXCLUDED.document_id, case_law.document_id),
|
||||
source_kind = 'external_upload',
|
||||
extraction_status = 'processing',
|
||||
halacha_extraction_status = 'pending'
|
||||
halacha_extraction_status = 'pending',
|
||||
content_hash = EXCLUDED.content_hash
|
||||
RETURNING *
|
||||
""",
|
||||
case_number, case_name, court, decision_date, tags_json,
|
||||
summary, key_quote, full_text, source_url,
|
||||
document_id, practice_area, appeal_subtype, headnote,
|
||||
source_type, precedent_level, is_binding,
|
||||
_content_hash(full_text),
|
||||
)
|
||||
return _row_to_case_law(row)
|
||||
|
||||
@@ -2722,13 +2797,13 @@ async def create_internal_committee_decision(
|
||||
subject_tags, summary, full_text,
|
||||
source_kind, source_type, document_id,
|
||||
extraction_status, halacha_extraction_status,
|
||||
practice_area, appeal_subtype, is_binding, proceeding_type
|
||||
practice_area, appeal_subtype, is_binding, proceeding_type, content_hash
|
||||
) VALUES (
|
||||
$1, $2, $3, $4, $5, $6,
|
||||
$7, $8, $9,
|
||||
'internal_committee', 'appeals_committee', $10,
|
||||
'processing', 'pending',
|
||||
$11, $12, $13, $14
|
||||
$11, $12, $13, $14, $15
|
||||
)
|
||||
ON CONFLICT (case_number, proceeding_type)
|
||||
WHERE source_kind = 'internal_committee'
|
||||
@@ -2748,13 +2823,14 @@ async def create_internal_committee_decision(
|
||||
is_binding = EXCLUDED.is_binding,
|
||||
document_id = COALESCE(EXCLUDED.document_id, case_law.document_id),
|
||||
extraction_status = 'processing',
|
||||
halacha_extraction_status = 'pending'
|
||||
halacha_extraction_status = 'pending',
|
||||
content_hash = EXCLUDED.content_hash
|
||||
RETURNING *
|
||||
""",
|
||||
case_number, case_name, court, decision_date, chair_name, district,
|
||||
tags_json, summary, full_text,
|
||||
document_id, practice_area, appeal_subtype, is_binding,
|
||||
proceeding_type,
|
||||
proceeding_type, _content_hash(full_text),
|
||||
)
|
||||
return _row_to_case_law(row)
|
||||
|
||||
|
||||
@@ -182,6 +182,7 @@ async def ingest_document(
|
||||
|
||||
try:
|
||||
stored_chunks = await _chunk_embed_store(case_law_id, raw_text, page_offsets, page_count, progress)
|
||||
await db.mark_indexed(case_law_id)
|
||||
|
||||
# Step 9: multimodal — uniform: flag + PDF + page_count, NOT intake type.
|
||||
if (config.MULTIMODAL_ENABLED and page_count > 0
|
||||
@@ -256,3 +257,27 @@ async def _chunk_embed_store(case_law_id, text, page_offsets, page_count, progre
|
||||
for c, v in zip(chunks, chunk_vectors)
|
||||
]
|
||||
return await db.store_precedent_chunks(case_law_id, chunk_dicts)
|
||||
|
||||
|
||||
async def reindex_case_law(
|
||||
case_law_id: "UUID | str",
|
||||
progress: ProgressCb | None = None,
|
||||
) -> dict:
|
||||
"""Re-chunk + re-embed an existing case_law row from its STORED full_text (GAP-09).
|
||||
|
||||
No re-extract / no re-OCR (uses the stored text — see feedback_no_reocr_retrofit)
|
||||
and no LLM/CLI (only chunker + voyage embeddings), so it is safe to run anywhere.
|
||||
Idempotent: store_precedent_chunks(_hierarchical) is DELETE-then-INSERT.
|
||||
"""
|
||||
progress = progress or _noop_progress
|
||||
cid = case_law_id if isinstance(case_law_id, UUID) else UUID(str(case_law_id))
|
||||
row = await db.get_case_law(cid)
|
||||
if not row:
|
||||
raise ValueError(f"case_law not found: {cid}")
|
||||
text = (row.get("full_text") or "").strip()
|
||||
if not text:
|
||||
raise ValueError("case_law has no stored full_text to re-index")
|
||||
stored = await _chunk_embed_store(cid, text, None, 0, progress)
|
||||
await db.mark_indexed(cid)
|
||||
await progress("completed", 100, f"הוטמע מחדש: {stored} chunks")
|
||||
return {"status": "completed", "case_law_id": str(cid), "chunks": stored, "reindexed": True}
|
||||
|
||||
@@ -129,6 +129,9 @@ async def get_dashboard() -> dict:
|
||||
cases_with_stale_blocks = await conn.fetchval(
|
||||
"SELECT COUNT(*) FROM cases WHERE blocks_stale"
|
||||
)
|
||||
stale_embedding_case_law = await conn.fetchval(
|
||||
"SELECT COUNT(*) FROM case_law "
|
||||
"WHERE coalesce(full_text,'') <> '' AND content_hash IS DISTINCT FROM indexed_hash")
|
||||
|
||||
# QA summary
|
||||
qa_total = await conn.fetchval("SELECT COUNT(DISTINCT case_id) FROM qa_results")
|
||||
@@ -162,6 +165,7 @@ async def get_dashboard() -> dict:
|
||||
"case_law_entries": total_case_law,
|
||||
"non_searchable_case_law": non_searchable_case_law,
|
||||
"cases_with_stale_blocks": cases_with_stale_blocks,
|
||||
"stale_embedding_case_law": stale_embedding_case_law,
|
||||
},
|
||||
"cases_by_status": cases_by_status,
|
||||
"qa": {
|
||||
|
||||
@@ -215,6 +215,24 @@ async def precedent_extract_metadata(case_law_id: str) -> str:
|
||||
return _ok(result)
|
||||
|
||||
|
||||
async def precedent_reindex(case_law_id: str) -> str:
|
||||
"""re-chunk + re-embed פסיקה קיימת מה-full_text השמור (FU-3/GAP-09).
|
||||
|
||||
לתיקון drift של embeddings או אחרי שינוי-תוכן. אינו מריץ OCR/LLM — רק
|
||||
chunking + voyage embeddings. idempotent (מוחק ובונה chunks מחדש).
|
||||
"""
|
||||
try:
|
||||
cid = UUID(case_law_id)
|
||||
except ValueError:
|
||||
return _err("case_law_id לא תקין")
|
||||
try:
|
||||
from legal_mcp.services import ingest
|
||||
result = await ingest.reindex_case_law(cid)
|
||||
except Exception as e:
|
||||
return _err(str(e))
|
||||
return _ok(result)
|
||||
|
||||
|
||||
async def precedent_process_pending(kind: str = "metadata", limit: int = 20) -> str:
|
||||
"""ריקון תור בקשות חילוץ שנערמו ע"י כפתורי ה-UI. kind: 'metadata' או 'halacha'.
|
||||
|
||||
|
||||
@@ -110,6 +110,9 @@ def test_ingest_calls_recompute_searchable(monkeypatch, tmp_path):
|
||||
|
||||
async def _recompute(cid): calls["recompute"].append(cid)
|
||||
monkeypatch.setattr(ingest.db, "recompute_searchable", _recompute)
|
||||
|
||||
async def _mark_indexed(cid): return None
|
||||
monkeypatch.setattr(ingest.db, "mark_indexed", _mark_indexed)
|
||||
monkeypatch.setattr(ingest.config, "PARENT_DOC_RETRIEVAL_ENABLED", False)
|
||||
monkeypatch.setattr(ingest.config, "MULTIMODAL_ENABLED", False)
|
||||
|
||||
|
||||
89
mcp-server/tests/test_reindex_on_change.py
Normal file
89
mcp-server/tests/test_reindex_on_change.py
Normal file
@@ -0,0 +1,89 @@
|
||||
"""FU-3: re-index on content change (offline, monkeypatched I/O)."""
|
||||
from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
from uuid import uuid4
|
||||
|
||||
import pytest
|
||||
|
||||
from legal_mcp.services import db, ingest
|
||||
|
||||
|
||||
def _run(coro):
|
||||
return asyncio.run(coro)
|
||||
|
||||
|
||||
# ── content_hash is deterministic ──────────────────────────────────────
|
||||
def test_content_hash_deterministic():
|
||||
h1 = db._content_hash("פסק דין כלשהו")
|
||||
h2 = db._content_hash("פסק דין כלשהו")
|
||||
assert h1 == h2 and len(h1) == 64 # sha256 hex
|
||||
|
||||
|
||||
def test_content_hash_empty_is_blank():
|
||||
assert db._content_hash("") == ""
|
||||
assert db._content_hash(None) == ""
|
||||
|
||||
|
||||
def test_content_hash_changes_with_text():
|
||||
assert db._content_hash("alpha") != db._content_hash("beta")
|
||||
|
||||
|
||||
# ── mark_indexed copies content_hash → indexed_hash ─────────────────────
|
||||
def test_mark_indexed_executes_update(monkeypatch):
|
||||
seen = {}
|
||||
|
||||
class _Conn:
|
||||
async def execute(self, q, *a):
|
||||
seen["q"] = q; seen["args"] = a
|
||||
async def __aenter__(self): return self
|
||||
async def __aexit__(self, *a): return False
|
||||
|
||||
class _Pool:
|
||||
def acquire(self): return _Conn()
|
||||
|
||||
async def _pool(): return _Pool()
|
||||
monkeypatch.setattr(db, "get_pool", _pool)
|
||||
|
||||
cid = uuid4()
|
||||
_run(db.mark_indexed(cid))
|
||||
assert "indexed_hash" in seen["q"] and "content_hash" in seen["q"]
|
||||
assert seen["args"][0] == cid
|
||||
|
||||
|
||||
# ── reindex_case_law re-embeds from stored text, no extractor/LLM ───────
|
||||
def test_reindex_case_law_uses_stored_text(monkeypatch):
|
||||
cid = uuid4()
|
||||
calls = {"chunk_embed_store": [], "mark_indexed": []}
|
||||
|
||||
async def _get_case_law(x):
|
||||
return {"id": cid, "full_text": "טקסט שמור של ההחלטה"}
|
||||
monkeypatch.setattr(ingest.db, "get_case_law", _get_case_law)
|
||||
|
||||
async def _ces(case_law_id, text, page_offsets, page_count, progress):
|
||||
calls["chunk_embed_store"].append((case_law_id, text))
|
||||
return 5
|
||||
monkeypatch.setattr(ingest, "_chunk_embed_store", _ces)
|
||||
|
||||
async def _mark(x):
|
||||
calls["mark_indexed"].append(x)
|
||||
monkeypatch.setattr(ingest.db, "mark_indexed", _mark)
|
||||
|
||||
out = _run(ingest.reindex_case_law(cid))
|
||||
assert out["chunks"] == 5 and out["reindexed"] is True
|
||||
assert calls["chunk_embed_store"][0][1] == "טקסט שמור של ההחלטה"
|
||||
assert calls["mark_indexed"] == [cid]
|
||||
|
||||
|
||||
def test_reindex_case_law_missing_row_raises(monkeypatch):
|
||||
async def _none(x): return None
|
||||
monkeypatch.setattr(ingest.db, "get_case_law", _none)
|
||||
with pytest.raises(ValueError, match="not found"):
|
||||
_run(ingest.reindex_case_law(uuid4()))
|
||||
|
||||
|
||||
def test_reindex_case_law_empty_text_raises(monkeypatch):
|
||||
async def _empty(x): return {"id": uuid4(), "full_text": " "}
|
||||
monkeypatch.setattr(ingest.db, "get_case_law", _empty)
|
||||
with pytest.raises(ValueError, match="no stored full_text"):
|
||||
_run(ingest.reindex_case_law(uuid4()))
|
||||
@@ -87,6 +87,7 @@ def patched(monkeypatch, tmp_path):
|
||||
monkeypatch.setattr(db, "set_case_law_extraction_status", _set_status)
|
||||
monkeypatch.setattr(db, "set_case_law_halacha_status", _set_status)
|
||||
monkeypatch.setattr(db, "recompute_searchable", _recompute_searchable)
|
||||
monkeypatch.setattr(db, "mark_indexed", _recompute_searchable)
|
||||
# Force flat chunking + multimodal OFF unless a test flips it.
|
||||
monkeypatch.setattr(config, "PARENT_DOC_RETRIEVAL_ENABLED", False)
|
||||
monkeypatch.setattr(config, "MULTIMODAL_ENABLED", False)
|
||||
|
||||
Reference in New Issue
Block a user