Files

Chaim a16f8cd933 docs(plan): FU-2a idempotent-ingest implementation plan (7 tasks, TDD)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-05-30 20:04:49 +00:00

27 KiB

Raw Blame History

FU-2a: Idempotent Ingest + Write-Time Normalization + `searchable` Flag — Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Make ingest idempotent (ON CONFLICT upsert), normalize identifiers at the write boundary (type-aware), and add a materialized searchable flag — all forward-only, no identifier migration.

Architecture: Pure-code + one schema-additive migration (V21) in db.py. The two create_*_case_law functions move from app-level SELECT-then-INSERT/UPDATE to atomic INSERT … ON CONFLICT … DO UPDATE against the existing V15 partial unique indexes (predicate repeated). A new _canonical_case_number normalizes at write for identifier-keyed corpora (internal/cases), not for external (citation is its id). A new searchable boolean is recomputed from the completeness contract on ingest/metadata completion; the search-layer filter is gated behind a dry-run.

Tech Stack: Python 3.12, asyncpg, PostgreSQL (pgvector) at localhost:5433, pytest offline, local .venv at mcp-server/.venv.

Spec: docs/superpowers/specs/2026-05-30-fu2a-idempotent-ingest-design.md

Run tests: cd ~/legal-ai/mcp-server && .venv/bin/python -m pytest tests/test_idempotent_ingest.py -v DB smoke (real Postgres): source ~/.env, connect to localhost:5433 db legal_ai (see Task 6).

File Structure

Modify mcp-server/src/legal_mcp/services/db.py:
- add _canonical_case_number(s) (pure) near _normalize_case_number (~line 1196).
- add pure _compute_searchable(row, has_embedded_chunk) + async recompute_searchable(...).
- add SCHEMA_V21_SQL (after V20, ~line 1094) + wire into _run_schema_migrations (~line 1119).
- normalize at write in create_case, create_internal_committee_decision (NOT create_external_case_law).
- convert create_external_case_law + create_internal_committee_decision to ON CONFLICT … DO UPDATE.
Modify mcp-server/src/legal_mcp/services/ingest.py: call db.recompute_searchable(case_law_id) after statuses are set (uniform, both types).
Modify the search layer (services/hybrid_search.py and/or db.py search functions) — gated searchable = true filter (Task 6, only if dry-run is clean).
Create mcp-server/tests/test_idempotent_ingest.py — offline tests for the pure pieces + ingest wiring.

Unchanged: public signatures of ingest_precedent/ingest_internal_decision (FU-1) and the DB-create parameter lists. Normalization/upsert live inside the write boundary.

Task 1: Failing tests (pure logic + ingest wiring)

Files: Create mcp-server/tests/test_idempotent_ingest.py

Step 1: Write the failing tests

"""FU-2a: idempotent ingest + write-time normalization + searchable flag.

Offline tests for the *pure* pieces (canonical normalization, completeness
predicate) and ingest wiring. The real ON CONFLICT upsert is verified by a
DB smoke test against localhost:5433 (see plan Task 6), since it requires a
live Postgres partial unique index.
"""
from __future__ import annotations

import asyncio
from uuid import uuid4

import pytest

from legal_mcp.services import db, ingest


def _run(coro):
    return asyncio.run(coro)


# ── GAP-06: canonical normalization (pure, deterministic) ──────────────
@pytest.mark.parametrize("raw,expected", [
    ("ערר 8137/24", "8137-24"),
    ("  עע\"מ 1/20 ", "1-20"),
    ("8126-03-25", "8126-03-25"),          # month segment preserved
    ("בל\"מ 1010-01-25", "1010-01-25"),
    ("8047/23", "8047-23"),
])
def test_canonical_case_number(raw, expected):
    assert db._canonical_case_number(raw) == expected


def test_canonical_does_not_invent_month():
    # No month in input → none added (X1 §1).
    assert db._canonical_case_number("8126/24") == "8126-24"


# ── GAP-13: completeness predicate (pure) ──────────────────────────────
def _complete_row():
    return {
        "case_number": "8047-23", "case_name": "פלוני נ' הוועדה",
        "practice_area": "rishuy_uvniya", "source_kind": "internal_committee",
        "extraction_status": "completed", "headnote": "תקציר",
        "summary": "", "subject_tags": [],
    }


def test_compute_searchable_true_when_complete():
    assert db._compute_searchable(_complete_row(), has_embedded_chunk=True) is True


def test_compute_searchable_false_without_embedded_chunk():
    assert db._compute_searchable(_complete_row(), has_embedded_chunk=False) is False


def test_compute_searchable_false_without_metadata():
    row = _complete_row()
    row["headnote"] = ""; row["summary"] = ""; row["subject_tags"] = []
    assert db._compute_searchable(row, has_embedded_chunk=True) is False


def test_compute_searchable_false_when_extraction_incomplete():
    row = _complete_row(); row["extraction_status"] = "pending"
    assert db._compute_searchable(row, has_embedded_chunk=True) is False


def test_compute_searchable_false_without_core_fields():
    row = _complete_row(); row["practice_area"] = ""
    assert db._compute_searchable(row, has_embedded_chunk=True) is False


# ── ingest wires in recompute_searchable (both types) ──────────────────
def test_ingest_calls_recompute_searchable(monkeypatch, tmp_path):
    calls = {"recompute": [], "meta": [], "hal": []}

    async def _extract_text(path): return ("text", 1, [0])
    monkeypatch.setattr(ingest.extractor, "extract_text", _extract_text)
    monkeypatch.setattr(ingest.extractor, "strip_nevo_preamble", lambda t: t)
    monkeypatch.setattr(ingest.chunker, "chunk_document",
                        lambda t, page_offsets=None: [type("C", (), {
                            "chunk_index": 0, "content": "c", "section_type": "b",
                            "page_number": 1})()])

    async def _embed(texts, input_type="document"): return [[0.0] * 8 for _ in texts]
    monkeypatch.setattr(ingest.embeddings, "embed_texts", _embed)

    async def _store(cid, dicts): return len(dicts)
    monkeypatch.setattr(ingest.db, "store_precedent_chunks", _store)

    async def _create_internal(**kw): return {"id": uuid4()}
    monkeypatch.setattr(ingest.db, "create_internal_committee_decision", _create_internal)

    async def _noop(*a, **k): return None
    monkeypatch.setattr(ingest.db, "set_case_law_extraction_status", _noop)
    monkeypatch.setattr(ingest.db, "set_case_law_halacha_status", _noop)
    monkeypatch.setattr(ingest.db, "request_metadata_extraction",
                        lambda cid: calls["meta"].append(cid) or _noop())
    monkeypatch.setattr(ingest.db, "request_halacha_extraction",
                        lambda cid: calls["hal"].append(cid) or _noop())

    async def _recompute(cid): calls["recompute"].append(cid)
    monkeypatch.setattr(ingest.db, "recompute_searchable", _recompute)
    monkeypatch.setattr(ingest.config, "PARENT_DOC_RETRIEVAL_ENABLED", False)
    monkeypatch.setattr(ingest.config, "MULTIMODAL_ENABLED", False)

    from legal_mcp.services import internal_decisions
    _run(internal_decisions.ingest_internal_decision(
        case_number="8047/23", text="t", chair_name="x", practice_area="rishuy_uvniya"))
    assert len(calls["recompute"]) == 1, "ingest must recompute searchable after success"

Step 2: Run to verify failure

Run: cd ~/legal-ai/mcp-server && .venv/bin/python -m pytest tests/test_idempotent_ingest.py -v Expected: FAIL — AttributeError: module 'legal_mcp.services.db' has no attribute '_canonical_case_number' (and _compute_searchable, recompute_searchable).

Step 3: Commit

cd ~/legal-ai
git add mcp-server/tests/test_idempotent_ingest.py
git commit -m "test(ingest): failing tests for idempotent ingest + searchable (FU-2a)"

Task 2: `_canonical_case_number` + write-time normalization

Files: Modify mcp-server/src/legal_mcp/services/db.py

Step 1: Add _canonical_case_number next to _normalize_case_number (~line 1212)

def _canonical_case_number(s: str) -> str:
    """Canonical write-time form per X1 §1: trim · prefix-strip · '/'→'-'.

    Deterministic and format-only — does NOT add or remove a month segment.
    Used at the write boundary for identifier-keyed corpora (internal
    committee decisions, active cases). NOT for external precedents, whose
    canonical identifier is the full citation.
    """
    s = (s or "").strip()
    m = re.search(r"\d", s)
    if m:
        s = s[m.start():]
    return s.strip().replace("/", "-")

Step 2: Normalize at write in create_case (~line 1158)

Change the INSERT's case_number binding to normalized form. Replace case_id, case_number, title, with:

            case_id, _canonical_case_number(case_number), title,

Step 3: Normalize at write in create_internal_committee_decision (top of function body, ~line 2649)

Immediately after pool = await get_pool(), add:

    case_number = _canonical_case_number(case_number)

(Do NOT add this to create_external_case_law — external keeps its citation verbatim; that function only .strip()s, which the caller adapter already does.)

Step 4: Run normalization tests

Run: cd ~/legal-ai/mcp-server && .venv/bin/python -m pytest tests/test_idempotent_ingest.py -k "canonical" -v Expected: test_canonical_case_number (5 cases) + test_canonical_does_not_invent_month PASS.

Step 5: Commit

cd ~/legal-ai
git add mcp-server/src/legal_mcp/services/db.py
git commit -m "feat(ingest): write-time canonical case_number normalization (GAP-06, FU-2a)"

Task 3: Convert both create functions to `ON CONFLICT DO UPDATE`

Files: Modify mcp-server/src/legal_mcp/services/db.py

Step 1: Replace create_external_case_law body (lines 2566-2624, from pool = await get_pool() to return _row_to_case_law(row))

    pool = await get_pool()
    tags_json = json.dumps(subject_tags or [], ensure_ascii=False)
    async with pool.acquire() as conn:
        # Atomic upsert on the V15 partial unique index
        # uq_case_law_external_number (case_number) WHERE source_kind <> 'internal_committee'.
        # The predicate is repeated in ON CONFLICT (required for partial indexes).
        # This also subsumes the old cited_only→external_upload promotion: a
        # cited_only row with the same case_number conflicts and is promoted by
        # DO UPDATE. Scoped to the external partial index, so an internal row with
        # the same number is NOT touched (the old SELECT-without-source_kind could
        # wrongly promote it).
        row = await conn.fetchrow(
            """
            INSERT INTO case_law (
                case_number, case_name, court, date, subject_tags,
                summary, key_quote, full_text, source_url,
                source_kind, document_id, extraction_status,
                halacha_extraction_status, practice_area, appeal_subtype,
                headnote, source_type, precedent_level, is_binding
            ) VALUES (
                $1, $2, $3, $4, $5, $6, $7, $8, $9,
                'external_upload', $10, 'processing', 'pending',
                $11, $12, $13, $14, $15, $16
            )
            ON CONFLICT (case_number) WHERE source_kind <> 'internal_committee'
            DO UPDATE SET
                case_name = EXCLUDED.case_name,
                court = COALESCE(NULLIF(EXCLUDED.court, ''), case_law.court),
                date = COALESCE(EXCLUDED.date, case_law.date),
                practice_area = EXCLUDED.practice_area,
                appeal_subtype = EXCLUDED.appeal_subtype,
                subject_tags = EXCLUDED.subject_tags,
                summary = COALESCE(NULLIF(EXCLUDED.summary, ''), case_law.summary),
                headnote = EXCLUDED.headnote,
                key_quote = COALESCE(NULLIF(EXCLUDED.key_quote, ''), case_law.key_quote),
                full_text = EXCLUDED.full_text,
                source_url = COALESCE(NULLIF(EXCLUDED.source_url, ''), case_law.source_url),
                source_type = EXCLUDED.source_type,
                precedent_level = EXCLUDED.precedent_level,
                is_binding = EXCLUDED.is_binding,
                document_id = COALESCE(EXCLUDED.document_id, case_law.document_id),
                source_kind = 'external_upload',
                extraction_status = 'processing',
                halacha_extraction_status = 'pending'
            RETURNING *
            """,
            case_number, case_name, court, decision_date, tags_json,
            summary, key_quote, full_text, source_url,
            document_id, practice_area, appeal_subtype, headnote,
            source_type, precedent_level, is_binding,
        )
    return _row_to_case_law(row)

Step 2: Replace create_internal_committee_decision body (lines 2649-2708)

    pool = await get_pool()
    case_number = _canonical_case_number(case_number)
    tags_json = json.dumps(subject_tags or [], ensure_ascii=False)
    async with pool.acquire() as conn:
        # Atomic upsert on V15 partial unique index
        # uq_case_law_internal_number_proc (case_number, proceeding_type)
        # WHERE source_kind = 'internal_committee'. Predicate repeated for the
        # partial index. Replaces the old SELECT-then-INSERT/UPDATE (race-prone).
        row = await conn.fetchrow(
            """
            INSERT INTO case_law (
                case_number, case_name, court, date, chair_name, district,
                subject_tags, summary, full_text,
                source_kind, source_type, document_id,
                extraction_status, halacha_extraction_status,
                practice_area, appeal_subtype, is_binding, proceeding_type
            ) VALUES (
                $1, $2, $3, $4, $5, $6,
                $7, $8, $9,
                'internal_committee', 'appeals_committee', $10,
                'processing', 'pending',
                $11, $12, $13, $14
            )
            ON CONFLICT (case_number, proceeding_type)
                WHERE source_kind = 'internal_committee'
            DO UPDATE SET
                case_name = EXCLUDED.case_name,
                court = COALESCE(NULLIF(EXCLUDED.court, ''), case_law.court),
                date = COALESCE(EXCLUDED.date, case_law.date),
                chair_name = COALESCE(NULLIF(EXCLUDED.chair_name, ''), case_law.chair_name),
                district = COALESCE(NULLIF(EXCLUDED.district, ''), case_law.district),
                practice_area = EXCLUDED.practice_area,
                appeal_subtype = EXCLUDED.appeal_subtype,
                subject_tags = EXCLUDED.subject_tags,
                summary = COALESCE(NULLIF(EXCLUDED.summary, ''), case_law.summary),
                full_text = EXCLUDED.full_text,
                source_type = 'appeals_committee',
                source_kind = 'internal_committee',
                is_binding = EXCLUDED.is_binding,
                document_id = COALESCE(EXCLUDED.document_id, case_law.document_id),
                extraction_status = 'processing',
                halacha_extraction_status = 'pending'
            RETURNING *
            """,
            case_number, case_name, court, decision_date, chair_name, district,
            tags_json, summary, full_text,
            document_id, practice_area, appeal_subtype, is_binding,
            proceeding_type,
        )
    return _row_to_case_law(row)

Step 3: Verify import + no syntax error

Run: cd ~/legal-ai/mcp-server && .venv/bin/python -c "from legal_mcp.services import db; print('db imports')" Expected: prints db imports.

Step 4: Commit

cd ~/legal-ai
git add mcp-server/src/legal_mcp/services/db.py
git commit -m "feat(ingest): atomic ON CONFLICT upsert in create_*_case_law (GAP-03, FU-2a)"

Task 4: V21 migration — `searchable` column + recompute

Files: Modify mcp-server/src/legal_mcp/services/db.py

Step 1: Add SCHEMA_V21_SQL after SCHEMA_V20_SQL (~line 1094)

# ── V21: explicit `searchable` flag (GAP-13 / INV-DM1) ─────────────
# Materialized completeness flag — a case_law row is exposed to search only
# when it satisfies the completeness contract (02-data-model §2a). Recomputed
# on ingest/metadata completion via recompute_searchable(); not inferred at
# query time. Default false so a freshly-inserted row is excluded until proven
# complete. Health-check surfaces count(*) FILTER (WHERE NOT searchable).
SCHEMA_V21_SQL = """
ALTER TABLE case_law ADD COLUMN IF NOT EXISTS searchable boolean NOT NULL DEFAULT false;
CREATE INDEX IF NOT EXISTS idx_case_law_searchable ON case_law (searchable);
"""

Step 2: Wire V21 into _run_schema_migrations (~line 1119) and bump the log line

After await conn.execute(SCHEMA_V20_SQL) add:

        await conn.execute(SCHEMA_V21_SQL)

Change the log line "Database schema initialized (v1-v20)" → "Database schema initialized (v1-v21)".

Step 3: Add _compute_searchable (pure) + recompute_searchable (async) near the case_law helpers (after create_internal_committee_decision, ~line 2709)

def _compute_searchable(row: dict, has_embedded_chunk: bool) -> bool:
    """Completeness contract (INV-DM1 / 02-data-model §2a).

    A row is searchable IFF: canonical id present · case_name/practice_area/
    source_kind present · ≥1 chunk with a non-null embedding · extraction
    completed · metadata non-empty (≥1 of headnote/summary/subject_tags).
    Pure — `has_embedded_chunk` is supplied by the caller (cross-table check).
    """
    if not has_embedded_chunk:
        return False
    if (row.get("extraction_status") or "") != "completed":
        return False
    if not (row.get("case_number") or "").strip():
        return False
    if not (row.get("case_name") or "").strip():
        return False
    if not (row.get("practice_area") or "").strip():
        return False
    if not (row.get("source_kind") or "").strip():
        return False
    tags = row.get("subject_tags") or []
    has_meta = bool((row.get("headnote") or "").strip()) \
        or bool((row.get("summary") or "").strip()) \
        or (len(tags) > 0)
    return has_meta


async def recompute_searchable(case_law_id: "UUID | str | None" = None) -> int:
    """Recompute and persist the `searchable` flag. Idempotent / reversible.

    If case_law_id is None, recompute ALL rows (used by the V21 backfill and
    the dry-run). Returns the number of rows now marked searchable=true.
    """
    pool = await get_pool()
    async with pool.acquire() as conn:
        if case_law_id is not None:
            cid = case_law_id if isinstance(case_law_id, UUID) else UUID(str(case_law_id))
            rows = await conn.fetch(
                "SELECT * FROM case_law WHERE id = $1", cid)
        else:
            rows = await conn.fetch("SELECT * FROM case_law")
        n_true = 0
        for r in rows:
            row = dict(r)
            # subject_tags is stored jsonb; _row_to_case_law parses it, but here
            # we read raw — normalize to a list length check.
            tags = row.get("subject_tags")
            if isinstance(tags, str):
                try:
                    tags = json.loads(tags)
                except (ValueError, TypeError):
                    tags = []
            row["subject_tags"] = tags or []
            has_chunk = await conn.fetchval(
                "SELECT EXISTS(SELECT 1 FROM precedent_chunks "
                "WHERE case_law_id = $1 AND embedding IS NOT NULL)", row["id"])
            val = _compute_searchable(row, bool(has_chunk))
            await conn.execute(
                "UPDATE case_law SET searchable = $2 WHERE id = $1", row["id"], val)
            if val:
                n_true += 1
    return n_true

Step 4: Run the completeness-predicate tests

Run: cd ~/legal-ai/mcp-server && .venv/bin/python -m pytest tests/test_idempotent_ingest.py -k "searchable and not ingest" -v Expected: all test_compute_searchable_* PASS.

Step 5: Commit

cd ~/legal-ai
git add mcp-server/src/legal_mcp/services/db.py
git commit -m "feat(data-model): V21 searchable flag + recompute_searchable (GAP-13, FU-2a)"

Task 5: Wire `recompute_searchable` into ingest

Files: Modify mcp-server/src/legal_mcp/services/ingest.py

Step 1: Call recompute after statuses are set in ingest_document

In ingest.py, find the block (added by FU-1) that sets statuses + queues extraction:

        await db.set_case_law_extraction_status(case_law_id, "completed")
        await db.set_case_law_halacha_status(case_law_id, "pending")
        await db.request_metadata_extraction(case_law_id)
        await db.request_halacha_extraction(case_law_id)

Immediately AFTER request_halacha_extraction, add:

        await db.recompute_searchable(case_law_id)

Rationale: at this point chunks+embeddings are stored and extraction_status is completed, so the completeness predicate is meaningful. Metadata may still be pending (queued), so the row may compute searchable=false until metadata fills — the metadata extractor also calls recompute (Task 5 Step 2).

Step 2: Call recompute after metadata extraction fills fields

In mcp-server/src/legal_mcp/services/precedent_metadata_extractor.py, find extract_and_apply's success path (where it persists the filled metadata fields). After the DB update that writes the extracted metadata, add a call:

        await db.recompute_searchable(case_law_id)

(Import db is already present in that module; if not, add from legal_mcp.services import db. Confirm by reading the file's imports first.)

Step 3: Run the ingest-wiring test

Run: cd ~/legal-ai/mcp-server && .venv/bin/python -m pytest tests/test_idempotent_ingest.py -k "ingest_calls_recompute" -v Expected: test_ingest_calls_recompute_searchable PASS.

Step 4: Commit

cd ~/legal-ai
git add mcp-server/src/legal_mcp/services/ingest.py mcp-server/src/legal_mcp/services/precedent_metadata_extractor.py
git commit -m "feat(ingest): recompute searchable on ingest + metadata completion (GAP-13, FU-2a)"

Task 6: DB smoke + dry-run + GATED search filter

Files: Modify search layer ONLY if dry-run is clean (see Step 4).

Step 1: Apply the V21 migration to the local DB and smoke-test upsert idempotency

Run (sources env, exercises real Postgres):

cd ~/legal-ai && set -a && source ~/.env && set +a
cd mcp-server && .venv/bin/python -c "
import asyncio, uuid
from legal_mcp.services import db
async def main():
    await db.get_pool()  # runs migrations incl V21
    # idempotent internal upsert: same (case_number, proceeding_type) twice
    cn = 'ZZ9999/24'
    r1 = await db.create_internal_committee_decision(case_number=cn, case_name='t', full_text='x', practice_area='rishuy_uvniya')
    r2 = await db.create_internal_committee_decision(case_number=cn, case_name='t2', full_text='x2', practice_area='rishuy_uvniya')
    assert r1['id'] == r2['id'], 'upsert must update, not duplicate'
    # cleanup
    pool = await db.get_pool()
    async with pool.acquire() as c:
        await c.execute(\"DELETE FROM case_law WHERE case_number = 'ZZ9999-24'\")
    print('UPSERT IDEMPOTENT OK; normalized stored as ZZ9999-24')
asyncio.run(main())
"

Expected: UPSERT IDEMPOTENT OK and no duplicate. (Note: ZZ9999/24 normalizes to ZZ9999-24 — confirms write-time normalization too.)

Step 2: Backfill the searchable flag (recompute, reversible)

cd ~/legal-ai && set -a && source ~/.env && set +a
cd mcp-server && .venv/bin/python -c "
import asyncio
from legal_mcp.services import db
async def main():
    n = await db.recompute_searchable()
    print('recompute_searchable: rows now searchable =', n)
asyncio.run(main())
"

Step 3: Dry-run report — which rows would drop from search if the filter is enabled

cd ~/legal-ai && set -a && source ~/.env && set +a
PGPASSWORD="$POSTGRES_PASSWORD" psql "host=$POSTGRES_HOST port=$POSTGRES_PORT dbname=$POSTGRES_DB user=$POSTGRES_USER" -c "
SELECT source_kind,
       count(*) AS total,
       count(*) FILTER (WHERE NOT searchable) AS would_drop
FROM case_law GROUP BY source_kind ORDER BY source_kind;"

Report the table to the controller. Decision gate: if would_drop includes legitimate, currently-findable precedents (e.g. external_upload / internal_committee rows that users rely on), DO NOT enable the search filter in Step 4 — stop and report; the filter waits for FU-2b. If would_drop is only genuinely-incomplete rows, proceed.

Step 4: (GATED) Enable searchable = true filter in the search layer

ONLY if Step 3 is clean. Read mcp-server/src/legal_mcp/services/hybrid_search.py to find the case_law WHERE clauses in search_precedent_library_hybrid / search_documents_hybrid. Add AND cl.searchable = true (alias as used in that query) to the case_law-joined precedent search paths. Add a focused test asserting a non-searchable row is excluded (monkeypatch or DB smoke). If deferred, write a one-line note in the spec §7 that the filter is pending FU-2b and skip.

Step 5: Add health-check visibility

Find the health-check endpoint/function (search def health / processing_status in web/app.py or tools/). Add a field non_searchable_case_law = SELECT count(*) FROM case_law WHERE NOT searchable. Keep it a single cheap COUNT.

Step 6: Commit

cd ~/legal-ai
git add -A mcp-server/ web/
git commit -m "feat(retrieval): gated searchable filter + health-check visibility (GAP-13, FU-2a)"

Task 7: Full suite + smoke + lint + TaskMaster

Step 1: Full test suite

Run: cd ~/legal-ai/mcp-server && .venv/bin/python -m pytest tests/ -q Expected: all pass (the FU-1 77 + new FU-2a tests). Report the summary line.

Step 2: Smoke-import

Run: cd ~/legal-ai/mcp-server && .venv/bin/python -c "from legal_mcp.services import db, ingest, precedent_library, internal_decisions; print('clean')" Expected: clean.

Step 3: Lint changed files (if ruff available)

Run: cd ~/legal-ai/mcp-server && .venv/bin/python -m ruff check src/legal_mcp/services/db.py src/legal_mcp/services/ingest.py 2>/dev/null; echo "exit=$?" Expected: clean or "ruff not available".

Step 4: Mark TaskMaster #60 + subtasks done

Controller handles this (edit .taskmaster/tasks/tasks.json, verify via MCP get_task). Subtasks 60.1 (GAP-03), 60.2 (GAP-06), 60.5 (GAP-13).

Self-Review Notes

GAP-03 → Task 3 (ON CONFLICT both functions). GAP-06 → Task 2 (_canonical_case_number + write-time, type-aware). GAP-13 → Tasks 4-5 (column + recompute + wiring) and gated Task 6 (filter).
No identifier migration — FU-2b (#67) owns GAP-07/08. The V21 backfill only sets a derived, reversible flag.
Gated search filter (Task 6 Step 3-4): the behavior-visible change is contingent on a clean dry-run; otherwise deferred. Surface the dry-run table to the user.
Offline-test limitation: ON CONFLICT needs real Postgres → verified by Task 6 Step 1 smoke; offline tests cover the pure logic (normalize, completeness) and ingest wiring.
Type-consistency: _canonical_case_number, _compute_searchable(row, has_embedded_chunk), recompute_searchable(case_law_id=None) — names used identically in tests (Task 1) and impl (Tasks 2,4).

27 KiB Raw Blame History

FU-2a: Idempotent Ingest + Write-Time Normalization + searchable Flag — Implementation Plan

File Structure

Task 1: Failing tests (pure logic + ingest wiring)

Task 2: _canonical_case_number + write-time normalization

Task 3: Convert both create functions to ON CONFLICT DO UPDATE

Task 4: V21 migration — searchable column + recompute

Task 5: Wire recompute_searchable into ingest

Task 6: DB smoke + dry-run + GATED search filter

Task 7: Full suite + smoke + lint + TaskMaster

Self-Review Notes

27 KiB

Raw Blame History

FU-2a: Idempotent Ingest + Write-Time Normalization + `searchable` Flag — Implementation Plan

Task 2: `_canonical_case_number` + write-time normalization

Task 3: Convert both create functions to `ON CONFLICT DO UPDATE`

Task 4: V21 migration — `searchable` column + recompute

Task 5: Wire `recompute_searchable` into ingest