Practice area separation: multi-tenant axis across DB, RAG, and UI

Adds two orthogonal columns — practice_area (top-level legal domain: appeals_committee / national_insurance / labor_law) and appeal_subtype (building_permit / betterment_levy / compensation_197) — denormalized into cases, documents, document_chunks, decisions, and style_corpus so vector searches can filter without JOINs. Why: the system handles two unrelated sub-domains under the same appeals committee (1xxx building permits and 8xxx/9xxx betterment/197), with different rules and writing style. Without a separation axis, search_similar() and the block-writer's precedent lookup were free to surface betterment-levy paragraphs while drafting a building-permit decision — a real risk of cross-domain contamination. The same axis also lets future domains (national insurance, labor law) coexist without separate schemas. Schema (V4 migration in db.py): - ALTER ... ADD COLUMN IF NOT EXISTS on all five tables + composite indexes (practice_area first). - Idempotent backfill: case_number ~ '^1' → building_permit, '^8' → betterment_levy, '^9' → compensation_197; propagated to documents, chunks, and decisions via case_id; training-corpus rows (case_id NULL) default to appeals_committee. Code: - New services/practice_area.py with derive_subtype, validate, and is_override + enum constants. - db.create_case / create_document / store_chunks / create_decision inherit practice_area from the parent case (or take an explicit override for the case_id=None training corpus). - db.search_similar and search_similar_paragraphs accept practice_area + appeal_subtype filters using the denormalized columns. - tools/search.py auto-resolves the filter from case_number when given. - block_writer._build_precedents_context now passes the active case's practice_area to search_similar_paragraphs — closes the contamination hole for the discussion-block precedent fetch. - tools/cases.case_create auto-derives subtype from case_number; an explicit override that disagrees writes a case_subtype_override entry to audit_log so we can spot bad classifications later. - tools/documents.document_upload_training tags new training material with practice_area + subtype end-to-end (corpus, document, chunks). UI (web/static/index.html + web/app.py): - New-case wizard gets a practice_area dropdown (others disabled until national_insurance / labor_law arrive) and an appeal_subtype dropdown with JS auto-fill from the case-number prefix; manual edits stick. - Case header shows a blue badge with practice_area · subtype. - CaseCreateRequest plumbs both fields through to cases_tools.case_create. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 16:36:48 +00:00
parent a8b79822bf
commit 26d09d648f
8 changed files with 468 additions and 34 deletions
--- a/mcp-server/src/legal_mcp/tools/documents.py
+++ b/mcp-server/src/legal_mcp/tools/documents.py
@@ -105,6 +105,8 @@ async def document_upload_training(
    decision_date: str = "",
    subject_categories: list[str] | None = None,
    title: str = "",
+    practice_area: str = "appeals_committee",
+    appeal_subtype: str = "",
 ) -> str:
    """העלאת החלטה קודמת של דפנה לקורפוס הסגנון (training).

@@ -114,10 +116,13 @@ async def document_upload_training(
        decision_date: תאריך ההחלטה (YYYY-MM-DD)
        subject_categories: קטגוריות - אפשר לבחור כמה (בנייה, שימוש חורג, תכנית, היתר, הקלה, חלוקה, תמ"א 38, היטל השבחה, פיצויים 197)
        title: שם המסמך
+        practice_area: תחום משפטי (appeals_committee / national_insurance / labor_law)
+        appeal_subtype: סוג ערר (building_permit / betterment_levy / compensation_197).
+                        ריק = יוסק אוטומטית ממספר ההחלטה
    """
    from datetime import date as date_type

-    from legal_mcp.services import extractor, embeddings, chunker
+    from legal_mcp.services import chunker, embeddings, extractor, practice_area as pa

    source = Path(file_path)
    if not source.exists():
@@ -126,6 +131,11 @@ async def document_upload_training(
    if not title:
        title = source.stem

+    # Resolve subtype: explicit > derived from decision_number > 'unknown'
+    if not appeal_subtype:
+        appeal_subtype = pa.derive_subtype(decision_number, practice_area)
+    pa.validate(practice_area, appeal_subtype)
+
    # Copy to training directory (skip if already there)
    config.TRAINING_DIR.mkdir(parents=True, exist_ok=True)
    dest = config.TRAINING_DIR / source.name
@@ -140,25 +150,29 @@ async def document_upload_training(
    if decision_date:
        d_date = date_type.fromisoformat(decision_date)

-    # Add to style corpus
+    # Add to style corpus (tagged by domain so block-writer can filter)
    corpus_id = await db.add_to_style_corpus(
        document_id=None,
        decision_number=decision_number,
        decision_date=d_date,
        subject_categories=subject_categories or [],
        full_text=text,
+        practice_area=practice_area,
+        appeal_subtype=appeal_subtype,
    )

    # Chunk and embed for RAG search over training corpus
    chunks = chunker.chunk_document(text)
    if chunks:
-        # Create a document record (no case association)
+        # Create a document record (no case association — tag explicitly)
        doc = await db.create_document(
            case_id=None,
            doc_type="decision",
            title=f"[קורפוס] {title}",
            file_path=str(dest),
            page_count=page_count,
+            practice_area=practice_area,
+            appeal_subtype=appeal_subtype,
        )
        doc_id = UUID(doc["id"])
        await db.update_document(doc_id, extracted_text=text, extraction_status="completed")
@@ -176,7 +190,10 @@ async def document_upload_training(
            }
            for c, emb in zip(chunks, embs)
        ]
-        await db.store_chunks(doc_id, None, chunk_dicts)
+        await db.store_chunks(
+            doc_id, None, chunk_dicts,
+            practice_area=practice_area, appeal_subtype=appeal_subtype,
+        )

    return json.dumps({
        "corpus_id": str(corpus_id),