Support ingestion of betterment levy (היטל השבחה) decisions into a
separate training corpus (CMPA). Key changes:
- Add .doc file extraction via LibreOffice conversion in extractor
- Add practice_area/appeal_subtype columns to style_corpus table
- Route training files to cmp/ or cmpa/ subdirs based on appeal subtype
- Fix derive_subtype to handle ARAR-YY-NNNN format (was matching year digit)
- Expose practice_area/appeal_subtype params in MCP upload_training tool
- Add appeal_subtype filter to analyze_style for per-type style analysis
- Update betterment levy methodology in lessons.py: checklist (from generic
to corpus-based), opening/closing strategies, and discussion rules
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds two orthogonal columns — practice_area (top-level legal domain:
appeals_committee / national_insurance / labor_law) and appeal_subtype
(building_permit / betterment_levy / compensation_197) — denormalized
into cases, documents, document_chunks, decisions, and style_corpus so
vector searches can filter without JOINs.
Why: the system handles two unrelated sub-domains under the same
appeals committee (1xxx building permits and 8xxx/9xxx betterment/197),
with different rules and writing style. Without a separation axis,
search_similar() and the block-writer's precedent lookup were free to
surface betterment-levy paragraphs while drafting a building-permit
decision — a real risk of cross-domain contamination. The same axis
also lets future domains (national insurance, labor law) coexist
without separate schemas.
Schema (V4 migration in db.py):
- ALTER ... ADD COLUMN IF NOT EXISTS on all five tables + composite
indexes (practice_area first).
- Idempotent backfill: case_number ~ '^1' → building_permit, '^8' →
betterment_levy, '^9' → compensation_197; propagated to documents,
chunks, and decisions via case_id; training-corpus rows (case_id NULL)
default to appeals_committee.
Code:
- New services/practice_area.py with derive_subtype, validate, and
is_override + enum constants.
- db.create_case / create_document / store_chunks / create_decision
inherit practice_area from the parent case (or take an explicit
override for the case_id=None training corpus).
- db.search_similar and search_similar_paragraphs accept practice_area
+ appeal_subtype filters using the denormalized columns.
- tools/search.py auto-resolves the filter from case_number when given.
- block_writer._build_precedents_context now passes the active case's
practice_area to search_similar_paragraphs — closes the contamination
hole for the discussion-block precedent fetch.
- tools/cases.case_create auto-derives subtype from case_number; an
explicit override that disagrees writes a case_subtype_override entry
to audit_log so we can spot bad classifications later.
- tools/documents.document_upload_training tags new training material
with practice_area + subtype end-to-end (corpus, document, chunks).
UI (web/static/index.html + web/app.py):
- New-case wizard gets a practice_area dropdown (others disabled until
national_insurance / labor_law arrive) and an appeal_subtype dropdown
with JS auto-fill from the case-number prefix; manual edits stick.
- Case header shows a blue badge with practice_area · subtype.
- CaseCreateRequest plumbs both fields through to cases_tools.case_create.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>