Files
legal-ai/scripts/calibrate_halacha_dedup.py
Chaim 1286a1e60d feat(halacha): application gate + lexical dedup tail + quality harnesses (#81,#82)
Halacha-extraction quality (#81) and dedup-on-insert (#82) — engine changes
(pure + tested) plus measurement/ops tooling.

halacha_quality.py
- #81.4 application gate: is_fact_dependent() (high-precision "applied to THIS
  case" deixis per the strict rubric §3/§27) + FLAG_APPLICATION. compute_quality_flags
  now takes rule_type and flags rule_type=='application' OR fact-dependent —
  blocking auto-approve (an illustration is not a generalizable holding).
- #82.3 lexical tail signal: jaccard_shingles / normalized_levenshtein /
  lexical_near_duplicate + FLAG_NEAR_DUPLICATE, for the 0.83–0.93 cosine band.

halacha_extractor.py — pass rule_type to the flag computation; re-type a
binding-labeled fact-application to 'application' (mirrors non_decision→obiter).

db.py (store_halachot_for_chunk) — dedup now fetches the nearest same-precedent
neighbor once: cosine ≥ DEDUP → skip (unchanged); cosine in [BAND, DEDUP) with
high lexical overlap → FLAG_NEAR_DUPLICATE (review, not skip — never drop a
possibly-distinct principle unreviewed).

config.py — HALACHA_DEDUP_BAND_COSINE (0.83).

Scripts:
- scripts/halacha_goldset.py (#81.7) — export stratified sample for human
  tagging; score validators (P/R/F1) against the tags. Backbone for #81.8.
- scripts/halacha_batch_reconcile.py (#82.7) — conservative cross-precedent
  dedup (cosine ≥0.95), dry-run report only.
- scripts/calibrate_halacha_dedup.py (#82.1) — calibrate the lexical thresholds
  against the 2026-06-03 cleanup gold-set.

Deferred (documented): #82.4 merge-provenance and #82.5 DB ON CONFLICT/UNIQUE
on normalized quote are NOT included — the current skip+flag behavior is safe,
whereas a UNIQUE on normalized_quote would fail on existing dups and a blind
merge risks losing provenance; they need their own chair-reviewed migration.
#82.6 over-merge guard is moot until merge lands. #81.6 full rhetorical-role
classifier deferred (section pre-filter + application flag cover the practical
case); #81.8 blocked on the human-tagged gold-set (harness now provided).

Verified:
- pytest tests/test_halacha_quality.py — 52 passed (14 new).
- calibrate: configured (0.55,0.70) → precision 1.0 (zero false-merge), recall
  0.30 — correct profile for an auto-approve-blocking signal.
- goldset export: 15-row sample CSV. batch reconcile: 819 halachot → 5
  cross-precedent candidate pairs.

Invariants: G1 (normalize at source — flag at insert, not at read); §6 (no
silent swallow — suspect items flagged to review, never dropped); G2 (no
parallel path — same store_halachot_for_chunk / compute_quality_flags).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 19:55:45 +00:00

116 lines
4.8 KiB
Python
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
#!/usr/bin/env python3
"""#82.1 — calibrate the lexical dedup thresholds against the cleanup gold-set.
The 2026-06-03 cleanup manifest (data/audit/halacha-cleanup-manifest-*.csv)
records, for each removed halacha, a ``reason`` and a ``survivor_id`` — i.e. a
human-labeled set of TRUE duplicate pairs (deleted rule ↔ its survivor). This
script uses them to validate the lexical near-duplicate thresholds introduced
in #82.3 (``HALACHA`` Jaccard/Levenshtein), so the numbers in
``halacha_quality.lexical_near_duplicate`` are calibrated, not guessed.
It sweeps (jaccard_min × levenshtein_min) and reports precision/recall against:
* positives — duplicate-labeled pairs (deleted rule ↔ survivor rule)
* negatives — random non-paired rules from the same manifest (≈all distinct)
and marks the currently-configured operating point.
cd ~/legal-ai/mcp-server
.venv/bin/python ../scripts/calibrate_halacha_dedup.py \
--manifest ../data/audit/halacha-cleanup-manifest-20260603T101747Z.csv
"""
from __future__ import annotations
import argparse
import asyncio
import csv
import sys
from pathlib import Path
from uuid import UUID
from legal_mcp.services import db, halacha_quality as hq
async def _survivor_text(survivor_id: str, manifest_map: dict) -> str:
if survivor_id in manifest_map:
return manifest_map[survivor_id]
try:
row = await db.get_halacha(UUID(survivor_id)) if hasattr(db, "get_halacha") else None
except Exception:
row = None
if row:
return row.get("rule_statement", "")
# fallback: direct query
try:
pool = await db.get_pool()
r = await pool.fetchrow("SELECT rule_statement FROM halachot WHERE id = $1", UUID(survivor_id))
return r["rule_statement"] if r else ""
except Exception:
return ""
async def main(args: argparse.Namespace) -> int:
path = Path(args.manifest)
if not path.is_absolute():
path = (Path.cwd() / path).resolve()
with path.open(encoding="utf-8") as f:
rows = list(csv.DictReader(f))
by_id = {r["id"]: r.get("rule_statement", "") for r in rows}
positives: list[tuple[str, str]] = []
for r in rows:
if "duplicate" in (r.get("reason") or "").lower() and r.get("survivor_id"):
a = r.get("rule_statement", "")
b = await _survivor_text(r["survivor_id"], by_id)
if a and b:
positives.append((a, b))
# negatives: pair each deleted rule with a different, non-survivor rule.
rules = [r.get("rule_statement", "") for r in rows if r.get("rule_statement")]
negatives: list[tuple[str, str]] = []
for i in range(len(positives)):
a = rules[i % len(rules)]
b = rules[(i * 7 + 3) % len(rules)] # deterministic spread, no RNG
if a and b and a != b:
negatives.append((a, b))
print(f"positives (labeled dup pairs): {len(positives)} "
f"negatives: {len(negatives)}", flush=True)
if not positives:
print("no labeled duplicate pairs found in manifest — cannot calibrate", flush=True)
return 1
# precompute lexical scores per pair
def scores(pairs):
return [(hq.jaccard_shingles(a, b), hq.normalized_levenshtein(a, b)) for a, b in pairs]
pos_s, neg_s = scores(positives), scores(negatives)
print(f"\n{'jac_min':>8}{'lev_min':>8}{'P':>8}{'R':>8}{'F1':>8}", flush=True)
best = None
for jm in (0.40, 0.45, 0.50, 0.55, 0.60, 0.65, 0.70):
for lm in (0.60, 0.65, 0.70, 0.75, 0.80, 0.85):
tp = sum(1 for j, l in pos_s if j >= jm or l >= lm)
fp = sum(1 for j, l in neg_s if j >= jm or l >= lm)
fn = len(pos_s) - tp
p = tp / (tp + fp) if (tp + fp) else 0.0
r = tp / (tp + fn) if (tp + fn) else 0.0
f1 = 2 * p * r / (p + r) if (p + r) else 0.0
mark = " <- configured" if (abs(jm - hq._LEX_JACCARD_MIN) < 1e-9
and abs(lm - hq._LEX_LEVENSHTEIN_MIN) < 1e-9) else ""
if mark:
print(f"{jm:>8.2f}{lm:>8.2f}{p:>8.3f}{r:>8.3f}{f1:>8.3f}{mark}", flush=True)
if best is None or f1 > best[0]:
best = (f1, jm, lm, p, r)
print(f"\nbest F1={best[0]:.3f} at jaccard_min={best[1]}, levenshtein_min={best[2]} "
f"(P={best[3]:.3f}, R={best[4]:.3f})", flush=True)
print("note: positives may include obiter/application cuts (not pure dups); "
"use precision as the guard against false-merges.", flush=True)
return 0
if __name__ == "__main__":
ap = argparse.ArgumentParser(description=__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter)
ap.add_argument("--manifest", required=True, help="path to halacha-cleanup-manifest-*.csv")
args = ap.parse_args()
sys.exit(asyncio.run(main(args)))