Halacha-extraction quality (#81) and dedup-on-insert (#82) — engine changes (pure + tested) plus measurement/ops tooling. halacha_quality.py - #81.4 application gate: is_fact_dependent() (high-precision "applied to THIS case" deixis per the strict rubric §3/§27) + FLAG_APPLICATION. compute_quality_flags now takes rule_type and flags rule_type=='application' OR fact-dependent — blocking auto-approve (an illustration is not a generalizable holding). - #82.3 lexical tail signal: jaccard_shingles / normalized_levenshtein / lexical_near_duplicate + FLAG_NEAR_DUPLICATE, for the 0.83–0.93 cosine band. halacha_extractor.py — pass rule_type to the flag computation; re-type a binding-labeled fact-application to 'application' (mirrors non_decision→obiter). db.py (store_halachot_for_chunk) — dedup now fetches the nearest same-precedent neighbor once: cosine ≥ DEDUP → skip (unchanged); cosine in [BAND, DEDUP) with high lexical overlap → FLAG_NEAR_DUPLICATE (review, not skip — never drop a possibly-distinct principle unreviewed). config.py — HALACHA_DEDUP_BAND_COSINE (0.83). Scripts: - scripts/halacha_goldset.py (#81.7) — export stratified sample for human tagging; score validators (P/R/F1) against the tags. Backbone for #81.8. - scripts/halacha_batch_reconcile.py (#82.7) — conservative cross-precedent dedup (cosine ≥0.95), dry-run report only. - scripts/calibrate_halacha_dedup.py (#82.1) — calibrate the lexical thresholds against the 2026-06-03 cleanup gold-set. Deferred (documented): #82.4 merge-provenance and #82.5 DB ON CONFLICT/UNIQUE on normalized quote are NOT included — the current skip+flag behavior is safe, whereas a UNIQUE on normalized_quote would fail on existing dups and a blind merge risks losing provenance; they need their own chair-reviewed migration. #82.6 over-merge guard is moot until merge lands. #81.6 full rhetorical-role classifier deferred (section pre-filter + application flag cover the practical case); #81.8 blocked on the human-tagged gold-set (harness now provided). Verified: - pytest tests/test_halacha_quality.py — 52 passed (14 new). - calibrate: configured (0.55,0.70) → precision 1.0 (zero false-merge), recall 0.30 — correct profile for an auto-approve-blocking signal. - goldset export: 15-row sample CSV. batch reconcile: 819 halachot → 5 cross-precedent candidate pairs. Invariants: G1 (normalize at source — flag at insert, not at read); §6 (no silent swallow — suspect items flagged to review, never dropped); G2 (no parallel path — same store_halachot_for_chunk / compute_quality_flags). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
116 lines
4.8 KiB
Python
116 lines
4.8 KiB
Python
#!/usr/bin/env python3
|
||
"""#82.1 — calibrate the lexical dedup thresholds against the cleanup gold-set.
|
||
|
||
The 2026-06-03 cleanup manifest (data/audit/halacha-cleanup-manifest-*.csv)
|
||
records, for each removed halacha, a ``reason`` and a ``survivor_id`` — i.e. a
|
||
human-labeled set of TRUE duplicate pairs (deleted rule ↔ its survivor). This
|
||
script uses them to validate the lexical near-duplicate thresholds introduced
|
||
in #82.3 (``HALACHA`` Jaccard/Levenshtein), so the numbers in
|
||
``halacha_quality.lexical_near_duplicate`` are calibrated, not guessed.
|
||
|
||
It sweeps (jaccard_min × levenshtein_min) and reports precision/recall against:
|
||
* positives — duplicate-labeled pairs (deleted rule ↔ survivor rule)
|
||
* negatives — random non-paired rules from the same manifest (≈all distinct)
|
||
|
||
and marks the currently-configured operating point.
|
||
|
||
cd ~/legal-ai/mcp-server
|
||
.venv/bin/python ../scripts/calibrate_halacha_dedup.py \
|
||
--manifest ../data/audit/halacha-cleanup-manifest-20260603T101747Z.csv
|
||
"""
|
||
from __future__ import annotations
|
||
|
||
import argparse
|
||
import asyncio
|
||
import csv
|
||
import sys
|
||
from pathlib import Path
|
||
from uuid import UUID
|
||
|
||
from legal_mcp.services import db, halacha_quality as hq
|
||
|
||
|
||
async def _survivor_text(survivor_id: str, manifest_map: dict) -> str:
|
||
if survivor_id in manifest_map:
|
||
return manifest_map[survivor_id]
|
||
try:
|
||
row = await db.get_halacha(UUID(survivor_id)) if hasattr(db, "get_halacha") else None
|
||
except Exception:
|
||
row = None
|
||
if row:
|
||
return row.get("rule_statement", "")
|
||
# fallback: direct query
|
||
try:
|
||
pool = await db.get_pool()
|
||
r = await pool.fetchrow("SELECT rule_statement FROM halachot WHERE id = $1", UUID(survivor_id))
|
||
return r["rule_statement"] if r else ""
|
||
except Exception:
|
||
return ""
|
||
|
||
|
||
async def main(args: argparse.Namespace) -> int:
|
||
path = Path(args.manifest)
|
||
if not path.is_absolute():
|
||
path = (Path.cwd() / path).resolve()
|
||
with path.open(encoding="utf-8") as f:
|
||
rows = list(csv.DictReader(f))
|
||
by_id = {r["id"]: r.get("rule_statement", "") for r in rows}
|
||
|
||
positives: list[tuple[str, str]] = []
|
||
for r in rows:
|
||
if "duplicate" in (r.get("reason") or "").lower() and r.get("survivor_id"):
|
||
a = r.get("rule_statement", "")
|
||
b = await _survivor_text(r["survivor_id"], by_id)
|
||
if a and b:
|
||
positives.append((a, b))
|
||
|
||
# negatives: pair each deleted rule with a different, non-survivor rule.
|
||
rules = [r.get("rule_statement", "") for r in rows if r.get("rule_statement")]
|
||
negatives: list[tuple[str, str]] = []
|
||
for i in range(len(positives)):
|
||
a = rules[i % len(rules)]
|
||
b = rules[(i * 7 + 3) % len(rules)] # deterministic spread, no RNG
|
||
if a and b and a != b:
|
||
negatives.append((a, b))
|
||
|
||
print(f"positives (labeled dup pairs): {len(positives)} "
|
||
f"negatives: {len(negatives)}", flush=True)
|
||
if not positives:
|
||
print("no labeled duplicate pairs found in manifest — cannot calibrate", flush=True)
|
||
return 1
|
||
|
||
# precompute lexical scores per pair
|
||
def scores(pairs):
|
||
return [(hq.jaccard_shingles(a, b), hq.normalized_levenshtein(a, b)) for a, b in pairs]
|
||
pos_s, neg_s = scores(positives), scores(negatives)
|
||
|
||
print(f"\n{'jac_min':>8}{'lev_min':>8}{'P':>8}{'R':>8}{'F1':>8}", flush=True)
|
||
best = None
|
||
for jm in (0.40, 0.45, 0.50, 0.55, 0.60, 0.65, 0.70):
|
||
for lm in (0.60, 0.65, 0.70, 0.75, 0.80, 0.85):
|
||
tp = sum(1 for j, l in pos_s if j >= jm or l >= lm)
|
||
fp = sum(1 for j, l in neg_s if j >= jm or l >= lm)
|
||
fn = len(pos_s) - tp
|
||
p = tp / (tp + fp) if (tp + fp) else 0.0
|
||
r = tp / (tp + fn) if (tp + fn) else 0.0
|
||
f1 = 2 * p * r / (p + r) if (p + r) else 0.0
|
||
mark = " <- configured" if (abs(jm - hq._LEX_JACCARD_MIN) < 1e-9
|
||
and abs(lm - hq._LEX_LEVENSHTEIN_MIN) < 1e-9) else ""
|
||
if mark:
|
||
print(f"{jm:>8.2f}{lm:>8.2f}{p:>8.3f}{r:>8.3f}{f1:>8.3f}{mark}", flush=True)
|
||
if best is None or f1 > best[0]:
|
||
best = (f1, jm, lm, p, r)
|
||
print(f"\nbest F1={best[0]:.3f} at jaccard_min={best[1]}, levenshtein_min={best[2]} "
|
||
f"(P={best[3]:.3f}, R={best[4]:.3f})", flush=True)
|
||
print("note: positives may include obiter/application cuts (not pure dups); "
|
||
"use precision as the guard against false-merges.", flush=True)
|
||
return 0
|
||
|
||
|
||
if __name__ == "__main__":
|
||
ap = argparse.ArgumentParser(description=__doc__,
|
||
formatter_class=argparse.RawDescriptionHelpFormatter)
|
||
ap.add_argument("--manifest", required=True, help="path to halacha-cleanup-manifest-*.csv")
|
||
args = ap.parse_args()
|
||
sys.exit(asyncio.run(main(args)))
|