feat(halacha): strict-rubric quality gate + dedup-on-insert (#81,#82)
Bake the 2026-06-03 strict-cleanup rubric into the extraction pipeline so the corpus stays clean at the source instead of accumulating duplicates, obiter dicta, truncated quotes and thin restatements that clog the review queue. #81 — quality gate: - New pure module halacha_quality.py with unit-tested validators: non-decision/obiter (Wambaugh markers), truncated-quote (mid-word cut), thin-restatement (rule≈quote), quote-unverified. - Validators run in halacha_extractor._process; a non-decision is re-typed obiter; flags persist in new halachot.quality_flags column. - Auto-approve now requires confidence>=threshold AND no quality flags; flagged items route to pending_review regardless of confidence. - Both extraction prompts hardened: reject undecided dicta, exclude case-specific applications, require abstraction, forbid over-splitting. #82 — dedup-on-insert (store_halachot_for_chunk): - Within the same precedent, skip a halacha whose normalized supporting_quote already exists, or whose rule-embedding has cosine>=HALACHA_DEDUP_COSINE (0.93) against an already-stored one. Makes re-runs idempotent. Migration: halachot.quality_flags TEXT[] (additive, idempotent ALTER). Tests: 19 new unit tests; full suite 156 passed. Validated end-to-end against dev DB (dedup skips dups, flag blocks auto-approve, re-run inserts 0). Calibration: flags fire on only ~10% of current survivors (low false-positive). Spec: docs/halacha-strict-rubric.md Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -144,6 +144,16 @@ HALACHA_AUTO_APPROVE_THRESHOLD = float(
|
||||
os.environ.get("HALACHA_AUTO_APPROVE_THRESHOLD", "0.80")
|
||||
)
|
||||
|
||||
# Halacha dedup-on-insert — within-precedent semantic cosine ceiling. Before
|
||||
# storing a halacha, store_halachot_for_chunk skips it if its rule-embedding has
|
||||
# cosine >= this value against an already-stored halacha of the SAME precedent
|
||||
# (exact normalized supporting_quote is always skipped regardless). 0.93 is the
|
||||
# conservative auto-skip floor: the 2026-06-03 cleanup showed the 0.90-0.95 band
|
||||
# is "almost entirely" same-rule-reworded, but auto-skip is unreviewed so we sit
|
||||
# just above the manual-cleanup 0.90 to avoid dropping a genuinely distinct
|
||||
# principle. Set > 1.0 to disable semantic dedup (exact-quote dedup still runs).
|
||||
HALACHA_DEDUP_COSINE = float(os.environ.get("HALACHA_DEDUP_COSINE", "0.93"))
|
||||
|
||||
# Google Cloud Vision (OCR for scanned PDFs)
|
||||
GOOGLE_CLOUD_VISION_API_KEY = os.environ.get("GOOGLE_CLOUD_VISION_API_KEY", "")
|
||||
|
||||
|
||||
Reference in New Issue
Block a user