feat(halacha): #82.4 provenance-union on dedup-skip + #82.6 over-merge guard
All checks were successful
G12 Leak-Guard / leak-guard (pull_request) Successful in 6s
All checks were successful
G12 Leak-Guard / leak-guard (pull_request) Successful in 6s
חילוץ החלטת-ה-dedup ל-helper טהור ובדיק `halacha_quality.dedup_action()` (skip/flag/keep), ושני שיפורים על מסלול ה-dedup-on-insert: #82.4 — merge-with-provenance, לא blind-drop: כשמדלגים על כפילות-סמנטית (cosine≥0.93), מאחדים את ה-`cites` של השורה הנכנסת אל השכן הקנוני ששורד (במקום לאבד אותם). זהו שדה-ה- provenance היחיד שקיים בהכנסה; בחירת-קנוני + מיזוג-corroboration מלא שייכים למסלול ה- reconimation הלא-מקוון (#82.7 / #84.2, שם לשורות כבר יש provenance מצטבר) — מתועד בקוד. #82.6 — over-merge guard: ההחלטה PAIRWISE מול שכן יחיד הקרוב ביותר, ורק השורה הנכנסת מודלגת אי-פעם (אף שורה קיימת לא ממוזגת/נמחקת). אין connected-components closure בהכנסה, לכן שרשרת A~B~C לא קורסת לשורה אחת גם כש-A,C מובחנים. מתועד ב-dedup_action + נבדק. invariants: G1 (provenance נשמר במקור, לא אובד) · G2 (לוגיקת-החלטה ב-helper יחיד בדיק, refactor משמר-התנהגות) · INV-G10 (אין auto-merge של שורות קיימות; tail→flag→סקירת-יו"ר). tests: 6 חדשות (skip/flag/keep/over-merge/boundaries) + 59 בדיקות-הלכה קיימות עוברות. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -244,6 +244,35 @@ def lexical_near_duplicate(
|
||||
or normalized_levenshtein(a, b) >= levenshtein_min)
|
||||
|
||||
|
||||
def dedup_action(
|
||||
dist: float, rule_new: str, rule_neighbor: str,
|
||||
dedup_distance: float, band_distance: float,
|
||||
) -> str:
|
||||
"""Decide a fresh halacha's fate vs its nearest same-precedent neighbor (#82.4).
|
||||
|
||||
PAIRWISE by construction — it compares the new rule to exactly ONE neighbor
|
||||
(the nearest already-stored one), never to a cluster, so dedup-on-insert can
|
||||
NEVER collapse a chain A~B~C into a single row even when A and C are
|
||||
distinct: each insert is an independent pairwise decision and only the
|
||||
*incoming* row is ever skipped (no existing row is merged or deleted). This
|
||||
is the over-merge guard (#82.6) — connected-components closure, the central
|
||||
over-merge risk in entity-resolution, is deliberately NOT performed here.
|
||||
|
||||
``dist`` is cosine distance (1 − cosine sim) to the neighbor. Returns:
|
||||
* 'skip' — semantic duplicate (dist ≤ dedup_distance): drop the incoming
|
||||
row; the caller unions its provenance (cites) into the surviving
|
||||
neighbor rather than blind-dropping it.
|
||||
* 'flag' — lexical tail (dedup_distance < dist ≤ band_distance AND high
|
||||
lexical overlap): keep the row but mark near_duplicate → chair review.
|
||||
* 'keep' — distinct enough: store normally.
|
||||
"""
|
||||
if dist <= dedup_distance:
|
||||
return "skip"
|
||||
if dist <= band_distance and lexical_near_duplicate(rule_new, rule_neighbor):
|
||||
return "flag"
|
||||
return "keep"
|
||||
|
||||
|
||||
# ── Aggregate ──
|
||||
|
||||
FLAG_NON_DECISION = "non_decision"
|
||||
|
||||
Reference in New Issue
Block a user