feat(halacha): #82.4 provenance-union on dedup-skip + #82.6 over-merge guard
All checks were successful
G12 Leak-Guard / leak-guard (pull_request) Successful in 6s
All checks were successful
G12 Leak-Guard / leak-guard (pull_request) Successful in 6s
חילוץ החלטת-ה-dedup ל-helper טהור ובדיק `halacha_quality.dedup_action()` (skip/flag/keep), ושני שיפורים על מסלול ה-dedup-on-insert: #82.4 — merge-with-provenance, לא blind-drop: כשמדלגים על כפילות-סמנטית (cosine≥0.93), מאחדים את ה-`cites` של השורה הנכנסת אל השכן הקנוני ששורד (במקום לאבד אותם). זהו שדה-ה- provenance היחיד שקיים בהכנסה; בחירת-קנוני + מיזוג-corroboration מלא שייכים למסלול ה- reconimation הלא-מקוון (#82.7 / #84.2, שם לשורות כבר יש provenance מצטבר) — מתועד בקוד. #82.6 — over-merge guard: ההחלטה PAIRWISE מול שכן יחיד הקרוב ביותר, ורק השורה הנכנסת מודלגת אי-פעם (אף שורה קיימת לא ממוזגת/נמחקת). אין connected-components closure בהכנסה, לכן שרשרת A~B~C לא קורסת לשורה אחת גם כש-A,C מובחנים. מתועד ב-dedup_action + נבדק. invariants: G1 (provenance נשמר במקור, לא אובד) · G2 (לוגיקת-החלטה ב-helper יחיד בדיק, refactor משמר-התנהגות) · INV-G10 (אין auto-merge של שורות קיימות; tail→flag→סקירת-יו"ר). tests: 6 חדשות (skip/flag/keep/over-merge/boundaries) + 59 בדיקות-הלכה קיימות עוברות. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -4299,22 +4299,39 @@ async def store_halachot_for_chunk(
|
||||
flags = list(h.get("quality_flags") or [])
|
||||
if emb is not None and config.HALACHA_DEDUP_COSINE <= 1.0:
|
||||
neighbor = await conn.fetchrow(
|
||||
"SELECT rule_statement, (embedding <=> $2) AS dist "
|
||||
"SELECT id, rule_statement, cites, (embedding <=> $2) AS dist "
|
||||
"FROM halachot WHERE case_law_id = $1 "
|
||||
"AND embedding IS NOT NULL "
|
||||
"ORDER BY embedding <=> $2 LIMIT 1",
|
||||
case_law_id, emb,
|
||||
)
|
||||
if neighbor is not None:
|
||||
dist = float(neighbor["dist"])
|
||||
if dist <= dedup_distance:
|
||||
# PAIRWISE decision vs the single nearest neighbor — no
|
||||
# cluster closure, so a chain A~B~C can't over-merge to one
|
||||
# row (#82.6 over-merge guard). See halacha_quality.dedup_action.
|
||||
action = halacha_quality.dedup_action(
|
||||
float(neighbor["dist"]), h["rule_statement"],
|
||||
neighbor["rule_statement"], dedup_distance, band_distance,
|
||||
)
|
||||
if action == "skip":
|
||||
# #82.4 — merge-with-provenance, not blind drop: fold the
|
||||
# incoming row's cites into the surviving neighbor (the
|
||||
# only provenance present at insert; full canonical-
|
||||
# selection/merge lives in the offline reconciliation
|
||||
# path, #82.7 / #84.2).
|
||||
new_cites = [c for c in (h.get("cites") or []) if c]
|
||||
if new_cites:
|
||||
await conn.execute(
|
||||
"UPDATE halachot SET cites = ARRAY(SELECT DISTINCT "
|
||||
"unnest(COALESCE(cites, '{}') || $2::text[])), "
|
||||
"updated_at = now() WHERE id = $1",
|
||||
neighbor["id"], new_cites,
|
||||
)
|
||||
skipped += 1
|
||||
continue
|
||||
# tail band: below auto-skip but lexically near → flag.
|
||||
if (dist <= band_distance
|
||||
and halacha_quality.FLAG_NEAR_DUPLICATE not in flags
|
||||
and halacha_quality.lexical_near_duplicate(
|
||||
h["rule_statement"], neighbor["rule_statement"])):
|
||||
if (action == "flag"
|
||||
and halacha_quality.FLAG_NEAR_DUPLICATE not in flags):
|
||||
flags.append(halacha_quality.FLAG_NEAR_DUPLICATE)
|
||||
|
||||
confidence = float(h.get("confidence", 0.0))
|
||||
|
||||
Reference in New Issue
Block a user