Retrofit: tighten yod-bet pattern, add cover-block fallback
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 6s

The "על כן" pattern for block-yod-bet was too greedy and matched mid-discussion
transitional sentences (e.g. "על כן, במקום בו..."), which caused forward-scan
to skip block-yod-alef ("סוף דבר") via the pointer advance.

Tightened to require an operative subject (אנו / הערר / הוועדה / ועדת הערר)
so terminal "על כן, אנו מחליטים" still matches but mid-block transitions don't.

Added structural_fallback for cover blocks (alef/bet/gimel/dalet) — these are
template metadata not present in user-edited DOCX bodies. Inject zero-content
anchors so apply_user_edit can still target them later. The frontend toast
distinguishes real content gaps from fallback anchors.

Also expanded heading patterns based on training corpus inspection:
- block-vav: על המקרקעין חלות / במצב התכנוני / התכניות החלות
- block-zayin: טענות העוררת
- block-chet: עיקר תגובת המשיב
- block-tet: הדיון בוועדת הערר

For case 1130-25, this raises detection from 6/12 to 11/12 blocks — only
block-yod-bet remains missing (Daphna's edit ends at "סוף דבר" + numbered
ruling, no terminal "ההחלטה" or "על כן אנו מחליטים" paragraph).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-26 06:57:41 +00:00
parent eac7784b87
commit 36ca713dfa
6 changed files with 154 additions and 24 deletions

View File

@@ -85,13 +85,40 @@ _BLOCK_HEADING_PATTERNS: list[tuple[str, list[str]]] = [
("block-gimel", [r"^נגד\s*$", r"^—\s*נגד\s*—"]),
("block-dalet", [r"^החלטה\s*$"]),
("block-heh", [r"^רקע\s*$", r"^רקע\s+עובדתי", r"^פתח\s+דבר"]),
("block-vav", [r"^תכניות\s+חלות", r"^ההליכים?\s+שבפנינו", r"^ההליכים?\s+בפני\s+הוועדה\s+המקומית"]),
("block-zayin", [r"^תמצית\s+טענות", r"^טענות\s+הצדדים", r"^טענות\s+העוררי"]),
("block-chet", [r"^תגובת\s+המשיב", r"^עמדת\s+הוועדה\s+המקומית", r"^תשובת"]),
("block-tet", [r"^ההליכים?\s+בפני\s+ועדת\s+הערר", r"^הדיון\s+בפנינו"]),
("block-vav", [
r"^תכניות\s+חלות",
r"^ההליכים?\s+שבפנינו",
r"^ההליכים?\s+בפני\s+הוועדה\s+המקומית",
r"^על\s+המקרקעין\s+חלות",
r"^התכניות?\s+החלות",
r"^במצב\s+התכנוני",
]),
("block-zayin", [
r"^תמצית\s+טענות",
r"^טענות\s+הצדדים",
r"^טענות\s+העוררי",
r"^טענות\s+העוררת",
]),
("block-chet", [
r"^תגובת\s+המשיב",
r"^עמדת\s+הוועדה\s+המקומית",
r"^תשובת",
r"^עיקר\s+תגובת\s+המשיב",
]),
("block-tet", [
r"^ההליכים?\s+בפני\s+ועדת\s+הערר",
r"^הדיון\s+בפנינו",
r"^הדיון\s+בוועדת\s+הערר",
]),
("block-yod", [r"^דיון\s+והכרעה", r"^דיון\s*$", r"^ההכרעה"]),
("block-yod-alef", [r"^סוף\s+דבר", r"^סיכום\s*$"]),
("block-yod-bet", [r"^ההחלטה\s*$", r"^על\s+כן[,\.]?"]),
# block-yod-bet "על כן" must be operative — paired with אנו/הערר/הוועדה.
# Loose `^על כן` alone matches mid-discussion transitions ("על כן, במקום בו...")
# and steals the bookmark from block-yod-alef via forward-scan.
("block-yod-bet", [
r"^ההחלטה\s*$",
r"^על\s+כן[,\.\s]+(?:אנו|הערר|הוועדה|ועדת\s+הערר)\b",
]),
]
_COMPILED_HEADING_PATTERNS: list[tuple[str, list[re.Pattern[str]]]] = [
@@ -252,6 +279,20 @@ def retrofit_bookmarks(
block_starts = _detect_block_starts(paragraphs)
# Cover-block fallback: alef/bet/gimel/dalet are template metadata
# (judges, case number, parties, "החלטה" title) that don't appear in
# the body of user-edited DOCX files — they live in headers/template.
# Inject zero-content anchors at paragraph 0 so apply_user_edit can
# still target them later.
structural_fallback: list[str] = []
cover_blocks = ["block-alef", "block-bet", "block-gimel", "block-dalet"]
first_detected_idx = min(block_starts.values()) if block_starts else 0
for i, name in enumerate(cover_blocks):
if name not in block_starts:
idx = min(i, max(0, first_detected_idx - 1))
block_starts[name] = idx
structural_fallback.append(name)
# Calculate end_idx for each block = paragraph before the next block's start,
# or last paragraph if this is the last block found.
ordered_found = sorted(block_starts.items(), key=lambda kv: kv[1])
@@ -280,11 +321,16 @@ def retrofit_bookmarks(
_save_docx_xml(members, doc_tree, settings_tree, output_path)
missing = [n for n, _ in BLOCK_ORDER if n not in block_starts and n not in existing_names]
logger.info("retrofit %s: added=%s missing=%s",
docx_path.name, added, missing)
missing = [
n for n, _ in BLOCK_ORDER
if n not in block_starts
and n not in existing_names
]
logger.info("retrofit %s: added=%s missing=%s structural=%s",
docx_path.name, added, missing, structural_fallback)
return {
"bookmarks_added": added,
"missing_blocks": missing,
"structural_fallback": structural_fallback,
"existing_bookmarks": existing_names,
}