Files

Build & Deploy / build-and-deploy (push) Successful in 6s

Details

Add dafna-decision-template skill — knowledge for template-based DOCX export

Documents the rules and decisions behind building DOCX files from דפנה's
decision template (טיוטת החלטה.dotx). The implementation lives in
mcp-server/src/legal_mcp/services/analysis_docx_exporter.py; this skill
captures the "why" so future improvements don't need to rediscover it.

Contents:
  SKILL.md                       5 critical rules, style mapping table,
                                 export flow, line classification,
                                 dash policy, placeholder handling,
                                 troubleshooting, future TODOs
  references/dotx-to-docx.md     why python-docx can't open .dotx +
                                 the conversion recipe
  references/rtl-runs.md         why <w:rtl/> is required on every run
                                 (otherwise Hebrew falls back to
                                 Times New Roman)
  references/style-mapping.md    XML dump of every template style,
                                 with the Title-via-theme gotcha
  references/line-classification.md  the 7 regex categories in
                                 _classify_line() with real examples

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-16 18:57:57 +00:00

3.0 KiB

Raw Permalink Blame History

למה `<w:rtl/>` חובה בכל run

הבעיה

כשאתה יוצר run ב-python-docx על סגנון עברי מוגדר היטב (למשל Normal עם cs="David") — עברית עדיין יוצאת ב-Times New Roman.

הסיבה

Word משתמש ב-3 font slots בתוך <w:rFonts>:

w:ascii — תווים לטיניים
w:hAnsi — אותיות מיוחדות אירופיות
w:cs (complex script) — עברית, ערבית, תאית

ההחלטה איזה slot להשתמש נעשית לפי סוג הטקסט ב-run ולפי דגל רמת הריצה <w:rtl/>. בלי הדגל, Word יכול להתייחס לטקסט העברי כ-LTR (למשל כשהוא מתערבב עם ספרות/לטינית) ולבחור את ascii — Times New Roman.

הפתרון

מסמן כל run עברי כ-complex-script:

from docx.oxml import OxmlElement
from docx.oxml.ns import qn

def _mark_run_rtl(run):
    rPr = run._r.get_or_add_rPr()
    if rPr.find(qn("w:rtl")) is None:
        rPr.append(OxmlElement("w:rtl"))

וגם ברמת ה-paragraph (למקרה ש-paragraph mark עצמו משפיע):

def _mark_paragraph_rtl(paragraph):
    pPr = paragraph._p.get_or_add_pPr()
    rPr = pPr.find(qn("w:rPr"))
    if rPr is None:
        rPr = OxmlElement("w:rPr"); pPr.append(rPr)
    if rPr.find(qn("w:rtl")) is None:
        rPr.append(OxmlElement("w:rtl"))

תופעות לוואי של חוסר RTL ברמת ה-run

Font fallback ל-Times New Roman — הסימפטום הנפוץ ביותר.
BiDi reordering של פיסוק — נקודתיים, פסיקים, סוגריים עוברים למקום הלא נכון. הסימפטום: "(א)" הופך ל-")א(".
מספרים "נוגדים" ברצף עברי — "בשנת 2024 פסקנו" יכול להיראות עם המספר במיקום הלא נכון.

איך לבדוק שה-RTL חל

from docx.oxml.ns import qn

for p in doc.paragraphs:
    for r in p.runs:
        rPr = r._r.find(qn("w:rPr"))
        has_rtl = rPr is not None and rPr.find(qn("w:rtl")) is not None
        if not has_rtl and any('\u0590' <= c <= '\u05FF' for c in r.text):
            print(f"Missing RTL: {r.text[:40]!r}")

זה לא מספיק רק ברמת הסגנון

זו תפיסה מוטעית נפוצה: "אם הסגנון כולל <w:rtl/> ב-rPr, ירש כל ריצה". לא נכון. סגנון נותן ברירת מחדל ל-runs שעדיין לא נוצרו ב-Word GUI — אבל runs שנוצרו דרך python-docx מקבלים rPr ריק, שלא תורש אוטומטית את ה-rtl מהסגנון. לכן חייבים להוסיף ידנית.

הטמפלט של דפנה כדוגמה

בוחנים את word/document.xml של הטמפלט המקורי — כל ריצה עברית כוללת:

<w:r><w:rPr><w:rFonts w:hint="cs"/><w:rtl/></w:rPr><w:t>רקע</w:t></w:r>

<w:rtl/> נמצא שם במפורש. אנחנו מחקים את זה.

3.0 KiB Raw Permalink Blame History

למה <w:rtl/> חובה בכל run