Improve document processing pipeline and agent workflows

- Add delete_document_chunks for reprocessing, save extracted text to disk - Expand case directory structure (original/extracted/proofread/backup) - Update classifier patterns (תגובה, הודעת עמדה) - Fix proofreader agent paths for new directory layout - Update HEARTBEAT to notify on every task completion - Improve bidi_table with LRE/PDF directional embedding - Add Paperclip project verification and auto-close setup issue - Add auto-sync-cases.sh for Gitea synchronization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 16:45:49 +00:00
parent 63c9ca184b
commit 3f759d3610
10 changed files with 164 additions and 19 deletions
--- a/.claude/agents/HEARTBEAT.md
+++ b/.claude/agents/HEARTBEAT.md
@@ -69,16 +69,17 @@ python3 /home/chaim/legal-ai/scripts/notify.py \
  "תוכן ההודעה עם סיכום מה נדרש"
 ```

-**מתי לשלוח:**
+**מתי לשלוח — תמיד:**
+- **סיום כל משימה** — עם סיכום קצר של מה בוצע
 - בקשה לקביעת תוצאה (דחייה/קבלה/חלקית)
 - בקשה לאישור כיוון נימוק
 - דוח QA שנכשל (צריך החלטה על תיקונים)
 - החלטה מוכנה לביקורת דפנה
 - כל מצב שדורש פעולה אנושית ולא יכול להתקדם לבד
+- שגיאה שלא ניתן לפתור ללא התערבות

 **מתי לא לשלוח:**
- עדכוני סטטוס רגילים
- סיום משימה שלא דורשת תגובה
+- עדכוני סטטוס ביניים (רק בסיום)
 - שגיאות טכניות שאפשר לפתור לבד

 ## 6. Release
--- a/.claude/agents/legal-analyst.md
+++ b/.claude/agents/legal-analyst.md
@@ -183,9 +183,9 @@ tools:
 1. [שאלה עקרונית — "האם..."]
 2. [שאלה יישומית — "מהם..."]

-**מילות מפתח לחיפוש:**
- nevo: "ביטוי" ו "ביטוי" ו "ועדת ערר"
- law-mate: מילה1 מילה2 מילה3
+**חיפוש תקדימים:**
+- nevo (קלאסי): "ביטוי" ו "ביטוי" ו "ועדת ערר"
+- nevo AI / law-mate: [השאלות המשפטיות מלמעלה — שאלה עקרונית + יישומית]

 **חקיקה רלוונטית:**
 - סעיף X לחוק...
--- a/.claude/agents/legal-proofreader.md
+++ b/.claude/agents/legal-proofreader.md
@@ -42,7 +42,7 @@ tools:
 1. טען את מילון ראשי התיבות: `/home/chaim/legal-ai/data/abbreviations.json`
 2. **סדר החלפה:** ארוכים לפני קצרים (למניעת החלפה חלקית)
 3. לכל מסמך:
-   - קרא את קובץ ה-MD מהדיסק (מצא אותו ב-`data/cases/` לפי הנתיב)
+   - קרא את קובץ הטקסט מתיקיית `documents/extracted/` בתיק (קובץ `.txt` עם אותו שם כמו ה-PDF המקורי)
   - החלף כל מופע של ראשי תיבות שבורים (מפתחות המילון) בצורה הנכונה (ערכי המילון)
   - ספור כמה החלפות בוצעו

@@ -58,13 +58,13 @@ tools:
 **תקן** רק מה שאתה בטוח בו (90%+). אם לא בטוח — סמן `[?]` ליד המקום הבעייתי.

 ### שלב 4: שמירה
-1. **גיבוי**: שמור עותק מקורי כ-`{filename}.pre-proofread.md`
-2. **כתוב** את הגרסה המתוקנת לקובץ ה-MD המקורי
+1. **גיבוי**: העתק את הקובץ המקורי מ-`extracted/` לתיקיית `documents/backup/` עם סיומת `.pre-proofread.txt`
+2. **כתוב** את הגרסה המתוקנת לתיקיית `documents/proofread/` (עם אותו שם קובץ כמו ב-`extracted/`)
 3. עדכן את מסד הנתונים — שנה `extraction_status` ל-`proofread`:
 ```bash
 PGPASSWORD="${PGPASSWORD:-$(grep DB_PASSWORD /home/chaim/.env | cut -d= -f2)}" \
 psql -h localhost -p 5432 -U "${DB_USER:-legal_ai}" -d "${DB_NAME:-legal_ai}" \
-c "UPDATE documents SET extraction_status = 'proofread', extracted_text = pg_read_file('/path/to/file.md') WHERE id = '{doc_id}';"
+-c "UPDATE documents SET extraction_status = 'proofread', extracted_text = pg_read_file('/path/to/file.txt') WHERE id = '{doc_id}';"
 ```
 אם עדכון DB לא אפשרי, עדכן רק את הקובץ ודווח.

@@ -90,7 +90,7 @@ psql -h localhost -p 5432 -U "${DB_USER:-legal_ai}" -d "${DB_NAME:-legal_ai}" \
 ## כללים קריטיים

 1. **אל תשנה תוכן משפטי** — רק תיקוני OCR. אם מילה נראית מוזרה אבל היא מונח משפטי — אל תגע
-2. **אל תדרוס בלי גיבוי** — תמיד `.pre-proofread.md` לפני שינוי
+2. **אל תדרוס בלי גיבוי** — תמיד העתק ל-`backup/` לפני שינוי
 3. **ראשי תיבות ארוכים קודם** — `נתבייע` (5 תווים) לפני `עייד` (3 תווים)
 4. **דווח מקומות מסופקים** — סמן `[?]` ותן לאדם להחליט
 5. **אל תמציא טקסט** — אם חסר משהו, סמן `[...]` ואל תנחש
--- a/mcp-server/src/legal_mcp/services/db.py
+++ b/mcp-server/src/legal_mcp/services/db.py
@@ -687,6 +687,16 @@ async def update_decision(decision_id: UUID, **fields) -> None:

 # ── Chunks & Vectors ───────────────────────────────────────────────

+async def delete_document_chunks(document_id: UUID) -> int:
+    """Delete all chunks for a document (used before reprocessing)."""
+    pool = await get_pool()
+    async with pool.acquire() as conn:
+        result = await conn.execute(
+            "DELETE FROM document_chunks WHERE document_id = $1", document_id
+        )
+        return int(result.split()[-1])  # e.g. "DELETE 5" -> 5
+
+
 async def store_chunks(
    document_id: UUID,
    case_id: UUID | None,
--- a/mcp-server/src/legal_mcp/services/local_classifier.py
+++ b/mcp-server/src/legal_mcp/services/local_classifier.py
@@ -19,7 +19,7 @@ logger = logging.getLogger(__name__)
 _FILENAME_RULES: list[tuple[str, str, float]] = [
    # (regex pattern on filename, doc_type, confidence)
    (r"כתב.ערר|כתב-ערר", "appeal", 1.0),
-    (r"תשובה|תשובת|תגובת|השלמת.טיעון|בקשה.להשלמת", "response", 1.0),
+    (r"תשובה|תשובת|תגובה|תגובת|השלמת.טיעון|בקשה.להשלמת|הודעת.עמדה", "response", 1.0),
    (r"פרוטוקול", "protocol", 1.0),
    (r"החלטת?.ביניים|החלטה.לתיקון", "decision", 0.95),
    (r"הוראות.תכנית|תכנית", "plan", 1.0),
--- a/mcp-server/src/legal_mcp/services/processor.py
+++ b/mcp-server/src/legal_mcp/services/processor.py
@@ -3,6 +3,7 @@
 from __future__ import annotations

 import logging
+from pathlib import Path
 from uuid import UUID

 from legal_mcp.services import chunker, db, embeddings, extractor, references_extractor
@@ -37,6 +38,17 @@ async def process_document(document_id: UUID, case_id: UUID) -> dict:
            page_count=page_count,
        )

+        # Save extracted text to documents/extracted/ directory
+        original_path = Path(doc["file_path"])
+        extracted_dir = original_path.parent.parent / "extracted"
+        extracted_dir.mkdir(parents=True, exist_ok=True)
+        txt_path = extracted_dir / (original_path.stem + ".txt")
+        try:
+            txt_path.write_text(text, encoding="utf-8")
+            logger.info("Saved extracted text to %s", txt_path)
+        except Exception as e:
+            logger.warning("Failed to save text file (non-fatal): %s", e)
+
        # Step 1.5: Classify document — local rules first, Claude Code headless fallback
        classification_result = {}
        try:
--- a/mcp-server/src/legal_mcp/tools/cases.py
+++ b/mcp-server/src/legal_mcp/tools/cases.py
@@ -62,7 +62,12 @@ async def case_create(
    # Initialize git repo for the case
    case_dir = config.find_case_dir(case_number)
    case_dir.mkdir(parents=True, exist_ok=True)
-    (case_dir / "documents").mkdir(exist_ok=True)
+    docs_dir = case_dir / "documents"
+    docs_dir.mkdir(exist_ok=True)
+    (docs_dir / "original").mkdir(exist_ok=True)
+    (docs_dir / "extracted").mkdir(exist_ok=True)
+    (docs_dir / "proofread").mkdir(exist_ok=True)
+    (docs_dir / "backup").mkdir(exist_ok=True)
    (case_dir / "drafts").mkdir(exist_ok=True)

    # Save case metadata
--- a/scripts/auto-sync-cases.sh
+++ b/scripts/auto-sync-cases.sh
@@ -0,0 +1,37 @@
+#!/bin/bash
+# Auto-sync case repos to Gitea
+# Runs via crontab every minute, commits and pushes any changes found.
+
+CASES_DIR="/home/chaim/legal-ai/data/cases"
+LOG="/home/chaim/legal-ai/data/.auto-sync.log"
+GIT_ENV="GIT_AUTHOR_NAME=Ezer Mishpati GIT_AUTHOR_EMAIL=legal@local GIT_COMMITTER_NAME=Ezer Mishpati GIT_COMMITTER_EMAIL=legal@local GIT_TERMINAL_PROMPT=0"
+
+for status_dir in "$CASES_DIR"/new "$CASES_DIR"/in-progress "$CASES_DIR"/completed; do
+    [ -d "$status_dir" ] || continue
+    for case_dir in "$status_dir"/*/; do
+        [ -d "$case_dir/.git" ] || continue
+
+        cd "$case_dir" || continue
+
+        # Check for any changes (modified, new, deleted)
+        changes=$(git status --porcelain 2>/dev/null)
+        [ -z "$changes" ] && continue
+
+        # Stage all changes
+        git add -A 2>/dev/null
+
+        # Build commit message from changed files
+        changed_files=$(git diff --cached --name-only 2>/dev/null | head -5)
+        count=$(git diff --cached --name-only 2>/dev/null | wc -l)
+        case_name=$(basename "$case_dir")
+        msg="סנכרון אוטומטי — ${count} קבצים שונו"
+
+        # Commit
+        env $GIT_ENV git commit -m "$msg" --quiet 2>/dev/null
+        if [ $? -eq 0 ]; then
+            # Push (non-blocking, ignore errors)
+            git push origin main --quiet 2>/dev/null
+            echo "$(date '+%Y-%m-%d %H:%M:%S') | $case_name | $count files synced" >> "$LOG"
+        fi
+    done
+done
--- a/scripts/bidi_table.py
+++ b/scripts/bidi_table.py
@@ -1,8 +1,8 @@
 #!/usr/bin/env python3
 """BiDi-safe box-drawing table renderer for mixed Hebrew/English terminal output.

-Uses LRM (Left-to-Right Mark, U+200E) before box-drawing characters to prevent
-the BiDi algorithm from breaking table alignment when Hebrew text is present.
+Uses Unicode directional marks to prevent the BiDi algorithm from breaking
+table alignment when Hebrew text is present.

 Usage as module:
    from scripts.bidi_table import bidi_table
@@ -14,14 +14,25 @@ Usage from CLI:

 from __future__ import annotations

-LRM = "\u200E"  # Left-to-Right Mark — invisible, prevents BiDi reordering
+import re
+
+LRM = "\u200E"  # Left-to-Right Mark
+RLM = "\u200F"  # Right-to-Left Mark
+LRE = "\u202A"  # Left-to-Right Embedding
+PDF = "\u202C"  # Pop Directional Formatting
+
+_HEB_RE = re.compile(r'[\u0590-\u05FF]')
+
+
+def _has_hebrew(text: str) -> bool:
+    return bool(_HEB_RE.search(text))


 def bidi_table(headers: list[str], rows: list[list[str]]) -> str:
    """Render a box-drawing table safe for mixed RTL/LTR terminal display."""
    ncols = len(headers)

-    # Calculate column widths
+    # Calculate column widths (visual length, not counting bidi marks)
    col_widths = [len(h) for h in headers]
    for row in rows:
        for i, cell in enumerate(row[:ncols]):
@@ -35,8 +46,10 @@ def bidi_table(headers: list[str], rows: list[list[str]]) -> str:
        for i in range(ncols):
            cell = cells[i] if i < len(cells) else ""
            padded = cell + " " * max(0, col_widths[i] - len(cell))
-            parts.append(" " + padded + " ")
-        return LRM + "│" + (LRM + "│").join(parts) + LRM + "│"
+            # Wrap each cell: LRE forces left-to-right context for the cell,
+            # so box-drawing chars stay in place. PDF closes the embedding.
+            parts.append(LRE + " " + padded + " " + PDF)
+        return LRM + "│" + ("│").join(parts) + "│"

    lines = [hline("┌", "┬", "┐")]
    lines.append(dataline(headers))
--- a/web/paperclip_client.py
+++ b/web/paperclip_client.py
@@ -84,6 +84,9 @@ async def create_project(
        # Link issue to legal-ai case via plugin state
        await _link_case_to_issue(conn, issue_id, case_number)

+        # Verify project creation and close the setup issue
+        await _verify_and_close_setup_issue(conn, project_id, issue_id, identifier, case_number)
+
        return {
            "id": project_id,
            "company_id": company_id,
@@ -140,6 +143,70 @@ async def _link_case_to_issue(conn: asyncpg.Connection, issue_id: str, case_numb
    logger.info("Linked issue %s to case %s via plugin state", issue_id, case_number)


+async def _verify_and_close_setup_issue(
+    conn: asyncpg.Connection,
+    project_id: str,
+    issue_id: str,
+    identifier: str,
+    case_number: str,
+) -> None:
+    """Verify the project was created correctly, then transition the setup issue to done."""
+    # Move to in_progress while verifying
+    await conn.execute(
+        "UPDATE issues SET status = 'in_progress', started_at = now() WHERE id = $1",
+        issue_id,
+    )
+    logger.info("%s: בביצוע — מאמת יצירת פרויקט", identifier)
+
+    # Verify: project exists, issue is linked, plugin state exists
+    checks = []
+
+    project = await conn.fetchrow("SELECT id, name FROM projects WHERE id = $1::uuid", project_id)
+    checks.append(("פרויקט נוצר", project is not None))
+
+    issue = await conn.fetchrow(
+        "SELECT id, project_id FROM issues WHERE id = $1 AND project_id = $2::uuid",
+        issue_id, project_id,
+    )
+    checks.append(("משימה משויכת לפרויקט", issue is not None))
+
+    plugin_link = await conn.fetchrow(
+        "SELECT value_json FROM plugin_state WHERE scope_id = $1 AND state_key = 'legal-case-number'",
+        issue_id,
+    )
+    checks.append(("קישור למערכת המשפטית", plugin_link is not None))
+
+    all_ok = all(ok for _, ok in checks)
+    report_lines = [f"{'✓' if ok else '✕'} {name}" for name, ok in checks]
+    report = "\n".join(report_lines)
+
+    if all_ok:
+        await conn.execute(
+            "UPDATE issues SET status = 'done', completed_at = now() WHERE id = $1",
+            issue_id,
+        )
+        # Document the verification in a comment
+        await conn.execute(
+            """INSERT INTO issue_comments (id, company_id, issue_id, body)
+               VALUES ($1, (SELECT company_id FROM issues WHERE id = $2), $2,
+               $3)""",
+            str(uuid.uuid4()), issue_id,
+            f"## אימות יצירת פרויקט — ערר {case_number}\n\n{report}\n\nהפרויקט נוצר בהצלחה. משימה נסגרה אוטומטית.",
+        )
+        logger.info("%s: הושלם — פרויקט אומת ונסגר", identifier)
+    else:
+        # Leave in_progress with a warning comment
+        failed = [name for name, ok in checks if not ok]
+        await conn.execute(
+            """INSERT INTO issue_comments (id, company_id, issue_id, body)
+               VALUES ($1, (SELECT company_id FROM issues WHERE id = $2), $2,
+               $3)""",
+            str(uuid.uuid4()), issue_id,
+            f"## אימות יצירת פרויקט — ערר {case_number}\n\n{report}\n\n⚠️ בדיקות שנכשלו: {', '.join(failed)}",
+        )
+        logger.warning("%s: אימות נכשל — %s", identifier, ", ".join(failed))
+
+
 async def get_project_url(case_number: str) -> str | None:
    """Find existing Paperclip project for a case number."""
    conn = await asyncpg.connect(PAPERCLIP_DB_URL)