Add training corpus UI with Nevo proofreading pipeline

- New proofreader service strips Nevo editorial additions (front matter,
  postamble, page headers, watermarks, inline codes) from DOCX/PDF/MD
- PDF pages use Google Vision OCR for clean Hebrew RTL extraction
- New training page at #/training with drag-and-drop upload, automatic
  metadata extraction (decision number, date, categories), reviewable
  preview, and style pattern report grouped by type
- API endpoints: /api/training/{analyze,upload,corpus,patterns,
  analyze-style,analyze-style/status}
- Fix claude_session.query to pipe prompt via stdin, avoiding ARG_MAX
  overflow when analyzing 900K+ char corpus
- CLI scripts for batch proofreading and corpus upload

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-11 11:04:58 +00:00
parent ecda95d610
commit 32f18de049
6 changed files with 1960 additions and 3 deletions

View File

@@ -24,6 +24,9 @@ LONG_TIMEOUT = 300 # For complex tasks like block writing
def query(prompt: str, timeout: int = DEFAULT_TIMEOUT, max_turns: int = 1) -> str:
"""Send a prompt to Claude Code headless and return the text response.
Passes the prompt via stdin (not argv) to avoid the OS ARG_MAX limit —
prompts can be 500K+ chars when analyzing a full style corpus.
Args:
prompt: The prompt to send.
timeout: Max seconds to wait.
@@ -36,14 +39,18 @@ def query(prompt: str, timeout: int = DEFAULT_TIMEOUT, max_turns: int = 1) -> st
RuntimeError: If claude CLI is not available or fails.
"""
cmd = [
"claude", "-p", prompt,
"claude", "-p",
"--output-format", "json",
"--max-turns", str(max_turns),
]
try:
result = subprocess.run(
cmd, capture_output=True, text=True, timeout=timeout,
cmd,
input=prompt,
capture_output=True,
text=True,
timeout=timeout,
)
except FileNotFoundError:
raise RuntimeError("Claude CLI not found. Install Claude Code or add 'claude' to PATH.")