feat(training): Style Studio — upload, rich corpus, lessons, curator portrait, chat
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 2m7s

Six-phase upgrade of /training from a read-only dashboard into a full
Style Studio for managing Daphna's style corpus.

- Upload Sheet on /training: file → proofread preview → commit (no more
  CLI-only `upload-training` skill).
- Rich corpus metadata: GET /api/training/corpus returns summary, outcome,
  key_principles, page_count, parties (regex), legal_citation, lessons_count.
  PATCH endpoint for chair edits. CorpusDetailDrawer with 4 tabs (details
  /content/lessons/patterns) replaces the bare table row.
- LLM metadata enrichment: style_metadata_extractor + MCP tools
  (style_corpus_enrich, style_corpus_pending_enrichment) fill summary
  /outcome/key_principles via claude_session (free, host-side).
- Per-decision lessons: new decision_lessons table + 4 REST endpoints +
  LessonsTab in drawer; hermes-curator now auto-posts findings as
  decision_lessons(source=curator).
- Curator Portrait tab: prompt rendered with link to Gitea, recent
  curator findings, style_analyzer training prompts, propose-change
  form that writes proposals to data/curator-proposals/ for manual
  chair review (no auto-mutation of the agent file).
- Style chat tab: SSE-streamed conversations with the style agent.
  New host-side pm2 service (legal-chat-service, port 8770) wraps
  claude CLI with stream-json + --resume continuation; FastAPI proxies
  via host.docker.internal. Zero API cost — uses chaim's claude.ai
  subscription. chat_conversations + chat_messages persist history.

Architecture: keeps the existing rule that claude_session only runs
on the host (not the container). The new legal-chat-service is the
canonical bridge between the container and the local CLI for the chat
feature; everything else (upload, metadata, lessons) stays within the
container's existing capabilities.

Audit script (scripts/audit_training_corpus.py) included for verifying
which corpus rows still need enrichment.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-27 10:06:22 +00:00
parent 0629f19d5f
commit bb0cd7c6a2
23 changed files with 4568 additions and 75 deletions

205
web/chat_system_prompt.py Normal file
View File

@@ -0,0 +1,205 @@
"""Compose the system prompt the style-chat agent receives.
The chat runs against the local ``claude`` CLI on the host (via
legal-chat-service). We assemble a once-per-conversation system block
that gives the agent everything it needs to discuss decisions in
Daphna's voice:
- The style guide (``skills/decision/SKILL.md``) — how she writes
- The lessons file (``docs/legal-decision-lessons.md``) — what we've
learned across the corpus
- The corpus-analysis report (``docs/corpus-analysis.md``) — the
structural map of 24+ decisions
- A summary of every style_corpus row (number, date, subjects,
chars + summary if extracted) so the agent can reason about the
whole corpus without us shipping all of it inline
- Optional: when the conversation is scoped to a specific decision
(``style_corpus_id``), append its full_text so the chat can dive
into the text directly
Sent **once**, when the conversation is first created. On subsequent
messages the legal-chat-service uses ``claude --resume <session_id>``
and the on-disk CLI session keeps the system context intact — no need
to re-ship the 100K+ chars of skills + lessons every turn.
"""
from __future__ import annotations
import logging
import os
from pathlib import Path
from uuid import UUID
from legal_mcp.services import db
logger = logging.getLogger(__name__)
# The reference files live in the repo at known paths. In the
# container they're mounted alongside the code, so resolve relative
# to web/app.py's parent.
_REPO_ROOT = Path(os.environ.get(
"LEGAL_AI_REPO_ROOT",
str(Path(__file__).resolve().parent.parent),
))
_SKILLS_PATH = _REPO_ROOT / "skills" / "decision" / "SKILL.md"
_LESSONS_PATH = _REPO_ROOT / "docs" / "legal-decision-lessons.md"
_CORPUS_ANALYSIS_PATH = _REPO_ROOT / "docs" / "corpus-analysis.md"
def _safe_read(path: Path, cap_chars: int = 50_000) -> str:
"""Read a file (UTF-8) or return a marker that it's missing.
The cap protects against accidentally injecting an enormous file —
even at 50K, a single source file is the lion's share of the
system prompt budget.
"""
try:
text = path.read_text(encoding="utf-8")
except FileNotFoundError:
return f"(קובץ {path.name} לא נמצא בנתיב {path})"
except OSError as e:
logger.warning("could not read %s: %s", path, e)
return f"(שגיאה בקריאת {path.name}: {e})"
if len(text) > cap_chars:
return text[:cap_chars] + f"\n\n[... חתך ב-{cap_chars:,} תווים מתוך {len(text):,}]"
return text
async def _corpus_summary_block() -> str:
"""Compact one-row-per-decision summary the agent can scan."""
rows = await db.get_pool()
async with rows.acquire() as conn:
records = await conn.fetch(
"""
SELECT decision_number, decision_date, appeal_subtype,
subject_categories, length(full_text) AS chars,
coalesce(summary, '') AS summary,
coalesce(outcome, '') AS outcome
FROM style_corpus
ORDER BY decision_date NULLS LAST
"""
)
if not records:
return "(הקורפוס ריק)"
lines = []
for r in records:
cats = r["subject_categories"]
if isinstance(cats, str):
import json as _json
try:
cats = _json.loads(cats)
except _json.JSONDecodeError:
cats = []
cats_str = ", ".join(cats or []) if cats else ""
date_str = str(r["decision_date"]) if r["decision_date"] else ""
summary = (r["summary"] or "").strip()
outcome = (r["outcome"] or "").strip()
head = f"- **{r['decision_number'] or ''}** ({date_str}) [{r['appeal_subtype'] or ''}] · {r['chars']:,} תווים"
meta = f" נושאים: {cats_str}"
body = ""
if summary:
body = f"\n תקציר: {summary}"
if outcome:
body += f" — תוצאה: {outcome}"
elif outcome:
body = f"\n תוצאה: {outcome}"
lines.append(head + "\n" + meta + body)
return "\n".join(lines)
async def _decision_full_text(corpus_id: UUID) -> str:
pool = await db.get_pool()
async with pool.acquire() as conn:
row = await conn.fetchrow(
"SELECT decision_number, decision_date, full_text "
"FROM style_corpus WHERE id = $1",
corpus_id,
)
if not row:
return ""
header = f"# החלטה {row['decision_number']} ({row['decision_date']})\n\n"
return header + (row["full_text"] or "")
SYSTEM_PROMPT_HEADER = """\
אתה סוכן הסגנון של עו"ד דפנה תמיר, יו"ר ועדת הערר לתכנון ובניה — מחוז ירושלים.
תפקידך: לעזור לחיים (העוזר המקצועי של דפנה) להבין, לנתח ולחדד את הסגנון
של דפנה. אתה לא כותב החלטות חדשות; אתה דן בסגנון של החלטות קיימות,
מזהה דפוסים, מקפיד שהכותבים העתידיים (ה-writer agent) יישארו נאמנים
לקולה.
יש לך גישה ל:
1. **מדריך הסגנון** של דפנה (skills/decision/SKILL.md) — איך היא כותבת.
2. **הלקחים הגנריים** מהקורפוס (docs/legal-decision-lessons.md) — מה
למדנו לאורך 24+ החלטות. **חובה** להישען על הקבצים האלה כשאתה דן
בסגנון, ולא להמציא תובנות חדשות מהאוויר.
3. **ניתוח הקורפוס** המבני (docs/corpus-analysis.md) — מפת תוכן ופערים.
4. **רשימת ההחלטות בקורפוס** (למטה) — סקירה תמציתית של כל החלטה
שעלתה ל-style_corpus.
5. **טקסט מלא של החלטה ספציפית** (אם השיחה הוצמדה ל-style_corpus_id).
כללי תקשורת:
- כל התשובות בעברית.
- חיים יושב מולך, לא דפנה — אבל המטרה היא לחדד את הסגנון *של דפנה*.
- אם חיים שואל "האם פסקה X מתאימה לסגנון של דפנה?" — תן ניתוח מנומק
שמסתמך על SKILL.md ועל החלטות הקורפוס. אל תמציא ראיות.
- אם אתה צריך החלטה ספציפית שאין בקורפוס — הודע לחיים שיצרף אותה.
- אם חיים אומר לך משהו חדש על דפנה ("דפנה אומרת לעולם אל תפתח החלטה
במילה X") — שמור את זה בזיכרון השיחה; אם זה מצדיק תיעוד קבוע, הצע
לחיים להוסיף את זה כ-decision_lesson (POST /api/training/lessons)
או כתוספת ל-SKILL.md.
- אל תיתן לעצמך אישיות מומצאת — אתה כלי-עזר מקצועי, לא חבר.
"""
async def build_system_prompt(
*,
corpus_id: UUID | None = None,
include_corpus_summary: bool = True,
) -> str:
"""Assemble the full system prompt for a new chat conversation.
Args:
corpus_id: When set, the full_text of that decision is appended
so the chat can dive into the text.
include_corpus_summary: Set False for low-context chats (e.g.
quick "what does Daphna do at the end of a betterment-levy
decision?" — no need to ship 24 summaries).
"""
parts: list[str] = [SYSTEM_PROMPT_HEADER]
parts.append("\n## מדריך הסגנון (skills/decision/SKILL.md)\n")
parts.append(_safe_read(_SKILLS_PATH, cap_chars=40_000))
parts.append("\n\n## לקחים מהקורפוס (docs/legal-decision-lessons.md)\n")
parts.append(_safe_read(_LESSONS_PATH, cap_chars=30_000))
parts.append("\n\n## ניתוח קורפוס מבני (docs/corpus-analysis.md)\n")
parts.append(_safe_read(_CORPUS_ANALYSIS_PATH, cap_chars=15_000))
if include_corpus_summary:
parts.append("\n\n## רשימת ההחלטות בקורפוס הסגנון\n")
try:
parts.append(await _corpus_summary_block())
except Exception as e:
logger.warning("corpus summary failed: %s", e)
parts.append("(שגיאה בטעינת רשימת הקורפוס)")
if corpus_id is not None:
parts.append("\n\n## ההחלטה הספציפית בדיון (full_text)\n")
try:
txt = await _decision_full_text(corpus_id)
if txt:
parts.append(txt[:200_000]) # hard cap
else:
parts.append("(לא נמצאה החלטה — בדוק את ה-corpus_id)")
except Exception as e:
logger.warning("decision full_text failed: %s", e)
parts.append("(שגיאה בטעינת ההחלטה)")
return "\n".join(parts)