feat(training): Style Studio — upload, rich corpus, lessons, curator portrait, chat
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 2m7s

Six-phase upgrade of /training from a read-only dashboard into a full
Style Studio for managing Daphna's style corpus.

- Upload Sheet on /training: file → proofread preview → commit (no more
  CLI-only `upload-training` skill).
- Rich corpus metadata: GET /api/training/corpus returns summary, outcome,
  key_principles, page_count, parties (regex), legal_citation, lessons_count.
  PATCH endpoint for chair edits. CorpusDetailDrawer with 4 tabs (details
  /content/lessons/patterns) replaces the bare table row.
- LLM metadata enrichment: style_metadata_extractor + MCP tools
  (style_corpus_enrich, style_corpus_pending_enrichment) fill summary
  /outcome/key_principles via claude_session (free, host-side).
- Per-decision lessons: new decision_lessons table + 4 REST endpoints +
  LessonsTab in drawer; hermes-curator now auto-posts findings as
  decision_lessons(source=curator).
- Curator Portrait tab: prompt rendered with link to Gitea, recent
  curator findings, style_analyzer training prompts, propose-change
  form that writes proposals to data/curator-proposals/ for manual
  chair review (no auto-mutation of the agent file).
- Style chat tab: SSE-streamed conversations with the style agent.
  New host-side pm2 service (legal-chat-service, port 8770) wraps
  claude CLI with stream-json + --resume continuation; FastAPI proxies
  via host.docker.internal. Zero API cost — uses chaim's claude.ai
  subscription. chat_conversations + chat_messages persist history.

Architecture: keeps the existing rule that claude_session only runs
on the host (not the container). The new legal-chat-service is the
canonical bridge between the container and the local CLI for the chat
feature; everything else (upload, metadata, lessons) stays within the
container's existing capabilities.

Audit script (scripts/audit_training_corpus.py) included for verifying
which corpus rows still need enrichment.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-27 10:06:22 +00:00
parent 0629f19d5f
commit bb0cd7c6a2
23 changed files with 4568 additions and 75 deletions

View File

@@ -142,3 +142,175 @@ async def query_json(
"""
raw = await query(prompt, timeout=timeout, system=system)
return parse_llm_json(raw)
# ── Streaming + session continuation ────────────────────────────────
async def query_streaming(
prompt: str,
*,
system: str | None = None,
resume_session_id: str | None = None,
timeout: int = LONG_TIMEOUT,
cwd: str | None = None,
):
"""Stream Claude's response as an async iterator of events.
Wraps `claude -p --output-format=stream-json` (newline-delimited JSON
objects from the CLI) and translates each line into a small, stable
shape that the chat service / SSE proxy can forward without leaking
CLI internals to the browser.
Event shapes yielded:
{"type": "session_id", "value": "<uuid>"} # first event, used for resume
{"type": "text_delta", "text": "<partial>"} # incremental assistant text
{"type": "tool_use", "name": "...", "input": {...}}
{"type": "error", "message": "..."}
{"type": "done", "text": "<full response>"}
The CLI emits a richer stream; we project to this minimal set so the
front-end can stay stable across CLI upgrades.
Args:
prompt: The user message to send.
system: Optional system instructions (used only when starting a
fresh conversation — when resume_session_id is set, the
session already carries its system prompt).
resume_session_id: Continue a prior conversation. When given,
we don't re-send the system prompt; the CLI loads the
entire conversation history from disk.
timeout: Hard ceiling on the subprocess.
cwd: Working directory for the subprocess — defaults to the
host's HOME so claude.ai credentials resolve correctly.
"""
if resume_session_id:
# When resuming, system is already baked into the on-disk session
# — sending it again would be a no-op at best and confuse the
# conversation at worst.
full_prompt = prompt
cmd = [
"claude", "-p",
"--output-format", "stream-json",
"--verbose",
"--resume", resume_session_id,
]
else:
full_prompt = f"{system}\n\n{prompt}" if system else prompt
cmd = [
"claude", "-p",
"--output-format", "stream-json",
"--verbose",
]
if len(full_prompt) > 200_000:
logger.warning(
"Streaming: large prompt (%d chars) — may hit CLI input limits",
len(full_prompt),
)
try:
proc = await asyncio.create_subprocess_exec(
*cmd,
stdin=asyncio.subprocess.PIPE,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
cwd=cwd,
)
except FileNotFoundError:
yield {
"type": "error",
"message": (
"Claude CLI not found on host — legal-chat-service must "
"run where the `claude` binary is installed (Daphna's host, "
"not the legal-ai container)."
),
}
return
assert proc.stdin is not None # for type checkers
assert proc.stdout is not None
# Send the prompt and close stdin so the CLI knows the user message
# is complete.
try:
proc.stdin.write(full_prompt.encode("utf-8"))
await proc.stdin.drain()
proc.stdin.close()
except BrokenPipeError:
# CLI exited before reading the prompt — drain stderr and bail.
stderr_b = await proc.stderr.read() if proc.stderr else b""
yield {
"type": "error",
"message": f"Claude CLI closed stdin early: {stderr_b.decode('utf-8', errors='replace')[:300]}",
}
return
accumulated_text: list[str] = []
session_id_emitted = False
deadline = asyncio.get_event_loop().time() + timeout
try:
while True:
remaining = deadline - asyncio.get_event_loop().time()
if remaining <= 0:
yield {"type": "error", "message": f"timed out after {timeout}s"}
break
try:
line_b = await asyncio.wait_for(proc.stdout.readline(), timeout=remaining)
except asyncio.TimeoutError:
yield {"type": "error", "message": f"stream timed out after {timeout}s"}
break
if not line_b:
break
line = line_b.decode("utf-8", errors="replace").strip()
if not line:
continue
try:
event = json.loads(line)
except json.JSONDecodeError:
# Stray non-JSON line from CLI — surface a snippet for debug.
logger.debug("non-JSON stream line: %s", line[:120])
continue
# The CLI's stream-json emits several event types. We only
# care about the ones the chat service forwards.
t = event.get("type")
if not session_id_emitted:
sid = event.get("session_id")
if sid:
session_id_emitted = True
yield {"type": "session_id", "value": sid}
if t == "assistant":
# event["message"]["content"] is a list of blocks; we extract
# text blocks and tool_use blocks.
msg = event.get("message") or {}
for block in msg.get("content") or []:
btype = block.get("type")
if btype == "text":
text = block.get("text") or ""
if text:
accumulated_text.append(text)
yield {"type": "text_delta", "text": text}
elif btype == "tool_use":
yield {
"type": "tool_use",
"name": block.get("name") or "",
"input": block.get("input") or {},
}
elif t == "result":
# Final synthesized result line from the CLI — we already
# delivered the deltas, so just stop here.
break
finally:
if proc.returncode is None:
try:
proc.kill()
except ProcessLookupError:
pass
try:
await proc.wait()
except Exception:
pass
yield {"type": "done", "text": "".join(accumulated_text)}

View File

@@ -194,6 +194,55 @@ ALTER TABLE style_corpus ADD COLUMN IF NOT EXISTS appeal_subtype TEXT DEFAULT ''
-- הרחבת style_patterns עם appeal_subtype לניתוח סגנון נפרד לכל סוג ערר
ALTER TABLE style_patterns ADD COLUMN IF NOT EXISTS appeal_subtype TEXT DEFAULT '';
-- decision_lessons: per-decision learnings the chair / curator / style_analyzer
-- attaches to a corpus row. The generic legal-decision-lessons.md file stays
-- as the source of truth for cross-corpus patterns; this table stores the
-- granular "what we learned from THIS decision" notes that drive the writer's
-- future drafts and let the curator look up prior observations on the same row.
CREATE TABLE IF NOT EXISTS decision_lessons (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
style_corpus_id UUID NOT NULL REFERENCES style_corpus(id) ON DELETE CASCADE,
lesson_text TEXT NOT NULL,
category TEXT DEFAULT 'general', -- style / structure / lexicon / tabular / general
source TEXT DEFAULT 'manual', -- manual / curator / chair / style_analyzer
applied_to_skill BOOLEAN DEFAULT false, -- has this been promoted into SKILL.md?
created_by TEXT DEFAULT 'chaim',
created_at TIMESTAMPTZ DEFAULT now(),
updated_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX IF NOT EXISTS idx_decision_lessons_corpus ON decision_lessons(style_corpus_id);
CREATE INDEX IF NOT EXISTS idx_decision_lessons_applied ON decision_lessons(applied_to_skill);
-- chat_conversations / chat_messages: persistent history for the
-- "שיחה עם הסוכן" tab on /training. Each conversation can optionally be
-- scoped to a single style_corpus row (when the chair starts a chat
-- "about decision X"). claude_session_id is the value the local claude
-- CLI returns in stream-json — we pass it back via `--resume` on the
-- next message so the model continues the same conversation without
-- re-loading the system prompt every time.
CREATE TABLE IF NOT EXISTS chat_conversations (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
title TEXT NOT NULL DEFAULT 'שיחה חדשה',
style_corpus_id UUID REFERENCES style_corpus(id) ON DELETE SET NULL,
claude_session_id TEXT,
system_prompt_version TEXT DEFAULT 'v1',
created_at TIMESTAMPTZ DEFAULT now(),
last_message_at TIMESTAMPTZ DEFAULT now()
);
CREATE TABLE IF NOT EXISTS chat_messages (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
conversation_id UUID NOT NULL REFERENCES chat_conversations(id) ON DELETE CASCADE,
role TEXT NOT NULL, -- 'user' | 'assistant'
content TEXT NOT NULL,
raw_events JSONB DEFAULT '[]', -- stream-json events for the assistant turn (optional, for debug)
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX IF NOT EXISTS idx_chat_messages_conv ON chat_messages(conversation_id, created_at);
CREATE INDEX IF NOT EXISTS idx_chat_conv_corpus ON chat_conversations(style_corpus_id);
CREATE INDEX IF NOT EXISTS idx_chat_conv_last ON chat_conversations(last_message_at DESC);
-- טבלת qa_results
CREATE TABLE IF NOT EXISTS qa_results (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
@@ -1609,6 +1658,284 @@ async def delete_from_style_corpus(corpus_id: UUID) -> dict:
}
async def get_style_corpus_row(corpus_id: UUID) -> dict | None:
"""Return a single style_corpus row by id, or None if missing."""
pool = await get_pool()
async with pool.acquire() as conn:
row = await conn.fetchrow(
"""
SELECT id, document_id, decision_number, decision_date,
subject_categories, full_text, summary, outcome,
key_principles, practice_area, appeal_subtype, created_at
FROM style_corpus WHERE id = $1
""",
corpus_id,
)
return dict(row) if row else None
async def update_style_corpus_metadata(
corpus_id: UUID,
*,
summary: str | None = None,
outcome: str | None = None,
key_principles: list[str] | None = None,
appeal_subtype: str | None = None,
practice_area: str | None = None,
overwrite: bool = False,
) -> dict:
"""Patch the enriched-metadata columns of a style_corpus row.
By default, only empty columns are filled — passing ``overwrite=True``
is the caller's signal that they intentionally want to replace existing
values (used by the re-extract flow when the chair runs it manually).
"""
pool = await get_pool()
async with pool.acquire() as conn:
existing = await conn.fetchrow(
"SELECT summary, outcome, key_principles, appeal_subtype, practice_area "
"FROM style_corpus WHERE id = $1",
corpus_id,
)
if not existing:
return {"updated": False, "reason": "not found"}
sets: dict = {}
if summary is not None and (overwrite or not (existing["summary"] or "").strip()):
sets["summary"] = summary
if outcome is not None and (overwrite or not (existing["outcome"] or "").strip()):
sets["outcome"] = outcome
if key_principles is not None:
current = existing["key_principles"]
if isinstance(current, str):
try:
current = json.loads(current)
except json.JSONDecodeError:
current = []
if overwrite or not (current or []):
sets["key_principles"] = json.dumps(key_principles)
if appeal_subtype is not None and (overwrite or not (existing["appeal_subtype"] or "").strip()):
sets["appeal_subtype"] = appeal_subtype
if practice_area is not None and (overwrite or not (existing["practice_area"] or "").strip()):
sets["practice_area"] = practice_area
if not sets:
return {"updated": False, "reason": "nothing to update", "fields": []}
cols = list(sets.keys())
set_clause = ", ".join(f"{c} = ${i + 2}" for i, c in enumerate(cols))
values = [sets[c] for c in cols]
await conn.execute(
f"UPDATE style_corpus SET {set_clause} WHERE id = $1",
corpus_id, *values,
)
return {"updated": True, "fields": cols}
# ── decision_lessons (per-corpus row notes) ────────────────────────
async def list_decision_lessons(corpus_id: UUID) -> list[dict]:
pool = await get_pool()
async with pool.acquire() as conn:
rows = await conn.fetch(
"SELECT id, style_corpus_id, lesson_text, category, source, "
" applied_to_skill, created_by, created_at, updated_at "
"FROM decision_lessons WHERE style_corpus_id = $1 "
"ORDER BY created_at DESC",
corpus_id,
)
return [dict(r) for r in rows]
async def add_decision_lesson(
corpus_id: UUID,
*,
lesson_text: str,
category: str = "general",
source: str = "manual",
created_by: str = "chaim",
) -> dict:
pool = await get_pool()
async with pool.acquire() as conn:
row = await conn.fetchrow(
"INSERT INTO decision_lessons "
"(style_corpus_id, lesson_text, category, source, created_by) "
"VALUES ($1, $2, $3, $4, $5) "
"RETURNING id, style_corpus_id, lesson_text, category, source, "
" applied_to_skill, created_by, created_at, updated_at",
corpus_id, lesson_text, category, source, created_by,
)
return dict(row) if row else {}
async def update_decision_lesson(
lesson_id: UUID,
*,
lesson_text: str | None = None,
category: str | None = None,
applied_to_skill: bool | None = None,
) -> dict:
sets: dict = {}
if lesson_text is not None:
sets["lesson_text"] = lesson_text
if category is not None:
sets["category"] = category
if applied_to_skill is not None:
sets["applied_to_skill"] = applied_to_skill
if not sets:
return {"updated": False, "reason": "nothing to update"}
sets["updated_at"] = "now()" # sentinel — replaced inline below
cols = [c for c in sets if c != "updated_at"]
set_clause = ", ".join(f"{c} = ${i + 2}" for i, c in enumerate(cols))
set_clause += ", updated_at = now()"
values = [sets[c] for c in cols]
pool = await get_pool()
async with pool.acquire() as conn:
row = await conn.fetchrow(
f"UPDATE decision_lessons SET {set_clause} WHERE id = $1 "
f"RETURNING id, style_corpus_id, lesson_text, category, source, "
f" applied_to_skill, updated_at",
lesson_id, *values,
)
if not row:
return {"updated": False, "reason": "not found"}
return {"updated": True, **dict(row)}
async def delete_decision_lesson(lesson_id: UUID) -> dict:
pool = await get_pool()
async with pool.acquire() as conn:
result = await conn.execute(
"DELETE FROM decision_lessons WHERE id = $1", lesson_id,
)
# asyncpg returns "DELETE n"
deleted = result.split(" ", 1)[1].strip() if " " in result else "0"
return {"deleted": deleted != "0"}
async def count_decision_lessons_per_corpus() -> dict[str, int]:
"""Map style_corpus.id (str) → lesson count, for badge display in the list."""
pool = await get_pool()
async with pool.acquire() as conn:
rows = await conn.fetch(
"SELECT style_corpus_id, count(*) AS n "
"FROM decision_lessons GROUP BY style_corpus_id"
)
return {str(r["style_corpus_id"]): r["n"] for r in rows}
# ── chat (style agent conversations) ───────────────────────────────
async def create_chat_conversation(
*,
title: str = "שיחה חדשה",
style_corpus_id: UUID | None = None,
system_prompt_version: str = "v1",
) -> dict:
pool = await get_pool()
async with pool.acquire() as conn:
row = await conn.fetchrow(
"INSERT INTO chat_conversations "
"(title, style_corpus_id, system_prompt_version) "
"VALUES ($1, $2, $3) "
"RETURNING id, title, style_corpus_id, claude_session_id, "
" system_prompt_version, created_at, last_message_at",
title, style_corpus_id, system_prompt_version,
)
return dict(row) if row else {}
async def list_chat_conversations(limit: int = 50) -> list[dict]:
pool = await get_pool()
async with pool.acquire() as conn:
rows = await conn.fetch(
"""
SELECT c.id, c.title, c.style_corpus_id, c.claude_session_id,
c.created_at, c.last_message_at,
sc.decision_number,
(SELECT count(*) FROM chat_messages m WHERE m.conversation_id = c.id) AS message_count
FROM chat_conversations c
LEFT JOIN style_corpus sc ON sc.id = c.style_corpus_id
ORDER BY c.last_message_at DESC NULLS LAST
LIMIT $1
""",
limit,
)
return [dict(r) for r in rows]
async def get_chat_conversation(conv_id: UUID) -> dict | None:
pool = await get_pool()
async with pool.acquire() as conn:
row = await conn.fetchrow(
"SELECT id, title, style_corpus_id, claude_session_id, "
" system_prompt_version, created_at, last_message_at "
"FROM chat_conversations WHERE id = $1",
conv_id,
)
return dict(row) if row else None
async def delete_chat_conversation(conv_id: UUID) -> dict:
pool = await get_pool()
async with pool.acquire() as conn:
result = await conn.execute(
"DELETE FROM chat_conversations WHERE id = $1", conv_id,
)
deleted = result.split(" ", 1)[1].strip() if " " in result else "0"
return {"deleted": deleted != "0"}
async def update_chat_conversation_session_id(
conv_id: UUID, claude_session_id: str,
) -> None:
pool = await get_pool()
async with pool.acquire() as conn:
await conn.execute(
"UPDATE chat_conversations SET claude_session_id = $1, "
" last_message_at = now() "
"WHERE id = $2",
claude_session_id, conv_id,
)
async def add_chat_message(
conv_id: UUID,
*,
role: str,
content: str,
raw_events: list | None = None,
) -> dict:
pool = await get_pool()
async with pool.acquire() as conn:
row = await conn.fetchrow(
"INSERT INTO chat_messages "
"(conversation_id, role, content, raw_events) "
"VALUES ($1, $2, $3, $4) "
"RETURNING id, conversation_id, role, content, created_at",
conv_id, role, content, json.dumps(raw_events or []),
)
await conn.execute(
"UPDATE chat_conversations SET last_message_at = now() WHERE id = $1",
conv_id,
)
return dict(row) if row else {}
async def list_chat_messages(conv_id: UUID) -> list[dict]:
pool = await get_pool()
async with pool.acquire() as conn:
rows = await conn.fetch(
"SELECT id, role, content, created_at "
"FROM chat_messages WHERE conversation_id = $1 "
"ORDER BY created_at ASC",
conv_id,
)
return [dict(r) for r in rows]
async def get_style_patterns(pattern_type: str | None = None) -> list[dict]:
pool = await get_pool()
async with pool.acquire() as conn:

View File

@@ -0,0 +1,195 @@
"""Auto-extract per-decision metadata for a style_corpus row.
Populates the fields that the upload flow leaves empty — summary, outcome,
key_principles, appeal_subtype, practice_area — by asking Claude (via the
local CLI session) to read the proofread full_text and return a structured
JSON blob.
Caller policy (``apply_to_corpus``): by default we **only fill empty
columns**, so chair-edited values are preserved across re-runs. The chair
can force a refresh by passing ``overwrite=True``.
Why this is a separate module from ``precedent_metadata_extractor``:
that one fills the *external* case_law corpus (court rulings, third-party
committee decisions). This one fills the *style* corpus — Daphna's own
decisions used to teach the writer the in-house voice. The two corpora
have different schemas, different prompts, and different downstream
consumers, so coupling them would have been the wrong shortcut.
"""
from __future__ import annotations
import logging
from uuid import UUID
from legal_mcp.services import claude_session, db
logger = logging.getLogger(__name__)
# A single decision typically runs 200K-650K chars. We sample the head
# (where outcome + parties + framing live) and the tail (where the
# operative ruling sits). Picking from both edges keeps the prompt under
# 60K chars — comfortable for any Claude tier.
_HEAD_CHARS = 25_000
_TAIL_CHARS = 15_000
def _build_text_window(full_text: str) -> str:
if len(full_text) <= _HEAD_CHARS + _TAIL_CHARS:
return full_text
head = full_text[:_HEAD_CHARS]
tail = full_text[-_TAIL_CHARS:]
return (
f"{head}\n\n"
f"[... חתך: {len(full_text) - _HEAD_CHARS - _TAIL_CHARS:,} תווים מהאמצע "
f"הושמטו — שמרנו על ההתחלה (טענות + רקע) ועל הסוף (הכרעה + הוצאות) ...]"
f"\n\n{tail}"
)
# Static instructions — go via ``system`` so the SDK path can cache them
# across batch enrichment runs (24+ decisions in one pass).
METADATA_PROMPT = """אתה מסייע משפטי שמקטלג את הקורפוס הסגנוני של דפנה תמיר (יו"ר ועדת ערר).
תפקידך: לקרוא החלטה אחת ולחלץ מטא-דאטה ל-style_corpus — שדות שהמשתמש לא הזין בעת ההעלאה.
**אל תמציא**. אם המידע לא מופיע בטקסט, השאר מחרוזת ריקה או מערך ריק. אסור להסיק עובדות שלא כתובות.
## פלט נדרש
החזר JSON אחד (object אחד — לא array, לא markdown, לא הסברים):
{
"summary": "תקציר עניני ב-2-3 משפטים: מי העורר, מה דרש, מה הוכרע. סגנון יבש, ניטרלי, ללא שיפוט. דוגמה: 'ערר על דחיית בקשה להיתר לתוספת מרפסת בקומה ג׳. דפנה קיבלה את הערר חלקית — אישרה את המרפסת בהקטנה ל-12 מ״ר.'",
"outcome": "התוצאה התמציתית. אחד מאלה (או צירוף קצר): 'קבלה' / 'קבלה חלקית' / 'דחייה' / 'הסתלקות' / 'החזרה לוועדה המקומית'. אם זה לא ברור — מחרוזת ריקה.",
"key_principles": [
"עיקרון משפטי 1 שעולה מההחלטה — משפט אחד, ניסוח מופשט. למשל 'שיקול דעת מוגבל לחריגות בנייה קטנות'.",
"עיקרון 2",
"..."
],
"appeal_subtype": "תת-סוג ערר. ערכים מותרים: 'building_permit' (היתר בנייה / רישוי), 'betterment_levy' (היטל השבחה), 'compensation_197' (פיצויים ס׳ 197), 'use_change' (שימוש חורג), 'tama_38' (תמ\\"א 38), או מחרוזת ריקה אם לא ברור.",
"practice_area": "תחום משפט גנרי. ברירת מחדל: 'appeals_committee'. אם זה במובהק 'planning_law' — סמן.",
"parties_appellant": "שם העורר/ים המרכזיים בהחלטה (אחד או כמה, מופרדים בפסיק). אם זו החלטה מאוחדת — שם הצד המוביל. השאר ריק אם לא ניתן לזהות במדויק.",
"parties_respondent": "שם המשיב/ים. ברירת מחדל לעררי 1xxx ו-8xxx: 'הוועדה המקומית לתכנון ובניה ירושלים' או דומה. השאר ריק אם לא ברור."
}
## כללי איכות
1. **summary** — חייב להזכיר את התוצאה. בלי 'בית המשפט קבע ש...' (אנחנו לא בית משפט). בלי הערכת אישית.
2. **outcome** — קבלה / קבלה חלקית / דחייה / הסתלקות / החזרה לוועדה המקומית. אם דפנה הכריעה חלקית — 'קבלה חלקית'. אסור 'התקבל' או 'נדחה' בלשון פעולה — רק שם פעולה.
3. **key_principles** — 2-5 עקרונות מקסימום. כל אחד משפט אחד. לא ציטוטים מילוליים, אלא תמצות העיקרון.
4. **appeal_subtype** — תמיד פעולה אחת. אם החלטה מערבת כמה תת-סוגים — בחר את העיקרי.
5. **parties_appellant / parties_respondent** — שם בלבד, בלי 'נ׳' או 'נגד'.
החזר רק את ה-JSON. אל תכתוב שום דבר לפניו או אחריו.
"""
async def extract_decision_metadata(corpus_id: UUID | str) -> dict:
"""Run Claude over the row's full_text and return suggested fields.
Does NOT touch the DB. The caller decides what to apply.
"""
if isinstance(corpus_id, str):
corpus_id = UUID(corpus_id)
row = await db.get_style_corpus_row(corpus_id)
if not row:
return {}
full_text = (row.get("full_text") or "").strip()
if not full_text:
return {}
context = (
f"מספר החלטה: {row.get('decision_number') or ''}\n"
f"תאריך: {row.get('decision_date') or ''}\n"
f"תת-סוג נוכחי: {row.get('appeal_subtype') or ''}\n"
f"נושאים מתויגים: {row.get('subject_categories') or ''}"
)
window = _build_text_window(full_text)
user_msg = (
f"## הקלט\n{context}\n\n"
f"--- תחילת ההחלטה ---\n{window}\n--- סוף ההחלטה ---"
)
try:
result = await claude_session.query_json(user_msg, system=METADATA_PROMPT)
except Exception as e:
logger.warning("style_metadata_extractor: query failed: %s", e)
return {}
if not isinstance(result, dict):
logger.warning(
"style_metadata_extractor: expected JSON object, got %s",
type(result).__name__,
)
return {}
out: dict = {}
if isinstance(result.get("summary"), str):
out["summary"] = result["summary"].strip()
if isinstance(result.get("outcome"), str):
out["outcome"] = result["outcome"].strip()
kp = result.get("key_principles") or []
if isinstance(kp, list):
out["key_principles"] = [str(p).strip() for p in kp if str(p).strip()]
if isinstance(result.get("appeal_subtype"), str):
st = result["appeal_subtype"].strip()
# Open enum — but log values outside the documented list so we can
# tighten the prompt later if needed.
known = {
"building_permit", "betterment_levy", "compensation_197",
"use_change", "tama_38", "",
}
if st not in known:
logger.info("style_metadata: unknown appeal_subtype=%r (kept)", st)
out["appeal_subtype"] = st
if isinstance(result.get("practice_area"), str):
out["practice_area"] = result["practice_area"].strip()
# Parties: not stored in the schema today, but worth surfacing in the
# extractor's return value so callers (and the UI's drawer) can display
# them. The list endpoint extracts via regex; LLM output is the
# higher-quality fallback when regex fails.
if isinstance(result.get("parties_appellant"), str):
out["parties_appellant"] = result["parties_appellant"].strip()
if isinstance(result.get("parties_respondent"), str):
out["parties_respondent"] = result["parties_respondent"].strip()
return out
async def extract_and_apply(
corpus_id: UUID | str, *, overwrite: bool = False,
) -> dict:
"""Convenience: extract → apply → return summary of what changed.
Idempotent under default ``overwrite=False`` — re-runs only fill empty
fields. Use ``overwrite=True`` to refresh values the chair (or a prior
extraction) already wrote.
"""
if isinstance(corpus_id, str):
corpus_id = UUID(corpus_id)
suggested = await extract_decision_metadata(corpus_id)
if not suggested:
return {"extracted": False, "applied": False, "reason": "no suggestion"}
update_result = await db.update_style_corpus_metadata(
corpus_id,
summary=suggested.get("summary"),
outcome=suggested.get("outcome"),
key_principles=suggested.get("key_principles"),
appeal_subtype=suggested.get("appeal_subtype"),
practice_area=suggested.get("practice_area"),
overwrite=overwrite,
)
return {
"extracted": True,
"applied": update_result.get("updated", False),
"fields_set": update_result.get("fields", []),
"suggested": suggested,
}