feat(training): Style Studio — upload, rich corpus, lessons, curator portrait, chat
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 2m7s

Six-phase upgrade of /training from a read-only dashboard into a full
Style Studio for managing Daphna's style corpus.

- Upload Sheet on /training: file → proofread preview → commit (no more
  CLI-only `upload-training` skill).
- Rich corpus metadata: GET /api/training/corpus returns summary, outcome,
  key_principles, page_count, parties (regex), legal_citation, lessons_count.
  PATCH endpoint for chair edits. CorpusDetailDrawer with 4 tabs (details
  /content/lessons/patterns) replaces the bare table row.
- LLM metadata enrichment: style_metadata_extractor + MCP tools
  (style_corpus_enrich, style_corpus_pending_enrichment) fill summary
  /outcome/key_principles via claude_session (free, host-side).
- Per-decision lessons: new decision_lessons table + 4 REST endpoints +
  LessonsTab in drawer; hermes-curator now auto-posts findings as
  decision_lessons(source=curator).
- Curator Portrait tab: prompt rendered with link to Gitea, recent
  curator findings, style_analyzer training prompts, propose-change
  form that writes proposals to data/curator-proposals/ for manual
  chair review (no auto-mutation of the agent file).
- Style chat tab: SSE-streamed conversations with the style agent.
  New host-side pm2 service (legal-chat-service, port 8770) wraps
  claude CLI with stream-json + --resume continuation; FastAPI proxies
  via host.docker.internal. Zero API cost — uses chaim's claude.ai
  subscription. chat_conversations + chat_messages persist history.

Architecture: keeps the existing rule that claude_session only runs
on the host (not the container). The new legal-chat-service is the
canonical bridge between the container and the local CLI for the chat
feature; everything else (upload, metadata, lessons) stays within the
container's existing capabilities.

Audit script (scripts/audit_training_corpus.py) included for verifying
which corpus rows still need enrichment.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-27 10:06:22 +00:00
parent 0629f19d5f
commit bb0cd7c6a2
23 changed files with 4568 additions and 75 deletions

View File

@@ -35,6 +35,7 @@
| `compute_ndcg.py` | python | חישוב nDCG@10 על `search_relevance_feedback` (TaskMaster #50, Stage C). aggregation לפי `search_type` ולפי שבוע, כולל top-cited case_law ו-coverage %. דגלים: `--k 10`, `--weeks 12`, `--pretty`. read-only, פלט JSON. משמש גם את `GET /api/admin/rag-metrics` (מיובא inline) — שינוי חתימה ב-`compute()` ישבור את ה-endpoint | ידני / cron עתידי לדיווח שבועי |
| `backfill_multimodal_precedents.py` | python | Backfill voyage-multimodal-3 page embeddings על רשומות `case_law` (external_upload + internal_committee) שחסרות `precedent_image_embeddings`. בונה אינדקס קבצים מ-`data/precedent-library/` ו-`data/internal-decisions/`, מנסה התאמה לפי tokens של מספרי תיק (כולל parts-match לפורמטים שונים של Nevo doc-id). מדלג על רשומות בלי קובץ-מקור או עם MD בלבד (PyMuPDF לא מרנדר MD). תומך `--dry-run` (default) / `--apply` / `--only external_upload\|internal_committee` / `--limit N`. רץ בקונטיינר (יש `/data` + Voyage env). **הופעל 2026-05-26**: 70 חסרים → 26 backfilled (503 pages, ~$0.21 voyage tokens), 44 אין-קובץ-מקור. ניתן להריץ שוב אחרי שיועלו עוד PDF/DOCX לספרייה | ידני |
| `monitor_halacha_quality.py` | python | מנטר איכות חילוץ הלכות. בודק drift של `avg(confidence)` בין baseline היסטורי לחלון אחרון. מחזיר JSON מטריקות + alert ב-stderr אם drift > threshold (ברירת מחדל 5%). 2 סדרות: trusted (approved+published) ו-all_extracted. תומך `--window N` / `--threshold X` / `--min-sample N` / `--silent` / `--exit-on-alert`. רץ ב-container או מקומית עם `mcp-server/.venv` (אין תלות ב-LLM, רק SQL). **תזמון מומלץ**: `0 8 * * 1` (יום ראשון 08:00, שבועי) | `0 8 * * 1` (לתזמן) |
| `audit_training_corpus.py` | python | audit של `style_corpus` — לכל החלטה: שדות מטא-דאטה מאוכלסים (`summary`/`outcome`/`key_principles`/`appeal_subtype`/`subject_categories`), קישור ל-`documents` (FK + chunks + embeddings). מפיק `data/audit/corpus-YYYY-MM-DD.json` + summary בקונסול. דרוש `POSTGRES_URL` או POSTGRES_*. אין תלויות חיצוניות מלבד asyncpg. **רץ מהמכונה המקומית** (לא קונטיינר) — חיבור ישיר ל-Postgres :5433 | ידני / קדם-עבודה לפני enrichment של מטא-דאטה |
## תיקיית `.archive/` — סקריפטים שהושלמו

196
scripts/audit_training_corpus.py Executable file
View File

@@ -0,0 +1,196 @@
#!/usr/bin/env python
"""Audit the style_corpus table — list each decision with what's populated and what's missing.
Produces a JSON report at data/audit/corpus-YYYY-MM-DD.json so we can see at a glance
which corpus entries lack summary/outcome/key_principles/appeal_subtype/chunks/embeddings.
Run with the mcp-server venv (has asyncpg):
POSTGRES_URL=postgres://... ./mcp-server/.venv/bin/python scripts/audit_training_corpus.py
Without POSTGRES_URL, falls back to the per-field env vars used by web/mcp-server config.
"""
from __future__ import annotations
import asyncio
import json
import os
import re
import sys
from datetime import UTC, date, datetime
from pathlib import Path
import asyncpg
def _build_dsn() -> str:
if url := os.environ.get("POSTGRES_URL"):
return url
return (
f"postgres://{os.environ.get('POSTGRES_USER', 'legal_ai')}:"
f"{os.environ.get('POSTGRES_PASSWORD', '')}@"
f"{os.environ.get('POSTGRES_HOST', '127.0.0.1')}:"
f"{os.environ.get('POSTGRES_PORT', '5433')}/"
f"{os.environ.get('POSTGRES_DB', 'legal_ai')}"
)
async def audit() -> dict:
dsn = _build_dsn()
conn = await asyncpg.connect(dsn)
try:
rows = await conn.fetch(
"""
SELECT id, decision_number, decision_date, subject_categories,
length(full_text) AS chars,
summary,
outcome,
key_principles,
practice_area,
appeal_subtype,
document_id,
created_at
FROM style_corpus
ORDER BY decision_date NULLS LAST, decision_number
"""
)
# Chunk + embedding counts for each related document — by direct FK first,
# then by title-match for legacy rows where style_corpus.document_id is NULL.
chunk_counts = await conn.fetch(
"""
SELECT d.id AS doc_id, d.title,
count(c.id) AS chunks,
count(c.embedding) FILTER (WHERE c.embedding IS NOT NULL) AS chunks_with_emb
FROM documents d
LEFT JOIN document_chunks c ON c.document_id = d.id
WHERE d.title LIKE '[קורפוס]%' OR d.id IN (SELECT document_id FROM style_corpus WHERE document_id IS NOT NULL)
GROUP BY d.id, d.title
"""
)
finally:
await conn.close()
by_doc_id = {r["doc_id"]: r for r in chunk_counts}
# Index corpus documents by every digit cluster in their title so we can
# match against style_corpus.decision_number regardless of formatting
# (e.g. style_corpus has "1109-25" but title may say "ARAR-25-1109" or
# "ערר 1009-25"). Each digit run >=3 chars becomes a key.
by_digit: dict[str, dict] = {}
for r in chunk_counts:
title = r["title"] or ""
for tok in re.findall(r"\d{3,}", title):
by_digit.setdefault(tok, r)
decisions = []
gaps_total = {
"summary": 0, "outcome": 0, "key_principles": 0,
"appeal_subtype": 0, "subject_categories": 0,
"chunks": 0, "embeddings": 0, "document_id": 0,
}
for row in rows:
cats = row["subject_categories"]
if isinstance(cats, str):
try:
cats = json.loads(cats)
except json.JSONDecodeError:
cats = []
cats = cats or []
kp = row["key_principles"]
if isinstance(kp, str):
try:
kp = json.loads(kp)
except json.JSONDecodeError:
kp = []
kp = kp or []
# Resolve chunks: prefer FK, fall back to digit-cluster match on decision_number.
chunks = 0
chunks_with_emb = 0
if row["document_id"] and row["document_id"] in by_doc_id:
r = by_doc_id[row["document_id"]]
chunks = r["chunks"]
chunks_with_emb = r["chunks_with_emb"]
elif row["decision_number"]:
for tok in re.findall(r"\d{3,}", row["decision_number"]):
if tok in by_digit:
r = by_digit[tok]
chunks = r["chunks"]
chunks_with_emb = r["chunks_with_emb"]
break
missing = []
if not row["summary"]:
missing.append("summary")
gaps_total["summary"] += 1
if not row["outcome"]:
missing.append("outcome")
gaps_total["outcome"] += 1
if not kp:
missing.append("key_principles")
gaps_total["key_principles"] += 1
if not row["appeal_subtype"]:
missing.append("appeal_subtype")
gaps_total["appeal_subtype"] += 1
if not cats:
missing.append("subject_categories")
gaps_total["subject_categories"] += 1
if chunks == 0:
missing.append("chunks")
gaps_total["chunks"] += 1
elif chunks_with_emb < chunks:
missing.append(f"embeddings({chunks_with_emb}/{chunks})")
gaps_total["embeddings"] += 1
if row["document_id"] is None:
missing.append("document_id")
gaps_total["document_id"] += 1
decisions.append({
"id": str(row["id"]),
"decision_number": row["decision_number"] or "",
"decision_date": row["decision_date"].isoformat() if row["decision_date"] else None,
"chars": row["chars"],
"subject_categories": cats,
"practice_area": row["practice_area"] or "",
"appeal_subtype": row["appeal_subtype"] or "",
"summary_len": len(row["summary"] or ""),
"outcome_len": len(row["outcome"] or ""),
"key_principles_count": len(kp),
"chunks": chunks,
"chunks_with_embeddings": chunks_with_emb,
"document_id": str(row["document_id"]) if row["document_id"] else None,
"missing": missing,
"created_at": row["created_at"].isoformat() if row["created_at"] else None,
})
return {
"generated_at": datetime.now(UTC).isoformat(),
"total_decisions": len(decisions),
"gaps_total": gaps_total,
"decisions": decisions,
}
async def main() -> int:
report = await audit()
out_dir = Path(__file__).resolve().parents[1] / "data" / "audit"
out_dir.mkdir(parents=True, exist_ok=True)
today = date.today().isoformat()
out_file = out_dir / f"corpus-{today}.json"
out_file.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8")
# Console summary
print(f"Total decisions: {report['total_decisions']}")
print("Gaps by field (count of decisions missing it):")
for field, n in report["gaps_total"].items():
bar = "" * min(n, 60)
print(f" {field:25s} {n:3d} {bar}")
print(f"\nReport written to {out_file}")
return 0
if __name__ == "__main__":
sys.exit(asyncio.run(main()))

View File

@@ -0,0 +1,48 @@
/**
* pm2 ecosystem entry for legal-chat-service — the host-side SSE bridge
* to ``claude`` CLI that powers the /training chat tab.
*
* Why pm2:
* - Auto-restart if the process dies (claude CLI subprocess failures
* should never leave the service in a half-dead state).
* - Log rotation matches paperclip's behavior so the chair sees
* consistent log paths under ~/.pm2/logs/.
*
* Install (once):
* pm2 start /home/chaim/legal-ai/scripts/legal-chat-service.config.cjs
* pm2 save
*
* Smoke test:
* curl http://127.0.0.1:8770/health
* # → {"ok":true,"service":"legal-chat-service"}
*
* Update:
* pm2 restart legal-chat-service
*
* Stop:
* pm2 stop legal-chat-service
*/
module.exports = {
apps: [
{
name: "legal-chat-service",
cwd: "/home/chaim/legal-ai/mcp-server",
// Run the in-package server via the venv interpreter so all
// imports (claude_session, etc) resolve.
script: "/home/chaim/legal-ai/mcp-server/.venv/bin/python",
args: "-m legal_mcp.chat_service.server --port 8770",
// claude CLI looks up credentials under HOME — make sure it
// sees Daphna's session, not an empty container HOME.
env: {
HOME: "/home/chaim",
PATH: "/home/chaim/.local/bin:/usr/local/bin:/usr/bin:/bin",
PYTHONUNBUFFERED: "1",
},
restart_delay: 5000,
max_restarts: 10,
autorestart: true,
max_memory_restart: "500M",
},
],
};