19 Commits

Author SHA1 Message Date
e4651a9d06 feat(#99 / T10): get_style_guide — יחסי-זהב נמדדים מהקורפוס לצד היעד
style_distance.measure_corpus_ratios(): מפצל כל החלטה ב-style_corpus לסעיפים
(chunker) ומחשב ממוצע %-סעיף — אגרגט "_all" + פר-תוצאה (כשיש). cached.
get_style_guide מציג שורת "נמדד בפועל" עם ⚠️ על פער מטווח-היעד.

מצב נוכחי: style_corpus.outcome לא מאוכלס → מוצג אגרגט כל-ההחלטות (n=48:
רקע 26.4% / טענות 9.7% / דיון 43.8% / סיכום 20.1%); פיצול לפי-תוצאה future-ready.
המדידה גם מאירה מגבלות זיהוי-סעיפים (כוונת T10 — לסמן פער לבדיקה). חופף-חלקית
ל-T7 שמודד adherence per-draft; זה מודד את הקורפוס. כשל מדידה מוצג, לא נבלע.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 21:01:42 +00:00
a571ad535b fix(#88+#87): סנכרון DB↔file אוטומטי + claims_coverage מבחין כתב-ערר מתכתובת
#88 (DB↔file, lessons #35): drafts/decision.md דרסה את עצמה רק ב-save_block_content;
renumber_all_blocks + נתיבי store_block אחרים השאירו את הקובץ stale → QA נכשל
פעמיים על אותה בעיה (CMPA-62). תיקון: _update_draft_file הפך ל-hook אוטומטי
(מקבל decision_id, מאתר case פנימית) שנקרא מ-store_block (כל persist) ומ-
renumber_all_blocks. legal-qa ממילא קורא מ-DB → שני הצדדים זהים תמיד.

#87 (claims_coverage, 1033-25): טענות מתכתובת (claim_type='reply' — תגובה/
השלמת-טיעון) סומנו "לא נענו" כ-false-positive. תיקון: check_claims_coverage
דורש מענה רק לטענות כתב-הערר (claim_type='claim', appellant); reply/תכתובת
מוחרגות. בקבלה מלאה הסף מוקל (0.2→0.4) כי העורר זכה במלואו.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 20:54:31 +00:00
afc1548bca chore(style-acq T11): regen API types (learning + methodology endpoints)
npm run api:types — מסנכרן types.ts המחולל עם ה-endpoints החדשים
(/api/learning/pairs, style-distance, promote). הקוד משתמש בטיפוסים ידניים
(learning.ts) אז זה היגיינה לעתיד, לא תלות. סוגר את T11.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 20:44:41 +00:00
e096c51037 fix(#85): claude_session — retry על כשלים חולפים של claude -p
שורש #85 התברר: `claude -p` נכשל מדי פעם ב-exit מהיר + stderr ריק על
פרומפטים גדולים/איטיים (CEO write_interim_draft, learning_loop distillation),
**אותו פרומפט מצליח בריצה חוזרת** — כשל חולף, לא nesting (אומת: nested claude
מ-bash וגם פרומפט 70K הצליחו; הכשל אינו דטרמיניסטי).

query() עוטף spawn+communicate ב-לולאת retry (MAX_RETRIES=3, backoff לינארי
5s*attempt). FileNotFoundError + timeout נשארים דטרמיניסטיים (ללא retry).
empty-response גם מטופל כ-transient.

אומת e2e: distillation על 1130-25 רץ בהצלחה → pair=analyzed (9 שינויים,
6 style_method, 33.8% diff). פותר גם את write_interim_draft של ה-CEO.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 20:08:54 +00:00
85c5a4aacb Merge pull request 'feat(halacha-triage): quality-gated + prioritized review queue + metrics (#84)' (#93) from worktree-task84-halacha-triage into main
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 1m25s
2026-06-06 20:01:27 +00:00
420cb819f5 feat(halacha-triage): quality-gated + prioritized review queue + metrics (#84)
Backend for the halacha approval-queue triage (#84). The keyboard UI, batch
actions and defer/reject (#84.4–6) already shipped; this adds the gating,
prioritization and metrics the queue was missing.

db.list_halachot — two opt-in triage controls:
  * exclude_low_quality (#84.1): drop items carrying ANY quality_flag
    (application / quote_unverified / truncated / non_decision / thin /
    nli_unsupported / near_duplicate) — they belong in a 'needs extraction fix'
    bucket, not the chair's approve queue.
  * order_by_priority (#84.3): active-learning order — negatively-treated
    first, then most-uncertain (lowest confidence), then oldest — instead of
    FIFO, so the highest-value decisions surface first.

halachot_pending (MCP) — now gated + prioritized BY DEFAULT; include_low_quality=
true reveals the needs-fix bucket. The agent review path benefits immediately.

GET /api/halachot — same two params, default OFF (non-breaking; the UI opts in).

metrics.halacha_backlog (#84.7) — splits pending into clean vs flagged, adds
deferred, reviewed_total, approve_ratio, and a pending_by_flag breakdown, so the
backlog distinguishes real review work from extraction noise.

Deferred (documented): #84.2 near-duplicate cluster cards and wiring the UI
fetch to the new params require frontend work + an api:types regen AFTER this
deploys (the new query params aren't in prod's OpenAPI until then) — a clean
follow-up. The backend fully supports both now.

Verified against the live DB (read-only):
- pending 177 → gated-clean 110, 0 flagged items leak into the clean queue.
- priority order surfaces the lowest-confidence items first (0.55, 0.55, ...).
- backlog: pending_clean=110 / pending_flagged=67 / approve_ratio=0.916,
  pending_by_flag={nli_unsupported:59, quote_unverified:3, thin:3, truncated:2}.
- pytest tests/test_halacha_quality.py — 52 passed (no regression).

Invariants: G1 (gate at source — SQL filter, not post-hoc); G2 (no parallel
path — same list_halachot); §6 (flagged items routed to a bucket, never dropped).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 20:00:52 +00:00
32ef259843 Merge pull request 'feat(halacha): application gate + lexical dedup tail + quality harnesses (#81,#82)' (#92) from worktree-task81-82-halacha-engine into main
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 1m25s
2026-06-06 19:56:22 +00:00
1286a1e60d feat(halacha): application gate + lexical dedup tail + quality harnesses (#81,#82)
Halacha-extraction quality (#81) and dedup-on-insert (#82) — engine changes
(pure + tested) plus measurement/ops tooling.

halacha_quality.py
- #81.4 application gate: is_fact_dependent() (high-precision "applied to THIS
  case" deixis per the strict rubric §3/§27) + FLAG_APPLICATION. compute_quality_flags
  now takes rule_type and flags rule_type=='application' OR fact-dependent —
  blocking auto-approve (an illustration is not a generalizable holding).
- #82.3 lexical tail signal: jaccard_shingles / normalized_levenshtein /
  lexical_near_duplicate + FLAG_NEAR_DUPLICATE, for the 0.83–0.93 cosine band.

halacha_extractor.py — pass rule_type to the flag computation; re-type a
binding-labeled fact-application to 'application' (mirrors non_decision→obiter).

db.py (store_halachot_for_chunk) — dedup now fetches the nearest same-precedent
neighbor once: cosine ≥ DEDUP → skip (unchanged); cosine in [BAND, DEDUP) with
high lexical overlap → FLAG_NEAR_DUPLICATE (review, not skip — never drop a
possibly-distinct principle unreviewed).

config.py — HALACHA_DEDUP_BAND_COSINE (0.83).

Scripts:
- scripts/halacha_goldset.py (#81.7) — export stratified sample for human
  tagging; score validators (P/R/F1) against the tags. Backbone for #81.8.
- scripts/halacha_batch_reconcile.py (#82.7) — conservative cross-precedent
  dedup (cosine ≥0.95), dry-run report only.
- scripts/calibrate_halacha_dedup.py (#82.1) — calibrate the lexical thresholds
  against the 2026-06-03 cleanup gold-set.

Deferred (documented): #82.4 merge-provenance and #82.5 DB ON CONFLICT/UNIQUE
on normalized quote are NOT included — the current skip+flag behavior is safe,
whereas a UNIQUE on normalized_quote would fail on existing dups and a blind
merge risks losing provenance; they need their own chair-reviewed migration.
#82.6 over-merge guard is moot until merge lands. #81.6 full rhetorical-role
classifier deferred (section pre-filter + application flag cover the practical
case); #81.8 blocked on the human-tagged gold-set (harness now provided).

Verified:
- pytest tests/test_halacha_quality.py — 52 passed (14 new).
- calibrate: configured (0.55,0.70) → precision 1.0 (zero false-merge), recall
  0.30 — correct profile for an auto-approve-blocking signal.
- goldset export: 15-row sample CSV. batch reconcile: 819 halachot → 5
  cross-precedent candidate pairs.

Invariants: G1 (normalize at source — flag at insert, not at read); §6 (no
silent swallow — suspect items flagged to review, never dropped); G2 (no
parallel path — same store_halachot_for_chunk / compute_quality_flags).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 19:55:45 +00:00
366d89e6bb Merge pull request 'feat(nevo): backfill leaked preamble + ratio gold-set benchmark (#86)' (#91) from worktree-task86-nevo-backfill-benchmark into main
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 1m25s
2026-06-06 19:46:25 +00:00
fb51a0e869 feat(nevo): backfill leaked preamble + ratio gold-set benchmark (#86)
#86.2 backfill + #86.3 benchmark, plus a #86.1 over-strip fix found en route.

extractor.py
- extract_nevo_ratio(): capture Nevo's מיני-רציו block (editorial holdings
  summary) before it is stripped — a free professional gold-set (#86.3).
- _DECISION_START hardening (#86.2): the merged #86.1 regex over-stripped.
  (a) פסק-דין headers are markdown-wrapped (**פסק  דין**); the old anchor
      required the keyword as the first line char with one separator, so it
      missed the header and matched a citation 32K deep (עמ"נ 50567-07-21,
      losing 45% of the body). Now tolerates leading markdown + 0-3 seps,
      and the final-nun form (דין ן vs דינו נ).
  (b) bare השופט/הנשיא matched CITATIONS ("השופט מ' חשין, פסקה 23"). The
      authoring-judge line ends with a colon; we now require it.

ingest.py
- capture the ratio before stripping and store it on the row (best-effort,
  non-fatal); also strip the text-upload path (was file-only).

db.py
- add case_law.nevo_ratio column (additive); allow it in update_case_law.

scripts/backfill_nevo_preamble.py (#86.2) — dry-run-by-default data migration:
finds historically-leaked rulings, captures ratio→nevo_ratio, rewrites
full_text (+content_hash), reindexes, and FLAGS (never deletes) halachot whose
quote lives in the removed preamble (review_status=pending_review +
nevo_preamble_leak flag). Safety guard: rows with keep%<--min-keep (60) are
excluded from --apply as suspected over-strip. --apply writes backup+manifest
to data/audit/ first. Chair-gated — NOT applied here.

scripts/nevo_ratio_benchmark.py (#86.3) — LLM-as-judge (local claude_session,
zero cost) measures recall/precision/granularity of our halachot vs the Nevo
ratio. Works pre- and post-backfill (reads nevo_ratio, falls back to full_text).

Verified:
- pytest tests/test_nevo_preamble.py — 12 passed (incl. citation/markdown
  over-strip regressions).
- backfill dry-run: 19 leaked rulings, 27 contaminated halachot, all ≥75%
  keep (the 32K over-strip is gone).
- benchmark on בג"ץ 1764/05: recall=0.875 precision=1.0 granularity=1.75x.

Invariants: G1 (normalize at source — strip/capture at ingest, not at read);
no silent swallow (contaminated halachot flagged + reported, not dropped);
data-migration is dry-run-default with backup+manifest, chair-gated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 19:45:43 +00:00
12bdec10fa Merge pull request 'fix(claude_session): surface real CLI error + sanitize nested env (#85)' (#90) from worktree-task85-claude-session-nested into main
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 1m27s
2026-06-06 19:30:22 +00:00
8ec24cf822 fix(claude_session): surface real CLI error + sanitize nested env (#85)
write_interim_draft failed for all blocks from the CEO MCP instance with
"Claude CLI failed (exit 1): unknown error". Two fixes:

1. Error surfacing (the certain win): on non-zero exit, capture and log
   both stderr AND stdout (the CLI sometimes writes its diagnostic to
   stdout or nowhere), so the next occurrence is diagnosable instead of
   collapsing to "unknown error". This is why #85 was unsolved — the real
   error was swallowed (engineering rule §6: no silent swallow).

2. Defensive hardening: strip Claude Code session markers (CLAUDECODE,
   CLAUDE_CODE_*, CLAUDE_AGENT_*, AI_AGENT, CLAUDE_EFFORT) from the env of
   nested `claude -p` calls and run them from $HOME, decoupling them from
   the parent agent's session/project state. Aligns query() with the
   existing query_streaming() path (which already sets cwd=HOME). Auth/
   config vars are preserved.

Note: the original adapter-context failure could not be reproduced in a
plain interactive session (nested claude -p succeeds there in both old and
new code), so the env markers are a suspect, not a proven cause. The real
value is the diagnostics. Verified: nested query() returns PONG from
inside a CLAUDECODE=1 session; unit tests cover env sanitization.

Invariants: G1 (normalize at source — fix the spawn, not readers),
G2 (no parallel path — same query()), §6 (no silent error swallow).
INV: feedback_claude_session_local_only preserved (all calls stay local).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 19:29:36 +00:00
3b9f77daa8 Merge pull request 'feat(style-acq T8): analyze_corpus — הסרת LIMIT 20 (כיסוי מלא)' (#89) from worktree-style-acquisition-mvp into main
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 1m24s
2026-06-06 19:25:40 +00:00
32a6e2b57b Merge pull request 'fix(style-acq T9): מספור-אוטומטי אמיתי בייצוא DOCX' (#88) from worktree-style-acquisition-mvp into main
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 1m28s
2026-06-06 19:24:02 +00:00
37c00bac13 Merge pull request 'feat(style-acq T14): שער-יו"ר לאישור הצעות-curator → הטמעה לפרופיל' (#87) from worktree-style-acquisition-mvp into main
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 1m42s
2026-06-06 19:18:13 +00:00
6313fcd316 Merge pull request 'feat(style-acq T6+T13): פנקס-התאמה + מדד מרחק-סגנון ב-UI' (#86) from worktree-style-acquisition-mvp into main
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 38s
2026-06-06 19:13:32 +00:00
7b1c0c1a32 Merge pull request 'feat(style-acq T12): /methodology — ביטויי-מעבר + אנטי-דפוסים editable' (#85) from worktree-style-acquisition-mvp into main
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 1m25s
2026-06-06 19:09:15 +00:00
3b3e1e3bbf Merge pull request 'docs: FU-14 GAP-54 — סגירה כ-resolved-by-FU-1 (קליטת-פסיקה כבר מאוחדת)' (#84) from docs/gap54-closure into main
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 9s
2026-06-06 19:03:14 +00:00
37dcb30604 docs: FU-14 GAP-54 — סגירה כ-resolved-by-FU-1 (איחוד קליטת-פסיקה)
אימות (G2 — לא לפתור מחדש): קליטת-הפסיקה כבר מאוחדת ע"י FU-1. שני מסלולי-
הפסיקה (precedent_library + internal_decisions) עוברים דרך
ingest.ingest_document הקנוני עם ולידציית-enums + citation-guard סימטריים
(מתועד ב-01-ingest §4). המסלול ה-3 (training→style_corpus) הוא קורפוס נפרד
במכוון. מאומת ב-test_unified_ingest (9/9). אין קוד — רק תיעוד סגירה.

Invariants: מאשר INV-ING1 + G2 מקוימים. doc-only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 19:02:55 +00:00
26 changed files with 1784 additions and 59 deletions

View File

@@ -463,6 +463,7 @@ The draft's biggest structural error was adding the "נבאר" doctrinal paragra
- **Problem:** legal-writer updates `decision_blocks` in the DB, but legal-qa reads from `drafts/decision.md` on disk. In CMPA-62 the writer reported updating block headers in DB but the file did not re-sync, causing QA-2 to fail on exactly the same issue twice.
- **Lesson:** Single source of truth is mandatory — either the writer must write to BOTH the DB and the decision.md file in one atomic step, or there must be an automatic `regenerate-draft` hook that runs after every block update so the file always reflects the latest DB state. Two unsynchronized sources will keep producing the same false-fail loop.
- **Owner:** Infrastructure task — not a writer/QA prompt fix.
- **✅ RESOLVED (GAP-88, 2026-06-06):** `block_writer._update_draft_file` is now an automatic regenerate hook called from `store_block` (every persist) **and** `renumber_all_blocks` — so `drafts/decision.md` always reflects `decision_blocks`. legal-qa already validates against the DB; both sides are now identical.
---

View File

@@ -88,7 +88,7 @@
| GAP-51 | `set_outcome` enum-mismatch (3≠4); אוצרות-מילים סותרות | INV-TOOL1/UI1 | Medium | `block_writer.py:442` מול `lessons.py:11`, `workflow.py:145` | SSoT יחיד ל-outcome |
| GAP-52 | רוב הכלים לא-idempotent (case_create/document_upload/precedent_attach) | INV-TOOL3, G3 | Medium | `server.py`, tools/ | upsert/ON CONFLICT |
| GAP-53 | אין limit-caps (precedent_library_list/search_*/list_chair_feedback) | INV-TOOL5 | Low | tools/ | clamp ל-max |
| GAP-54 | 3 מסלולי-קליטת-פסיקה ולידציה א-סימטרית; citation-guard לא-מתועד | INV-ING1, G2 | Medium | `precedent_library.py`, `internal_decisions.py` | איחוד (תואם GAP-01/05) |
| GAP-54 | 3 מסלולי-קליטת-פסיקה ולידציה א-סימטרית; citation-guard לא-מתועד | INV-ING1, G2 | Medium | `precedent_library.py`, `internal_decisions.py` | **נפתר ע"י FU-1** — שני מסלולי-הפסיקה (library+internal) עוברים דרך `ingest.ingest_document` הקנוני (ולידציית-enums + citation-guard סימטריים, מתועד ב-01-ingest §4); המסלול ה-3 (training→`style_corpus`) הוא קורפוס נפרד במכוון (סגנון, לא פסיקה). מאומת ב-`test_unified_ingest.py` |
| GAP-55 | Infisical dead-code; מקור-config לא-מתועד (Coolify-only) | INV-ENV2, G2 | Medium | `mcp-server/.../config.py` | לתעד Coolify SSoT / לבודד Infisical |
| GAP-56 | UUIDs קשיחים (company/agent) — תואם GAP-26 | INV-ENV3/INT5 | High | `web/paperclip_client.py:36-62`, `web/app.py:3976` | config-driven |
| GAP-57 | creds plaintext בברירת-מחדל (`paperclip:paperclip`) | INV-ENV4, G9, §6 | High | `web/paperclip_client.py:21`, `web/app.py:3789,3964` | default ריק + fail-loud |
@@ -207,6 +207,7 @@
- **פרוסה 7, 2026-06-06 — ✅ GAP-48 הושלם.** משפחת `drafting` (18 כלים) הומרה ל-envelope. export_docx/revise_draft/apply_user_edit משתמשים ב-`err`-לכשל (כך שהסוכן והמשתמש רואים את הכשל ברמת-המעטפת), כש-`failed_gates` רוכב ב-`data`; 6 צרכני-app.py (get_decision_template/apply_user_edit×2/revise_draft/list_bookmarks/export_docx) חוּוטו עם בדיקת envelope-status; `test_export_qa_gate` עודכן לחוזה (182/182 עוברים). **GAP-48 סגור — כל ~12 המשפחות אחידות.**
- **פרוסה 8, 2026-06-06 — ✅ GAP-49 (החלק הקריטי).** השם המטעה `precedent_search_library` (ציטוטים מצורפים-לתיק) שונה ל-`search_case_precedents` ובכך בוטל ההיפוך המסוכן מול `search_precedent_library` (ספרייה סמכותית — מקור CREAC). הישן נשמר כ-alias deprecated (ב-server.py) → אפס שבירה לסוכנים חיים. docstrings הובהרו; עודכנו app.py (typeahead) + legal-researcher/legal-writer docs + precedent_library docstring. 5 כלי-החיפוש הנותרים מחפשים קורפוסים מובחנים בשמות סבירים — לא בוצע rename-המוני (churn גבוה, ערך נמוך). 182/182 עוברים. **⚠ אחרי merge+deploy:** סנכרון cross-company של doc-הסוכן (frontmatter `search_case_precedents`). נותר ב-FU-14: GAP-50 (מיזוג כלי-בלוק — נוגע בתהליך-הכתיבה, דורש הכרעת-יו"ר), GAP-54, GAP-47-חלק-ב.
- **פרוסה 9, 2026-06-06 — ✅ GAP-50 (הכרעת-יו"ר).** מיפוי הראה שכלי-הבלוק אינם "כפילות מיותרת": `write_block`/`write_all_blocks`/`save_block_content`/`write_interim_draft` משרתים זרימות שונות (CLI/initial-draft מול תהליך-ה-writer "התיקון בקובץ, לא ב-DB"). הכפילות האמיתית היחידה — `draft_section` (הקשר לפי-סעיף, כמעט-נטוש) חופף ל-`get_block_context` (לפי-בלוק, קנוני). הוחלט (יו"ר): **draft_section deprecated** (docstring ב-server.py+drafting.py מפנה ל-get_block_context; draft-decision.md עודכן) — בלי הסרה, בלי מיזוג כלי-הכתיבה (שמירת תהליך-הכתיבה המכוון). 182/182 עוברים. **GAP-49+50 סגורים.** נותר ב-FU-14: GAP-54 (איחוד קליטת-פסיקה), GAP-47-חלק-ב (הנחיות-יו"ר→DB).
- **פרוסה 10, 2026-06-06 — ✅ GAP-54 (נסגר כ-resolved-by-FU-1).** אימות (G2: לא לפתור מחדש): `ingest.ingest_document` הוא המסלול הקנוני; `precedent_library` ו-`internal_decisions` שניהם עוברים דרכו עם ולידציית-enums + citation-guard סימטריים (מתועד ב-01-ingest §4); training→`style_corpus` הוא קורפוס נפרד במכוון. 9/9 `test_unified_ingest` עוברים — אין קוד לכתוב. **FU-14 כמעט-מלא: נותר רק GAP-47-חלק-ב** (העברת הנחיות-יו"ר מ-`analysis-and-research.md` ל-DB) — פיצ'ר UI+זרימת-אנליסט נפרד, לא דחוף.
### FU-15 — deploy/env/secrets
- **מכסה:** GAP-55..62 · **invariants:** INV-ENV1ENV5 · **effort:** M · **תלויות:**

View File

@@ -154,6 +154,14 @@ HALACHA_AUTO_APPROVE_THRESHOLD = float(
# principle. Set > 1.0 to disable semantic dedup (exact-quote dedup still runs).
HALACHA_DEDUP_COSINE = float(os.environ.get("HALACHA_DEDUP_COSINE", "0.93"))
# Halacha dedup TAIL band (#82.3) — the [BAND_COSINE, DEDUP_COSINE) range is too
# low to auto-skip but suspicious. A halacha whose nearest same-precedent
# neighbor sits in this band AND has high LEXICAL overlap (Jaccard/Levenshtein
# on rule_statement) is flagged 'near_duplicate' (blocks auto-approve → review),
# not skipped — catching paraphrases the cosine threshold misses without
# dropping a possibly-distinct principle unreviewed. 0.83 from the same cleanup.
HALACHA_DEDUP_BAND_COSINE = float(os.environ.get("HALACHA_DEDUP_BAND_COSINE", "0.83"))
# Halacha NLI entailment validator (#81.3) — after extraction, a claude_session
# judge checks each halacha's rule_statement is entailed by its supporting_quote.
# Non-entailed (neutral/contradiction) → quality flag 'nli_unsupported' that

View File

@@ -1088,37 +1088,39 @@ async def save_block_content(case_id: UUID, block_id: str, content: str) -> dict
result["generation_type"] = "claude-code"
result["model_used"] = "claude-code"
await store_block(UUID(decision["id"]), result)
await store_block(UUID(decision["id"]), result) # store_block syncs the file (#35)
await db.mark_blocks_stale(case_id, False)
# Also write/update the draft file on disk
await _update_draft_file(case_id, UUID(decision["id"]))
return result
async def _update_draft_file(case_id: UUID, decision_id: UUID) -> None:
"""Rebuild drafts/decision.md from all blocks in DB."""
from pathlib import Path
case = await db.get_case(case_id)
if not case:
return
case_dir = config.find_case_dir(case["case_number"])
draft_dir = case_dir / "drafts"
draft_dir.mkdir(parents=True, exist_ok=True)
async def _update_draft_file(decision_id: UUID) -> None:
"""Rebuild drafts/decision.md from all blocks in DB — the single
regenerate-draft hook (lessons #35 / GAP-88). Called after EVERY
decision_blocks mutation (store_block, renumber) so the on-disk file never
drifts from the DB. legal-qa validates against the DB; export and the chair
read the file — keeping them identical kills the "QA fails twice on the same
already-fixed issue" loop (CMPA-62). Resolves case from decision_id so no
caller has to thread case_id through."""
pool = await db.get_pool()
async with pool.acquire() as conn:
case_row = await conn.fetchrow(
"SELECT c.case_number FROM decisions d JOIN cases c ON c.id = d.case_id "
"WHERE d.id = $1",
decision_id,
)
if not case_row:
return
rows = await conn.fetch(
"SELECT content FROM decision_blocks WHERE decision_id = $1 AND content != '' ORDER BY block_index",
decision_id,
)
draft_dir = config.find_case_dir(case_row["case_number"]) / "drafts"
draft_dir.mkdir(parents=True, exist_ok=True)
draft_path = draft_dir / "decision.md"
draft_path.write_text("\n\n".join(row["content"] for row in rows if row["content"]), encoding="utf-8")
logger.info("Draft file updated: %s (%d blocks)", draft_path, len(rows))
logger.info("Draft file synced: %s (%d blocks)", draft_path, len(rows))
# ── Renumbering ───────────────────────────────────────────────────
@@ -1172,6 +1174,11 @@ async def renumber_all_blocks(decision_id: UUID) -> dict:
)
updated += 1
# #35 — renumber mutates content via raw UPDATE (bypasses store_block), so
# sync the draft file here too, otherwise the file keeps stale numbering.
if updated:
await _update_draft_file(decision_id)
return {"total_paragraphs": current_num - 1, "blocks_updated": updated}
@@ -1204,6 +1211,9 @@ async def store_block(decision_id: UUID, block_result: dict) -> None:
block_result["model_used"],
block_result["temperature"],
)
# #35 — regenerate the on-disk draft on every persist so DB and file stay
# identical (legal-qa reads DB; export/chair read the file).
await _update_draft_file(decision_id)
async def write_and_store_block(

View File

@@ -29,6 +29,7 @@ from __future__ import annotations
import asyncio
import json
import logging
import os
from legal_mcp.config import parse_llm_json
@@ -40,15 +41,39 @@ logger = logging.getLogger(__name__)
DEFAULT_TIMEOUT = 1800
LONG_TIMEOUT = 3600 # opus block writing on full case context
# #85 — `claude -p` fails intermittently with a fast non-zero exit and empty
# stderr (observed on large/slow cold prompts: CEO write_interim_draft,
# learning_loop distillation). The SAME prompt succeeds on retry, so the bail is
# transient — retry with linear backoff. Timeouts and "CLI not found" are
# deterministic and are NOT retried.
# #85 — two complementary hardenings for the same symptom (`claude -p` failing
# with a fast non-zero exit + empty stderr on large/slow cold prompts: CEO
# write_interim_draft, learning_loop distillation):
#
# 1. CLEAN ENV (defensive): a running Claude Code session exports markers into
# child processes; a *nested* ``claude -p`` inherits them. Stripping them lets
# every nested invocation launch as a clean top-level session. Could not be
# reproduced deterministically, so it's a suspect, not a proven cause. Auth/
# config (CLAUDE_CONFIG_DIR, ANTHROPIC_*, PATH, HOME) are kept.
# 2. RETRY (the real fix): the SAME large prompt that exits 1 once succeeds on a
# plain retry — the bail is transient. Retry with linear backoff. Timeouts and
# "CLI not found" stay deterministic and are NOT retried.
# See TaskMaster legal-ai #85.
_SESSION_MARKER_PREFIXES = ("CLAUDECODE", "CLAUDE_CODE_", "CLAUDE_AGENT_")
_SESSION_MARKER_EXACT = frozenset({"AI_AGENT", "CLAUDE_EFFORT"})
MAX_RETRIES = 3
RETRY_BACKOFF_BASE = 5 # seconds; sleep = base * attempt_number
def _clean_subprocess_env() -> dict[str, str]:
"""Copy the current env minus Claude Code session markers.
Lets a nested ``claude -p`` start fresh instead of detecting it is
already inside a Claude Code session (#85).
"""
env = dict(os.environ)
for key in list(env):
if key in _SESSION_MARKER_EXACT or key.startswith(_SESSION_MARKER_PREFIXES):
del env[key]
return env
async def query(
prompt: str,
timeout: int = DEFAULT_TIMEOUT,
@@ -112,6 +137,8 @@ async def query(
stdin=asyncio.subprocess.PIPE,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
env=_clean_subprocess_env(),
cwd=os.path.expanduser("~"),
)
except FileNotFoundError:
# Deterministic — never retry.
@@ -139,8 +166,11 @@ async def query(
raise RuntimeError(f"Claude CLI timed out after {timeout}s")
if proc.returncode != 0:
stderr = stderr_b.decode("utf-8", errors="replace").strip()[:500] or "unknown error"
last_err = f"exit {proc.returncode}: {stderr}"
# The CLI sometimes writes its diagnostic to stdout (or nowhere)
# rather than stderr (#85) — surface whichever is present.
stderr = stderr_b.decode("utf-8", errors="replace").strip()
stdout = stdout_b.decode("utf-8", errors="replace").strip()
last_err = f"exit {proc.returncode}: {(stderr or stdout or 'no output')[:500]}"
else:
stdout = stdout_b.decode("utf-8", errors="replace").strip()
if stdout:
@@ -256,6 +286,7 @@ async def query_streaming(
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
cwd=cwd,
env=_clean_subprocess_env(),
)
except FileNotFoundError:
yield {

View File

@@ -619,6 +619,12 @@ ALTER TABLE case_law ADD COLUMN IF NOT EXISTS practice_area TEXT DEFAULT '';
ALTER TABLE case_law ADD COLUMN IF NOT EXISTS appeal_subtype TEXT DEFAULT '';
ALTER TABLE case_law ADD COLUMN IF NOT EXISTS headnote TEXT DEFAULT '';
-- chair-editable abstract shown in search results.
ALTER TABLE case_law ADD COLUMN IF NOT EXISTS nevo_ratio TEXT DEFAULT '';
-- The Nevo editorial מיני-רציו block, captured at ingest *before* it is
-- stripped from the body (#86.3). Kept separate from `headnote` (which is
-- our own abstract) so it can serve as a free professional gold-set for
-- benchmarking halacha-extraction recall/precision. Empty when the source
-- is not a Nevo export or carries no mini-ratio.
ALTER TABLE case_law ADD COLUMN IF NOT EXISTS source_type TEXT DEFAULT '';
-- 'court_ruling' | 'appeals_committee'
@@ -3263,7 +3269,7 @@ async def update_case_law(case_law_id: UUID, **fields) -> dict | None:
"""
allowed = {
"case_number", "case_name", "court", "date", "practice_area", "appeal_subtype",
"subject_tags", "summary", "headnote", "key_quote", "source_url",
"subject_tags", "summary", "headnote", "nevo_ratio", "key_quote", "source_url",
"source_type", "precedent_level", "is_binding", "district", "chair_name",
"proceeding_type", "citation_formatted",
}
@@ -3693,6 +3699,7 @@ async def store_halachot_for_chunk(
"""
threshold = config.HALACHA_AUTO_APPROVE_THRESHOLD
dedup_distance = 1.0 - config.HALACHA_DEDUP_COSINE # cosine sim → distance
band_distance = 1.0 - config.HALACHA_DEDUP_BAND_COSINE # tail-band ceiling (#82.3)
pool = await get_pool()
inserted = 0
skipped = 0
@@ -3716,21 +3723,32 @@ async def store_halachot_for_chunk(
if norm_quote and norm_quote in existing_quotes:
skipped += 1
continue
# 2) semantic near-duplicate (rule embedding cosine)
# 2) semantic near-duplicate (rule embedding cosine) — fetch the
# nearest same-precedent neighbor once so we can both auto-skip
# (cosine ≥ DEDUP) and flag the lexical tail (#82.3).
emb = h.get("embedding")
flags = list(h.get("quality_flags") or [])
if emb is not None and config.HALACHA_DEDUP_COSINE <= 1.0:
dup = await conn.fetchval(
"SELECT 1 FROM halachot WHERE case_law_id = $1 "
"AND embedding IS NOT NULL AND (embedding <=> $2) <= $3 "
"LIMIT 1",
case_law_id, emb, dedup_distance,
neighbor = await conn.fetchrow(
"SELECT rule_statement, (embedding <=> $2) AS dist "
"FROM halachot WHERE case_law_id = $1 "
"AND embedding IS NOT NULL "
"ORDER BY embedding <=> $2 LIMIT 1",
case_law_id, emb,
)
if dup:
if neighbor is not None:
dist = float(neighbor["dist"])
if dist <= dedup_distance:
skipped += 1
continue
# tail band: below auto-skip but lexically near → flag.
if (dist <= band_distance
and halacha_quality.FLAG_NEAR_DUPLICATE not in flags
and halacha_quality.lexical_near_duplicate(
h["rule_statement"], neighbor["rule_statement"])):
flags.append(halacha_quality.FLAG_NEAR_DUPLICATE)
confidence = float(h.get("confidence", 0.0))
flags = h.get("quality_flags") or []
auto_approve = confidence >= threshold and not flags
review_status = "approved" if auto_approve else "pending_review"
reviewer = (
@@ -3774,7 +3792,19 @@ async def list_halachot(
practice_area: str | None = None,
limit: int = 200,
offset: int = 0,
exclude_low_quality: bool = False,
order_by_priority: bool = False,
) -> list[dict]:
"""List halachot with optional triage controls (#84).
exclude_low_quality — drop items carrying ANY quality_flag (application /
truncated_quote / quote_unverified / non_decision / thin_restatement /
nli_unsupported / near_duplicate). These belong in a 'needs extraction
fix' bucket, not the chair's approve queue (#84.1).
order_by_priority — replace FIFO with an active-learning order (#84.3):
negatively-treated first, then most-uncertain (lowest confidence), then
oldest — so the chair sees the highest-value decisions first.
"""
pool = await get_pool()
conditions = []
params: list = []
@@ -3791,7 +3821,16 @@ async def list_halachot(
conditions.append(f"${idx} = ANY(h.practice_areas)")
params.append(practice_area)
idx += 1
if exclude_low_quality:
# a clean item has an empty/NULL quality_flags array
conditions.append("COALESCE(array_length(h.quality_flags, 1), 0) = 0")
where_sql = f"WHERE {' AND '.join(conditions)}" if conditions else ""
order_sql = (
"ORDER BY corroboration_negative DESC, h.confidence ASC NULLS LAST, "
"h.created_at ASC"
if order_by_priority
else "ORDER BY h.case_law_id, h.halacha_index"
)
params.extend([limit, offset])
sql = f"""
SELECT h.id, h.case_law_id, h.halacha_index, h.rule_statement,
@@ -3819,7 +3858,7 @@ async def list_halachot(
GROUP BY halacha_id
) cor ON cor.halacha_id = h.id
{where_sql}
ORDER BY h.case_law_id, h.halacha_index
{order_sql}
LIMIT ${idx} OFFSET ${idx + 1}
"""
rows = await pool.fetch(sql, *params)

View File

@@ -362,12 +362,24 @@ _NEVO_MARKERS = ("ספרות:", "חקיקה שאוזכרה:", "מיני-רציו
# preamble: bibliography + מיני-רציו). Two families:
# - ועדת ערר / district openings (בפנינו / הערר שבנדון / ...)
# - COURT-RULING openings (#86.1): a פסק-דין header or the authoring judge's
# line ("השופט/ת X:", "כב' השופט", "הנשיא"). Without these, Nevo court
# judgments — exactly the ones carrying a מיני-רציו — slipped through unstripped
# (e.g. בג"ץ 1764/05), risking that the extractor reads Nevo's answer key.
# line. Without these, Nevo court judgments — exactly the ones carrying a
# מיני-רציו — slipped through unstripped (e.g. בג"ץ 1764/05).
#
# #86.2 hardening — two over-strip bugs found while backfilling:
# 1. ``פסק-דין`` headers are often markdown-wrapped (``**פסק דין**``); the old
# ``^פסק[- ]דין`` required the keyword to be the very first char of the line
# and allowed only one separator, so it missed the header and fell through
# to a citation 32K deep (עמ"נ 50567-07-21). We now tolerate leading
# markdown/whitespace and 0-3 separators.
# 2. Bare ``השופט``/``הנשיא`` matched *citations* ("השופט מ' חשין, פסקה 23"),
# stripping real decision body. The authoring-judge line ends with a COLON
# ("השופט י' עמית:"); citations use a comma. We now require the colon.
_DECISION_START = re.compile(
r"^(בפנינו|לפנינו|לפניי|הערר שבנדון|ועדת הערר לתכנון|רקע עובדתי|עסקינן|"
r"פסק[- ]דין|פסק[- ]דינו|כב(?:וד)?['׳]?\s*השופט|המשנה לנשיא|הנשיא|השופט)",
r"^[ \t>*_#]{0,6}(?:"
r"בפנינו|לפנינו|לפניי|הערר שבנדון|ועדת הערר לתכנון|רקע עובדתי|עסקינן|"
r"פסק[ \t\-]{0,3}די(?:ן|נו)|" # פסק-דין / פסק דין / **פסק דין** header (final-nun ן vs דינו)
r"(?:כב(?:וד)?['׳\"]?\s*)?(?:ה?שופט[ת]?|ה?נשיא[ה]?|המשנה לנשיא)\s+[^\n,]{1,40}:" # author line → colon
r")",
re.MULTILINE,
)
@@ -388,3 +400,41 @@ def strip_nevo_preamble(text: str) -> str:
logger.debug("Stripped %d chars of Nevo preamble", m.start())
return stripped
return text
_RATIO_MARKER = "מיני-רציו:"
def extract_nevo_ratio(text: str) -> str:
"""Return the Nevo מיני-רציו block (editorial holdings summary), or ''.
The mini-ratio is Nevo's own headnote — a concise, professionally-written
list of the holdings. We capture it *before* :func:`strip_nevo_preamble`
discards it, to serve as a free gold-set for benchmarking how well our
halacha extractor covers the real holdings (#86.3).
The block runs from the ``מיני-רציו:`` marker to whichever comes first:
the decision body (``_DECISION_START``) or the next preamble marker
(bibliography / legislation). Returns '' when there is no mini-ratio.
"""
if not text:
return ""
start = text.find(_RATIO_MARKER)
if start == -1:
return ""
body = text[start + len(_RATIO_MARKER):]
# End at the earliest of: decision body start, or a following preamble
# marker (ספרות: / חקיקה שאוזכרה: / ...). Both are measured relative to
# the ratio body so we never run past it into the judgment itself.
end = len(body)
dm = _DECISION_START.search(body)
if dm:
end = min(end, dm.start())
for marker in _NEVO_MARKERS:
if marker == _RATIO_MARKER:
continue
pos = body.find(marker)
if pos != -1:
end = min(end, pos)
return body[:end].strip()

View File

@@ -592,10 +592,16 @@ async def _extract_impl(case_law_id: UUID, force: bool = False,
flags = halacha_quality.compute_quality_flags(
coerced["rule_statement"], coerced["supporting_quote"],
coerced["reasoning_summary"], coerced["quote_verified"],
coerced["rule_type"],
)
coerced["quality_flags"] = flags
if halacha_quality.FLAG_NON_DECISION in flags and coerced["rule_type"] != "obiter":
coerced["rule_type"] = "obiter"
# #81.4 — a binding-labeled rule that reads as a case-application is
# re-typed application (it carries FLAG_APPLICATION either way).
elif (halacha_quality.FLAG_APPLICATION in flags
and coerced["rule_type"] == "binding"):
coerced["rule_type"] = "application"
cleaned.append(coerced)
# #81.3 NLI entailment — one batched judge call per chunk (fail-open).
if config.HALACHA_NLI_ENABLED and cleaned:

View File

@@ -128,6 +128,91 @@ def is_thin_restatement(rule_statement: str, supporting_quote: str) -> bool:
return overlap >= _THIN_OVERLAP and len_ratio <= _THIN_LEN_RATIO
# ── Fact-dependent application: not a generalizable holding (#81.4) ──
#
# The strict rubric's cut_application (docs/halacha-strict-rubric.md §3, §27):
# a determination that rests on the case's specific facts/parties/amounts is an
# illustration, not a holding — it must not enter the corpus as a binding rule.
# The extractor already classifies ``rule_type='application'``; this is a
# HIGH-PRECISION secondary catch for rules the model mislabeled as binding,
# using only the unambiguous "applied to THIS case" deixis (bare party words
# like "המערער" appear in genuine rules too, so they are deliberately excluded).
_FACT_DEPENDENT_MARKERS = (
"במקרה דנן",
"במקרה שבפנינו",
"במקרה שלפנינו",
"במקרה שלפניי",
"בענייננו",
"בנדון דידן",
"בנדון דנן",
"במקרה שלנו",
"בנסיבות המקרה שלפנינו",
"בנסיבות תיק זה",
"בתיק שלפנינו",
"בערר שלפנינו",
"בערר דנן",
)
def is_fact_dependent(rule_statement: str) -> bool:
"""True when the rule is phrased as an application to THIS case (not a holding)."""
norm = normalize_text(rule_statement)
return any(marker in norm for marker in _FACT_DEPENDENT_MARKERS)
# ── Lexical near-duplicate signal (the 0.830.90 cosine tail) — #82.3 ──
#
# Embedding cosine alone misses paraphrases that float just below the dedup
# threshold (0.93). A secondary lexical signal — Jaccard over word-shingles +
# normalized Levenshtein on the rule_statement — catches "same rule, reworded"
# in that band without lowering the global cosine threshold. Hybrid
# lexical+semantic beats either alone (arXiv:1805.11611). Pure functions.
def _shingles(text: str, k: int = 2) -> set[str]:
words = [w for w in re.split(r"[^א-ת0-9]+", normalize_text(text)) if w]
if len(words) < k:
return {" ".join(words)} if words else set()
return {" ".join(words[i : i + k]) for i in range(len(words) - k + 1)}
def jaccard_shingles(a: str, b: str, k: int = 2) -> float:
sa, sb = _shingles(a, k), _shingles(b, k)
if not sa or not sb:
return 0.0
return len(sa & sb) / len(sa | sb)
def normalized_levenshtein(a: str, b: str) -> float:
"""1.0 == identical, 0.0 == fully different (edit distance / max len)."""
a, b = normalize_text(a), normalize_text(b)
if not a and not b:
return 1.0
if not a or not b:
return 0.0
# classic DP edit distance (rule_statements are short — a few hundred chars)
prev = list(range(len(b) + 1))
for i, ca in enumerate(a, 1):
cur = [i]
for j, cb in enumerate(b, 1):
cur.append(min(prev[j] + 1, cur[j - 1] + 1, prev[j - 1] + (ca != cb)))
prev = cur
return 1.0 - prev[-1] / max(len(a), len(b))
_LEX_JACCARD_MIN = 0.55
_LEX_LEVENSHTEIN_MIN = 0.70
def lexical_near_duplicate(
a: str, b: str, jaccard_min: float = _LEX_JACCARD_MIN,
levenshtein_min: float = _LEX_LEVENSHTEIN_MIN,
) -> bool:
"""High lexical overlap → likely the same rule reworded (for the cosine tail)."""
return (jaccard_shingles(a, b) >= jaccard_min
or normalized_levenshtein(a, b) >= levenshtein_min)
# ── Aggregate ──
FLAG_NON_DECISION = "non_decision"
@@ -135,6 +220,8 @@ FLAG_TRUNCATED_QUOTE = "truncated_quote"
FLAG_THIN_RESTATEMENT = "thin_restatement"
FLAG_QUOTE_UNVERIFIED = "quote_unverified"
FLAG_NLI_UNSUPPORTED = "nli_unsupported" # rule not entailed by its quote (#81.3)
FLAG_APPLICATION = "application" # fact-dependent, not a holding (#81.4)
FLAG_NEAR_DUPLICATE = "near_duplicate" # cosine-tail lexical dup (#82.3)
# ── NLI entailment check (rule_statement ⊨ supporting_quote) — #81.3 ──
@@ -250,6 +337,7 @@ def compute_quality_flags(
supporting_quote: str,
reasoning_summary: str = "",
quote_verified: bool = True,
rule_type: str = "binding",
) -> list[str]:
"""Return the list of quality flags for one halacha (empty == clean).
@@ -264,4 +352,9 @@ def compute_quality_flags(
flags.append(FLAG_THIN_RESTATEMENT)
if not quote_verified:
flags.append(FLAG_QUOTE_UNVERIFIED)
# #81.4 — an application (fact-dependent) item is an illustration, not a
# generalizable holding: never auto-approve it. Trust the model's
# rule_type='application' and add a high-precision deixis catch.
if rule_type == "application" or is_fact_dependent(rule_statement):
flags.append(FLAG_APPLICATION)
return flags

View File

@@ -158,9 +158,14 @@ async def ingest_document(
except Exception as e:
await progress("failed", 100, f"כשל בחילוץ טקסט: {e}")
raise
raw_text = extractor.strip_nevo_preamble((raw_text or "")).strip()
raw_text = (raw_text or "")
else:
raw_text = (text or "").strip()
raw_text = (text or "")
# Capture the Nevo מיני-רציו (editorial holdings summary) BEFORE stripping
# it out — it is a free professional gold-set for benchmarking halacha
# extraction (#86.3). Stored on the case_law row below once we have its id.
nevo_ratio = extractor.extract_nevo_ratio(raw_text)
raw_text = extractor.strip_nevo_preamble(raw_text).strip()
if not raw_text:
await progress("failed", 100, "לא נמצא טקסט בקובץ")
raise ValueError("no extractable text in file")
@@ -180,6 +185,13 @@ async def ingest_document(
)
case_law_id = UUID(str(record["id"]))
# Persist the captured mini-ratio (best-effort; never block ingest on it).
if nevo_ratio:
try:
await db.update_case_law(case_law_id, nevo_ratio=nevo_ratio)
except Exception as e: # noqa: BLE001 — additive metadata, non-fatal
logger.warning("could not store nevo_ratio for %s: %s", case_law_id, e)
try:
stored_chunks = await _chunk_embed_store(case_law_id, raw_text, page_offsets, page_count, progress)
await db.mark_indexed(case_law_id)

View File

@@ -117,12 +117,33 @@ async def halacha_backlog(conn) -> dict:
oldest = await conn.fetchval(
"SELECT MIN(created_at) FROM halachot WHERE review_status = 'pending_review'"
)
# #84.7 — split the pending bucket: how many are genuine candidates (clean)
# vs flagged 'needs extraction fix', and the breakdown by flag, so the chair
# sees how much of the backlog is real review vs extraction noise.
pending_clean = await conn.fetchval(
"SELECT COUNT(*) FROM halachot WHERE review_status = 'pending_review' "
"AND COALESCE(array_length(quality_flags, 1), 0) = 0"
)
flag_rows = await conn.fetch(
"SELECT flag, COUNT(*) AS n FROM ("
" SELECT unnest(quality_flags) AS flag FROM halachot "
" WHERE review_status = 'pending_review'"
") t GROUP BY flag ORDER BY n DESC"
)
pending_total = counts.get("pending_review", 0)
reviewed = counts.get("approved", 0) + counts.get("rejected", 0) + counts.get("published", 0)
return {
"pending_review": counts.get("pending_review", 0),
"pending_review": pending_total,
"pending_clean": pending_clean, # real review candidates (#84.1)
"pending_flagged": pending_total - pending_clean, # needs-fix bucket
"approved": counts.get("approved", 0),
"rejected": counts.get("rejected", 0),
"deferred": counts.get("deferred", 0),
"published": counts.get("published", 0),
"total": sum(counts.values()),
"reviewed_total": reviewed,
"approve_ratio": round(counts.get("approved", 0) / reviewed, 3) if reviewed else None,
"pending_by_flag": {r["flag"]: r["n"] for r in flag_rows},
"oldest_pending_at": oldest.isoformat() if oldest else None,
}

View File

@@ -104,7 +104,7 @@ CLAIMS_CHECK_PROMPT = """אתה בודק איכות החלטות משפטיות.
"""
async def check_claims_coverage(blocks: list[dict], claims: list[dict]) -> dict:
async def check_claims_coverage(blocks: list[dict], claims: list[dict], outcome: str = "") -> dict:
"""בדיקה סמנטית (Claude) שכל טענה נענתה בדיון."""
yod = next((b for b in blocks if b["block_id"] == "block-yod"), None)
if not yod or not yod.get("content"):
@@ -114,16 +114,26 @@ async def check_claims_coverage(blocks: list[dict], claims: list[dict]) -> dict:
if not claims:
return {"name": "claims_coverage", "passed": True, "errors": [], "severity": "critical"}
# Filter: only APPELLANT claims from original pleadings.
# Committee/permit_applicant claims are defensive positions, not claims
# that need to be "addressed" in the discussion.
# #87/GAP-87 — only the appellant's claims from the APPEAL PLEADING itself
# must be addressed. claim_type: 'claim'=כתב ערר (mandatory), 'response'=כתב
# תשובה, 'reply'=תגובה/השלמת-טיעון/תכתובת (supplementary correspondence — NOT
# a standalone duty to answer, especially on full acceptance). Counting reply/
# correspondence claims as "unanswered" produced false QA fails (1033-25).
source_claims = [
c for c in claims
if c.get("source_document", "") != "block-zayin"
and c.get("claim_type") == "claim"
and c.get("party_role") == "appellant"
]
if not source_claims:
# Fallback: appellant/respondent pleadings, excluding supplementary replies.
source_claims = [
c for c in claims
if c.get("source_document", "") != "block-zayin"
and c.get("claim_type") != "reply"
and c.get("party_role") in ("appellant", "respondent")
]
if not source_claims:
# Fallback: all non-block-zayin claims
source_claims = [c for c in claims if c.get("source_document", "") != "block-zayin"]
if not source_claims:
source_claims = claims
@@ -165,9 +175,14 @@ async def check_claims_coverage(blocks: list[dict], claims: list[dict]) -> dict:
total = len(source_claims)
covered = len(addressed) + len(partial)
# On full acceptance the appellant prevailed in full — not every sub-claim
# needs individual treatment (the chair noted this for correspondence claims,
# 1033-25). Relax the missing-tolerance accordingly.
allowed_missing_ratio = 0.4 if outcome == "full_acceptance" else 0.2
return {
"name": "claims_coverage",
"passed": len(missing) <= total * 0.2, # Allow up to 20% missing
"passed": len(missing) <= total * allowed_missing_ratio,
"errors": errors,
"severity": "critical",
"details": f"{covered}/{total} טענות נענו ({covered/total*100:.0f}%), {len(partial)} חלקית, {len(missing)} חסרות",
@@ -361,8 +376,10 @@ async def validate_decision(case_id: UUID) -> dict:
# Get claims
claims = await db.get_claims(case_id)
# Determine appeal type
# Determine appeal type + outcome (outcome relaxes claims coverage on full acceptance — #87)
appeal_type = case.get("appeal_type", "licensing")
from legal_mcp.services.lessons import canonical_outcome
outcome = canonical_outcome(decision.get("outcome", "") or "")
# Run all checks
# Run sync checks
@@ -370,7 +387,7 @@ async def validate_decision(case_id: UUID) -> dict:
check_neutral_background(blocks),
]
# Async check: claims coverage with Claude
results.append(await check_claims_coverage(blocks, claims))
results.append(await check_claims_coverage(blocks, claims, outcome))
# More sync checks
results.extend([
check_weight_compliance(blocks, appeal_type),

View File

@@ -27,6 +27,62 @@ _BLOCK_TO_SECTION = {
"block-yod-alef": "summary",
}
# chunker section_type → golden-ratio section (for corpus measurement, T10)
_CHUNK_SECTION_TO_GOLDEN = {
"facts": "background", "intro": "background",
"appellant_claims": "claims", "respondent_claims": "claims",
"legal_analysis": "discussion",
"conclusion": "summary", "ruling": "summary",
}
_CORPUS_RATIOS_CACHE: dict | None = None
async def measure_corpus_ratios() -> dict:
"""Measure ACTUAL section %-of-total from Dafna's style_corpus, averaged per
outcome — the empirical counterpart to lessons.GOLDEN_RATIOS (T10). Splits each
decision via chunker (accurate, not the filtered exemplars). Cached for the
process. Returns {outcome: {"n": int, "sections": {sec: pct}}}."""
global _CORPUS_RATIOS_CACHE
if _CORPUS_RATIOS_CACHE is not None:
return _CORPUS_RATIOS_CACHE
from legal_mcp.services.chunker import _split_into_sections
pool = await db.get_pool()
async with pool.acquire() as conn:
rows = await conn.fetch("SELECT full_text, outcome FROM style_corpus WHERE full_text <> ''")
# Per-outcome AND an "_all" aggregate. style_corpus.outcome is currently
# unpopulated for the imported corpus, so per-outcome may be empty — "_all"
# is the meaningful signal today, and per-outcome becomes live once outcomes
# are backfilled. No silent loss: callers see which buckets have data via n.
by_outcome: dict[str, list[dict]] = {}
for r in rows:
sect_words: dict[str, int] = {}
for stype, stext in _split_into_sections(r["full_text"]):
g = _CHUNK_SECTION_TO_GOLDEN.get(stype)
if g:
sect_words[g] = sect_words.get(g, 0) + len(stext.split())
total = sum(sect_words.values())
if total < 100: # sections didn't parse — skip
continue
pct = {s: w / total * 100 for s, w in sect_words.items()}
by_outcome.setdefault("_all", []).append(pct)
outcome = canonical_outcome(r["outcome"] or "")
if outcome:
by_outcome.setdefault(outcome, []).append(pct)
result: dict = {}
for outcome, decs in by_outcome.items():
avg = {}
for sec in ("background", "claims", "discussion", "summary"):
vals = [d.get(sec, 0.0) for d in decs]
if vals:
avg[sec] = round(sum(vals) / len(vals), 1)
result[outcome] = {"n": len(decs), "sections": avg}
_CORPUS_RATIOS_CACHE = result
return result
def count_anti_patterns(text: str) -> dict:
"""Count each anti-pattern occurrence in text. Lower = closer to Dafna."""

View File

@@ -170,6 +170,41 @@ async def get_style_guide() -> str:
)
result += "\n"
# T10 — measured-from-corpus ratios alongside the targets, ⚠️ flags a gap
# (actual average outside the target range → revisit the target or the corpus).
try:
from legal_mcp.services.style_distance import measure_corpus_ratios
measured = await measure_corpus_ratios()
if measured:
result += "### נמדד מהקורפוס בפועל (ממוצע) — ⚠️ = פער מהיעד\n\n"
result += "| קבוצה | רקע | טענות | דיון | סיכום |\n|---|------|-------|------|-------|\n"
# Per-outcome rows (flagged vs that outcome's target), when outcomes exist.
for outcome in VALID_OUTCOMES:
m = measured.get(outcome)
if not m:
continue
tgt = GOLDEN_RATIOS[outcome]
cells = []
for sec in ("background", "claims", "discussion", "summary"):
val = m["sections"].get(sec)
if val is None:
cells.append("")
continue
lo, hi = tgt[sec]
cells.append(f"{val}%" + ("" if lo <= val <= hi else " ⚠️"))
result += f"| {outcome_labels[outcome]} (n={m['n']}) | " + " | ".join(cells) + " |\n"
# "_all" aggregate — the meaningful row today (corpus outcome unpopulated);
# shown informationally (no single target to flag against).
allm = measured.get("_all")
if allm:
cells = [f"{allm['sections'].get(s, '')}%" if allm['sections'].get(s) is not None else ""
for s in ("background", "claims", "discussion", "summary")]
result += f"| כל ההחלטות (n={allm['n']}) | " + " | ".join(cells) + " |\n"
result += ("\n_⚠ = הממוצע בפועל חורג מטווח-היעד; שקול לעדכן יעד ב-/methodology או לבדוק את הקורפוס. "
"פיצול לפי-תוצאה יופיע כש-`style_corpus.outcome` יאוכלס._\n\n")
except Exception as e: # surfaced, not swallowed
result += f"_מדידת יחסי-זהב מהקורפוס נכשלה: {e}_\n\n"
# Opening and summary strategies
result += "## אסטרטגיות פתיחה וסיכום לפי תוצאה\n\n"
for outcome in VALID_OUTCOMES:

View File

@@ -356,7 +356,22 @@ async def halacha_review(
return _ok(row)
async def halachot_pending(limit: int = 100) -> str:
"""תור ההלכות הממתינות לאישור (review_status='pending_review')."""
rows = await db.list_halachot(review_status="pending_review", limit=limit)
async def halachot_pending(limit: int = 100, include_low_quality: bool = False) -> str:
"""תור ההלכות הממתינות לאישור (review_status='pending_review').
כברירת-מחדל (#84.1, #84.3) התור **מסונן** — הלכות עם דגל-איכות כלשהו
(application / ציטוט-לא-מאומת / קטוע / obiter / restatement דק / לא-נתמך /
near-duplicate) מוסתרות (הן שייכות ל'דורש תיקון-חילוץ', לא לתור-האישור),
ו**ממוין לפי עדיפות** (טופלו-לרעה תחילה, אז הכי לא-ודאיים, אז הישנים).
Args:
limit: מספר מקסימלי.
include_low_quality: True כדי לחשוף גם פריטים מסומני-איכות (בקט 'דורש תיקון').
"""
rows = await db.list_halachot(
review_status="pending_review",
limit=limit,
exclude_low_quality=not include_low_quality,
order_by_priority=True,
)
return _ok(rows)

View File

@@ -0,0 +1,44 @@
from __future__ import annotations
import os
from legal_mcp.services import claude_session as cs
def test_clean_env_strips_session_markers(monkeypatch):
"""Nested claude -p must not inherit the parent session markers (#85)."""
for k in (
"CLAUDECODE",
"CLAUDE_CODE_ENTRYPOINT",
"CLAUDE_CODE_SESSION_ID",
"CLAUDE_CODE_EXECPATH",
"CLAUDE_CODE_SSE_PORT",
"CLAUDE_AGENT_SDK_VERSION",
"AI_AGENT",
"CLAUDE_EFFORT",
):
monkeypatch.setenv(k, "x")
env = cs._clean_subprocess_env()
assert "CLAUDECODE" not in env
assert "AI_AGENT" not in env
assert "CLAUDE_EFFORT" not in env
assert not any(k.startswith("CLAUDE_CODE_") for k in env)
assert not any(k.startswith("CLAUDE_AGENT_") for k in env)
def test_clean_env_keeps_auth_and_path(monkeypatch):
"""Auth/config + PATH/HOME must survive — they are needed by the CLI."""
monkeypatch.setenv("CLAUDECODE", "1")
monkeypatch.setenv("CLAUDE_CONFIG_DIR", "/home/chaim/.claude")
monkeypatch.setenv("ANTHROPIC_BASE_URL", "https://example")
monkeypatch.setenv("PATH", os.environ.get("PATH", "/usr/bin"))
env = cs._clean_subprocess_env()
# CLAUDE_CONFIG_DIR carries credentials — must NOT be stripped.
assert env.get("CLAUDE_CONFIG_DIR") == "/home/chaim/.claude"
assert env.get("ANTHROPIC_BASE_URL") == "https://example"
assert "PATH" in env
assert "CLAUDECODE" not in env

View File

@@ -181,3 +181,75 @@ def test_consolidation_priority_prefers_approved_then_confidence():
"quote_verified": True, "rule_statement": "x"}
# approved sorts before higher-confidence pending → kept as canonical
assert min([approved, pending_hi], key=he._consolidation_priority)["id"] == "a"
# ── #81.4 fact-dependent / application ──
@pytest.mark.parametrize("rule", [
"במקרה דנן ועדת הערר קבעה כי ההיתר בטל",
"בענייננו אין הצדקה לפיצוי",
"בערר שלפנינו הוכח כי השומה שגויה",
])
def test_is_fact_dependent_hits(rule):
assert hq.is_fact_dependent(rule) is True
@pytest.mark.parametrize("rule", [
"ועדת הערר מוסמכת לדון בהיטל השבחה",
"נטל ההוכחה מוטל על המבקש",
"פגיעה תכנונית מזכה בפיצוי לפי סעיף 197",
])
def test_is_fact_dependent_misses(rule):
assert hq.is_fact_dependent(rule) is False
def test_application_flag_from_rule_type():
flags = hq.compute_quality_flags(
"נטל ההוכחה על המבקש", "נטל ההוכחה על המבקש כאמור",
rule_type="application",
)
assert hq.FLAG_APPLICATION in flags
def test_application_flag_from_deixis_even_if_binding():
flags = hq.compute_quality_flags(
"במקרה דנן נדחה הערר", "כפי שקבענו במקרה דנן נדחה הערר",
rule_type="binding",
)
assert hq.FLAG_APPLICATION in flags
def test_clean_binding_rule_has_no_flags():
flags = hq.compute_quality_flags(
"ועדת הערר מוסמכת לדון בטענות חוקתיות הנוגעות לתכנית",
"הוועדה מוסמכת לדון אף בטענות מסוג זה, ככל שהן נוגעות לתכנית שבנדון.",
rule_type="binding",
)
assert flags == []
# ── #82.3 lexical near-duplicate signal ──
def test_jaccard_high_for_reworded_same_rule():
a = "נטל ההוכחה בהיטל השבחה מוטל על הוועדה המקומית"
b = "נטל ההוכחה בהיטל השבחה מוטל על הוועדה המקומית בלבד"
assert hq.jaccard_shingles(a, b) >= 0.5
def test_jaccard_low_for_distinct_rules():
a = "ועדת הערר מוסמכת לדון בהיטל השבחה"
b = "המועד להגשת ערר הוא שלושים יום"
assert hq.jaccard_shingles(a, b) < 0.2
def test_normalized_levenshtein_identical_and_disjoint():
assert hq.normalized_levenshtein("אבג", "אבג") == 1.0
assert hq.normalized_levenshtein("", "אבג") == 0.0
def test_lexical_near_duplicate_band():
a = "נטל ההוכחה בהיטל השבחה מוטל על הוועדה המקומית"
b = "נטל ההוכחה בהיטל השבחה מוטל על הוועדה המקומית, כך נפסק"
assert hq.lexical_near_duplicate(a, b) is True
c = "המועד להגשת ערר על שומה הוא שלושים ימים"
assert hq.lexical_near_duplicate(a, c) is False

View File

@@ -55,3 +55,64 @@ def test_markers_past_400_chars_still_detected():
text = header + _PREAMBLE + "השופטת ע' ארבל:\n\nגוף ההחלטה..."
out = ex.strip_nevo_preamble(text)
assert out.startswith("השופטת ע' ארבל:")
# ── extract_nevo_ratio (#86.3 gold-set capture) ──
def test_extract_ratio_returns_block_before_body():
text = _PREAMBLE + "השופט ס' ג'ובראן:\n\nגוף ההחלטה..."
ratio = ex.extract_nevo_ratio(text)
assert "העותרים לא הוכיחו טעם מיוחד" in ratio
assert "המחוקק הגביל את הזמן" in ratio
# must not bleed into the judgment body
assert "גוף ההחלטה" not in ratio
assert "השופט ס' ג'ובראן" not in ratio
def test_extract_ratio_stops_at_following_marker():
# ratio first, then a bibliography marker AFTER it
text = (
"מיני-רציו:\n* עיקרון אחד בלבד.\n\n"
"פסקי דין שאוזכרו:\nבג\"ץ 1/00\n\n"
"פסק-דין\nגוף..."
)
ratio = ex.extract_nevo_ratio(text)
assert "עיקרון אחד בלבד" in ratio
assert "פסקי דין שאוזכרו" not in ratio
assert "בג\"ץ 1/00" not in ratio
def test_extract_ratio_empty_when_no_marker():
assert ex.extract_nevo_ratio("פסק דין\nהשופט כהן: ...") == ""
assert ex.extract_nevo_ratio("") == ""
# ── #86.2 over-strip regressions ──
def test_citation_judge_line_is_not_a_decision_start():
# "השופט מ' חשין, פסקה 23" is a CITATION (comma, no colon) — must NOT be
# treated as the decision opening, or 32K of real body gets stripped.
body = (
"**פסק דין**\n\n"
"שני ערעורים לפניי. כפי שנפסק מפי כבוד \n\n"
"השופט מ' חשין, פסקה 23 (להלן עניין קהתי), יש לבחון...\n"
)
text = _PREAMBLE + body
out = ex.strip_nevo_preamble(text)
assert out.startswith("**פסק דין**")
assert "השופט מ' חשין, פסקה" in out # citation kept inside body
assert "מיני-רציו" not in out
def test_markdown_wrapped_pdin_header_is_stripped():
text = _PREAMBLE + "**פסק דין**\n\nשני ערעוריה הנדונים..."
out = ex.strip_nevo_preamble(text)
assert out.startswith("**פסק דין**")
assert "מיני-רציו" not in out
def test_author_line_with_colon_still_strips():
text = _PREAMBLE + "כב' השופטת ד' ברק-ארז:\n\nגוף ההחלטה..."
out = ex.strip_nevo_preamble(text)
assert out.startswith("כב' השופטת ד' ברק-ארז:")
assert "מיני-רציו" not in out

View File

@@ -36,6 +36,11 @@
| `multimodal_backfill.py` | python | Backfill voyage-multimodal-3 page embeddings על מסמכי תיקים קיימים. idempotent (skips by default), forces `MULTIMODAL_ENABLED=true` ל-run, רץ מהקונטיינר. שלב C — ראה `docs/voyage-upgrades-plan.md` | ידני per-case (`python multimodal_backfill.py 8174-24 8137-24`) |
| `backfill_chunk_pages.py` | python | Backfill `page_number` ב-`document_chunks` קיימים. legacy chunker לא tracked עמודים → `page_number=NULL` חוסם boost של multimodal hybrid (text+image join על אותו עמוד). re-extracts כל PDF (re-OCR אם צריך, ~$0.0015/page), מחשב page_offsets, ומעדכן chunks. idempotent | ידני per-case (`python backfill_chunk_pages.py 8174-24 8137-24`) |
| `rechunk_legacy_precedents.py` | python | **#57** — re-chunk + re-embed פסיקה שהוטמעה לפני תיקון ה-chunker (#55). בוחר כל `case_law` עם chunk זעיר (`length(trim(content))<50` — טביעת-האצבע של ה-chunker הישן) ומריץ `ingest.reindex_case_law` (re-chunk+re-embed מ-`full_text` שמור בלבד — ללא re-OCR/LLM, feedback_no_reocr_retrofit; idempotent DELETE-then-INSERT). idempotent ברמת-הבאטץ' (שואב מחדש את הסט המושפע בכל ריצה). דגל `--limit N`. רץ עם venv של mcp-server (`cd mcp-server && .venv/bin/python ../scripts/rechunk_legacy_precedents.py`) | חד-פעמי — מיגרציית-נתונים של פסיקה legacy (תוקן 2026-06-03) |
| `backfill_nevo_preamble.py` | python | **#86.2** — מיגרציית-נתונים: חיתוך preamble/רציו של נבו שדלף לפסיקה שהוטמעה לפני תיקון #86.1. מאתר כל `case_law` ש-`strip_nevo_preamble(full_text)` עדיין מקצר (דליפה היסטורית), ומבצע: (1) לכידת ה-מיני-רציו ל-`case_law.nevo_ratio` (gold-set ל-#86.3); (2) שכתוב `full_text` החתוך + חישוב-מחדש של `content_hash`; (3) `reindex_case_law` (re-chunk+embed, ללא re-OCR/LLM); (4) **סימון (לא מחיקה)** הלכות ש-`supporting_quote` שלהן בתוך ה-preamble שהוסר → `pending_review` + quality_flag `nevo_preamble_leak`. **שומר-בטיחות:** שורות עם keep%<`--min-keep` (ברירת-מחדל 60) מוחרגות מ-`--apply` כחשד over-strip (אלא אם `--include-suspicious`). **dry-run כברירת-מחדל**; `--apply` כותב backup JSON + manifest CSV ל-`data/audit/` תחילה. idempotent. רץ עם venv של mcp-server. **chair-gated** (לאמת manifest לפני apply) | מיגרציית-נתונים — dry-run בוצע (19 פסקים, 27 הלכות מזוהמות); apply ממתין לאישור |
| `nevo_ratio_benchmark.py` | python | **#86.3** — מדידת איכות חילוץ-הלכות מול ה-מיני-רציו של נבו (gold-set מקצועי חינמי). לכל פסק עם `nevo_ratio` (או נגזר מ-`full_text` אם טרם בוצע backfill): LLM-judge מקומי (`claude_session`, אפס עלות) ממפה סמנטית את הלכות-המערכת מול הלכות-נבו ומפיק **recall** (כיסוי הלכות-נבו), **precision** (אחוז הלכותינו הממופות), **granularity** (יחס פירוק — איתות over-extraction ל-#81.5). `--case <num>` / `--all [--limit N]` / `--model` / `--out`. כותב CSV ל-`data/audit/`. רץ עם venv של mcp-server (דורש Claude CLI מקומי). אומת על בג"ץ 1764/05: recall 0.875, precision 1.0, granularity 1.75x | ידני — מדידת-איכות (CI/ad-hoc) |
| `halacha_goldset.py` | python | **#81.7** — הארנס gold-set לאיכות חילוץ-הלכות. `export --n N` מייצא מדגם מרובד (לפי precedent×rule_type) ל-CSV עם עמודות-תיוג ריקות (`is_holding`/`correct_type`/`quote_complete`) לתיוג ידני (חיים/דפנה). `score --in <csv>` קורא את ה-CSV המתויג ומודד כל ולידטור (`compute_quality_flags`/`is_fact_dependent`/`is_quote_truncated`/`is_thin_restatement`) מול אמת-המידה האנושית: P/R/F1 + confusion. בסיס ל-#81.8 (כיול סף האישור). מייבא את אותם ולידטורים שה-extractor מריץ. רץ עם venv של mcp-server | ידני — export→תיוג→score |
| `halacha_batch_reconcile.py` | python | **#82.7** — dedup חוצה-פסקים offline (שמרני, **dry-run בלבד**). dedup-on-insert משווה רק תוך-פסק; כאן סף מחמיר (cosine ≥0.95, `--cosine`) ולא-הרסני: מאתר זוגות הלכות near-duplicate בין פסקים שונים (pgvector `<=>` exact) עם איתות לקסיקלי (Jaccard/Levenshtein) ומדווח ל-CSV ב-`data/audit/` לסקירת היו"ר. לא מדלג/ממזג/מוחק. `--include-pending`. רץ עם venv של mcp-server. אומת: 819 הלכות → 5 זוגות מועמדים | ידני — דוח-סקירה |
| `calibrate_halacha_dedup.py` | python | **#82.1** — כיול ספי ה-dedup הלקסיקלי (#82.3) מול gold-set הניקוי. קורא `halacha-cleanup-manifest-*.csv` (זוגות duplicate↔survivor מתויגי-אדם), טוען טקסט-survivor מה-DB, ו-sweep של (jaccard_min × levenshtein_min) עם P/R/F1, מסמן את נקודת-העבודה המוגדרת. אימת ש-(0.55, 0.70) → **precision 1.0** (אפס false-merge), recall 0.30 — מתאים לאיתות-משני שחוסם auto-approve. `--manifest <path>`. רץ עם venv של mcp-server | חד-פעמי — כיול (בוצע 2026-06-06) |
| `audit_corpus_integrity.py` | python | בדיקה תקופתית של עקביות הקורפוס — 3 בדיקות SQL read-only על `case_law` ו-`cases`: (A) `external_upload` עם prefix פנימי `ערר`/`בל"מ`; (B) `internal_committee` חסר `chair_name`/`district`; (C) `cases.practice_area` מחוץ ל-{`rishuy_uvniya`, `betterment_levy`, `compensation_197`, `''`}. כותב log מצטבר ל-`data/logs/corpus_integrity_audit.log` ובמצב הפרות שולח wakeup ל-CEO ב-Paperclip (best-effort, רק אם `PAPERCLIP_API_URL`+`PAPERCLIP_API_KEY` מוגדרים). דגל: `--no-notify`. Idempotent, יוצא 0. **Cron יומי 07:00**: `0 7 * * * /home/chaim/legal-ai/mcp-server/.venv/bin/python /home/chaim/legal-ai/scripts/audit_corpus_integrity.py` | `0 7 * * *` (cron) |
| `backfill_legal_arguments.py` | python | Backfill `legal_arguments` לתיקים עם `claims` קיימים (TaskMaster #36). מקבץ פרופוזיציות גולמיות לטיעונים משפטיים מובחנים (~6-12 לכל צד) דרך `argument_aggregator.aggregate_claims_to_arguments` (Claude CLI). תומך `--dry-run`/`--apply`/`--force`/`--case <num>...`. **חייב לרוץ מהמכונה המקומית** (לא קונטיינר) — `claude_session` דורש Claude CLI | ידני per-case (`python scripts/backfill_legal_arguments.py --apply --case 1017-03-26`) |
| `upload_blam_decisions.py` | python | חד-פעמי (2026-05-26) — העלאת 2 החלטות בל"מ ל-`case_law` (8126/24 סופר נוח, 8047/23 הרנון) דרך `ingest_internal_decision` ישיר, עוקף MCP server שטרם נטען מחדש אחרי הוספת `proceeding_type`. **לא להריץ שוב** | חד-פעמי — להעביר ל-`.archive/` בהזדמנות |

View File

@@ -0,0 +1,240 @@
#!/usr/bin/env python3
"""#86.2 — backfill: strip leaked Nevo preamble/ratio from already-ingested rulings.
Court rulings ingested BEFORE the #86.1 fix kept their Nevo preamble
(bibliography + מיני-רציו) because the old ``_DECISION_START`` regex only
matched ועדת-ערר openings, not ``פסק-דין``/judge openings. For those rows the
preamble is baked into the stored ``full_text`` AND into the chunks — and the
מיני-רציו (Nevo's editorial answer-key) may have leaked into extracted
halachot, contaminating the corpus.
This script finds every case_law row whose stored ``full_text`` would still be
shortened by the CURRENT ``strip_nevo_preamble`` (i.e. a pre-fix leak), and:
1. captures the מיני-רציו into ``case_law.nevo_ratio`` (gold-set for #86.3),
unless that column is already populated;
2. rewrites ``full_text`` to the stripped body + recomputes ``content_hash``;
3. re-chunks + re-embeds via ``ingest.reindex_case_law`` (no re-OCR, no LLM);
4. flags — never deletes — halachot whose supporting_quote lives entirely in
the removed preamble region: review_status -> 'pending_review' plus a
'nevo_preamble_leak' quality_flag, so the chair can re-judge them (#84).
DRY-RUN BY DEFAULT. ``--apply`` performs the migration and first writes a JSON
backup + CSV manifest to ``data/audit/`` (per the code-protocol data-migration
rule). Idempotent: a re-run finds nothing because stripped rows no longer match.
Run with the MCP server venv (config loads ~/.env / Infisical for POSTGRES +
VOYAGE, same as the live MCP tools):
cd ~/legal-ai/mcp-server
.venv/bin/python ../scripts/backfill_nevo_preamble.py # dry-run
.venv/bin/python ../scripts/backfill_nevo_preamble.py --apply # migrate
.venv/bin/python ../scripts/backfill_nevo_preamble.py --limit 3 # smoke
"""
from __future__ import annotations
import argparse
import asyncio
import csv
import json
import sys
from datetime import datetime, timezone
from pathlib import Path
from legal_mcp.services import db, ingest
from legal_mcp.services.extractor import extract_nevo_ratio, strip_nevo_preamble
from legal_mcp.services.halacha_quality import normalize_text
REPO_ROOT = Path(__file__).resolve().parent.parent
AUDIT_DIR = REPO_ROOT / "data" / "audit"
# Safety: a clean strip removes only the Nevo preamble (a small head). If the
# strip would discard more than this fraction of the document, treat it as a
# suspected over-strip (a citation/heading false-match) and DO NOT auto-apply
# — surface it for manual review instead. Destroying real decision body is
# far worse than leaving a preamble in place.
DEFAULT_MIN_KEEP_PCT = 60
async def _scan(conn, limit: int | None) -> list[dict]:
"""Return rows whose stored full_text still carries a Nevo preamble."""
rows = await conn.fetch(
"SELECT id, case_number, full_text, nevo_ratio "
"FROM case_law WHERE full_text <> '' ORDER BY case_number"
)
hits: list[dict] = []
for r in rows:
full = r["full_text"] or ""
stripped = strip_nevo_preamble(full)
if stripped == full:
continue # no leak (already clean, or never had a preamble)
removed = full[: len(full) - len(stripped)]
ratio = extract_nevo_ratio(full)
keep_pct = round(100 * len(stripped) / len(full)) if full else 0
hits.append({
"id": r["id"],
"case_number": r["case_number"],
"full_text": full,
"stripped": stripped,
"removed": removed,
"ratio": ratio,
"keep_pct": keep_pct,
"had_ratio_stored": bool((r["nevo_ratio"] or "").strip()),
})
if limit and len(hits) >= limit:
break
return hits
async def _contaminated_halachot(conn, case_law_id, removed: str) -> list[dict]:
"""Halachot whose supporting_quote sits entirely inside the removed preamble."""
norm_removed = normalize_text(removed)
if not norm_removed:
return []
rows = await conn.fetch(
"SELECT id, halacha_index, supporting_quote, review_status, quality_flags "
"FROM halachot WHERE case_law_id = $1",
case_law_id,
)
bad = []
for r in rows:
q = normalize_text(r["supporting_quote"] or "")
if len(q) >= 20 and q in norm_removed:
bad.append(dict(r))
return bad
async def main(args: argparse.Namespace) -> int:
ts = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
pool = await db.get_pool()
async with pool.acquire() as conn:
hits = await _scan(conn, args.limit)
for h in hits:
h["contaminated"] = await _contaminated_halachot(conn, h["id"], h["removed"])
# Partition into safe (auto-appliable) vs suspicious (manual review).
for h in hits:
h["suspicious"] = h["keep_pct"] < args.min_keep
safe = [h for h in hits if not h["suspicious"]]
suspicious = [h for h in hits if h["suspicious"]]
n = len(hits)
total_contam = sum(len(h["contaminated"]) for h in hits)
print(f"leaked rulings found: {n} (contaminated halachot: {total_contam}; "
f"safe: {len(safe)}, suspicious<{args.min_keep}%: {len(suspicious)})", flush=True)
for h in hits:
print(
f" {'' if h['suspicious'] else ' '}{h['case_number']}: "
f"keep {h['keep_pct']}%, -{len(h['removed']):,} preamble chars, "
f"ratio={len(h['ratio'])} chars, "
f"{len(h['contaminated'])} contaminated halachot"
+ ("" if h["ratio"] else " [no mini-ratio]")
+ (" [ratio already stored]" if h["had_ratio_stored"] else ""),
flush=True,
)
if suspicious:
print(f"\n{len(suspicious)} ruling(s) below {args.min_keep}% keep — "
"EXCLUDED from --apply (suspected over-strip). Review manually or "
"pass --include-suspicious to force.", flush=True)
if not hits:
print("nothing to backfill — corpus clean ✓", flush=True)
return 0
apply_set = hits if args.include_suspicious else safe
# Always write a manifest (dry-run included) for the audit trail.
AUDIT_DIR.mkdir(parents=True, exist_ok=True)
manifest = AUDIT_DIR / f"nevo-backfill-manifest-{ts}.csv"
with manifest.open("w", encoding="utf-8", newline="") as f:
w = csv.writer(f)
w.writerow(["case_law_id", "case_number", "keep_pct", "preamble_chars",
"ratio_chars", "contaminated_halachot", "suspicious", "applied"])
for h in hits:
will_apply = args.apply and (not h["suspicious"] or args.include_suspicious)
w.writerow([h["id"], h["case_number"], h["keep_pct"], len(h["removed"]),
len(h["ratio"]), len(h["contaminated"]), h["suspicious"], will_apply])
print(f"manifest: {manifest}", flush=True)
if not args.apply:
print("\nDRY-RUN — no changes written. Re-run with --apply to migrate.", flush=True)
return 0
# Backup the BEFORE state before mutating anything.
backup = AUDIT_DIR / f"nevo-backfill-backup-{ts}.json"
with backup.open("w", encoding="utf-8") as f:
json.dump([
{
"id": str(h["id"]),
"case_number": h["case_number"],
"full_text": h["full_text"],
"ratio": h["ratio"],
"contaminated": [
{"id": str(c["id"]), "halacha_index": c["halacha_index"],
"review_status": c["review_status"],
"quality_flags": list(c["quality_flags"] or [])}
for c in h["contaminated"]
],
}
for h in apply_set
], f, ensure_ascii=False, indent=2)
print(f"backup: {backup}", flush=True)
n_apply = len(apply_set)
ok, failed = 0, []
for i, h in enumerate(apply_set, 1):
cid, cn = h["id"], h["case_number"]
try:
async with pool.acquire() as conn:
async with conn.transaction():
# 1+2: rewrite full_text + content_hash; store ratio if absent.
await conn.execute(
"UPDATE case_law SET full_text = $2, content_hash = $3 WHERE id = $1",
cid, h["stripped"], db._content_hash(h["stripped"]),
)
if h["ratio"] and not h["had_ratio_stored"]:
await conn.execute(
"UPDATE case_law SET nevo_ratio = $2 WHERE id = $1",
cid, h["ratio"],
)
# 4: flag (never delete) contaminated halachot.
for c in h["contaminated"]:
flags = list(c["quality_flags"] or [])
if "nevo_preamble_leak" not in flags:
flags.append("nevo_preamble_leak")
await conn.execute(
"UPDATE halachot SET review_status = 'pending_review', "
"quality_flags = $2 WHERE id = $1",
c["id"], flags,
)
# 3: reindex outside the txn (its own DELETE-then-INSERT + embeddings).
res = await ingest.reindex_case_law(cid)
ok += 1
print(f"[{i}/{n_apply}] OK {cn}: -> {res['chunks']} chunks, "
f"{len(h['contaminated'])} halachot flagged", flush=True)
except Exception as e: # noqa: BLE001 — per-row, keep going
failed.append((cn, str(e)))
print(f"[{i}/{n_apply}] FAIL {cn}: {e}", flush=True)
print(f"\nDONE — {ok}/{n_apply} migrated, {len(failed)} failed"
+ (f", {len(suspicious)} suspicious skipped" if suspicious and not args.include_suspicious else ""),
flush=True)
for cn, e in failed:
print(f" FAILED {cn}: {e}", flush=True)
return 0 if not failed else 1
if __name__ == "__main__":
ap = argparse.ArgumentParser(description=__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter)
ap.add_argument("--apply", action="store_true",
help="perform the migration (default: dry-run)")
ap.add_argument("--limit", type=int, default=None,
help="process only the first N leaked rulings")
ap.add_argument("--min-keep", type=int, default=DEFAULT_MIN_KEEP_PCT,
help=f"min%% of doc that must remain after strip to auto-apply "
f"(default {DEFAULT_MIN_KEEP_PCT}); lower = suspected over-strip")
ap.add_argument("--include-suspicious", action="store_true",
help="force --apply on rows below --min-keep (use with care)")
args = ap.parse_args()
sys.exit(asyncio.run(main(args)))

View File

@@ -0,0 +1,115 @@
#!/usr/bin/env python3
"""#82.1 — calibrate the lexical dedup thresholds against the cleanup gold-set.
The 2026-06-03 cleanup manifest (data/audit/halacha-cleanup-manifest-*.csv)
records, for each removed halacha, a ``reason`` and a ``survivor_id`` — i.e. a
human-labeled set of TRUE duplicate pairs (deleted rule ↔ its survivor). This
script uses them to validate the lexical near-duplicate thresholds introduced
in #82.3 (``HALACHA`` Jaccard/Levenshtein), so the numbers in
``halacha_quality.lexical_near_duplicate`` are calibrated, not guessed.
It sweeps (jaccard_min × levenshtein_min) and reports precision/recall against:
* positives — duplicate-labeled pairs (deleted rule ↔ survivor rule)
* negatives — random non-paired rules from the same manifest (≈all distinct)
and marks the currently-configured operating point.
cd ~/legal-ai/mcp-server
.venv/bin/python ../scripts/calibrate_halacha_dedup.py \
--manifest ../data/audit/halacha-cleanup-manifest-20260603T101747Z.csv
"""
from __future__ import annotations
import argparse
import asyncio
import csv
import sys
from pathlib import Path
from uuid import UUID
from legal_mcp.services import db, halacha_quality as hq
async def _survivor_text(survivor_id: str, manifest_map: dict) -> str:
if survivor_id in manifest_map:
return manifest_map[survivor_id]
try:
row = await db.get_halacha(UUID(survivor_id)) if hasattr(db, "get_halacha") else None
except Exception:
row = None
if row:
return row.get("rule_statement", "")
# fallback: direct query
try:
pool = await db.get_pool()
r = await pool.fetchrow("SELECT rule_statement FROM halachot WHERE id = $1", UUID(survivor_id))
return r["rule_statement"] if r else ""
except Exception:
return ""
async def main(args: argparse.Namespace) -> int:
path = Path(args.manifest)
if not path.is_absolute():
path = (Path.cwd() / path).resolve()
with path.open(encoding="utf-8") as f:
rows = list(csv.DictReader(f))
by_id = {r["id"]: r.get("rule_statement", "") for r in rows}
positives: list[tuple[str, str]] = []
for r in rows:
if "duplicate" in (r.get("reason") or "").lower() and r.get("survivor_id"):
a = r.get("rule_statement", "")
b = await _survivor_text(r["survivor_id"], by_id)
if a and b:
positives.append((a, b))
# negatives: pair each deleted rule with a different, non-survivor rule.
rules = [r.get("rule_statement", "") for r in rows if r.get("rule_statement")]
negatives: list[tuple[str, str]] = []
for i in range(len(positives)):
a = rules[i % len(rules)]
b = rules[(i * 7 + 3) % len(rules)] # deterministic spread, no RNG
if a and b and a != b:
negatives.append((a, b))
print(f"positives (labeled dup pairs): {len(positives)} "
f"negatives: {len(negatives)}", flush=True)
if not positives:
print("no labeled duplicate pairs found in manifest — cannot calibrate", flush=True)
return 1
# precompute lexical scores per pair
def scores(pairs):
return [(hq.jaccard_shingles(a, b), hq.normalized_levenshtein(a, b)) for a, b in pairs]
pos_s, neg_s = scores(positives), scores(negatives)
print(f"\n{'jac_min':>8}{'lev_min':>8}{'P':>8}{'R':>8}{'F1':>8}", flush=True)
best = None
for jm in (0.40, 0.45, 0.50, 0.55, 0.60, 0.65, 0.70):
for lm in (0.60, 0.65, 0.70, 0.75, 0.80, 0.85):
tp = sum(1 for j, l in pos_s if j >= jm or l >= lm)
fp = sum(1 for j, l in neg_s if j >= jm or l >= lm)
fn = len(pos_s) - tp
p = tp / (tp + fp) if (tp + fp) else 0.0
r = tp / (tp + fn) if (tp + fn) else 0.0
f1 = 2 * p * r / (p + r) if (p + r) else 0.0
mark = " <- configured" if (abs(jm - hq._LEX_JACCARD_MIN) < 1e-9
and abs(lm - hq._LEX_LEVENSHTEIN_MIN) < 1e-9) else ""
if mark:
print(f"{jm:>8.2f}{lm:>8.2f}{p:>8.3f}{r:>8.3f}{f1:>8.3f}{mark}", flush=True)
if best is None or f1 > best[0]:
best = (f1, jm, lm, p, r)
print(f"\nbest F1={best[0]:.3f} at jaccard_min={best[1]}, levenshtein_min={best[2]} "
f"(P={best[3]:.3f}, R={best[4]:.3f})", flush=True)
print("note: positives may include obiter/application cuts (not pure dups); "
"use precision as the guard against false-merges.", flush=True)
return 0
if __name__ == "__main__":
ap = argparse.ArgumentParser(description=__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter)
ap.add_argument("--manifest", required=True, help="path to halacha-cleanup-manifest-*.csv")
args = ap.parse_args()
sys.exit(asyncio.run(main(args)))

View File

@@ -0,0 +1,106 @@
#!/usr/bin/env python3
"""#82.7 — offline CROSS-precedent halacha dedup (conservative, dry-run reporter).
Dedup-on-insert (db.store_halachot_for_chunk) only compares within a single
precedent — the 2026-06-03 audit showed cosine ≥0.90 is reliable only
within-precedent. Across precedents the same principle legitimately recurs, so
this batch job is deliberately STRICTER (cosine ≥0.95) and NON-DESTRUCTIVE: it
only reports candidate cross-precedent near-duplicate pairs to a CSV for the
chair to review. Nothing is skipped, merged, or deleted.
Pairs are found with pgvector's exact cosine (``<=>``) per halacha against
halachot in OTHER precedents; a secondary lexical check (Jaccard/Levenshtein)
is reported alongside so the reviewer can tell "same rule" from "same topic".
cd ~/legal-ai/mcp-server
.venv/bin/python ../scripts/halacha_batch_reconcile.py # cosine ≥0.95
.venv/bin/python ../scripts/halacha_batch_reconcile.py --cosine 0.97
"""
from __future__ import annotations
import argparse
import asyncio
import csv
import sys
from datetime import datetime, timezone
from pathlib import Path
from legal_mcp.services import db, halacha_quality as hq
REPO_ROOT = Path(__file__).resolve().parent.parent
AUDIT_DIR = REPO_ROOT / "data" / "audit"
async def main(args: argparse.Namespace) -> int:
cosine = args.cosine
max_dist = 1.0 - cosine
statuses = ("approved", "published") if not args.include_pending else (
"approved", "published", "pending_review")
pool = await db.get_pool()
async with pool.acquire() as conn:
rows = await conn.fetch(
"SELECT h.id, h.case_law_id, cl.case_number, h.rule_statement "
"FROM halachot h JOIN case_law cl ON cl.id = h.case_law_id "
"WHERE h.embedding IS NOT NULL AND h.review_status = ANY($1::text[]) "
"ORDER BY h.case_law_id, h.halacha_index",
list(statuses),
)
print(f"scanning {len(rows)} halachot for cross-precedent pairs "
f"(cosine ≥ {cosine})...", flush=True)
seen: set[frozenset] = set()
pairs: list[dict] = []
for r in rows:
# nearest neighbor in a DIFFERENT precedent
nb = await conn.fetchrow(
"SELECT h2.id, cl2.case_number, h2.rule_statement, "
" (h2.embedding <=> (SELECT embedding FROM halachot WHERE id = $1)) AS dist "
"FROM halachot h2 JOIN case_law cl2 ON cl2.id = h2.case_law_id "
"WHERE h2.embedding IS NOT NULL AND h2.case_law_id <> $2 "
" AND h2.review_status = ANY($3::text[]) "
"ORDER BY h2.embedding <=> (SELECT embedding FROM halachot WHERE id = $1) "
"LIMIT 1",
r["id"], r["case_law_id"], list(statuses),
)
if nb is None or float(nb["dist"]) > max_dist:
continue
key = frozenset({str(r["id"]), str(nb["id"])})
if key in seen:
continue
seen.add(key)
pairs.append({
"case_a": r["case_number"], "id_a": r["id"], "rule_a": r["rule_statement"],
"case_b": nb["case_number"], "id_b": nb["id"], "rule_b": nb["rule_statement"],
"cosine": round(1.0 - float(nb["dist"]), 4),
"jaccard": round(hq.jaccard_shingles(r["rule_statement"], nb["rule_statement"]), 3),
"levenshtein": round(hq.normalized_levenshtein(r["rule_statement"], nb["rule_statement"]), 3),
})
pairs.sort(key=lambda p: -p["cosine"])
print(f"found {len(pairs)} cross-precedent candidate pair(s)", flush=True)
for p in pairs[:30]:
print(f" cos={p['cosine']} jac={p['jaccard']} lev={p['levenshtein']} "
f"{p['case_a']}{p['case_b']}: {p['rule_a'][:60]}...", flush=True)
if pairs:
ts = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
AUDIT_DIR.mkdir(parents=True, exist_ok=True)
out = AUDIT_DIR / f"halacha-cross-precedent-{ts}.csv"
with out.open("w", encoding="utf-8", newline="") as f:
w = csv.DictWriter(f, fieldnames=list(pairs[0].keys()))
w.writeheader()
w.writerows(pairs)
print(f"\nreport: {out} (review-only — nothing changed)", flush=True)
return 0
if __name__ == "__main__":
ap = argparse.ArgumentParser(description=__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter)
ap.add_argument("--cosine", type=float, default=0.95,
help="min cosine for a cross-precedent candidate (default 0.95)")
ap.add_argument("--include-pending", action="store_true",
help="also scan pending_review halachot (default: approved/published only)")
args = ap.parse_args()
sys.exit(asyncio.run(main(args)))

149
scripts/halacha_goldset.py Normal file
View File

@@ -0,0 +1,149 @@
#!/usr/bin/env python3
"""#81.7 — gold-set harness for halacha-extraction quality.
Two modes — the human tagging in between is the only manual step:
export — dump a stratified sample of halachot to a CSV with EMPTY label
columns for חיים/דפנה to fill (is_holding, correct_type,
quote_complete). Stratified across precedents and rule_types so
the set isn't dominated by one ruling.
score — read the tagged CSV back and measure each pure validator
(compute_quality_flags / is_fact_dependent / is_quote_truncated /
is_thin_restatement) against the human labels: precision, recall,
F1 per validator + a confusion summary. This is the ground-truth
#81.8 needs to recalibrate the auto-approve threshold.
The validators here are the SAME ones the live extractor runs, imported
directly — so the score reflects production behavior, not a reimplementation.
cd ~/legal-ai/mcp-server
.venv/bin/python ../scripts/halacha_goldset.py export --n 150
# ... חיים/דפנה fill is_holding / correct_type / quote_complete ...
.venv/bin/python ../scripts/halacha_goldset.py score --in data/audit/halacha-goldset-<ts>.csv
"""
from __future__ import annotations
import argparse
import asyncio
import csv
import sys
from collections import defaultdict
from datetime import datetime, timezone
from pathlib import Path
from legal_mcp.services import db, halacha_quality as hq
REPO_ROOT = Path(__file__).resolve().parent.parent
AUDIT_DIR = REPO_ROOT / "data" / "audit"
# Columns the human fills. is_holding: 1 if a real generalizable holding, 0 if
# obiter/application/fact-recitation/non-rule. correct_type: binding/interpretive/
# obiter/application. quote_complete: 1 if the quote is a whole, untruncated span.
LABEL_COLS = ["is_holding", "correct_type", "quote_complete"]
EXPORT_COLS = [
"id", "case_number", "halacha_index", "rule_type", "review_status",
"confidence", "rule_statement", "supporting_quote", *LABEL_COLS,
]
async def _export(n: int) -> int:
rows = await db.list_halachot(limit=5000)
# stratify: round-robin across (case_law_id, rule_type) buckets.
buckets: dict = defaultdict(list)
for r in rows:
buckets[(r["case_law_id"], r.get("rule_type"))].append(r)
sample: list[dict] = []
keys = list(buckets.values())
i = 0
while len(sample) < n and any(keys):
b = keys[i % len(keys)]
if b:
sample.append(b.pop())
i += 1
if i > n * 50:
break
ts = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
AUDIT_DIR.mkdir(parents=True, exist_ok=True)
out = AUDIT_DIR / f"halacha-goldset-{ts}.csv"
with out.open("w", encoding="utf-8", newline="") as f:
w = csv.DictWriter(f, fieldnames=EXPORT_COLS, extrasaction="ignore")
w.writeheader()
for r in sample:
w.writerow({**{k: r.get(k, "") for k in EXPORT_COLS},
**{lc: "" for lc in LABEL_COLS}})
print(f"exported {len(sample)} halachot for tagging → {out}", flush=True)
print(f"fill columns: {', '.join(LABEL_COLS)} (is_holding/quote_complete = 1/0)", flush=True)
return 0
def _prf(tp: int, fp: int, fn: int) -> tuple[float, float, float]:
p = tp / (tp + fp) if (tp + fp) else 0.0
r = tp / (tp + fn) if (tp + fn) else 0.0
f1 = 2 * p * r / (p + r) if (p + r) else 0.0
return round(p, 3), round(r, 3), round(f1, 3)
def _score(path: Path) -> int:
with path.open(encoding="utf-8") as f:
rows = [r for r in csv.DictReader(f) if (r.get("is_holding") or "").strip() != ""]
if not rows:
print("no labeled rows (is_holding empty everywhere) — nothing to score", flush=True)
return 1
# A validator FLAG is a prediction of "NOT a clean holding" (should be
# rejected/reviewed). Ground truth NOT-holding = is_holding == 0.
# We score each validator as a detector of not-holding.
counters: dict[str, dict[str, int]] = defaultdict(lambda: {"tp": 0, "fp": 0, "fn": 0, "tn": 0})
def tally(name: str, predicted_bad: bool, truly_bad: bool):
c = counters[name]
if predicted_bad and truly_bad:
c["tp"] += 1
elif predicted_bad and not truly_bad:
c["fp"] += 1
elif not predicted_bad and truly_bad:
c["fn"] += 1
else:
c["tn"] += 1
for r in rows:
rule = r.get("rule_statement", "")
quote = r.get("supporting_quote", "")
rtype = r.get("rule_type", "binding")
quote_complete = (r.get("quote_complete") or "1").strip() not in ("0", "false", "")
truly_not_holding = (r.get("is_holding") or "").strip() in ("0", "false")
flags = hq.compute_quality_flags(rule, quote, "", quote_complete, rtype)
tally("any_flag", bool(flags), truly_not_holding)
tally("application", hq.FLAG_APPLICATION in flags, truly_not_holding)
tally("non_decision", hq.FLAG_NON_DECISION in flags, truly_not_holding)
tally("thin_restatement", hq.FLAG_THIN_RESTATEMENT in flags, truly_not_holding)
# quote-truncation scored against quote_complete label specifically
tally("truncated_quote", hq.is_quote_truncated(quote), not quote_complete)
print(f"scored {len(rows)} labeled halachot\n", flush=True)
print(f"{'validator':<18}{'P':>7}{'R':>7}{'F1':>7} tp/fp/fn/tn", flush=True)
for name, c in counters.items():
p, rec, f1 = _prf(c["tp"], c["fp"], c["fn"])
print(f"{name:<18}{p:>7}{rec:>7}{f1:>7} "
f"{c['tp']}/{c['fp']}/{c['fn']}/{c['tn']}", flush=True)
return 0
async def main(args: argparse.Namespace) -> int:
if args.mode == "export":
return await _export(args.n)
return _score(Path(args.infile))
if __name__ == "__main__":
ap = argparse.ArgumentParser(description=__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter)
sub = ap.add_subparsers(dest="mode", required=True)
pe = sub.add_parser("export", help="dump a sample CSV for human tagging")
pe.add_argument("--n", type=int, default=150, help="sample size (default 150)")
ps = sub.add_parser("score", help="measure validators against a tagged CSV")
ps.add_argument("--in", dest="infile", required=True, help="tagged CSV path")
args = ap.parse_args()
sys.exit(asyncio.run(main(args)))

View File

@@ -0,0 +1,173 @@
#!/usr/bin/env python3
"""#86.3 — benchmark halacha-extraction quality against Nevo's מיני-רציו gold-set.
Nevo's editorial מיני-רציו is a free, professionally-written list of a ruling's
holdings. By comparing the halachot WE extracted against it we get an honest,
zero-cost measurement of extraction quality per ruling:
* recall — fraction of Nevo's holdings that our halachot cover
* precision — fraction of our halachot that map to a Nevo holding
* granularity — our_count / nevo_holding_count (over-decomposition signal,
the #81.5 concern: e.g. 14 ours vs 4 Nevo = 3.5x)
The gold-truth ratio is read from ``case_law.nevo_ratio`` (populated by
``backfill_nevo_preamble.py`` / ingest). For rulings not yet backfilled it
falls back to computing the ratio on-the-fly from the stored ``full_text``,
so the harness works before and after the migration.
An LLM-as-judge (local ``claude_session``, zero API cost) does the semantic
mapping — string overlap can't tell "same holding, different words" from a
genuinely new holding. The judge is asked to count, not to rewrite.
Run with the MCP server venv (needs the local ``claude`` CLI):
cd ~/legal-ai/mcp-server
.venv/bin/python ../scripts/nevo_ratio_benchmark.py --case 'בג"ץ 1764/05'
.venv/bin/python ../scripts/nevo_ratio_benchmark.py --all --limit 5
.venv/bin/python ../scripts/nevo_ratio_benchmark.py --all # full corpus
"""
from __future__ import annotations
import argparse
import asyncio
import csv
import json
import sys
from datetime import datetime, timezone
from pathlib import Path
from legal_mcp.services import claude_session, db
from legal_mcp.services.extractor import extract_nevo_ratio
REPO_ROOT = Path(__file__).resolve().parent.parent
AUDIT_DIR = REPO_ROOT / "data" / "audit"
_JUDGE_SYSTEM = (
"אתה בוחן-איכות משפטי. נתונים לך (א) רשימת ההלכות (מיני-רציו) שכתב עורך נבו "
"עבור פסק-דין — אמת-המידה; (ב) רשימת ההלכות שמערכת אוטומטית חילצה מאותו "
"פסק-דין. משימתך: למפות סמנטית בין השתיים (אותו עיקרון משפטי בניסוח שונה = "
"התאמה), ולספור. החזר JSON בלבד, ללא טקסט נוסף."
)
def _judge_prompt(ratio: str, ours: list[str]) -> str:
ours_block = "\n".join(f"{i}. {s}" for i, s in enumerate(ours, 1)) or "(אין)"
return (
f"מיני-רציו של נבו (אמת-מידה):\n{ratio}\n\n"
f"ההלכות שחולצו על-ידי המערכת ({len(ours)}):\n{ours_block}\n\n"
"החזר JSON עם המפתחות:\n"
'{"nevo_holdings": <מספר העקרונות הנפרדים במיני-רציו>,\n'
' "covered": <כמה מעקרונות נבו מכוסים ע"י לפחות הלכה אחת שלנו>,\n'
' "ours_total": <מספר ההלכות שלנו>,\n'
' "ours_mapped": <כמה מההלכות שלנו ממופות לעיקרון נבו כלשהו>,\n'
' "notes": "<עד 2 משפטים: מה הוחמץ / מה עודף>"}'
)
async def _bench_one(row: dict, model: str | None) -> dict:
cn = row["case_number"]
ratio = (row.get("nevo_ratio") or "").strip() or extract_nevo_ratio(row.get("full_text") or "")
result = {"case_number": cn, "nevo_holdings": 0, "covered": 0,
"ours_total": 0, "ours_mapped": 0, "recall": None,
"precision": None, "granularity": None, "notes": "", "error": ""}
if not ratio:
result["error"] = "no mini-ratio"
return result
halachot = await db.list_halachot(case_law_id=row["id"], limit=500)
ours = [h["rule_statement"] for h in halachot
if h.get("review_status") in ("approved", "published", "pending_review")
and (h.get("rule_statement") or "").strip()]
result["ours_total"] = len(ours)
if not ours:
result["error"] = "no extracted halachot"
return result
try:
verdict = await claude_session.query_json(
_judge_prompt(ratio, ours), system=_JUDGE_SYSTEM, model=model, effort="low",
)
except Exception as e: # noqa: BLE001
result["error"] = f"judge failed: {e}"
return result
if not isinstance(verdict, dict):
result["error"] = "judge returned non-dict"
return result
nh = int(verdict.get("nevo_holdings") or 0)
cov = int(verdict.get("covered") or 0)
ot = int(verdict.get("ours_total") or len(ours))
om = int(verdict.get("ours_mapped") or 0)
result.update({
"nevo_holdings": nh, "covered": cov, "ours_total": ot, "ours_mapped": om,
"recall": round(cov / nh, 3) if nh else None,
"precision": round(om / ot, 3) if ot else None,
"granularity": round(ot / nh, 2) if nh else None,
"notes": str(verdict.get("notes") or "")[:300],
})
return result
async def main(args: argparse.Namespace) -> int:
pool = await db.get_pool()
async with pool.acquire() as conn:
if args.case:
rows = await conn.fetch(
"SELECT id, case_number, nevo_ratio, full_text FROM case_law "
"WHERE case_number = $1", args.case,
)
else:
# rulings that have (or can derive) a ratio
rows = await conn.fetch(
"SELECT id, case_number, nevo_ratio, full_text FROM case_law "
"WHERE nevo_ratio <> '' OR full_text LIKE '%מיני-רציו:%' "
"ORDER BY case_number"
)
rows = [dict(r) for r in rows]
if args.limit:
rows = rows[: args.limit]
if not rows:
print("no rulings with a mini-ratio found", flush=True)
return 0
print(f"benchmarking {len(rows)} ruling(s)...", flush=True)
results = []
for i, row in enumerate(rows, 1):
res = await _bench_one(row, args.model)
results.append(res)
if res["error"]:
print(f"[{i}/{len(rows)}] {res['case_number']}: SKIP ({res['error']})", flush=True)
else:
print(f"[{i}/{len(rows)}] {res['case_number']}: "
f"recall={res['recall']} precision={res['precision']} "
f"granularity={res['granularity']}x "
f"(nevo={res['nevo_holdings']}, ours={res['ours_total']})", flush=True)
scored = [r for r in results if r["recall"] is not None]
if scored:
avg = lambda k: round(sum(r[k] for r in scored) / len(scored), 3) # noqa: E731
print(f"\n=== {len(scored)} scored — mean recall={avg('recall')} "
f"precision={avg('precision')} granularity={avg('granularity')}x ===", flush=True)
ts = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
AUDIT_DIR.mkdir(parents=True, exist_ok=True)
out = Path(args.out) if args.out else AUDIT_DIR / f"nevo-ratio-benchmark-{ts}.csv"
with out.open("w", encoding="utf-8", newline="") as f:
w = csv.DictWriter(f, fieldnames=list(results[0].keys()))
w.writeheader()
w.writerows(results)
print(f"report: {out}", flush=True)
return 0
if __name__ == "__main__":
ap = argparse.ArgumentParser(description=__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter)
g = ap.add_mutually_exclusive_group(required=True)
g.add_argument("--case", help="benchmark a single case_number")
g.add_argument("--all", action="store_true", help="benchmark all rulings with a mini-ratio")
ap.add_argument("--limit", type=int, default=None, help="cap the number of rulings")
ap.add_argument("--model", default=None, help="judge model (default: CLI session default)")
ap.add_argument("--out", default=None, help="output CSV path (default: data/audit/)")
args = ap.parse_args()
sys.exit(asyncio.run(main(args)))

View File

@@ -1113,6 +1113,52 @@ export interface paths {
patch?: never;
trace?: never;
};
"/api/cases/{case_number}/decision-blocks": {
parameters: {
query?: never;
header?: never;
path?: never;
cookie?: never;
};
/**
* Api Get Decision Blocks
* @description Return all 12 decision blocks as JSON (empty blocks included).
*
* Read path for the interactive block viewer — content lives in
* decision_blocks but was previously only reachable via DOCX export.
*/
get: operations["api_get_decision_blocks_api_cases__case_number__decision_blocks_get"];
put?: never;
post?: never;
delete?: never;
options?: never;
head?: never;
patch?: never;
trace?: never;
};
"/api/cases/{case_number}/decision-blocks/{block_id}": {
parameters: {
query?: never;
header?: never;
path?: never;
cookie?: never;
};
get?: never;
/**
* Api Update Decision Block
* @description Save inline-edited content for a single decision block.
*
* Writes to decision_blocks (upsert, status='draft') and rebuilds the
* on-disk decision.md. Creates a decision row if none exists yet.
*/
put: operations["api_update_decision_block_api_cases__case_number__decision_blocks__block_id__put"];
post?: never;
delete?: never;
options?: never;
head?: never;
patch?: never;
trace?: never;
};
"/api/cases/{case_number}/learn": {
parameters: {
query?: never;
@@ -1959,6 +2005,88 @@ export interface paths {
patch?: never;
trace?: never;
};
"/api/learning/pairs": {
parameters: {
query?: never;
header?: never;
path?: never;
cookie?: never;
};
/**
* Api Learning Pairs
* @description פנקס-ההתאמה (INV-LRN4) — כל ההחלטות וסטטוס ההשוואה מול הסופי.
* status אופציונלי: final_received / analyzed / lessons_folded.
*/
get: operations["api_learning_pairs_api_learning_pairs_get"];
put?: never;
post?: never;
delete?: never;
options?: never;
head?: never;
patch?: never;
trace?: never;
};
"/api/learning/style-distance/{case_number}": {
parameters: {
query?: never;
header?: never;
path?: never;
cookie?: never;
};
/**
* Api Learning Style Distance
* @description מדד מרחק-סגנון (T7) לתיק — האם הטיוטה מתכנסת לדפנה.
*/
get: operations["api_learning_style_distance_api_learning_style_distance__case_number__get"];
put?: never;
post?: never;
delete?: never;
options?: never;
head?: never;
patch?: never;
trace?: never;
};
"/api/learning/pairs/{pair_id}": {
parameters: {
query?: never;
header?: never;
path?: never;
cookie?: never;
};
/**
* Api Learning Pair Detail
* @description פירוט שורת-פנקס כולל הצעת-הדיסטילציה (analysis) לאישור יו"ר (T14).
*/
get: operations["api_learning_pair_detail_api_learning_pairs__pair_id__get"];
put?: never;
post?: never;
delete?: never;
options?: never;
head?: never;
patch?: never;
trace?: never;
};
"/api/learning/pairs/{pair_id}/promote": {
parameters: {
query?: never;
header?: never;
path?: never;
cookie?: never;
};
get?: never;
put?: never;
/**
* Api Learning Promote
* @description שער-יו"ר (INV-G10/LRN1): מאשר לקחי-סגנון + ביטויי-מעבר מהצעת-הדיסטילציה
* ומטמיע אותם בערוצים שהכותב צורך (methodology overrides → T15). מקדם status.
*/
post: operations["api_learning_promote_api_learning_pairs__pair_id__promote_post"];
delete?: never;
options?: never;
head?: never;
patch?: never;
trace?: never;
};
"/api/admin/skills": {
parameters: {
query?: never;
@@ -2254,7 +2382,14 @@ export interface paths {
head?: never;
/**
* Api Resolve Feedback
* @description Mark feedback as resolved.
* @description Mark feedback as resolved. When ``fold`` is true (default) and the entry
* has an extracted lesson, also wake the CEO to fold that lesson into the
* right knowledge file (the feedback→agent-knowledge loop).
*
* The fold is fire-and-forget (BackgroundTask) and best-effort — resolving
* never fails because Paperclip is down. Pass ``fold=false`` for pure
* bookkeeping resolves (e.g. from the per-case drafts panel) to avoid
* spawning a CEO run per click.
*/
patch: operations["api_resolve_feedback_api_feedback__feedback_id__resolve_patch"];
trace?: never;
@@ -2566,7 +2701,13 @@ export interface paths {
path?: never;
cookie?: never;
};
/** Halachot List */
/**
* Halachot List
* @description List halachot. ``exclude_low_quality`` hides flagged items (#84.1) and
* ``order_by_priority`` switches to the active-learning order (#84.3). Both
* default off so existing callers are unaffected; the review-queue view opts
* in.
*/
get: operations["halachot_list_api_halachot_get"];
put?: never;
post?: never;
@@ -2746,6 +2887,11 @@ export interface components {
/** Issue Id */
issue_id?: string | null;
};
/** BlockUpdateRequest */
BlockUpdateRequest: {
/** Content */
content: string;
};
/** Body_api_create_feedback_api_feedback_post */
Body_api_create_feedback_api_feedback_post: {
/**
@@ -3475,6 +3621,19 @@ export interface components {
/** Citation Formatted */
citation_formatted?: string | null;
};
/** PromoteLearningRequest */
PromoteLearningRequest: {
/**
* Lessons
* @default []
*/
lessons: string[];
/**
* Phrases
* @default []
*/
phrases: string[];
};
/** ReviseRequest */
ReviseRequest: {
/** Revisions */
@@ -5263,6 +5422,73 @@ export interface operations {
};
};
};
api_get_decision_blocks_api_cases__case_number__decision_blocks_get: {
parameters: {
query?: never;
header?: never;
path: {
case_number: string;
};
cookie?: never;
};
requestBody?: never;
responses: {
/** @description Successful Response */
200: {
headers: {
[name: string]: unknown;
};
content: {
"application/json": unknown;
};
};
/** @description Validation Error */
422: {
headers: {
[name: string]: unknown;
};
content: {
"application/json": components["schemas"]["HTTPValidationError"];
};
};
};
};
api_update_decision_block_api_cases__case_number__decision_blocks__block_id__put: {
parameters: {
query?: never;
header?: never;
path: {
case_number: string;
block_id: string;
};
cookie?: never;
};
requestBody: {
content: {
"application/json": components["schemas"]["BlockUpdateRequest"];
};
};
responses: {
/** @description Successful Response */
200: {
headers: {
[name: string]: unknown;
};
content: {
"application/json": unknown;
};
};
/** @description Validation Error */
422: {
headers: {
[name: string]: unknown;
};
content: {
"application/json": components["schemas"]["HTTPValidationError"];
};
};
};
};
api_learn_api_cases__case_number__learn_post: {
parameters: {
query?: never;
@@ -6575,6 +6801,135 @@ export interface operations {
};
};
};
api_learning_pairs_api_learning_pairs_get: {
parameters: {
query?: {
status?: string;
limit?: number;
};
header?: never;
path?: never;
cookie?: never;
};
requestBody?: never;
responses: {
/** @description Successful Response */
200: {
headers: {
[name: string]: unknown;
};
content: {
"application/json": unknown;
};
};
/** @description Validation Error */
422: {
headers: {
[name: string]: unknown;
};
content: {
"application/json": components["schemas"]["HTTPValidationError"];
};
};
};
};
api_learning_style_distance_api_learning_style_distance__case_number__get: {
parameters: {
query?: never;
header?: never;
path: {
case_number: string;
};
cookie?: never;
};
requestBody?: never;
responses: {
/** @description Successful Response */
200: {
headers: {
[name: string]: unknown;
};
content: {
"application/json": unknown;
};
};
/** @description Validation Error */
422: {
headers: {
[name: string]: unknown;
};
content: {
"application/json": components["schemas"]["HTTPValidationError"];
};
};
};
};
api_learning_pair_detail_api_learning_pairs__pair_id__get: {
parameters: {
query?: never;
header?: never;
path: {
pair_id: string;
};
cookie?: never;
};
requestBody?: never;
responses: {
/** @description Successful Response */
200: {
headers: {
[name: string]: unknown;
};
content: {
"application/json": unknown;
};
};
/** @description Validation Error */
422: {
headers: {
[name: string]: unknown;
};
content: {
"application/json": components["schemas"]["HTTPValidationError"];
};
};
};
};
api_learning_promote_api_learning_pairs__pair_id__promote_post: {
parameters: {
query?: never;
header?: never;
path: {
pair_id: string;
};
cookie?: never;
};
requestBody: {
content: {
"application/json": components["schemas"]["PromoteLearningRequest"];
};
};
responses: {
/** @description Successful Response */
200: {
headers: {
[name: string]: unknown;
};
content: {
"application/json": unknown;
};
};
/** @description Validation Error */
422: {
headers: {
[name: string]: unknown;
};
content: {
"application/json": components["schemas"]["HTTPValidationError"];
};
};
};
};
api_list_skills_api_admin_skills_get: {
parameters: {
query?: never;
@@ -7580,6 +7935,8 @@ export interface operations {
practice_area?: string;
limit?: number;
offset?: number;
exclude_low_quality?: boolean;
order_by_priority?: boolean;
};
header?: never;
path?: never;

View File

@@ -6031,7 +6031,13 @@ async def halachot_list(
practice_area: str = "",
limit: int = 200,
offset: int = 0,
exclude_low_quality: bool = False,
order_by_priority: bool = False,
):
"""List halachot. ``exclude_low_quality`` hides flagged items (#84.1) and
``order_by_priority`` switches to the active-learning order (#84.3). Both
default off so existing callers are unaffected; the review-queue view opts
in."""
cid: UUID | None = None
if case_law_id:
try:
@@ -6043,6 +6049,8 @@ async def halachot_list(
review_status=review_status or None,
practice_area=practice_area or None,
limit=limit, offset=offset,
exclude_low_quality=exclude_low_quality,
order_by_priority=order_by_priority,
)
return {"items": rows, "count": len(rows)}