מחיל את scripts/_pipeline_runtime.py (מ-P0) על final_learning_pipeline: 3 הצעדים
([1]ingest/Opus-distillation [2]enroll-style-corpus [3]style-panel) רצים דרך אותו
runtime עמידות — מימוש אחד לשני הפייפליינים (G2), לא מימוש מקביל.
קריסה/OOM בפאנל-הסגנון [3] ממשיכה מ-[3] במקום לשלם שוב על דיסטילציית-ה-Opus [1]
(היקרה). thread יציב לכל תיק (learning:{case}); dry-run = preview נפרד. CLI זהה +
--fresh. שגיאת ingest קריטית → raise → halt + clean non-zero exit (resume מנסה שוב).
degradation חיננית כמו ב-P0 (ללא langgraph → ליניארי).
אימות: py_compile OK; מיובא נקי ב-venv המשותף (langgraph נעדר, lazy import). מנגנון
ה-runtime עצמו מכוסה ב-test_pipeline_runtime.py (P0) — אותו runtime.
Invariants: INV-DUR1 (עמידות), G2 (runtime יחיד), G3 (idempotency).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
style_lesson_panel.py: before writing 2/2-keep lessons, skip any whose normalized
lesson_text already exists on the corpus (any source), and collapse duplicates within
a run. Makes the run-learning button safe to click repeatedly (the curator may re-run
the pipeline) — it converges instead of piling up duplicate decision_lessons.
Verified on בל"מ 8126-03-25: re-running --apply with 7 existing lessons wrote 0
("1 כפילויות דולגו"), count stayed 7.
Invariants: INV-LRN1/G10 unchanged (proposals only, manual fold).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
הכפתורים "הרץ למידת-קול"/"הרץ אימות-הלכות" מעירים את הרמס, ובמקום שהסוכן
(DeepSeek) ירכיב כמה קריאות-כלי (שביר), הוא מריץ עכשיו פקודה דטרמיניסטית אחת.
חדש:
- scripts/final_learning_pipeline.py — (1) ingest_final_version עם נתיב-הסופי
(מדלג אם הזוג כבר analyzed; --force לחידוש), (2) רישום לקורפוס-הסגנון
(idempotent — סוגר את הפער שפאנל-הסגנון דרש corpus_id), (3) style_lesson_panel
--apply. --dry-run להרצה בטוחה.
- scripts/final_halacha_pipeline.py — extract_internal_citations →
corroboration.build_all → halacha_panel_approve --apply. --dry-run / --limit.
briefs הרמס (web/paperclip_client._curator_task_brief) פושטו לפקודה-אחת לכל
task — חסין מול הרצת-סוכן. תוקנו שני הפערים שזוהו: ingest דרש file_path,
ופאנל-הסגנון דרש style_corpus.
נלווה: תיקון help מיושן של halacha_panel_approve (--apply מחווט). SCRIPTS.md.
אומת: שני ה-pipelines רצו dry-run על בל"מ 8126-03-25 (skip-ingest, קורפוס,
פאנלים) בהצלחה. Invariants: INV-LRN1/LRN5/G10 (הפיך, שער-יו"ר ידני נשמר),
INV-DM7. G2 — תזמור של יכולות קיימות, לא מסלול-מקביל.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
מוסיף מסלול ייעודי לקליטת ההחלטה החתומה של היו"ר, ומפעיל אותו דרך שני
שלבים אוטומטיים מדורגים עם פאנלי-סוכנים (אוטו-אישור + אסקלציה ליו"ר).
Backend (web/):
- POST /api/cases/{case}/final/upload — קליטת final חיצוני: שמירה קנונית
(סופי-{case}.docx + עותק קורפוס-סגנון תחת case_number מלא כדי שבל"מ לא
יתנגש עם ערר באותו מספר), פתיחת draft_final_pairs (final_received). לא נוגע
ב-active_draft ולא מריץ retrofit (נבדל מ-exports/upload ו-mark-final → לא G2).
- POST .../final/run-learning + .../final/run-halacha — שלבים מדורגים שמעירים
worker מקומי (claude/DeepSeek/Gemini מקומיים בלבד) דרך הרחבת
wake_curator_for_final עם param task=learning|halacha.
פאנל-סגנון חדש (scripts/style_lesson_panel.py): שני שופטים (DeepSeek+Gemini)
על-גבי דיסטילציית-ה-Opus; הסכמה 2/2-keep → decision_lesson
(source=panel:deepseek+gemini); substance מדולג (INV-LRN5); הפיך + גיבוי CSV.
פאנל-הלכות: docstring/SCRIPTS.md עודכנו (--apply מחווט).
Frontend (web-ui/): כפתור "העלאת החלטה סופית של היו"ר" + שני כפתורים מדורגים
"הרץ למידת-קול"/"הרץ אימות-הלכות" ב-drafts-panel; כל התוויות בעברית
(badge מקור-לקח: "פאנל: דיפסיק+גמיני", "הרמס (סקירה)"...).
Spec: docs/spec/07-learning.md §0.6. Invariants: INV-LRN1/LRN4/LRN5, G10
(שער-יו"ר ידני להטמעה ל-SKILL.md/lessons.md — הפאנלים יוצרים הצעות בלבד);
G2 (מסלול-סופי הוא יכולת חסרה, לא מסלול-מקביל).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
הדף הציג את התורים באופן לא-אחיד (by_status גולמי), בלי הבחנה בין "ממתין"
(בקלוג: status=pending) ל"בתור" (התור הפעיל: requested_at IS NOT NULL), בלי
הצגת הפריט שרץ כרגע, ובלי שום שליטה בתהליכים.
מה נוסף:
1. כרטיסי-תור אחידים — בתור / ממתין(בקלוג) / בעיבוד / הושלם / נכשל + "רץ עכשיו"
(citation/case_number של הפריט בעיבוד) לכל drain (אחזור-פסיקה, מטא-דאטה,
הלכות, יומונים). שערי-אנוש (אישור-הלכות, פסיקה-חסרה) נשארים מוני-סטטוס.
2. פאנל ניהול-תהליכים בסגנון "שירותי Windows":
- דמון (court-fetch-service/xvfb/chat/reaper): הפעל-מחדש / עצור / הפעל.
- cron drain: "הרץ עכשיו" (pm2 restart) + מתג הפעל/כבה תזמון.
3. כל תגי-הסטטוס מתורגמים לעברית.
מנגנון:
- הפעל/כבה תזמון = דגל ב-DB (טבלה drain_controls). pm2 cron_restart מחיה תהליך
שעוצר ב-stop, לכן ה"כיבוי" האמין הוא דגל שכל drain בודק ב-startup (no-op מיידי
כשכבוי). הקונטיינר כותב/קורא ישירות מ-DB.
- הרץ-עכשיו + restart/stop/start = proxy ל-pm2 דרך endpoint חדש בגשר-המארח
(court_fetch_service /pm2/control), מאובטח Bearer + whitelist ל-legal-* בלבד.
- יומונים: drain_digests הועבר מ-crontab ל-pm2 (legal-digest-drain.config.cjs)
כדי שיופיע ויהיה שליט כמו כל drain. drain_halacha_queue.py הובא לבקרת-גרסאות.
Invariants: מקיים G2 (הרחבת /operations + הגשר הקיים, לא מסלול מקביל) ו-G1
(drain_controls = מקור-אמת יחיד לכיבוי, נורמליזציה במקור ולא תיקון-בקריאה).
אין בליעת שגיאות שקטה (הגשר מחזיר {ok,error}; המוטציות מציגות toast).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The 211 open missing_precedents include 99 Supreme serial-format rulings
(בג"ץ/בר"מ/עע"מ NNNN/YY) with no נט-format triple — fetchable only from
supremedecisions.court.gov.il. Decoded its public JSON API (no browser, no
CAPTCHA, no smart-card); validated live on בג"ץ 3483/05 + בר"מ 10212/16.
- court_fetch_supreme.py: rewrite. POST Home/SearchVerdicts with a structured
`document` ({Year:"YYYY", CaseNum, OldMainNumFormat:true, SearchText:[…]}) +
X-Requested-With header → records; GET Home/Download?path=&fileName=&type=4 →
PDF. The earlier attempt failed only on the request shape (string vs object).
2-digit→4-digit year; try candidate docs best-first (פסק-דין→pages), skipping
the published-report 's'-prefix files the free endpoint WAF-blocks.
- orchestrator: on successful ingest, close matching open missing_precedents
(link to the new case_law). End-to-end validated (בר"מ 10212/16 → corpus).
- backfill_missing_precedents.py: enqueue fetchable open gaps (supreme + net)
into court_fetch_jobs; the drainer fetches+ingests+closes. dry-run default.
- X13 spec + SCRIPTS.md updated (Tier-0 decoded, no longer a limitation).
Very old un-digitized Supreme cases (e.g. בג"ץ 389/87 → 0 records) → manual.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Gemini key is stored in Infisical as GOOGLE_GEMINI_API_KEY
(nautilus /external-apis/gemini). Align the panel to read that canonical name
first, falling back to bare GEMINI_API_KEY for back-compat — so an
Infisical→.env sync keeps working.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The halacha extraction queue was stuck (same class as the metadata issue): 26
precedents requested extraction with no drainer, plus 1 orphaned in 'processing'
(status=processing, requested_at cleared → never re-picked by the queue).
- db.requeue_stale_processing_extractions(kind): re-stamp orphaned 'processing'
rows (requested_at IS NULL) so they re-drain; halacha extractor force=False
resumes from chunk checkpoints (no duplicates).
- process_pending_extractions calls it at the top — fully unattended, safe under
the global advisory lock. Mirrors the digests-drain self-heal.
- legal-halacha-drain.config.cjs: pm2 cron (every 2h, conservative — Claude is
slow/rate-limited and each run adds to the chair's pending_review queue).
drain_halacha_queue.py stays on claude_session (high reasoning quality for
holding/ratio; NOT moved to Gemini). SCRIPTS.md.
The chair-approval gate (INV-G10) is untouched — this only produces halachot;
Daphna still approves each in /approvals.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The /precedents metadata queue was stuck — 24 rows requested, nothing draining
them — and the agentic claude CLI hit error_max_turns on what is a single
structured text→JSON task (slow + flaky). Metadata extraction is bounded
extraction, the wrong fit for an agentic loop.
- gemini_session.py: query_json drop-in (gemini-2.5-flash, JSON mode, httpx —
no new SDK dep). Reads GEMINI_API_KEY (~/.env; SoT Infisical
nautilus:/external-apis/gemini). Host-side only — no LLM from the container.
- precedent_metadata_extractor: claude_session.query_json → gemini_session.
Validated live: rich, accurate fields (case_name/summary/appeal_subtype/tags).
- process_pending_extractions: kind-aware cooldown — metadata 2s (Gemini, fast),
halacha keeps 30s (Claude rate limits).
- drain_metadata_queue.py + legal-metadata-drain.config.cjs (pm2 cron */15) so
the queue never clogs again. SCRIPTS.md.
- X8 INV-FP5 updated: per-task engine choice (Gemini=bounded metadata,
claude_session=agentic halacha), both host-side, single canonical queue (G2).
Agentic/voice-sensitive work (writing, analysis, halacha) stays on claude_session
(Daphna's subscription). Gemini cost ≈ $0.10/1M tokens — negligible.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Periodic safety net for the multi-judge approval panel: samples panel-approved
halachot, re-runs the same 3-judge KEEP vote, and surfaces any that now lean
DROP — candidate false-keeps a human should glance at. Report-only by default;
--flag reopens flips to pending_review. Baseline 0/15 on the 2026-06-07 batch.
Closes the loop the literature prescribes (Trust-or-Escalate / selective
prediction): monitor the auto-decision error rate rather than trusting it
blindly. Reuses halacha_panel_approve's judges (single source of truth).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
drain_digests רץ תחת flock (drainer יחיד), אז כל שורה 'processing' בתחילת ריצה
היא שריד מריצה קודמת שנקטעה באמצע-שורה (סשן/מכסה). מאפסים אותה ל-'pending'
לריצה חוזרת — סוגר את הפער האחרון ל-resume אוטומטי מלא ללא התערבות.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The chair cannot review every pending halacha. Three independent-lineage judges
(Opus via claude_session · DeepSeek · Gemini-2.5-flash — #1 on LegalBench) vote
on the COARSE axis we proved reliable across models (92%): "is this a genuine,
keepable rule?". Only an agreed verdict acts; every split escalates to the chair
(INV-G10). Buckets: clean→KEEP?; nli_unsupported→entailment re-adjudication;
extraction-defects→re-extraction.
halacha_panel_calibrate.py calibrates the voting policy on the gold-set's
is_holding (the coarse label) per Trust-or-Escalate (ICLR 2025): unanimous →
94.9% precision / 78% coverage; majority → 92.9% / 99%; ZERO false-drops in
both (the panel never rejects a good rule). Chosen policy (chair-approved):
clean→majority-2/3, nli→asymmetric (majority-reject, unanimous-approve),
defects→re-extraction. Reversible (--apply backs up review_status+flags first).
Sources: Panel-of-LLM-Evaluators (PoLL) · Trust-or-Escalate (ICLR 2025,
arXiv:2407.18370) · selective-prediction / learning-to-defer.
Invariants: upholds G10 (human gate — splits escalate, panel only collapses the
queue) and G9 (provenance — reviewer records the panel + policy). Read paths only
in calibrate; --apply writes review_status/quality_flags reversibly with backup.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
COURT_FETCH_SHARED_SECRET + LEGAL_CHAT_SHARED_SECRET migrated to Infisical
nautilus:/legal-ai (2026-06-07). Updated the pm2 config comments: the stale
"migrate to Infisical once the MCP server is back" TODO is now done; local
env files remain the runtime source, Infisical is the SoT/record.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ה-cron של drain_digests הוא מנגנון ה-resume (pending-based, idempotent, host-side,
לא תלוי בסשן). חיזוק: אם enrich נכשל באמצע (מכסת claude נגמרה) השורה נשארה
'completed' עם שדות ריקים → לא היתה מטופלת שוב. עכשיו drain מאפס בתחילתו כל
digest 'completed' עם concept_tag ריק *וגם* underlying_citation ריק (= חילוץ
שמעולם לא נחת; שורה תקינה תמיד מכילה לפחות מראה-מקום) → pending לריצה חוזרת.
כך כל קטיעה/מכסה מתאוששת אוטומטית בריצת ה-cron הבאה.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- scripts/drain_court_fetch.py: drives orchestrator.drain_pending (host-only;
no-op when queue empty). Mirrors drain_halacha_queue.py.
- scripts/legal-court-fetch-drain.config.cjs: pm2 cron (hourly :17, one-shot),
COURT_FETCH_DRAIN_CRON override.
- fix: orchestrator default service URL 127.0.0.1 → 10.0.1.1 (the service binds
the docker0 gateway; the host can't reach it on loopback). Found live — the
first drain failed "connection refused" until corrected.
- SCRIPTS.md entries.
Validated end-to-end in PRODUCTION on a real digest: עת"מ 43830-12-24
(החברה להגנת הטבע) fetched from נט המשפט → case_law (79 chunks, source_url),
digest relinked (INV-DIG3 closed), halacha queued pending_review. job=done.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The gold-set's human role tags were made while seeing a claude AI recommendation,
so human↔AI agreement (~100%) is anchoring, not an independent accuracy signal.
This adds a third, genuinely independent judge — a DIFFERENT model (DeepSeek,
direct OpenAI-compatible API) classifies rule_role BLIND (never sees the human
tag nor the first AI's answer) — and reports an inter-rater agreement matrix.
Finding (100 tagged items): ai↔human 100% (anchored) vs deepseek↔human 50%
fine-grained — BUT 92% on the coarse axis (generalizable-rule vs application/
obiter). Conclusion: the fine sub-type (holding/interpretive/procedural) is an
inherently fuzzy boundary two capable models split differently; the coarse
"is this a real rule" axis is robust across models. Use the coarse axis as
ground truth; treat the sub-type as advisory, never as a gate.
Zero chair tagging, read-only on the gold-set. Key from ~/.hermes deepseek env.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
שלוש שכבות-הגנה נגד דליפת-זיכרון מדפדפנים יתומים, + טיפול בדליפה הגדולה
בפועל בשרת (task-master-mcp).
- camofox_client.py:
- asyncio.wait_for קשיח סביב כל ה-fetch (COURT_FETCH_HARD_TIMEOUT_S=180ש')
— hang → ביטול → async-with tear-down → reap.
- _reap_orphan_browsers(): הורג camoufox-bin יתומים (ppid=1) לפני ואחרי כל
fetch. סדרתיות (INV-CF4) → כל ppid=1 הוא שארית בטוחה.
- scripts/reap_orphan_procs.py: reaper כללי ל-task-master-mcp (~3GB יתומים)
+ camoufox-bin. רק ppid=1; /proc טהור. --dry-run / --loop N.
- scripts/legal-reaper.config.cjs: דמון pm2 (loop 180s, max_memory_restart 100M).
- X13 spec + SCRIPTS.md: תיעוד שכבות-ההגנה.
max_memory_restart בשירות (1.5G) כבר נותן רשת-ביטחון ברמת-התהליך.
Invariants: מקיים INV-CF4 (politeness/serial) — ללא שינוי חוזה.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The extractor classified rule_type by SOURCE bindingness (higher-court→binding,
committee→persuasive) instead of by rule KIND. The gold-set proved it: 'binding'
appeared on 19/19 external rulings & 0 committees; 'persuasive' on 13/13
committees & 0 external — only 58% agreement with the human role tags. The two
axes (authority vs rule role) were crammed into one enum.
This splits them per INV-DM7:
- authority (binding/persuasive) — DERIVED from case_law.precedent_level
(עליון/מנהלי→binding, ועדת_ערר_מחוזית→persuasive), never stored, never
LLM-guessed. New helper halacha_quality.derive_authority; surfaced read-only
in list_halachot / goldset_list / search results.
- rule_type — now the rule ROLE only: holding/interpretive/procedural/
application/obiter. Both extractor prompts unified to this vocabulary;
_coerce_halacha no longer defaults rule_type from the source; legacy
binding→holding / persuasive→interpretive fold for safety.
UI: authority shown as a separate read-only badge (gold=מחייב / muted=משכנע)
across the review queue, precedent detail, and gold-set; the gold-set role
selector drops binding/persuasive and adds מהותי (holding).
Migration: scripts/halacha_rule_role_backfill.py re-classifies the 276 pre-split
binding/persuasive rows into a genuine role via local claude_session (run after
deploy). Gold-set correct_type/ai_correct_type 'binding'→'holding' via SQL.
Sources (≥3, per research-decision policy): OASIS LegalRuleML v1.0
(appliesAuthority/Strength as metadata orthogonal to rule logic) · SemEval-2023
Task 6 LegalEval (rhetorical roles by function, authority kept separate) ·
Bluebook signals (weight-of-authority is a separate dimension).
Invariants: ESTABLISHES INV-DM7. Upholds G1 (normalize at source — extractor
classifies role, system derives authority) and G2 (single source of truth —
authority derived, not a parallel stored field). Tests: 211 pass + new
derive_authority/coerce coverage. web-ui build + tsc clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The chair wanted an independent recommendation beside each tag, to reconsider
his own judgments. Adds a NON-ground-truth AI second-opinion:
- schema: halacha_goldset.ai_is_holding / ai_correct_type / ai_rationale /
ai_generated_at (additive).
- db.goldset_set_ai_recommendation + goldset_list now returns the ai_* fields.
- scripts/goldset_ai_recommend.py — local claude_session judges is_holding +
type + a one-line rationale per item, INDEPENDENTLY (own legal rubric).
Independent of the rule-based validators #81.8 measures → no circularity.
Never auto-applied; QA aid only.
- web-ui: each card shows "🤖 המלצת AI: הלכה/לא · type" + rationale and an
agreement/disagreement chip vs the human tag (amber on disagree); a
"⚠ אי-הסכמות AI (N)" filter to review only the conflicts.
Methodology note kept explicit: the human stays the ground truth; the AI is a
prompt to reconsider, not to copy.
Verified: tsc --noEmit 0; generator stores recs and flags disagreements with
existing human tags.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cross-precedent recurrence of a principle is real but is NOT citation
corroboration (X11) — the 5 candidate pairs have ZERO citations between their
precedents. Recording them in halacha_citation_corroboration would fabricate
citation data and inflate corroboration_count. This adds a proper, separate
halacha-level link for parallel authority.
Schema (V28): equivalent_halachot — symmetric (halacha_a < halacha_b, CHECK +
UNIQUE), non-citation, cross-precedent-only. ON DELETE CASCADE.
db.py:
- link_equivalent_halachot (idempotent; rejects same-id and SAME-precedent pairs
— parallel authority is cross-precedent by definition), unlink, and
list_equivalent_for_halacha.
- list_halachot gains include_equivalents → _annotate_equivalents attaches an
`equivalents` list (both directions) per row.
API: include_equivalents on GET /api/halachot; GET/POST/DELETE
/api/halachot/{id}/equivalents for the chair to view/link/unlink manually.
scripts/halacha_batch_reconcile.py: --link records found cross-precedent pairs
as equivalent_halachot (non-destructive, idempotent).
web-ui: Halacha.equivalents type; the clean review queue fetches
include_equivalents; the review card shows a gold "עיקרון מקביל ב-N" badge + an
expandable list (case + rule + similarity) labeled "אסמכתה מקבילה — לא ציטוט".
Populated the 5 reviewed pairs (chair decision: keep all + link as parallel
authority). Verified: 5 rows; the 1023-20 hub annotates 3 of its halachot with
equivalents; tsc --noEmit exits 0.
Invariants: G1 (model recurrence at source in its own table, not by abusing the
citator); G2 (no parallel path — extends list_halachot); citator integrity
preserved (corroboration stays citation-only).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Halacha-extraction quality (#81) and dedup-on-insert (#82) — engine changes
(pure + tested) plus measurement/ops tooling.
halacha_quality.py
- #81.4 application gate: is_fact_dependent() (high-precision "applied to THIS
case" deixis per the strict rubric §3/§27) + FLAG_APPLICATION. compute_quality_flags
now takes rule_type and flags rule_type=='application' OR fact-dependent —
blocking auto-approve (an illustration is not a generalizable holding).
- #82.3 lexical tail signal: jaccard_shingles / normalized_levenshtein /
lexical_near_duplicate + FLAG_NEAR_DUPLICATE, for the 0.83–0.93 cosine band.
halacha_extractor.py — pass rule_type to the flag computation; re-type a
binding-labeled fact-application to 'application' (mirrors non_decision→obiter).
db.py (store_halachot_for_chunk) — dedup now fetches the nearest same-precedent
neighbor once: cosine ≥ DEDUP → skip (unchanged); cosine in [BAND, DEDUP) with
high lexical overlap → FLAG_NEAR_DUPLICATE (review, not skip — never drop a
possibly-distinct principle unreviewed).
config.py — HALACHA_DEDUP_BAND_COSINE (0.83).
Scripts:
- scripts/halacha_goldset.py (#81.7) — export stratified sample for human
tagging; score validators (P/R/F1) against the tags. Backbone for #81.8.
- scripts/halacha_batch_reconcile.py (#82.7) — conservative cross-precedent
dedup (cosine ≥0.95), dry-run report only.
- scripts/calibrate_halacha_dedup.py (#82.1) — calibrate the lexical thresholds
against the 2026-06-03 cleanup gold-set.
Deferred (documented): #82.4 merge-provenance and #82.5 DB ON CONFLICT/UNIQUE
on normalized quote are NOT included — the current skip+flag behavior is safe,
whereas a UNIQUE on normalized_quote would fail on existing dups and a blind
merge risks losing provenance; they need their own chair-reviewed migration.
#82.6 over-merge guard is moot until merge lands. #81.6 full rhetorical-role
classifier deferred (section pre-filter + application flag cover the practical
case); #81.8 blocked on the human-tagged gold-set (harness now provided).
Verified:
- pytest tests/test_halacha_quality.py — 52 passed (14 new).
- calibrate: configured (0.55,0.70) → precision 1.0 (zero false-merge), recall
0.30 — correct profile for an auto-approve-blocking signal.
- goldset export: 15-row sample CSV. batch reconcile: 819 halachot → 5
cross-precedent candidate pairs.
Invariants: G1 (normalize at source — flag at insert, not at read); §6 (no
silent swallow — suspect items flagged to review, never dropped); G2 (no
parallel path — same store_halachot_for_chunk / compute_quality_flags).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
#86.2 backfill + #86.3 benchmark, plus a #86.1 over-strip fix found en route.
extractor.py
- extract_nevo_ratio(): capture Nevo's מיני-רציו block (editorial holdings
summary) before it is stripped — a free professional gold-set (#86.3).
- _DECISION_START hardening (#86.2): the merged #86.1 regex over-stripped.
(a) פסק-דין headers are markdown-wrapped (**פסק דין**); the old anchor
required the keyword as the first line char with one separator, so it
missed the header and matched a citation 32K deep (עמ"נ 50567-07-21,
losing 45% of the body). Now tolerates leading markdown + 0-3 seps,
and the final-nun form (דין ן vs דינו נ).
(b) bare השופט/הנשיא matched CITATIONS ("השופט מ' חשין, פסקה 23"). The
authoring-judge line ends with a colon; we now require it.
ingest.py
- capture the ratio before stripping and store it on the row (best-effort,
non-fatal); also strip the text-upload path (was file-only).
db.py
- add case_law.nevo_ratio column (additive); allow it in update_case_law.
scripts/backfill_nevo_preamble.py (#86.2) — dry-run-by-default data migration:
finds historically-leaked rulings, captures ratio→nevo_ratio, rewrites
full_text (+content_hash), reindexes, and FLAGS (never deletes) halachot whose
quote lives in the removed preamble (review_status=pending_review +
nevo_preamble_leak flag). Safety guard: rows with keep%<--min-keep (60) are
excluded from --apply as suspected over-strip. --apply writes backup+manifest
to data/audit/ first. Chair-gated — NOT applied here.
scripts/nevo_ratio_benchmark.py (#86.3) — LLM-as-judge (local claude_session,
zero cost) measures recall/precision/granularity of our halachot vs the Nevo
ratio. Works pre- and post-backfill (reads nevo_ratio, falls back to full_text).
Verified:
- pytest tests/test_nevo_preamble.py — 12 passed (incl. citation/markdown
over-strip regressions).
- backfill dry-run: 19 leaked rulings, 27 contaminated halachot, all ≥75%
keep (the 32K over-strip is gone).
- benchmark on בג"ץ 1764/05: recall=0.875 precision=1.0 granularity=1.75x.
Invariants: G1 (normalize at source — strip/capture at ingest, not at read);
no silent swallow (contaminated halachot flagged + reported, not dropped);
data-migration is dry-run-default with backup+manifest, chair-gated.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
הספ (docs/spec/, G1–G11) חובר לסוכני Paperclip דרך INV-AG1 אבל לא למסלול
שבו רוב הקוד נכתב בפועל — הסשן האינטראקטיבי של Claude Code. סוגר את הפער
לפני מחזור-2 (FU-9..15), שהוא כולו כתיבת-קוד.
שלוש שכבות אכיפה:
1. תיעוד — CLAUDE.md §"פרוטוקול כתיבת-קוד" + docs/spec בטבלת-הייחוס
2. hook — scripts/spec-guard.sh (PreToolUse על Edit/Write/MultiEdit, רשום
ב-.claude/settings.json) מזכיר פעם-בסשן בכל נגיעה בקובץ-קוד; non-blocking
3. PR — .gitea/PULL_REQUEST_TEMPLATE.md עם סעיף-חובה "Invariants"
המקבילה האינטראקטיבית ל-INV-AG1 שכבר אוכף על הסוכנים (HEARTBEAT §"קריאת-ספ").
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds scripts/rechunk_legacy_precedents.py: selects every case_law with a tiny
chunk (content<50 — the pre-fix chunker fingerprint) and runs
ingest.reindex_case_law (re-chunk+re-embed from stored full_text only, no
re-OCR/LLM, idempotent). Batch-idempotent (re-queries the affected set).
Run result (2026-06-03): 73 precedents reindexed, 0 failed. Tiny chunks
483 -> 4 (99.2%); total precedent_chunks 5019 -> 3115 (fragments merged).
Search verified healthy (substantial coherent passages, no errors).
The 4 residual tiny chunks are isolated section headings ('דיון',
'טענות המשיבים', ...) emitted by the CURRENT (fixed) chunker — not legacy
fragments — and are already filtered at query time (>=50, #55). Minor
chunker edge case, candidate #55 follow-up.
The DB chunk migration is already applied to prod; this commit is the script
+ SCRIPTS.md entry only (no app code change, no deploy needed).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Covers GAP-11 (INV-RET4/G8) and GAP-14 (INV-QA1/G10). Retrieval quality was
never measured (only telemetry observation) and the halacha review backlog was
invisible (the 10/19 gap was found by accident).
Unit B — backlog visibility (pure code, container):
- metrics.halacha_backlog(conn) → {pending_review, approved, rejected, published,
total, oldest_pending_at}; surfaced in metrics.get_dashboard() (get_metrics MCP
tool) and /api/system/diagnostics. Live count revealed 178 pending / 1552 total,
oldest from 2026-05-03 — previously invisible.
Unit A — retrieval eval harness (host-side scripts):
- scripts/eval_gold_bootstrap.py — seeds data/eval/gold-set.jsonl. Two sources:
citations (cited==relevant via search_relevance_feedback — empty until decisions
cite precedents) and known_item (query=case_name → relevant=self; a real
citation-free signal, the methodology #52 checked by hand). Idempotent; preserves
source='chair' rows.
- scripts/eval_retrieval.py — runs the production retrieval path (search_library /
search_internal) over the gold-set; computes precision@k, recall@k, MRR, nDCG@k
(k=5,10); aggregates overall + per-corpus + per-practice_area; writes a report and
a delta vs committed baseline.json (which records the retrieval_config it reflects).
--self-test unit-checks the metric math offline.
Gold-set strategy = hybrid (chair decision): bootstrap + chair review. The citation
source is empty today (0 cited precedents in decisions), so the seed is known-item
(77 queries: 54 internal_decisions + 23 precedent_library). The gold-set is
PROVISIONAL until Dafna reviews it (the domain chair-gate).
Baseline (production config: multimodal+rerank on): R@10=0.987, MRR=0.837,
nDCG@10=0.872. Finding: MULTIMODAL_ENABLED=true slightly lowers known-item recall
(image-page results displace exact name matches) — relevant to #15. precedent_library
weaker than internal (R@10 0.957 vs 1.0) — one external precedent unfindable by name.
"CI gate" realized as discipline (re-runnable harness + committed baseline + run
before/after any retrieval-layer change) — retrieval needs prod DB + Voyage, no CI
runner has that access.
Spec: docs/superpowers/specs/2026-05-31-fu5-eval-harness-design.md
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Dry-run surfaced 2 rows with בל"מ prefix but proceeding_type=ערר. Since the
migration strips the prefix, a wrong proceeding_type would silently lose the
בל"מ signal — must be chair-adjudicated, not auto-applied. Chair table now
flags 4 rows: 2 DUP_CHECK (8047-23) + 2 PROC_MISMATCH.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>