ה-backfill של citation_formatted חשף קריסה ב-apply_to_record: כשפסק-דין
חיצוני מכיל docket שכבר שייך לרשומה כפולה אחרת, נרמול case_number → docket-נקי
נתקל ב-uq_case_law_external_number ומפיל את כל המיזוג (כולל הציטוט).
דוגמה: 'ע"א 3213/97' → '3213/97' שכבר קיים (כפילות נקר).
- db.case_number_collides(case_number, exclude_id) — בודק אם docket כבר שייך
לרשומה לא-internal אחרת (האינדקס החלקי).
- apply_to_record — מדלג על נרמול ה-case_number כשיש התנגשות (כפילות לדדופ
בהמשך, לא ענייננו כאן) וממשיך לכתוב את הציטוט. no-silent-swallow: מתעד warning.
- scripts/backfill_precedent_citations.py — try/except per-row + מונה שגיאות,
כך ששורה אחת לא מפילה את האצווה.
אומת: ריצה-מחדש מלאה ללא קריסה (0 שגיאות); ההתנגשות תועדה ודולגה כצפוי;
פסיקת בית-משפט: 224/228 מולאו, 4 נמנעו (חסר צדדים/תאריך — abstention, INV-AH).
test_fu2b_reconcile ✓.
Invariants: INV-AH (abstention) · G1 (נרמול-בכתיבה נשמר, רק לא קורס) ·
חוקה §6 (אין בליעה שקטה — דילוג מתועד).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
צד-המטא-דאטה (precedent_metadata_extractor) קרא רק GEMINI_API_KEY בעוד בסביבה
קיים GOOGLE_GEMINI_API_KEY — תוקן ב-PR #255 (fallback). הבאג המשני שנותר: כש-
extract_and_apply החזיר 'no_metadata' (כשל-Gemini), מסלול-הדריינר
process_pending_extractions התיישב ל-metadata_status='completed' ללא-תנאי, כך
שהרשומה ננטשה בשקט עם מטא ריק והדריינר לא חזר אליה (נצפה: da2d9ccb '4491-02-21',
5fabdac5 '14306-09-23' — completed אך court/date/summary ריקים).
תיקון (G1 — אבחנת-מקור):
- extract_and_apply מבדיל תוצאה-ריקה: יש full_text → 'extraction_failed' (חולף,
בר-retry); אין full_text → 'no_metadata' (אין מה לחלץ).
- process_pending_extractions (metadata): 'extraction_failed' → חוזר ל-'pending'
(משמר את חותם-התור) במקום להתיישב 'completed'. retry-loop הקיים מנסה שוב,
ואחרי-מיצוי הרשומה נשארת בתור. מסלול reextract_metadata כבר עקבי (חוזר pending
על כל מה שאינו completed/no_changes).
תיקון-נתון (בוצע ידנית דרך כלי-MCP precedent_extract_metadata): da2d9ccb +
5fabdac5 חולצו-מחדש בהצלחה (court/date/summary/headnote/tags מלאים). 0 נותרו
external 'completed-but-empty'.
הערה: מפתח-Gemini אינו נדרש ב-Coolify — המחלץ רץ רק מקומית (precedent_library →
extract_and_apply, host ~/.env עם GOOGLE_GEMINI_API_KEY); app.py מייבא רק את
הקבוע PLACEHOLDER_PENDING_EXTRACTION, לא את פונקציית-החילוץ.
בדיקות: test_metadata_extract_failure_status (transient/permanent/missing). כל
335 בדיקות mcp עוברות. guards נקיים.
Invariants: G1 (אבחנת-מקור, לא התיישבות-בקריאה), INV-G3/X16 (עמידות — בר-retry),
G12 (leak-guard נקי).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
שני באגים בקליטת-פסיקה חיצונית (התגלו בתיק 1132-09-24 שהועלה דרך "פסקה חסרה"):
1. case_number קיבל את מחרוזת-הציטוט המלאה במקום דוקט נקי. הסיבה: overwrite_case_number=True
הועבר רק לנתיב-הפנימי (internal_decisions); נתיב-הדריינר ל-external השאיר את הציטוט שב-
case_number (precedent_library: case_number=citation). היקף: 122 רשומות external_upload.
2. source_type לא נאכף מול precedent_level — רק ה-prompt ביקש מה-LLM. כשה-LLM פלט
level=ועדת_ערר_מחוזית אך source_type=court_ruling, ההחלטה סווגה בספרייה כ"פסיקת בית משפט".
תיקון (ב-apply_to_record, כך שכל הנתיבים נהנים):
• case_number מנורמל לדוקט הנקי כש-(א) caller כופה או (ב) הערך הנוכחי ציטוט-צורני (רווח/אורך>20);
guard _is_clean_docket מבטיח שלעולם לא נכתב ערך לא-דוקט לשדה-הזהות (LLM-זבל נדחה).
• _source_type_for_level גוזר source_type מ-precedent_level ודורס אי-עקביות (ועדת_ערר_*→
appeals_committee; עליון/מנהלי→court_ruling) — מקור-אמת אחד, לא הישענות על עקביות-LLM.
נבדק: 18 unit-tests (docket-validation, level→type mapping) + 3 integration-tests מול
apply_to_record עם DB מדומה (נרמול, אי-דריסת-דוקט-תקין, דחיית-זבל, אכיפת-עקביות). py_compile נקי.
תיקון-נקודתי כבר בוצע ידנית ל-1132-09-24. Backfill ל-122 בנפרד (TaskMaster #141).
Invariants: G1 (תיקון-במקור), G2 (אותו extractor — בלי מסלול מקביל), INV-AH (מקור-אמת
דטרמיניסטי לסיווג, לא ניחוש-LLM). G11 (זהות-תיק נקייה).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The /precedents metadata queue was stuck — 24 rows requested, nothing draining
them — and the agentic claude CLI hit error_max_turns on what is a single
structured text→JSON task (slow + flaky). Metadata extraction is bounded
extraction, the wrong fit for an agentic loop.
- gemini_session.py: query_json drop-in (gemini-2.5-flash, JSON mode, httpx —
no new SDK dep). Reads GEMINI_API_KEY (~/.env; SoT Infisical
nautilus:/external-apis/gemini). Host-side only — no LLM from the container.
- precedent_metadata_extractor: claude_session.query_json → gemini_session.
Validated live: rich, accurate fields (case_name/summary/appeal_subtype/tags).
- process_pending_extractions: kind-aware cooldown — metadata 2s (Gemini, fast),
halacha keeps 30s (Claude rate limits).
- drain_metadata_queue.py + legal-metadata-drain.config.cjs (pm2 cron */15) so
the queue never clogs again. SCRIPTS.md.
- X8 INV-FP5 updated: per-task engine choice (Gemini=bounded metadata,
claude_session=agentic halacha), both host-side, single canonical queue (G2).
Agentic/voice-sensitive work (writing, analysis, halacha) stays on claude_session
(Daphna's subscription). Gemini cost ≈ $0.10/1M tokens — negligible.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wire db.recompute_searchable into the ingest pipeline (after statuses are set) and into
extract_and_apply (after fields are persisted to DB, success path only).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Until now, "case_number" was the only stored identifier for a precedent.
But a *citation per the Israeli unified citation rules* is a different
beast — it has bold parties, an unbold prefix (court abbrev + panel/
district parenthetical + case number), and an unbold trailing reporter
(נבו / פ"ד...). Without storing it as a first-class field we couldn't
hand the chair a one-click "copy as citation" experience for pasting
into decisions.
Changes:
- Schema V19: case_law.citation_formatted TEXT (Markdown — parties
wrapped in **…** so the copy helper can render <strong> for Word/Docs
paste and keep plain-text fallback meaningful).
- Metadata extractor: composes citation_formatted from the document
text per the unified citation rules, with worked examples for ע"א /
עת"מ / ערר / בל"מ in the prompt. Refuses to store half-formed strings.
- PATCH /api/precedent-library/{id} accepts citation_formatted so the
chair can correct LLM mistakes.
- /precedents/[id]: dedicated "מראה מקום" block with bold rendering,
a copy-to-clipboard button (text/html + text/plain so Word keeps
the bolds), and an inline edit textarea.
- /precedents list rows: link displays the formatted citation when
available, with a small inline copy button — falls back to the bare
case_number for older rows.
Backfill of existing rows happens by re-stamping the extraction queue
once V19 has rolled out and the new field is reachable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The missing-precedents drawer + general precedent upload both required
the user to type chair_name, district, practice_area, court, date etc.
upfront — even though those fields can be (and already are, post-upload)
extracted from the document text by the LLM. The metadata-extraction
wakeup also only fired for the /precedent-library/upload path, leaving
missing-precedents committee uploads stuck with whatever stub the user
typed.
Changes:
- Extractor learns chair_name + district, overwrites the new
PLACEHOLDER_PENDING_EXTRACTION sentinel for internal_committee rows
(the DB CHECK forces non-empty; we stamp the placeholder at insert).
- missing_precedent_upload no longer 400s on missing chair/district;
it infers district from the citation when possible, falls back to
the placeholder, and always fires pc_wake_for_precedent_extraction
so the LLM can fill in the rest.
- Both upload sheets default to file (+ citation) only; every other
field is tucked into a closed <details> labeled "אופציונלי — דריסה
ידנית של שדות שיחולצו אוטומטית". Required validators on chair/
district/practice_area dropped — the LLM fills them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same case_number can exist as both a regular appeal (ערר) and an
extension-of-time request (בל"מ), and we were inferring the difference
from appeal_subtype prefixes — fragile, and case-number lookups
weren't disambiguated. Now stored as a first-class field on both
case_law (corpus) and cases (live cases), with partial unique indexes
on (case_number, proceeding_type).
- SCHEMA_V15: column + CHECK constraints + backfill from
appeal_subtype LIKE 'extension_request_%' + partial unique indexes
replace the old global UNIQUE(case_number).
- derive_proceeding_type() centralizes the inference rule
(extension_request_* → בל"מ; subject regex fallback; default ערר).
- Metadata extractor prompt asks Claude to populate the new field
explicitly; apply_to_record writes it for internal_committee rows.
- internal_decision_upload, case_create, case_update accept an
optional proceeding_type; FastAPI request models expose it.
- Wizard + edit dialog get a sided Select; case header renders the
resolved label (ערר / בל"מ).
- Uploaded the 2 staged בל"מ decisions on betterment levy:
8126/24 (סופר נוח, 13 chunks), 8047/23 (הרנון, 48 chunks).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "חלץ מטא-דאטה" / "חלץ הלכות" buttons in the UI were returning 404
for any precedent with `source_kind != 'external_upload'`. The original
restriction was meant to keep LLM extraction off internal-committee
imports (their metadata supposedly came from the case file system),
but the same precedent rows can still need re-extraction when ingest
produces broken data — e.g. the corrupted `subject_tags` value
`['[','"','ה','י',...]` that motivated this change (an early ingest
stored a JSON literal into a TEXT[] column, which Postgres split into
single chars).
Two changes here:
1. db.request_metadata_extraction / request_halacha_extraction:
drop the `AND source_kind='external_upload'` filter. The extractor
already preserves user values (only fills empty fields), so this
is safe.
2. precedent_metadata_extractor.extract_and_apply: detect the
character-by-character corruption above and treat it as empty so
the freshly-extracted tags actually replace the broken ones.
Heuristic: 3+ elements where every element is at most 2 chars
(legitimate tags are multi-character Hebrew words).
Coolify deploy required for the FastAPI container to pick this up.
Two follow-ups after running the metadata extractor on 403-17:
1. Library table: shadcn TableCell defaults to whitespace-nowrap and
the table wrapper has overflow-x-auto, so the long citation forced
a horizontal scrollbar inside the row. Override on the citation
cell only — whitespace-normal + break-words + min/max-w to keep the
column readable. Same for the case-name cell. Row aligns to top so
wrapping doesn't push neighbours up.
2. Extractor now also fills source_type (court_ruling /
appeals_committee). The previous round added decision_date_iso,
precedent_level, and court but left source_type empty. Same
closed-enum + merge-only-if-empty policy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first end-to-end run on 403-17 surfaced three fields the auto-fill
left blank because the chair didn't set them in the upload form: date,
precedent_level, and court. All three are right there in the ruling's
header text — there's no reason to require manual entry.
Prompt now asks for:
- decision_date_iso (YYYY-MM-DD parsed from "ניתנה היום, … 5 בספטמבר 2022"
style signatures)
- precedent_level (closed enum: עליון/מנהלי/ועדת_ערר_ארצית/ועדת_ערר_מחוזית)
- court (the full court name from the title block)
Validation is unchanged: precedent_level only accepts the four enum
values; decision_date_iso is parsed into a Python date object before
being handed to update_case_law (asyncpg doesn't coerce strings to
DATE columns); court is stored verbatim.
Merge policy is unchanged — only fills empty fields. Anything the
chair typed in the upload form survives.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three fixes to the precedent library after the first end-to-end test on
403-17 surfaced runtime issues:
1. Anthropic SDK fallback in claude_session. The legal-ai Docker container
does not ship the `claude` CLI, so every halacha and metadata extraction
was failing with "Claude CLI not found." Module now tries the CLI first
(zero-cost local path) and falls back to the Anthropic SDK with
ANTHROPIC_API_KEY when the binary is absent. Default model is
claude-sonnet-4-6, overridable via CLAUDE_SDK_MODEL env. The system
message gets cache_control: ephemeral so multi-chunk runs reuse the
cached instruction prefix at ~10% read cost. Adds `anthropic` to
pyproject deps.
2. precedent_metadata_extractor crashed with KeyError because the JSON
example inside the prompt template contained literal { } characters
that str.format() interpreted as placeholders. Switched to f-string
concatenation; the prompt template no longer needs format() at all.
3. Library list query stays stale after upload because the upload
mutation's onSuccess fires when the POST returns task_id, not when
SSE reports completion. Added a second invalidate inside the SSE
watcher in PrecedentUploadSheet so the new row appears with up-to-date
chunk and halachot counts the moment processing finishes.
Halacha and metadata extractors now route the long static prompt through
the new `system=` parameter so the SDK path actually caches it; the CLI
path concatenates and behaves as before.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three improvements to the precedent library based on usage feedback:
1. Auto-fill metadata at upload time. New service
precedent_metadata_extractor reads the ruling's full_text and
suggests case_name (short), summary, headnote, key_quote,
subject_tags, appeal_subtype. The merge policy fills only empty
fields, preserving everything the chair typed in the upload form.
Wired into the ingest pipeline; also exposed as a re-run endpoint
POST /api/precedent-library/{id}/extract-metadata for existing
records.
2. Edit sheet in the UI. Pencil icon on each library row opens a
pre-populated form covering every field. A Sparkles button on the
sheet runs the metadata extractor on demand and refreshes the
form. The case_number is read-only because halachot are FK'd to
it; renaming requires delete + re-upload.
3. Halacha extractor branches on is_binding. Sources marked binding
(Supreme/Administrative) keep the strict halacha prompt. Non-binding
sources (other appeals committees, district courts on planning
matters) get a different prompt that extracts applications,
interpretive principles, and persuasive conclusions — labeled with
new rule_types 'application' and 'persuasive'. The fallback also
widens chunk selection: if the chunker labeled nothing as
legal_analysis/ruling/conclusion, we now run on all chunks rather
than returning zero halachot for a usable ruling.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>