All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 3m5s
Observed 2026-05-03: a `precedent_process_pending(halacha)` run that
chained two precedents (1110/20 → 317/10) succeeded for the first
(9 halachot, 129 chunks) and produced status=`no_halachot` for the
second despite it being a 47KB Supreme Court ruling with rich legal
analysis. A manual single-precedent re-run on 317/10 immediately
extracted 53 halachot. Diagnosis: every chunk's claude_session call
in the back-to-back run silently failed (likely Anthropic rate-limit
storm after the 1110/20 token burn), and the empty list was reported
as "Claude looked and found nothing" — same code path as a real
0-halacha ruling. The user couldn't tell the difference.
Three changes:
1. Surface chunk-level failures (halacha_extractor.py)
`_extract_chunk` now returns `(halachot, succeeded)` so the caller
can count how many chunks crashed. `extract()` uses this to
distinguish:
- `no_halachot` — chunks ran cleanly, Claude found nothing
- `extraction_failed` — ≥50% of chunks crashed AND zero halachot
came back (rate limit, subprocess crash, etc.)
When `extraction_failed`, DB status is left as 'processing' so the
request stays in the queue for the caller to retry — instead of
the old behaviour where it got marked 'completed' and silently
dropped from the queue.
2. Inter-precedent cooldown (precedent_library.py)
`process_pending_extractions` now sleeps 30s between precedents.
Anthropic rate-limits per-org, and back-to-back large rulings
(~4M tokens for 1110/20, immediately followed by another 2-3M)
was the empirical trigger. 30s gives the per-minute counter time
to drain.
3. Auto-retry on extraction_failed (precedent_library.py)
When a precedent comes back as `extraction_failed`, retry once
after a 60s cooldown before giving up. Rate-limit storms are
transient — the manual re-run of 317/10 minutes later succeeded
with 53 halachot and zero chunk failures, confirming a single
retry is sufficient. Only retries `extraction_failed`; never
`no_halachot` (Claude looked and there genuinely is no holding).
The DB status now ends up as 'failed' only after retries are
exhausted, matching the UI's terminal-failure chip.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
474 lines
22 KiB
Python
474 lines
22 KiB
Python
"""Extract binding legal rules (הלכות) from external court rulings.
|
||
|
||
Runs Claude (via the local headless ``claude -p`` bridge) over the
|
||
legal_analysis / ruling / conclusion chunks of a precedent, returns a
|
||
structured list of halachot, validates each one against the source text,
|
||
embeds the rule statement, and stores everything as ``pending_review`` in
|
||
the ``halachot`` table.
|
||
|
||
All extraction is idempotent — calling ``extract(case_law_id)`` twice
|
||
deletes prior rows for that precedent first.
|
||
|
||
Trust model:
|
||
Per chair decision, NO halacha is auto-published. Every extracted
|
||
halacha enters with ``review_status='pending_review'``. The chair
|
||
approves/rejects via the UI, and only ``approved`` (or ``published``)
|
||
rows are visible to ``search_precedent_library`` and the writing
|
||
agents.
|
||
"""
|
||
|
||
from __future__ import annotations
|
||
|
||
import asyncio
|
||
import logging
|
||
import re
|
||
from uuid import UUID
|
||
|
||
from legal_mcp import config
|
||
from legal_mcp.config import parse_llm_json
|
||
from legal_mcp.services import claude_session, db, embeddings, proofreader
|
||
|
||
logger = logging.getLogger(__name__)
|
||
|
||
|
||
# Concurrency model mirrors claims_extractor — each ``claude -p`` subprocess
|
||
# holds ~300 MB RSS, so we cap parallel chunks to keep the box healthy.
|
||
CHUNK_CONCURRENCY = 3
|
||
CHUNK_RETRY_ATTEMPTS = 1
|
||
|
||
# If at least this fraction of chunks crash and the precedent yields zero
|
||
# halachot, treat the run as `extraction_failed` rather than `no_halachot`.
|
||
# Picked at 0.5 so a precedent that genuinely has no holdings (e.g. a remand
|
||
# ruling that just sends the case back) isn't misflagged just because a few
|
||
# chunks timed out, while a real rate-limit storm — which kills nearly every
|
||
# call — is correctly distinguished and re-tried by the caller.
|
||
EXTRACTION_FAILURE_THRESHOLD = 0.5
|
||
|
||
# Sections from which to extract. facts/intro/appellant_claims/respondent_claims
|
||
# never contain holdings, only positions, so we skip them.
|
||
EXTRACTABLE_SECTIONS = ("legal_analysis", "ruling", "conclusion")
|
||
|
||
|
||
# Two prompts — choose by source's is_binding flag.
|
||
#
|
||
# The binding prompt extracts strict halachot (rules a future panel MUST
|
||
# follow). It rejects obiter dicta, factual findings, and citations of
|
||
# other rulings that the present court only mentioned in passing.
|
||
#
|
||
# The persuasive prompt is for sources that don't establish binding law
|
||
# (most appeals committee decisions, district courts on planning matters,
|
||
# etc.). For those, the value is in **how the panel reasoned and applied**
|
||
# established law to facts — not in new halachot. The user explicitly
|
||
# wants to be able to cite "another committee reached the same conclusion"
|
||
# even though it is not binding.
|
||
#
|
||
# The schema's rule_type field accepts six values:
|
||
# binding | interpretive | procedural | obiter | application | persuasive
|
||
|
||
HALACHA_EXTRACTION_PROMPT_BINDING = """אתה משפטן בכיר המתמחה בדיני תכנון ובניה (ועדות ערר, היטל השבחה, פיצויים לפי סעיף 197 לחוק התכנון והבניה). תפקידך: לחלץ הלכות מחייבות מתוך פסק דין/החלטה משפטית של ערכאה עליונה (עליון / מנהלי).
|
||
|
||
## הגדרות מחייבות
|
||
|
||
הלכה (binding rule) = כלל משפטי שהפסק קובע או מאמץ ומיישם, באופן שניתן להסתמך עליו בהחלטות עתידיות.
|
||
|
||
לא-הלכה (אין לחלץ):
|
||
- אמרת אגב (obiter dicta) — הערות שאינן הכרחיות להכרעה.
|
||
- ממצאים עובדתיים ספציפיים לתיק ("העורר לא הוכיח X").
|
||
- ציטוטי הלכות מפסקי דין אחרים שלא אומצו במפורש בפסק זה.
|
||
- הצהרות על דין קיים שאינן מיושמות בהכרעה.
|
||
|
||
הבחנה קריטית: כאשר הפסק מצטט הלכה מפסק קודם, חלץ אותה רק אם בית המשפט בפסק הנוכחי **מאמץ ומחיל** אותה (לא רק מזכיר אותה ברקע).
|
||
|
||
## תחומים אפשריים (practice_areas) — תחומי ועדת הערר בלבד
|
||
- rishuy_uvniya — רישוי ובניה (תיקי 1xxx: היתרים, שימוש חורג, תכניות, קווי בניין, גובה, חניה)
|
||
- betterment_levy — היטל השבחה (תיקי 8xxx: שומה, מערכות, תכניות המקנות בה, מועד קובע, סופיות ההחלטה)
|
||
- compensation_197 — פיצויים לפי ס' 197 (תיקי 9xxx: פגיעה במקרקעין, ירידת ערך, ס' 200/פטור)
|
||
|
||
הלכה אחת יכולה לחול על כמה תחומים — practice_areas הוא array ולא string יחיד.
|
||
|
||
## סוגי הלכה (rule_type)
|
||
- binding — הלכה מחייבת שהוחלה על התיק.
|
||
- interpretive — פרשנות סעיף חוק/תכנית שאומצה.
|
||
- procedural — כלל פרוצדורלי (סמכות, מועדים, הליכי שמיעה).
|
||
- obiter — אמרת אגב חשובה (חלץ רק אם משמעותית; סמן confidence נמוך).
|
||
|
||
## פלט נדרש
|
||
החזר JSON array בלבד, ללא markdown, ללא הסברים. דוגמה:
|
||
[
|
||
{
|
||
"rule_statement": "ניסוח הכלל בלשון משפטית מדויקת בגוף שלישי, 1-3 משפטים.",
|
||
"rule_type": "binding",
|
||
"reasoning_summary": "תמצית ההיגיון: למה בית המשפט הגיע לכלל הזה (1-2 משפטים).",
|
||
"supporting_quote": "ציטוט מילולי מדויק מהפסק התומך בכלל. חייב להופיע מילה במילה בטקסט הקלט.",
|
||
"page_reference": "פס' 12 / עמ' 8 — ככל שניתן לזהות מהקלט.",
|
||
"practice_areas": ["betterment_levy"],
|
||
"subject_tags": ["מועד_קביעת_שומה", "סופיות_ההחלטה"],
|
||
"cites": ["עע\\"מ 3975/22"],
|
||
"confidence": 0.85
|
||
}
|
||
]
|
||
|
||
## כללי איכות
|
||
1. **נאמנות מוחלטת לציטוט** — supporting_quote חייב להיות הדבקה מדויקת מהקלט. אם אין ציטוט מתאים — אל תמציא הלכה.
|
||
2. **מספר הלכות** — פסק רגיל מכיל 1-4 הלכות מחייבות. אל תמתח את הרשימה. אם אין הלכה — החזר [].
|
||
3. **לא לפצל יתר על המידה** — אם שני סעיפים מבטאים את אותו עיקרון, אחד את הניסוח.
|
||
4. **שפה** — rule_statement בעברית משפטית מקצועית, לא צמצום מילולי של הציטוט.
|
||
5. **subject_tags** — 2-5 תגיות בעברית, snake_case (חניה, קווי_בניין, שיקול_דעת, פגם_פרוצדורלי, סמכות, מועדים, פגיעה_במקרקעין, ירידת_ערך).
|
||
6. **confidence** — 0..1. מתחת ל-0.7 = ספק לגבי היות זה הלכה מחייבת.
|
||
"""
|
||
|
||
|
||
HALACHA_EXTRACTION_PROMPT_PERSUASIVE = """אתה משפטן בכיר המתמחה בדיני תכנון ובניה. תפקידך: לחלץ עקרונות, יישומים ומסקנות מתוך החלטה של ועדת ערר אחרת או של בית משפט שאינו ערכאה עליונה לסוגיה.
|
||
|
||
## חשוב — מה לחלץ ומה לא
|
||
|
||
המקור הזה **אינו** מקור להלכות מחייבות חדשות (binding rules). הלכות מחייבות מגיעות מהעליון/מנהלי. עם זאת, יש כאן ערך משמעותי שצריך לחלץ — איך הפנל הזה ניתח ויישם את הדין הקיים. כשנכתוב החלטה עתידית, נצטט מהמקור הזה כ"גם ועדת הערר ב-X הגיעה למסקנה דומה" — לא כסמכות מחייבת, אלא כתמיכה משכנעת.
|
||
|
||
**יש לחלץ:**
|
||
- **יישום של הלכה ידועה** (rule_type=`application`) — הפנל החיל הלכה ידועה (של עליון/מנהלי) על עובדות הנידונות. תצטט את ניסוח הכלל **כפי שהוצג כאן** (לא בהכרח כפי שנקבע במקור) ואת התוצאה.
|
||
- **עקרון פרשני שאומץ** (rule_type=`interpretive`) — איך הפנל פירש סעיף חוק / תכנית, באופן שניתן לאמץ.
|
||
- **כלל פרוצדורלי** (rule_type=`procedural`) — קביעות בנושאי סמכות, מועדים, הליך.
|
||
- **מסקנה מנומקת ומשכנעת** (rule_type=`persuasive`) — מסקנה שלמה של הפנל בסוגיה, עם ההיגיון התומך, ניתנת לציטוט כאסמכתא משכנעת.
|
||
|
||
**אין לחלץ:**
|
||
- ממצאים עובדתיים ספציפיים לתיק ("העורר לא הוכיח X").
|
||
- ציטוטים מפסקי דין אחרים ללא ניתוח של הפנל.
|
||
- אמרות אגב חסרות חשיבות.
|
||
|
||
## תחומים אפשריים (practice_areas) — תחומי ועדת הערר בלבד
|
||
- rishuy_uvniya — רישוי ובניה (תיקי 1xxx: היתרים, שימוש חורג, תכניות, קווי בניין, גובה, חניה)
|
||
- betterment_levy — היטל השבחה (תיקי 8xxx: שומה, מערכות, תכניות המקנות בה, מועד קובע, סופיות ההחלטה)
|
||
- compensation_197 — פיצויים לפי ס' 197 (תיקי 9xxx: פגיעה במקרקעין, ירידת ערך, ס' 200/פטור)
|
||
|
||
## פלט נדרש
|
||
החזר JSON array בלבד, ללא markdown, ללא הסברים:
|
||
[
|
||
{
|
||
"rule_statement": "ניסוח הכלל / המסקנה / היישום בלשון משפטית מדויקת, 1-3 משפטים.",
|
||
"rule_type": "application",
|
||
"reasoning_summary": "תמצית ההיגיון של הפנל (1-2 משפטים).",
|
||
"supporting_quote": "ציטוט מילולי מדויק מהקלט שתומך בכלל. חייב להופיע מילה במילה.",
|
||
"page_reference": "פס' 12 / עמ' 8 — ככל שניתן לזהות.",
|
||
"practice_areas": ["betterment_levy"],
|
||
"subject_tags": ["מועד_קביעת_שומה", "תכנית_רחביה"],
|
||
"cites": ["עע\\"מ 3975/22"],
|
||
"confidence": 0.85
|
||
}
|
||
]
|
||
|
||
## כללי איכות
|
||
1. **נאמנות מוחלטת לציטוט** — supporting_quote חייב להיות הדבקה מדויקת מהקלט. אם אין ציטוט מתאים — אל תוסיף את ההלכה.
|
||
2. **מספר הלכות** — החלטה ארוכה של ועדת ערר יכולה להניב 2-8 פריטים (יישומים + מסקנות). אם אין מה לחלץ — החזר [].
|
||
3. **rule_type מדויק** — application = יישום הלכה ידועה. interpretive = פרשנות. procedural = פרוצדורה. persuasive = מסקנה כללית בעלת ערך כאסמכתא.
|
||
4. **לא לפצל יתר על המידה** — שני סעיפים זהים מבחינה רעיונית = פריט אחד.
|
||
5. **שפה** — עברית משפטית מקצועית, גוף שלישי.
|
||
6. **subject_tags** — 2-5 תגיות בעברית, snake_case.
|
||
7. **confidence** — 0..1. דייק.
|
||
"""
|
||
|
||
|
||
_VALID_PRACTICE_AREAS = {"rishuy_uvniya", "betterment_levy", "compensation_197"}
|
||
_VALID_RULE_TYPES = {
|
||
"binding", "interpretive", "procedural", "obiter",
|
||
"application", "persuasive",
|
||
}
|
||
|
||
|
||
def _normalize_for_comparison(text: str) -> str:
|
||
"""Normalize Hebrew text for substring matching.
|
||
|
||
Collapses whitespace and unifies the half-dozen Hebrew quote-mark
|
||
variants. Use ``proofreader._fix_hebrew_quotes`` for the quote part
|
||
so we stay consistent with the proofreader pipeline.
|
||
"""
|
||
fixed = proofreader._fix_hebrew_quotes(text)
|
||
# Collapse all whitespace (newlines, tabs, multiple spaces) to a single space.
|
||
return re.sub(r"\s+", " ", fixed).strip()
|
||
|
||
|
||
def _verify_quote(supporting_quote: str, full_text: str) -> bool:
|
||
"""Return True if ``supporting_quote`` appears verbatim in ``full_text``
|
||
after Hebrew quote/whitespace normalization.
|
||
|
||
The LLM occasionally trims a leading/trailing word from the quote;
|
||
we accept the quote if at least 90% of its characters match a
|
||
contiguous substring of the source.
|
||
"""
|
||
if not supporting_quote.strip():
|
||
return False
|
||
normalized_quote = _normalize_for_comparison(supporting_quote)
|
||
normalized_text = _normalize_for_comparison(full_text)
|
||
if not normalized_quote:
|
||
return False
|
||
if normalized_quote in normalized_text:
|
||
return True
|
||
# Fallback: try the inner 90% of the quote (drops boundary trim).
|
||
if len(normalized_quote) >= 30:
|
||
trim = max(2, len(normalized_quote) // 20)
|
||
inner = normalized_quote[trim:-trim]
|
||
if inner and inner in normalized_text:
|
||
return True
|
||
return False
|
||
|
||
|
||
def _coerce_halacha(raw: dict, is_binding: bool = True) -> dict | None:
|
||
"""Validate and normalize one LLM-returned halacha dict.
|
||
|
||
Returns ``None`` if the entry is missing required fields. ``is_binding``
|
||
only affects the default rule_type when the LLM returned an unknown
|
||
value — for binding sources we default to ``binding``, otherwise to
|
||
``persuasive`` (never pretend an appeals committee created halacha).
|
||
"""
|
||
if not isinstance(raw, dict):
|
||
return None
|
||
rule_statement = (raw.get("rule_statement") or "").strip()
|
||
supporting_quote = (raw.get("supporting_quote") or "").strip()
|
||
if not rule_statement or not supporting_quote:
|
||
return None
|
||
|
||
default_rule_type = "binding" if is_binding else "persuasive"
|
||
rule_type = (raw.get("rule_type") or default_rule_type).strip().lower()
|
||
if rule_type not in _VALID_RULE_TYPES:
|
||
rule_type = default_rule_type
|
||
# Guard: don't let a non-binding source produce 'binding' rule_type
|
||
if not is_binding and rule_type == "binding":
|
||
rule_type = "persuasive"
|
||
|
||
practice_areas_raw = raw.get("practice_areas") or []
|
||
if isinstance(practice_areas_raw, str):
|
||
practice_areas_raw = [practice_areas_raw]
|
||
practice_areas = [p for p in practice_areas_raw if p in _VALID_PRACTICE_AREAS]
|
||
|
||
subject_tags_raw = raw.get("subject_tags") or []
|
||
if isinstance(subject_tags_raw, str):
|
||
subject_tags_raw = [subject_tags_raw]
|
||
subject_tags = [str(t).strip() for t in subject_tags_raw if str(t).strip()]
|
||
|
||
cites_raw = raw.get("cites") or []
|
||
if isinstance(cites_raw, str):
|
||
cites_raw = [cites_raw]
|
||
cites = [str(c).strip() for c in cites_raw if str(c).strip()]
|
||
|
||
try:
|
||
confidence = float(raw.get("confidence", 0.0))
|
||
except (TypeError, ValueError):
|
||
confidence = 0.0
|
||
confidence = max(0.0, min(1.0, confidence))
|
||
|
||
return {
|
||
"rule_statement": rule_statement,
|
||
"rule_type": rule_type,
|
||
"reasoning_summary": (raw.get("reasoning_summary") or "").strip(),
|
||
"supporting_quote": supporting_quote,
|
||
"page_reference": (raw.get("page_reference") or "").strip(),
|
||
"practice_areas": practice_areas,
|
||
"subject_tags": subject_tags,
|
||
"cites": cites,
|
||
"confidence": confidence,
|
||
}
|
||
|
||
|
||
async def _extract_chunk(
|
||
chunk_text: str,
|
||
section_type: str,
|
||
chunk_index: int,
|
||
chunk_total: int,
|
||
context: str,
|
||
is_binding: bool,
|
||
) -> tuple[list[dict], bool]:
|
||
"""Run the halacha extractor on one chunk with retry.
|
||
|
||
Returns ``(halachot, succeeded)`` so the caller can distinguish "Claude
|
||
said there are no halachot here" (`(_, True)`) from "every attempt
|
||
crashed/timed out" (`(_, False)`). Without this distinction a precedent
|
||
that hit a rate-limit storm looks identical to one that genuinely has no
|
||
halachot — and gets silently marked `no_halachot`.
|
||
|
||
The prompt branches on ``is_binding`` so non-binding sources (other
|
||
appeals committees, district courts) yield application/persuasive
|
||
entries rather than a forced 0-result strict halacha pass.
|
||
"""
|
||
base_prompt = (
|
||
HALACHA_EXTRACTION_PROMPT_BINDING if is_binding
|
||
else HALACHA_EXTRACTION_PROMPT_PERSUASIVE
|
||
)
|
||
chunk_label = f" (חלק {chunk_index + 1}/{chunk_total})" if chunk_total > 1 else ""
|
||
# Pass the static instruction prompt as `system` so the SDK path can cache
|
||
# it (5-min ephemeral). Only the per-chunk content varies via `prompt`.
|
||
user_msg = (
|
||
f"## הקלט\n"
|
||
f"סוג קטע: {section_type}\n"
|
||
f"{context}{chunk_label}\n\n"
|
||
f"--- תחילת הטקסט ---\n{chunk_text}\n--- סוף הטקסט ---"
|
||
)
|
||
last_err: Exception | None = None
|
||
for attempt in range(CHUNK_RETRY_ATTEMPTS + 1):
|
||
try:
|
||
result = await claude_session.query_json(user_msg, system=base_prompt)
|
||
except Exception as e:
|
||
last_err = e
|
||
logger.warning(
|
||
"halacha_extractor chunk %d/%d attempt %d raised: %s",
|
||
chunk_index + 1, chunk_total, attempt + 1, e,
|
||
)
|
||
continue
|
||
if isinstance(result, list):
|
||
return result, True
|
||
logger.warning(
|
||
"halacha_extractor chunk %d/%d attempt %d returned non-list (%s)",
|
||
chunk_index + 1, chunk_total, attempt + 1, type(result).__name__,
|
||
)
|
||
logger.error(
|
||
"halacha_extractor chunk %d/%d failed after %d attempts: %s",
|
||
chunk_index + 1, chunk_total, CHUNK_RETRY_ATTEMPTS + 1, last_err,
|
||
)
|
||
return [], False
|
||
|
||
|
||
async def extract(case_law_id: UUID | str) -> dict:
|
||
"""Extract halachot from an uploaded precedent and store them.
|
||
|
||
Idempotent: replaces any existing halachot for this case_law_id.
|
||
All inserted rows start as ``review_status='pending_review'``.
|
||
|
||
Returns:
|
||
``{"status": "...", "extracted": N, "verified": M, "stored": K, ...}``
|
||
"""
|
||
if isinstance(case_law_id, str):
|
||
case_law_id = UUID(case_law_id)
|
||
|
||
record = await db.get_case_law(case_law_id)
|
||
if not record:
|
||
return {"status": "not_found", "extracted": 0, "stored": 0}
|
||
|
||
is_binding = bool(record.get("is_binding"))
|
||
|
||
# Try the targeted sections first (legal_analysis / ruling / conclusion).
|
||
# If the chunker labeled everything as 'other' (common when a ruling
|
||
# uses non-standard headings or the section markers aren't bracketed
|
||
# cleanly), fall back to ALL chunks — better to over-include than to
|
||
# silently skip a ruling that has reasoning under an unexpected label.
|
||
chunks = await db.list_precedent_chunks(
|
||
case_law_id, section_types=EXTRACTABLE_SECTIONS,
|
||
)
|
||
if not chunks:
|
||
chunks = await db.list_precedent_chunks(case_law_id)
|
||
if chunks:
|
||
logger.info(
|
||
"halacha_extractor: case_law=%s — no targeted sections, "
|
||
"falling back to all %d chunks",
|
||
case_law_id, len(chunks),
|
||
)
|
||
if not chunks:
|
||
await db.set_case_law_halacha_status(case_law_id, "completed")
|
||
return {"status": "no_chunks", "extracted": 0, "stored": 0}
|
||
|
||
await db.set_case_law_halacha_status(case_law_id, "processing")
|
||
await db.delete_halachot(case_law_id)
|
||
|
||
citation = record.get("case_number", "")
|
||
court = record.get("court", "")
|
||
date_str = str(record.get("date") or "")
|
||
context = f"מקור: {citation} — {court}, {date_str}"
|
||
|
||
sem = asyncio.Semaphore(CHUNK_CONCURRENCY)
|
||
|
||
async def _bounded(idx: int, chunk_row: dict) -> tuple[list[dict], bool]:
|
||
async with sem:
|
||
return await _extract_chunk(
|
||
chunk_row["content"], chunk_row["section_type"],
|
||
idx, len(chunks), context, is_binding,
|
||
)
|
||
|
||
chunk_results = await asyncio.gather(
|
||
*[_bounded(i, c) for i, c in enumerate(chunks)]
|
||
)
|
||
raw_halachot: list[dict] = []
|
||
failed_chunks = 0
|
||
for items, ok in chunk_results:
|
||
raw_halachot.extend(items)
|
||
if not ok:
|
||
failed_chunks += 1
|
||
|
||
# If most chunks failed (rate limit storm, claude_session crash, etc.)
|
||
# do NOT touch the DB status — leave it 'processing' so the caller can
|
||
# retry without the request falling out of the queue. The caller
|
||
# (`process_pending_extractions`) is responsible for either retrying or
|
||
# finalising the status as 'failed' after retries are exhausted. This
|
||
# is the bug that produced 317/10's silent `no_halachot` after a
|
||
# 129-chunk neighbour saturated the API.
|
||
failure_rate = failed_chunks / len(chunks) if chunks else 0
|
||
if failure_rate >= EXTRACTION_FAILURE_THRESHOLD and not raw_halachot:
|
||
logger.error(
|
||
"halacha_extractor: case_law=%s extraction_failed — "
|
||
"%d/%d chunks failed (rate=%.0f%%), no halachot retrieved. "
|
||
"DB status left as 'processing' for caller-level retry.",
|
||
case_law_id, failed_chunks, len(chunks), failure_rate * 100,
|
||
)
|
||
return {
|
||
"status": "extraction_failed",
|
||
"extracted": 0,
|
||
"stored": 0,
|
||
"failed_chunks": failed_chunks,
|
||
"total_chunks": len(chunks),
|
||
}
|
||
|
||
if not raw_halachot:
|
||
await db.set_case_law_halacha_status(case_law_id, "completed")
|
||
return {
|
||
"status": "no_halachot",
|
||
"extracted": 0,
|
||
"stored": 0,
|
||
"failed_chunks": failed_chunks,
|
||
"total_chunks": len(chunks),
|
||
}
|
||
|
||
# Validate against the full text of the precedent for the quote check.
|
||
full_text = record.get("full_text") or ""
|
||
|
||
cleaned: list[dict] = []
|
||
for raw in raw_halachot:
|
||
coerced = _coerce_halacha(raw, is_binding=is_binding)
|
||
if coerced is None:
|
||
continue
|
||
coerced["quote_verified"] = _verify_quote(
|
||
coerced["supporting_quote"], full_text,
|
||
)
|
||
cleaned.append(coerced)
|
||
|
||
if not cleaned:
|
||
await db.set_case_law_halacha_status(case_law_id, "completed")
|
||
return {"status": "no_valid_halachot", "extracted": len(raw_halachot), "stored": 0}
|
||
|
||
# Embed rule_statement + reasoning_summary so semantic search hits the
|
||
# rule directly rather than the surrounding chunk centroid.
|
||
embed_inputs = [
|
||
f"{h['rule_statement']} — {h['reasoning_summary']}".strip(" —")
|
||
for h in cleaned
|
||
]
|
||
try:
|
||
vectors = await embeddings.embed_texts(embed_inputs, input_type="document")
|
||
except Exception as e:
|
||
logger.error("halacha_extractor: embeddings failed: %s", e)
|
||
vectors = [None] * len(cleaned)
|
||
|
||
for halacha, vec in zip(cleaned, vectors):
|
||
halacha["embedding"] = vec
|
||
|
||
stored = await db.store_halachot(case_law_id, cleaned)
|
||
|
||
verified = sum(1 for h in cleaned if h["quote_verified"])
|
||
await db.set_case_law_halacha_status(case_law_id, "completed")
|
||
|
||
logger.info(
|
||
"halacha_extractor: case_law=%s extracted=%d cleaned=%d verified=%d stored=%d",
|
||
case_law_id, len(raw_halachot), len(cleaned), verified, stored,
|
||
)
|
||
return {
|
||
"status": "completed",
|
||
"extracted": len(raw_halachot),
|
||
"valid": len(cleaned),
|
||
"verified": verified,
|
||
"stored": stored,
|
||
}
|