Files
legal-ai/scripts/derive_missing_from_cited_only.py
Chaim 161e370a4c
All checks were successful
G12 Leak-Guard / leak-guard (pull_request) Successful in 4s
Lint — undefined names / undefined-names (pull_request) Successful in 11s
feat(precedents): איחוד cited_only↔missing_precedents — גזירת פסיקה-חסרה (#143, G2)
שתי מערכות מקבילות לאותו מושג ("פסיקה מצוטטת שטקסטה לא נקלט"): טבלת
missing_precedents (תור-רכישה ידני של היו"ר) מול case_law source_kind='cited_only'
(stubs מגרף-הציטוטים/X11). חפיפה≈0 → 31 ה-stubs לא הופיעו ב-/missing-precedents.

הכרעה (G2): missing_precedents = SoT-לתור-יחיד; cited_only = מקור-גילוי נגזר (כמו
יומונים מזינים radar). גוזרים רשומת missing_precedents 'open' לכל stub.

תיקון:
- court_citation.citation_dedup_key — מפתח-dedup **designator-aware**
  (`{designator}|{docket}`). **מתקן פגם בתוכנית-הניתוח:** dedup על מספר-בלבד היה
  ממזג בטעות אותו docket בערכאות שונות (בג"ץ 389/87 ≠ ע"א 389/87; 18 כאלה בקיים).
- סכמה V40: missing_precedents מקבל citation_norm (מפתח-dedup) + discovery_source
  (manual|cited_only|digest|writer) + index. **בלי UNIQUE** — הקורפוס מחזיק
  לגיטימית אותו docket בערכאות שונות; ייחודיות נאכפת designator-aware בנתיב-היצירה.
- create_missing_precedent: מחשב citation_norm בכתיבה (G1), מקבל discovery_source
  + linked_case_law_id. find_missing_precedent_by_citation: dedup דרך citation_norm
  (fallback ל-citation גולמי כשאין מספר).
- scripts/derive_missing_from_cited_only.py: backfill citation_norm ל-291 +
  גזירת 31 (dry-run: 31 ייווצרו, 0 deduped). linked_case_law_id=stub, status=open
  → promote-in-place בהעלאת-טקסט דרך ON CONFLICT הקיים. אידמפוטנטי.

תלוי-הקשר: #140 (הגדרת cited_only). מתואם עם #136 (digest→MP — אותו citation_norm
+ create path). תיקון-נתון יורץ אחרי הפריסה.

בדיקות: test_dedup_key_is_designator_aware (בג"ץ≠ע"א, ערר≠בל"מ, גרסאות-format
מתמזגות). כל 356 עוברות. guards נקיים.

Invariants: G2 (SoT-לתור יחיד, cited_only נגזר), G1 (citation_norm מנורמל בכתיבה),
G3 (idempotent upsert), G10 (שער-העלאה ידני נשמר), G12.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 08:56:34 +00:00

114 lines
4.1 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""Derive missing_precedents 'open' gaps from cited_only stubs (#143, G2).
Two parallel systems described the same concept — "a cited precedent whose text
isn't in the corpus": the ``missing_precedents`` queue (the chair's acquisition
list) and ``case_law`` rows with ``source_kind='cited_only'`` (citation-only
stubs seeded by the X11 / corpus-graph). Overlap was ~0, so the 31 cited_only
stubs never surfaced on /missing-precedents.
This makes ``missing_precedents`` the single source-of-truth FOR THE QUEUE and
``cited_only`` a DERIVED discovery source (like digests feed the radar):
1. Backfill ``citation_norm`` (designator-aware dedup key) for every existing
missing_precedent — required before the dedup below can match.
2. For each cited_only stub, derive an 'open' missing_precedent (deduped on
citation_norm), with ``discovery_source='cited_only'``,
``linked_case_law_id`` = the stub (its canonical identity is known; status
stays 'open' until the text is uploaded → promote-in-place), and notes
listing the precedents that cite it.
Idempotent / re-runnable. Dry-run by default; ``--apply`` to write.
Host-only. Run:
HOME=/home/chaim mcp-server/.venv/bin/python scripts/derive_missing_from_cited_only.py [--apply]
"""
from __future__ import annotations
import asyncio
import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "mcp-server", "src"))
from legal_mcp.services import court_citation, db
async def _backfill_citation_norm(pool, apply: bool) -> int:
rows = await pool.fetch(
"SELECT id, citation FROM missing_precedents "
"WHERE COALESCE(citation_norm, '') = ''"
)
n = 0
for r in rows:
norm = court_citation.citation_dedup_key(r["citation"] or "")
if not norm:
continue
if apply:
await pool.execute(
"UPDATE missing_precedents SET citation_norm = $2 WHERE id = $1",
r["id"], norm,
)
n += 1
return n
async def _citing_precedents_note(pool, stub_id) -> str:
rows = await pool.fetch(
"""SELECT DISTINCT cl.case_number
FROM precedent_internal_citations p
JOIN case_law cl ON cl.id = p.source_case_law_id
WHERE p.cited_case_law_id = $1 AND COALESCE(cl.case_number,'') <> ''
ORDER BY cl.case_number LIMIT 8""",
stub_id,
)
citers = [r["case_number"] for r in rows]
base = "נגזר מ-cited_only (גרף-הציטוטים)"
if citers:
return f"{base}; מצוטט ע\"י: {', '.join(citers)}"
return base
async def main(apply: bool) -> int:
pool = await db.get_pool()
backfilled = await _backfill_citation_norm(pool, apply)
print(f"citation_norm backfill (existing rows){'' if apply else ' [dry]'}: {backfilled}")
stubs = await pool.fetch(
"SELECT id, case_number, case_name FROM case_law "
"WHERE source_kind = 'cited_only' ORDER BY case_number"
)
print(f"cited_only stubs: {len(stubs)}")
created = 0
skipped = 0
for s in stubs:
citation = (s["case_number"] or "").strip()
if not citation:
print(f" SKIP (no case_number) id={s['id']}")
continue
existing = await db.find_missing_precedent_by_citation(citation)
if existing:
skipped += 1
continue
norm = court_citation.citation_dedup_key(citation)
print(f" + {citation:<22} norm={norm!r} name={(s['case_name'] or '')[:24]!r}")
if apply:
note = await _citing_precedents_note(pool, s["id"])
await db.create_missing_precedent(
citation=citation,
case_name=s["case_name"] or None,
discovery_source="cited_only",
linked_case_law_id=s["id"],
notes=note,
)
created += 1
print(f"\n{'created' if apply else 'would create'}: {created} already-present (deduped): {skipped}")
if not apply:
print("(dry-run — pass --apply to write)")
return 0
if __name__ == "__main__":
sys.exit(asyncio.run(main("--apply" in sys.argv)))