docs(principles): move research into docs/precedent-corpus-redesign/ (README + research-full) (#153)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 11:36:38 +00:00
parent dd8064d94c
commit 8d409edc9d
13 changed files with 2399 additions and 2 deletions
--- a/docs/precedent-corpus-redesign/README.md
+++ b/docs/precedent-corpus-redesign/README.md
@@ -0,0 +1,142 @@
+# עיצוב-מחדש: קורפוס-הפסיקה — חשיבות, אחזור וסינון (מחקר + החלטה)
+
+> **מקור:** מחקר-עומק רב-סוכני (deep-research, 2026-06-20) — 6 זוויות, 25 מקורות ראשוניים,
+> 114 טענות חולצו, 25 אומתו באימות-יריב 3-קולות (**21 אושרו · 4 הופרכו**). שאלת-המחקר:
+> איך לדרג/לנקות קורפוס של ~3,562 עקרונות-משפטיים מחולצים, **באוטומציה גבוהה וכמעט ללא
+> ביקורת-אנושית** (אילוץ-יסוד של chaim). מסמך-אחות: [`legal-principles-redesign.md`](../legal-principles-redesign.md) §8.
+
+---
+
+## ההמלצה החד-משמעית
+
+**לא לחתוך (cull הרסני). לשמור הכל, לדרג-לפי-חשיבות בזמן-אחזור, ולגדר במנגנון selective-prediction
+מכויל כך שרק שבריר זעיר וחסום-סטטיסטית מגיע ליו"ר.** (אופציה B/C, לא A.)
+
+הסיבה במשפט: **הסיגנל המכני אמין (האם פס"ד מצוטט, ע"י מי) — אבל ההכרעה הפרשנית "האם העיקרון
+הזה חשוב" היא בדיוק המקום שבו האוטומציה נכשלת**, ולכן אסור להשמיד עקרונות על-בסיס ציון-חשיבות
+רועש; עדיף לשמור הכל ולתת ל-ranking בזמן-שאילתה (שמכיר את הקשר-הטיוטה) להציף את הרלוונטי.
+
+---
+
+## הממצאים (מאומתים)
+
+### 1. מרכזיות-רשת-ציטוטים = סיגנל-חשיבות בר-קנה-מידה, אך **ברמת-הפס"ד בלבד**, ובינוני בעוצמתו
+`confidence: high · 3-0`
+- ניתן לגזור תוויות-חשיבות אלגוריתמית מדפוסי-ציטוט, בלי תיוג-ידני (Swiss Criticality, ACL 2025 —
+  138,531 פסקים דרך LD-Label + Citation-Label משוקלל-טריות).
+- שיטות-מרכזיות (Derlén & Lindholm: PageRank/HITS/betweenness על 9,125 פסקי-CJEU; Fowler/Jeon
+  ב-SCOTUS) מבססות מרכזיות-ברמת-תיק כפרוקסי-חשיבות כמותי. **HITS/eigenvector עדיפים על degree-גולמי**
+  כי degree מתייחס לכל מצטט כשווה.
+- **גבול קריטי:** העוצמה הניבויית **בינונית בלבד** — JURIX 2023, ordinal regression על Importance-Score
+  של בית-המשפט הגיע ל-**F1≈0.655**; התפלגות-הציטוטים כבדת-זנב (preferential attachment). כל הראיות
+  ברמת-פס"ד; **אף אחת לא מאמתת חשיבות ברמת-עיקרון/holding.**
+- **משמעות לנו:** מרכזיות-פס"ד = prior חזק על תיק-האב של עיקרון, **לא** ציון per-עיקרון.
+- מקורות: arxiv 2410.13460v2 · ssrn 2910926 · polisci.umn s6.pdf · ResearchGate 376422421 · Nature s41598-021-82430-x
+
+### 2. חילוץ ברמת-holding ישים מסחרית — אך **תיוג-החשיבות/treatment ברמת-holding הוא השלב שגיא-מועד**
+`confidence: high · 3-0` (ותת-טענה הופרכה)
+- מערכת KeyNumber הפטנטית של West מסווגת headnotes בודדים (~6/פסק, לעיתים 50+) לטקסונומיה של
+  90,000+ מחלקות דרך cosine — מוכיח ש**חילוץ-holding ישים**.
+- **אבל** Hellyer (2018, Law Library Journal): Shepard's ו-KeyCite פספסו/תייגו-שגוי **~שליש**, ו-BCite
+  **מעל שני-שליש**, מיחסי-הטיפול-השליליים (מדגם 357); שלושת ה-citators הסכימו רק 53/357. **השגיאות
+  נמצאות בניתוח-העריכתי הפרשני, לא בזיהוי-המכני** — "the significant problems occur in the editorial
+  analysis process, after the initial process of identifying the citing cases".
+- **הופרך (0-3):** הטענה ש-West שומר רק 1-3 holdings לפסק — **הפרקטיקה המסחרית אינה תומכת בגיזום-holding
+  אגרסיבי.**
+- **משמעות לנו:** הסיגנל-המכני (מצוטט? ע"י מי?) אמין; "האם העיקרון חשוב" — שם גם מערכות-מסחריות-עם-עורכים
+  טועות קשות. **טיעון נגד cull-הרסני מונע-ציון-פרשני.**
+- מקורות: USPTO US7580939 · aallnet LLJ 110n4
+
+### 3. אחזור מודע-הקשר בזמן-שאילתה **עדיף** על דירוג-חשיבות חסר-הקשר
+`confidence: high · 3-0`
+- ICAIL 2021 "Context-Aware Legal Citation Recommendation" (Stanford RegLab + CMU): ניצול ההקשר-הטקסטואלי
+  המקומי של הטיוטה משפר את איכות-ההמלצה על-פני baselines חסרי-הקשר. **הרלוונטיות תלוית-הקשר — לא ידועה
+  בזמן-cull, זמינה בזמן-שאילתה.** ציון-חשיבות סטטי (offline) לא יכול לתפוס רלוונטיות-ספציפית-לפסקה →
+  השמדת עקרונות נמוכי-ציון-סטטי מסכנת פריטים רלוונטיים-מאוד בהקשר שה-cull לא ראה.
+- מקור: arxiv 2106.10776
+
+### 4. תכונות-רשת נעשות **חזקות יותר עם הזמן**; תכונות-דמיון-תוכן דועכות
+`confidence: medium · 2-1`
+- Mones et al. (Scientific Reports 2021, CJEU 1955-2014): תכונות-מבניות (common-neighbor/Adamic-Adar)
+  מראות עלייה-מובהקת בעוצמה-ניבויית עם התבגרות-הרשת, בעוד TF-IDF דועך. → **גרף-ציטוטים מתחזק-מעצמו** הוא
+  נכס עמיד יותר מ-cull חד-פעמי מבוסס-תוכן. (אזהרה: התיקון העריכתי — preferential-attachment הוא תכונה
+  *נודלית-דועכת*; המבנית-עולה היא common-neighbor.)
+- מקור: Nature s41598-021-82430-x
+
+### 5. Selective evaluation מכויל → **רק שבריר זעיר מגיע לאדם**
+`confidence: high · 3-0`
+- Cascaded Selective Evaluation (ICLR 2025): מנתב כל פריט למודל-החלש-ביותר-שעדיין-בטוח-מספיק; השאר
+  מסלים. השיג **מעל 80% הסכמה-אנושית** ב-ChatArena עם אחוז-הסלמה נמוך. → ניתן לכייל סף-ביטחון כך
+  שרק חלק קטן ומדוד עובר לסקירה.
+- מקור: ICLR 2025 (proceedings.iclr.cc 08dabd5...)
+
+### 6. Selective Conformal Risk Control (SCRC) → **ערבון-סיכון מותנה ברמה 1−α**
+`confidence: high · 3-0`
+- SCRC מספק ערבון-בקרת-סיכון מותנה: ניתן להבטיח **חסם-טעות מוכח** על הפריטים ש"נסגרים אוטומטית",
+  כך שאחוז-ההסלמה-לאדם חסום-סטטיסטית ולא תלוי-מזל. → המנגנון להמרת "אפס-ביקורת" ליעד **מובטח-מתמטית**.
+- מקורות: arxiv 2407.18370 · 2511.07396
+
+### 7. **התנהגות-ציטוט טבעית = פיקוח-משתמע** (במקום ביקורת-בכמות)
+`confidence: high · 3-0`
+- Joachims et al. (קליקים כ-implicit relevance; Radlinski/Joachims) — אותות-משתמשים טבעיים הם סיגנל-רלוונטיות
+  אמין כשמטפלים בהטיות-מיקום. **מקבילה אצלנו:** אילו פסקי-דין/הלכות דפנה *מצטטת בפועל* בהחלטותיה = הפיקוח,
+  במקום סקירה-מראש של מאות. self-correcting, מתחזק עם השימוש.
+- מקורות: Cornell joachims_etal_17a · radlinski_joachims_05a · arxiv 2403.18962
+
+---
+
+## טענות שהופרכו (לא לבנות עליהן)
+| טענה | קול | מקור |
+|------|-----|------|
+| degree-גולמי הוא המנבא היציב ביותר, עדיף על PageRank | 1-2 | ResearchGate 376422421 |
+| HITS (hubs/authorities) עדיף-באופן-מובהק על ספירת-ציטוטים | 1-2 | polisci.umn s6 |
+| link-prediction על גרף-הציטוטים מדרג תקדימים בדיוק חזק | 0-3 | Nature s41598 |
+| West שומר רק 1-3 holdings/פסק (תמיכה בגיזום-holding) | 0-3 | USPTO US7580939 |
+> מסקנה מההפרכות: **ספירת-ציטוטים היא סיגנל לגיטימי אך לא-מכריע, והמטרי-המדויק (degree/PageRank/HITS)
+> אינו מוכרע — אל תּתַכַּנֵּת-יתר אותו; ואל תצטט פרקטיקה-מסחרית כתומכת בגיזום-holding.**
+
+---
+
+## סינתזה לנתוני-המערכת שלנו
+
+| ממצא-מחקר | המצב אצלנו (אומת ב-DB) |
+|-----------|------------------------|
+| חשיבות אמינה רק ברמת-פס"ד | התאמת-זהב ברמת-עיקרון **נכשלה**: match_context=רשימת-הפניות; 62/112 פס"ד-מצוטטים חסרי-עקרונות; חציון-cosine 0.52 |
+| ספירת-ציטוטים = סיגנל עם זנב | יש פיזור אמיתי: 7×(1), 6×(1), 4×(4), 3×(8), 2×(38), 1×(269) — ראש-"הלכות-קבע" ברור |
+| אל תחתוך על ציון-פרשני רועש | ה-cull הבלינדי היה חותך ~66%, כולל הלכות-זהב (49% מהעקרונות מפס"ד-זהב) |
+| דרג-בזמן-שאילתה (מודע-הקשר) | יש לנו RAG (`search_precedent_library`/halacha) — נקודת-ההזרקה הטבעית ל-boost |
+| פיקוח-משתמע מציטוטי-היו"ר | יש לנו `precedent_internal_citations` (ציטוטי-דפנה) — מתעדכן עם כל החלטה חדשה |
+| אפס-ביקורת מובטח (SCRC/cascade) | מחליף את תורי-ה-pending_review בשער-conformal מכויל |
+
+**ההכרעה הנגזרת:**
+1. **לבטל את ה-cull ההרסני** כברירת-מחדל. הקורפוס נשאר שלם (הפיך — וכבר שוחזר לפריסטין).
+2. **שכבת-חשיבות = prior-לדירוג, לא מסנן-השמדה.** `importance_score(עיקרון) ∝ מרכזיות-פס"ד-המקור
+   (ספירת-ציטוטים בדרגות: דפנה ≫ יו"ר-אחר ≫ כללי) × סמכות × טריות` — מוזרק כ-boost ב-RRF בזמן-אחזור.
+3. **רעש מטופל ב-ranking, לא במחיקה** — עקרון נמוך-חשיבות פשוט שוקע ולא צץ; שום הלכה לא אובדת.
+4. **ביקורת-אנושית → אפס-מעשי:** רק ה"זבל-הוודאי" (≤1 קול בפאנל / quality-flags) מודח-אוטומטית (הפיך);
+   השאר נשאר; אין תור-אישור. אם בעתיד נרצה שער-החלטה — conformal (SCRC) חוסם את אחוז-ההסלמה מתמטית.
+5. **Active-learning:** ציטוטי-דפנה העתידיים מזינים את ה-prior אוטומטית (job רענון), בלי סקירה.
+
+> **מה שנשאר תקף מהעבודה שכבר נבנתה (PR #304/#305):** משטר-החילוץ התלת-מודלי + תקרת-5 **לחילוץ-להבא**
+> (מונע צמיחת-רעש חדש במקור — quality-at-source) נשאר; מה שמשתנה הוא ה**יחס לקורפוס-הקיים**: דירוג ולא
+> השמדה. הטרמינולוגיה (הלכה/כלל-פרשני/עיקרון) והסינתזה — נשארים.
+
+## שאלות-פתוחות (לאימות-פנימי, מהמחקר)
+1. האם ניתן לאמת ציון-חשיבות per-עיקרון (לא רק per-פס"ד) דרך מתאם בין retrieval-then-citation של היו"ר
+   לסיגנל-אלגוריתמי? (הליבה הלא-מוכחת — דורש מחקר-פנימי על הקורפוס שלנו.)
+2. גודל-מינימלי ותדירות-רענון לכיול מהתנהגות-הציטוט של היו"ר בקורפוס חד-מחבר קטן? (Trust-or-Escalate
+   השתמש ב-500 דוגמאות i.i.d.)
+3. שקלול ציטוטים-פנימיים (החלטה→החלטה של היו"ר) מול חיצוניים (מרכזיות-בית-משפט) — פנימי נדיר אך מיושר-יותר לסגנונה.
+4. האם דירוג-אגרסיבי-בזמן-שאילתה פוגע ב-precision/latency בקנה-המידה שלנו (~3,562), או שה-set קטן מספיק
+   שאין חיסרון מעשי — כלומר **האם ה-cull בכלל פותר בעיה שיש לנו?**
+
+---
+
+## מקורות (25 ראשוניים)
+מרכזיות/legal-IR: arxiv 2410.13460v2 · ssrn 2910926 · polisci.umn s6.pdf · ResearchGate 376422421 ·
+arxiv 2106.10776 · Nature s41598-021-82430-x · USPTO US7580939 · aallnet LLJ 110n4 ·
+law.northwestern updating · guides.law.stanford keynumbersystem.
+Selective-prediction/conformal: ICLR 2025 08dabd5 · arxiv 2512.12844 · arxiv 2407.18370 · vlm-uncertainty ·
+openreview JJPAy8mvrQ · arxiv 2511.07396 · arxiv 2605.18796.
+Implicit-feedback/active-learning: Cornell joachims_etal_17a · radlinski_joachims_05a · dl.acm 1229181 · arxiv 2403.18962.
+RAG pruning vs rank: arxiv 2407.12170 · 2511.00505 · 2409.13694v2.
--- a/docs/precedent-corpus-redesign/research-full.md
+++ b/docs/precedent-corpus-redesign/research-full.md
@@ -0,0 +1,152 @@
+# מחקר-עומק מלא (גולמי) — קורפוס-הפסיקה
+
+> נספח גולמי ל-[`precedent-corpus-redesign.md`](README.md). פלט מלא של מנוע deep-research (2026-06-20).
+
+**סטטיסטיקה:** 6 זוויות · 25 מקורות · 114 טענות חולצו · 25 אומתו · 21 אושרו · 4 הופרכו · 108 קריאות-סוכן · 108 סוכנים.
+
+
+## תקציר-מנהלים (verbatim)
+
+For your specific situation, the evidence points to option (B)/(C): rank-by-importance at retrieval time rather than a destructive cull, with selective-prediction gating that keeps human review near-zero. Automated importance signals from citation-network centrality (PageRank/HITS/degree) are a genuine, scalable proxy for PRECEDENT-level importance — derivable algorithmically without manual annotation (Swiss Criticality, Fowler/Jeon, Derlén & Lindholm) — but they are only moderately predictive (JURIX 2023 F1≈0.655) and are NOT validated at the holding/principle granularity you actually extract. Commercial systems (West KeyNumber patent, Shepard's/KeyCite) do operate at holding-level headnotes via cosine similarity, but their interpretive/editorial labels are substantially error-prone (one-third to two-thirds mislabeled), confirming that holding-level importance judgment is exactly where automation degrades — so you should not destructively prune on a noisy holding-level score. The robust path is to keep all extracted principles (reversible), attach multiple importance signals (precedent-level citation centrality + your chair's actual citation behavior as implicit supervision), rank at query time, and use a calibrated selective-prediction/conformal gate (Trust-or-Escalate cascade, SCRC) so only a tiny, statistically-bounded fraction ever escalates to the human — with a provable agreement guarantee at level 1−α.
+
+
+## לוג-הצינור
+
+- Q: Research question for a production legal-AI system (RAG that helps a planning-ap…
+- Decomposed into 6 angles: Citation-network importance & legal IR ranking, Headnote/holding selection at commercial citators, Selective prediction / conformal abstention thresholds, Multi-model agreement & trust-or-escalate routing, Implicit feedback active learning vs upfront review, RAG corpus pruning vs rank-at-retrieval
+- Citation-network importance & legal IR ranking: 6 results
+- Headnote/holding selection at commercial citators: 6 results
+- Headnote/holding selection at commercial citators: 4 novel (2 filtered)
+- Selective prediction / conformal abstention thresholds: 6 results
+- Selective prediction / conformal abstention thresholds: 5 novel (1 filtered)
+- Multi-model agreement & trust-or-escalate routing: 6 results
+- Multi-model agreement & trust-or-escalate routing: 3 novel (3 filtered)
+- Implicit feedback active learning vs upfront review: 6 results
+- Implicit feedback active learning vs upfront review: 4 novel (2 filtered)
+- RAG corpus pruning vs rank-at-retrieval: 6 results
+- RAG corpus pruning vs rank-at-retrieval: 3 novel (3 filtered)
+- Fetched 25 sources → 114 claims → verifying top 25
+- "Importance/criticality labels for legal decisions …": 3-0 ✓
+- "Case criticality is operationalized via a two-tier…": 3-0 ✓
+- "Derlén & Lindholm apply network-citation analysis …": 3-0 ✓
+- "Citation-network centrality scores (specifically t…": 3-0 ✓
+- "Network centrality measures correlate only reasona…": 3-0 ✓
+- "An ordinal regression model using network centrali…": 3-0 ✓
+- "Among centrality metrics, simple Degree (in-degree…": 1-2 ✗
+- "Citation counts alone (degree centrality / inward …": 1-2 ✗
+- "The authors construct importance scores using two …": 3-0 ✓
+- "Simple degree centrality (counting inward citation…": 3-0 ✓
+- "A deep-learning citation recommendation tool (BiLS…": 3-0 ✓
+- "Leveraging the local textual context surrounding a…": 3-0 ✓
+- "In a real judicial citation network (CJEU, 1955-20…": 3-0 ✓
+- "A link-prediction model on the citation graph pred…": 0-3 ✗
+- "Over time, structural/network features (preferenti…": 2-1 ✓
+- "West's commercial system classifies legal headnote…": 3-0 ✓
+- "The system does NOT treat all headnotes/holdings a…": 0-3 ✗
+- "The patented method operates at the granularity of…": 3-0 ✓
+- "Commercial citators' negative-treatment/holding la…": 3-0 ✓
+- "The error source is editorial analysis (the interp…": 3-0 ✓
+- "Selective evaluation with a calibrated confidence …": 3-0 ✓
+- "Cascaded Selective Evaluation routes each instance…": 3-0 ✓
+- "On ChatArena the cascade achieved over 80% human a…": 3-0 ✓
+- "Selective Conformal Risk Control (SCRC) is a frame…": 3-0 ✓
+- "SCRC provides a conditional risk-control guarantee…": 3-0 ✓
+- Verify done: 25 claims → 21 confirmed, 4 killed
+
+## הממצאים המלאים (verbatim)
+
+
+### ממצא 1 — Citation-network centrality is a scalable, manual-annotation-free importance signal — but it works at PRECEDENT/case level, not holding/principle level, and is only moderately predictive.
+**confidence:** high · **vote:** 3-0 across all constituent claims
+**מקורות:** https://arxiv.org/html/2410.13460v2, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2910926, http://users.polisci.umn.edu/~trj/MyPapers/s6.pdf, https://www.researchgate.net/publication/376422421_Centrality_Scores_and_Precedent_Value_in_Legal_Network_Analysis, https://www.nature.com/articles/s41598-021-82430-x
+
+Merges claims [0],[1],[2],[3],[4],[5],[7],[10]. Importance labels can be derived algorithmically from citation patterns, yielding far larger datasets than manual annotation (Swiss Criticality, ACL 2025: 138,531 cases via LD-Label + recency-weighted Citation-Label). Network-centrality methods (Derlén & Lindholm: PageRank, HITS, betweenness on 9,125 CJEU judgments; Fowler/Jeon at SCOTUS) establish case-level centrality as a quantitative importance proxy. Eigenvector/HITS approaches are preferred over raw degree because degree treats all citing cases equally regardless of the citing case's own importance. CRITICAL LIMIT: predictive power is only moderate — JURIX 2023 ordinal regression on the court's Importance Score achieved F1≈0.655 ('to an extent' indicates precedent value), and CJEU citation distributions are heavy-tailed/preferential-attachment (few highly-cited cases) confirming a meaningful but skewed signal. All evidence is scoped to precedent/case level; none validates principle/holding-level importance. For your ~3,562 principles (~12/precedent), this means: use precedent-level centrality as a strong prior on a principle's parent case, but do not treat it as a per-principle importance score.
+
+
+### ממצא 2 — Holding-level extraction IS achievable (commercial citators do it), but holding-level IMPORTANCE/treatment labeling is the error-prone editorial step — so a destructive cull keyed on a noisy holding-level score is risky.
+**confidence:** high · **vote:** 3-0; one constituent refuted (selective top-1-3 retention)
+**מקורות:** https://image-ppubs.uspto.gov/dirsearch-public/print/downloadPdf/7580939, https://www.aallnet.org/wp-content/uploads/2018/12/LLJ_110n4_02_hellyer.pdf
+
+Merges claims [12],[13],[14],[15]. West's patented KeyNumber system classifies individual headnotes (discrete holdings, ~6 per opinion, sometimes 50+) into a 90,000+ class taxonomy via cosine similarity over noun-word-pair vectors with composite scoring — proving holding-level extraction/classification is commercially viable. BUT Hellyer (2018, Law Library Journal) shows Shepard's and KeyCite missed/mislabeled ~one-third, and BCite over two-thirds, of negative citing relationships (357-sample); the three citators agreed only 53/357 times. The errors arise specifically in the EDITORIAL ANALYSIS (interpretive treatment/holding labeling), not the mechanical step of identifying citing cases — 'the significant problems occur in the editorial analysis process, after the initial process of identifying the citing cases.' Implication for you: mechanical signals (does a precedent get cited; by whom) are the reliable part; interpretive 'is this principle important' judgment is exactly where even commercial systems with human editors err badly. Note: the claim that West selectively retains only 1-3 holdings per case was REFUTED (vote 0-3) — commercial practice does NOT support aggressive holding-level pruning. This argues against a destructive cull driven by an interpretive importance score.
+
+
+### ממצא 3 — Context-aware, query-time retrieval (using the local textual context of the draft) outperforms context-free importance ranking for choosing which authority to surface — favoring rank-at-retrieval over a pre-pruned static corpus.
+**confidence:** high · **vote:** 3-0
+**מקורות:** https://arxiv.org/pdf/2106.10776
+
+Merges claims [8],[9]. The ICAIL 2021 'Context-Aware Legal Citation Recommendation using Deep Learning' (Stanford RegLab + CMU) builds a citation recommender for opinion drafting and finds that leveraging local textual context improves recommendation quality over context-free baselines (collaborative filtering on citation lists). Context-based deep models (BiLSTM/RoBERTa) beat context-free methods because they exploit semantics to judge which citation fits the passage. This directly supports your option (B): the right-to-surface principle depends on the draft's local context, which is unknowable at cull time but available at query time. A static importance score (computed once, offline) cannot capture passage-specific relevance — so destroying low-static-importance principles risks discarding items that are highly relevant in a context the cull never saw. Caveat: context-aware recommendation and importance ranking are complementary, not mutually exclusive; the paper benchmarks against a citation-list baseline, not a centrality ranker.
+
+
+### ממצא 4 — Structural/network features become MORE predictive over time while content-similarity features decay — supporting maintaining a persistent citation graph (which improves as the corpus matures) rather than freezing a one-time content-based cull.
+**confidence:** medium · **vote:** 2-1
+**מקורות:** https://www.nature.com/articles/s41598-021-82430-x
+
+Claim [11]. On the CJEU judicial citation network (1955-2014), Mones et al. (Scientific Reports 2021) found structural/common-neighbor features 'display a significant increase of predictive power' over time while document-content (TF-IDF) features show 'decreasing trends' — the network becomes increasingly informative as it matures. This implies a citation-graph-backed importance ranking is a more durable, self-improving asset than a one-shot content-similarity prune. CAVEATS lowering confidence to medium: (1) the verification flagged a misattribution — preferential attachment is a NODAL (decreasing) feature in the paper, not structural-increasing; the correctly-structural-increasing features are common-neighbor/Adamic-Adar indices. (2) The 'more durable than content similarity' framing is the claim's inference. (3) The paper itself flags automation-bias risk and that its recommendations operate at CASE level, not paragraph/holding level — reinforcing the precedent-vs-principle granularity caution. Still, the core direction (keep and grow the graph; rank at query time) is supported.
+
+
+### ממצא 5 — Selective prediction with calibrated thresholds gives a distribution-free, provable guarantee that auto-accepted judgments agree with the human at level 1−α (w.p. ≥1−δ), so the human reviews only a tiny calibrated fraction — directly satisfying the near-zero-review constraint.
+**confidence:** high · **vote:** 3-0
+**מקורות:** https://proceedings.iclr.cc/paper_files/paper/2025/file/08dabd5345b37fffcbe335bd578b15a0-Paper-Conference.pdf
+
+Merges claims [16],[17],[18]. ICLR 2025 'Trust or Escalate' (Cascaded Selective Evaluation) formulates threshold selection as a multiple-hypothesis-testing problem on a small calibration set (|D_cal|=500, δ=0.1), guaranteeing P(f_LM(x)=y_human | c_LM(x)≥λ) ≥ 1−α with probability ≥1−δ — distribution-free (only i.i.d. calibration assumed, built on Bates et al. 2021 risk-controlling sets and Angelopoulos et al. 2022 Learn-then-Test). The cascade routes cheap judges first and escalates to a stronger model only when not confident, abstaining when none are confident. Empirically on ChatArena: >80% human agreement at 79.1% coverage, 88.1% of covered instances handled by cheap models, GPT-4 invoked on only 17.5% of instances, 91% guarantee-success vs <60% for point-estimate calibration. For you: this is the mechanism to keep human review near-zero — calibrate against a small set of the chair's own accept/reject decisions, auto-accept high-confidence principles, auto-reject low-confidence ones, and escalate to the human ONLY the calibrated uncertain middle, with a provable agreement bound.
+
+
+### ממצא 6 — Conformal-risk-control variants (SCRC) extend the guarantee to abstention: risk is bounded ONLY on accepted (non-abstained) samples via two calibration thresholds — giving a principled accept/abstain/reject gate suited to a noisy KB triage.
+**confidence:** high · **vote:** 3-0
+**מקורות:** https://arxiv.org/html/2512.12844
+
+Merges claims [19],[20]. Selective Conformal Risk Control (Xu, Guo, Wei, 2025) combines conformal prediction with selective classification using two thresholds: λ₁ controls which samples are accepted (else abstain/defer), λ₂ controls prediction-set size. Theorem 2 guarantees E[l(C(X),Y) | g(X)≥1−λ₁] ≤ α — expected loss on ACCEPTED samples is bounded below a user-chosen target risk α; the calibration-only variant (SCRC-I) gives the bound w.p. ≥1−δ. This formalizes a three-way KB gate: auto-keep (accept) where conformal risk is provably low, auto-discard candidates, and defer the rest to the human — with risk controlled on exactly the items you act on automatically. Caveat: guarantees rely on exchangeability of calibration/test data, and 'risk' is a general bounded loss (expectation, not a probability); the source is current (Dec 2025) and peer-discussed but newer than the established Trust-or-Escalate line.
+
+
+### ממצא 7 — RECOMMENDATION: do NOT do a destructive holding-level cull; rank-by-importance at retrieval time over a reversibly-retained corpus, gated by a selective-prediction layer calibrated to the chair's natural citing behavior.
+**confidence:** medium · **vote:** synthesis of high-confidence findings; recommendation is inference
+**מקורות:** https://www.researchgate.net/publication/376422421_Centrality_Scores_and_Precedent_Value_in_Legal_Network_Analysis, https://www.aallnet.org/wp-content/uploads/2018/12/LLJ_110n4_02_hellyer.pdf, https://arxiv.org/pdf/2106.10776, https://proceedings.iclr.cc/paper_files/paper/2025/file/08dabd5345b37fffcbe335bd578b15a0-Paper-Conference.pdf, https://arxiv.org/html/2512.12844
+
+Synthesis. Choose option B/C, not A. Rationale chain: (1) per-principle importance scoring is only moderately reliable even with citation networks (F1≈0.655) and is the editorial step where commercial citators err one-third-to-two-thirds — too noisy to justify irreversible deletion; (2) the right principle to surface is context-dependent (ICAIL 2021), unknowable at cull time but available at query time; (3) the citation graph is a self-improving asset (Scientific Reports 2021). CONCRETE DESIGN: (a) Keep all ~3,562 principles; attach precedent-level citation-centrality (PageRank/degree on your internal + external citation graph) as a prior, NOT a per-principle delete trigger; rank principles at retrieval time fusing centrality prior + context-aware semantic similarity to the draft block. (b) Mark obviously-redundant/low-quality principles with a reversible 'demoted/suppressed' flag (review_status) rather than deleting — your system already has reversible review_status gating per the project context. (c) Make the chair's NATURAL behavior the supervision signal: log which principles/precedents she actually cites in finalized decisions (implicit feedback) and which retrieved items she ignores; use these as the calibration labels. (d) Wrap auto-keep/demote in a Trust-or-Escalate / SCRC gate calibrated on ~500 of those implicit accept/ignore signals, so only a tiny calibrated fraction (target-α) ever reaches her for explicit review, with a provable agreement bound. This keeps human upfront review at zero and converges via use. Confidence is medium because the recommendation composes high-confidence findings into a design choice the literature supports directionally but does not test end-to-end on a holding-level legal KB.
+
+
+## טענות שהופרכו (verbatim)
+
+- **[1-2]** Among centrality metrics, simple Degree (in-degree / citation count) was the most stable predictor of precedent value across network and sub-network configurations, outperforming more complex measures like PageRank in robustness.  
+  מקור: https://www.researchgate.net/publication/376422421_Centrality_Scores_and_Precedent_Value_in_Legal_Network_Analysis
+- **[1-2]** Citation counts alone (degree centrality / inward citations) are an insufficient proxy for legal importance; a Kleinberg HITS-style hubs-and-authorities measure that combines inward AND outward citations is superior and reveals importance information not evident in simple citation counts.  
+  מקור: http://users.polisci.umn.edu/~trj/MyPapers/s6.pdf
+- **[0-3]** A link-prediction model on the citation graph predicts which prior cases a new case will cite with strong accuracy — 95% of cases have a median rank below 292 — demonstrating that citation-network structure alone can rank precedents by likely relevance/importance for retrieval.  
+  מקור: https://www.nature.com/articles/s41598-021-82430-x
+- **[0-3]** The system does NOT treat all headnotes/holdings as equally important — it selectively retains only the most relevant one to three holdings per case by similarity, demonstrating commercial citator practice of holding-level selection/pruning rather than keeping everything.  
+  מקור: https://image-ppubs.uspto.gov/dirsearch-public/print/downloadPdf/7580939
+
+## שאלות-פתוחות (verbatim)
+
+- Can a per-PRINCIPLE importance score be validated (not just per-precedent)? E.g., does a principle's retrieval-then-citation rate by the chair correlate with any algorithmic signal well enough to gate on — this is the unproven core of your use case and would need an internal study on your own corpus.
+- What is the minimum reliable calibration-set size and refresh cadence for the chair's implicit citing behavior, given a small single-author corpus (the project notes ~5,243 principles but a low-data style-acquisition regime)? Trust-or-Escalate used 500 i.i.d. examples; can implicit signals from one chair's decisions reach that volume, and how fast does exchangeability degrade as her preferences evolve?
+- Should the importance prior combine INTERNAL citations (the chair's own decision-to-decision citations) with EXTERNAL precedent citations, and at what weighting — internal signals are scarcer but far more aligned to her style than generic court-citation centrality?
+- Does aggressive query-time ranking (vs. culling) measurably hurt RAG precision/latency at your corpus scale (~3,562-5,243 items), or is the retrieval set small enough that ranking-only with reversible demotion has no practical downside — i.e., is culling solving a problem you actually have?
+
+## כל המקורות
+
+- [primary] https://arxiv.org/html/2410.13460v2  · זווית: Citation-network importance & legal IR ranking · טענות: 5
+- [primary] https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2910926  · זווית: Citation-network importance & legal IR ranking · טענות: 5
+- [primary] https://www.researchgate.net/publication/376422421_Centrality_Scores_and_Precedent_Value_in_Legal_Network_Analysis  · זווית: Citation-network importance & legal IR ranking · טענות: 5
+- [primary] http://users.polisci.umn.edu/~trj/MyPapers/s6.pdf  · זווית: Citation-network importance & legal IR ranking · טענות: 5
+- [primary] https://arxiv.org/pdf/2106.10776  · זווית: Citation-network importance & legal IR ranking · טענות: 5
+- [primary] https://www.nature.com/articles/s41598-021-82430-x  · זווית: Citation-network importance & legal IR ranking · טענות: 5
+- [primary] https://image-ppubs.uspto.gov/dirsearch-public/print/downloadPdf/7580939  · זווית: Headnote/holding selection at commercial citators · טענות: 5
+- [primary] https://www.aallnet.org/wp-content/uploads/2018/12/LLJ_110n4_02_hellyer.pdf  · זווית: Headnote/holding selection at commercial citators · טענות: 5
+- [secondary] https://library.law.northwestern.edu/cases/updating  · זווית: Headnote/holding selection at commercial citators · טענות: 2
+- [secondary] https://guides.law.stanford.edu/cases/keynumbersystem  · זווית: Headnote/holding selection at commercial citators · טענות: 4
+- [primary] https://proceedings.iclr.cc/paper_files/paper/2025/file/08dabd5345b37fffcbe335bd578b15a0-Paper-Conference.pdf  · זווית: Selective prediction / conformal abstention thresholds · טענות: 5
+- [primary] https://arxiv.org/html/2512.12844  · זווית: Selective prediction / conformal abstention thresholds · טענות: 5
+- [primary] https://arxiv.org/pdf/2407.18370  · זווית: Selective prediction / conformal abstention thresholds · טענות: 5
+- [primary] https://sinatayebati.github.io/vlm-uncertainty/  · זווית: Selective prediction / conformal abstention thresholds · טענות: 5
+- [primary] https://openreview.net/forum?id=JJPAy8mvrQ  · זווית: Selective prediction / conformal abstention thresholds · טענות: 4
+- [primary] https://arxiv.org/abs/2407.18370  · זווית: Multi-model agreement & trust-or-escalate routing · טענות: 4
+- [primary] https://arxiv.org/pdf/2511.07396  · זווית: Multi-model agreement & trust-or-escalate routing · טענות: 5
+- [primary] https://arxiv.org/html/2605.18796  · זווית: Multi-model agreement & trust-or-escalate routing · טענות: 4
+- [primary] https://www.cs.cornell.edu/~tj/publications/joachims_etal_17a.pdf  · זווית: Implicit feedback active learning vs upfront review · טענות: 5
+- [primary] https://www.cs.cornell.edu/people/tj/publications/radlinski_joachims_05a.pdf  · זווית: Implicit feedback active learning vs upfront review · טענות: 5
+- [primary] https://dl.acm.org/doi/10.1145/1229179.1229181  · זווית: Implicit feedback active learning vs upfront review · טענות: 5
+- [primary] https://arxiv.org/pdf/2403.18962  · זווית: Implicit feedback active learning vs upfront review · טענות: 5
+- [primary] https://arxiv.org/abs/2407.12170  · זווית: RAG corpus pruning vs rank-at-retrieval · טענות: 3
+- [primary] https://arxiv.org/abs/2511.00505  · זווית: RAG corpus pruning vs rank-at-retrieval · טענות: 4
+- [primary] https://arxiv.org/html/2409.13694v2  · זווית: RAG corpus pruning vs rank-at-retrieval · טענות: 4