02-deep-research-importance-recommendation.md (was README) + 03-deep-research-full-output.md Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
23 KiB
מחקר-עומק מלא (גולמי) — קורפוס-הפסיקה
נספח גולמי ל-
precedent-corpus-redesign.md. פלט מלא של מנוע deep-research (2026-06-20).
סטטיסטיקה: 6 זוויות · 25 מקורות · 114 טענות חולצו · 25 אומתו · 21 אושרו · 4 הופרכו · 108 קריאות-סוכן · 108 סוכנים.
תקציר-מנהלים (verbatim)
For your specific situation, the evidence points to option (B)/(C): rank-by-importance at retrieval time rather than a destructive cull, with selective-prediction gating that keeps human review near-zero. Automated importance signals from citation-network centrality (PageRank/HITS/degree) are a genuine, scalable proxy for PRECEDENT-level importance — derivable algorithmically without manual annotation (Swiss Criticality, Fowler/Jeon, Derlén & Lindholm) — but they are only moderately predictive (JURIX 2023 F1≈0.655) and are NOT validated at the holding/principle granularity you actually extract. Commercial systems (West KeyNumber patent, Shepard's/KeyCite) do operate at holding-level headnotes via cosine similarity, but their interpretive/editorial labels are substantially error-prone (one-third to two-thirds mislabeled), confirming that holding-level importance judgment is exactly where automation degrades — so you should not destructively prune on a noisy holding-level score. The robust path is to keep all extracted principles (reversible), attach multiple importance signals (precedent-level citation centrality + your chair's actual citation behavior as implicit supervision), rank at query time, and use a calibrated selective-prediction/conformal gate (Trust-or-Escalate cascade, SCRC) so only a tiny, statistically-bounded fraction ever escalates to the human — with a provable agreement guarantee at level 1−α.
לוג-הצינור
- Q: Research question for a production legal-AI system (RAG that helps a planning-ap…
- Decomposed into 6 angles: Citation-network importance & legal IR ranking, Headnote/holding selection at commercial citators, Selective prediction / conformal abstention thresholds, Multi-model agreement & trust-or-escalate routing, Implicit feedback active learning vs upfront review, RAG corpus pruning vs rank-at-retrieval
- Citation-network importance & legal IR ranking: 6 results
- Headnote/holding selection at commercial citators: 6 results
- Headnote/holding selection at commercial citators: 4 novel (2 filtered)
- Selective prediction / conformal abstention thresholds: 6 results
- Selective prediction / conformal abstention thresholds: 5 novel (1 filtered)
- Multi-model agreement & trust-or-escalate routing: 6 results
- Multi-model agreement & trust-or-escalate routing: 3 novel (3 filtered)
- Implicit feedback active learning vs upfront review: 6 results
- Implicit feedback active learning vs upfront review: 4 novel (2 filtered)
- RAG corpus pruning vs rank-at-retrieval: 6 results
- RAG corpus pruning vs rank-at-retrieval: 3 novel (3 filtered)
- Fetched 25 sources → 114 claims → verifying top 25
- "Importance/criticality labels for legal decisions …": 3-0 ✓
- "Case criticality is operationalized via a two-tier…": 3-0 ✓
- "Derlén & Lindholm apply network-citation analysis …": 3-0 ✓
- "Citation-network centrality scores (specifically t…": 3-0 ✓
- "Network centrality measures correlate only reasona…": 3-0 ✓
- "An ordinal regression model using network centrali…": 3-0 ✓
- "Among centrality metrics, simple Degree (in-degree…": 1-2 ✗
- "Citation counts alone (degree centrality / inward …": 1-2 ✗
- "The authors construct importance scores using two …": 3-0 ✓
- "Simple degree centrality (counting inward citation…": 3-0 ✓
- "A deep-learning citation recommendation tool (BiLS…": 3-0 ✓
- "Leveraging the local textual context surrounding a…": 3-0 ✓
- "In a real judicial citation network (CJEU, 1955-20…": 3-0 ✓
- "A link-prediction model on the citation graph pred…": 0-3 ✗
- "Over time, structural/network features (preferenti…": 2-1 ✓
- "West's commercial system classifies legal headnote…": 3-0 ✓
- "The system does NOT treat all headnotes/holdings a…": 0-3 ✗
- "The patented method operates at the granularity of…": 3-0 ✓
- "Commercial citators' negative-treatment/holding la…": 3-0 ✓
- "The error source is editorial analysis (the interp…": 3-0 ✓
- "Selective evaluation with a calibrated confidence …": 3-0 ✓
- "Cascaded Selective Evaluation routes each instance…": 3-0 ✓
- "On ChatArena the cascade achieved over 80% human a…": 3-0 ✓
- "Selective Conformal Risk Control (SCRC) is a frame…": 3-0 ✓
- "SCRC provides a conditional risk-control guarantee…": 3-0 ✓
- Verify done: 25 claims → 21 confirmed, 4 killed
הממצאים המלאים (verbatim)
ממצא 1 — Citation-network centrality is a scalable, manual-annotation-free importance signal — but it works at PRECEDENT/case level, not holding/principle level, and is only moderately predictive.
confidence: high · vote: 3-0 across all constituent claims מקורות: https://arxiv.org/html/2410.13460v2, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2910926, http://users.polisci.umn.edu/~trj/MyPapers/s6.pdf, https://www.researchgate.net/publication/376422421_Centrality_Scores_and_Precedent_Value_in_Legal_Network_Analysis, https://www.nature.com/articles/s41598-021-82430-x
Merges claims [0],[1],[2],[3],[4],[5],[7],[10]. Importance labels can be derived algorithmically from citation patterns, yielding far larger datasets than manual annotation (Swiss Criticality, ACL 2025: 138,531 cases via LD-Label + recency-weighted Citation-Label). Network-centrality methods (Derlén & Lindholm: PageRank, HITS, betweenness on 9,125 CJEU judgments; Fowler/Jeon at SCOTUS) establish case-level centrality as a quantitative importance proxy. Eigenvector/HITS approaches are preferred over raw degree because degree treats all citing cases equally regardless of the citing case's own importance. CRITICAL LIMIT: predictive power is only moderate — JURIX 2023 ordinal regression on the court's Importance Score achieved F1≈0.655 ('to an extent' indicates precedent value), and CJEU citation distributions are heavy-tailed/preferential-attachment (few highly-cited cases) confirming a meaningful but skewed signal. All evidence is scoped to precedent/case level; none validates principle/holding-level importance. For your ~3,562 principles (~12/precedent), this means: use precedent-level centrality as a strong prior on a principle's parent case, but do not treat it as a per-principle importance score.
ממצא 2 — Holding-level extraction IS achievable (commercial citators do it), but holding-level IMPORTANCE/treatment labeling is the error-prone editorial step — so a destructive cull keyed on a noisy holding-level score is risky.
confidence: high · vote: 3-0; one constituent refuted (selective top-1-3 retention) מקורות: https://image-ppubs.uspto.gov/dirsearch-public/print/downloadPdf/7580939, https://www.aallnet.org/wp-content/uploads/2018/12/LLJ_110n4_02_hellyer.pdf
Merges claims [12],[13],[14],[15]. West's patented KeyNumber system classifies individual headnotes (discrete holdings, ~6 per opinion, sometimes 50+) into a 90,000+ class taxonomy via cosine similarity over noun-word-pair vectors with composite scoring — proving holding-level extraction/classification is commercially viable. BUT Hellyer (2018, Law Library Journal) shows Shepard's and KeyCite missed/mislabeled ~one-third, and BCite over two-thirds, of negative citing relationships (357-sample); the three citators agreed only 53/357 times. The errors arise specifically in the EDITORIAL ANALYSIS (interpretive treatment/holding labeling), not the mechanical step of identifying citing cases — 'the significant problems occur in the editorial analysis process, after the initial process of identifying the citing cases.' Implication for you: mechanical signals (does a precedent get cited; by whom) are the reliable part; interpretive 'is this principle important' judgment is exactly where even commercial systems with human editors err badly. Note: the claim that West selectively retains only 1-3 holdings per case was REFUTED (vote 0-3) — commercial practice does NOT support aggressive holding-level pruning. This argues against a destructive cull driven by an interpretive importance score.
ממצא 3 — Context-aware, query-time retrieval (using the local textual context of the draft) outperforms context-free importance ranking for choosing which authority to surface — favoring rank-at-retrieval over a pre-pruned static corpus.
confidence: high · vote: 3-0 מקורות: https://arxiv.org/pdf/2106.10776
Merges claims [8],[9]. The ICAIL 2021 'Context-Aware Legal Citation Recommendation using Deep Learning' (Stanford RegLab + CMU) builds a citation recommender for opinion drafting and finds that leveraging local textual context improves recommendation quality over context-free baselines (collaborative filtering on citation lists). Context-based deep models (BiLSTM/RoBERTa) beat context-free methods because they exploit semantics to judge which citation fits the passage. This directly supports your option (B): the right-to-surface principle depends on the draft's local context, which is unknowable at cull time but available at query time. A static importance score (computed once, offline) cannot capture passage-specific relevance — so destroying low-static-importance principles risks discarding items that are highly relevant in a context the cull never saw. Caveat: context-aware recommendation and importance ranking are complementary, not mutually exclusive; the paper benchmarks against a citation-list baseline, not a centrality ranker.
ממצא 4 — Structural/network features become MORE predictive over time while content-similarity features decay — supporting maintaining a persistent citation graph (which improves as the corpus matures) rather than freezing a one-time content-based cull.
confidence: medium · vote: 2-1 מקורות: https://www.nature.com/articles/s41598-021-82430-x
Claim [11]. On the CJEU judicial citation network (1955-2014), Mones et al. (Scientific Reports 2021) found structural/common-neighbor features 'display a significant increase of predictive power' over time while document-content (TF-IDF) features show 'decreasing trends' — the network becomes increasingly informative as it matures. This implies a citation-graph-backed importance ranking is a more durable, self-improving asset than a one-shot content-similarity prune. CAVEATS lowering confidence to medium: (1) the verification flagged a misattribution — preferential attachment is a NODAL (decreasing) feature in the paper, not structural-increasing; the correctly-structural-increasing features are common-neighbor/Adamic-Adar indices. (2) The 'more durable than content similarity' framing is the claim's inference. (3) The paper itself flags automation-bias risk and that its recommendations operate at CASE level, not paragraph/holding level — reinforcing the precedent-vs-principle granularity caution. Still, the core direction (keep and grow the graph; rank at query time) is supported.
ממצא 5 — Selective prediction with calibrated thresholds gives a distribution-free, provable guarantee that auto-accepted judgments agree with the human at level 1−α (w.p. ≥1−δ), so the human reviews only a tiny calibrated fraction — directly satisfying the near-zero-review constraint.
confidence: high · vote: 3-0 מקורות: https://proceedings.iclr.cc/paper_files/paper/2025/file/08dabd5345b37fffcbe335bd578b15a0-Paper-Conference.pdf
Merges claims [16],[17],[18]. ICLR 2025 'Trust or Escalate' (Cascaded Selective Evaluation) formulates threshold selection as a multiple-hypothesis-testing problem on a small calibration set (|D_cal|=500, δ=0.1), guaranteeing P(f_LM(x)=y_human | c_LM(x)≥λ) ≥ 1−α with probability ≥1−δ — distribution-free (only i.i.d. calibration assumed, built on Bates et al. 2021 risk-controlling sets and Angelopoulos et al. 2022 Learn-then-Test). The cascade routes cheap judges first and escalates to a stronger model only when not confident, abstaining when none are confident. Empirically on ChatArena: >80% human agreement at 79.1% coverage, 88.1% of covered instances handled by cheap models, GPT-4 invoked on only 17.5% of instances, 91% guarantee-success vs <60% for point-estimate calibration. For you: this is the mechanism to keep human review near-zero — calibrate against a small set of the chair's own accept/reject decisions, auto-accept high-confidence principles, auto-reject low-confidence ones, and escalate to the human ONLY the calibrated uncertain middle, with a provable agreement bound.
ממצא 6 — Conformal-risk-control variants (SCRC) extend the guarantee to abstention: risk is bounded ONLY on accepted (non-abstained) samples via two calibration thresholds — giving a principled accept/abstain/reject gate suited to a noisy KB triage.
confidence: high · vote: 3-0 מקורות: https://arxiv.org/html/2512.12844
Merges claims [19],[20]. Selective Conformal Risk Control (Xu, Guo, Wei, 2025) combines conformal prediction with selective classification using two thresholds: λ₁ controls which samples are accepted (else abstain/defer), λ₂ controls prediction-set size. Theorem 2 guarantees E[l(C(X),Y) | g(X)≥1−λ₁] ≤ α — expected loss on ACCEPTED samples is bounded below a user-chosen target risk α; the calibration-only variant (SCRC-I) gives the bound w.p. ≥1−δ. This formalizes a three-way KB gate: auto-keep (accept) where conformal risk is provably low, auto-discard candidates, and defer the rest to the human — with risk controlled on exactly the items you act on automatically. Caveat: guarantees rely on exchangeability of calibration/test data, and 'risk' is a general bounded loss (expectation, not a probability); the source is current (Dec 2025) and peer-discussed but newer than the established Trust-or-Escalate line.
ממצא 7 — RECOMMENDATION: do NOT do a destructive holding-level cull; rank-by-importance at retrieval time over a reversibly-retained corpus, gated by a selective-prediction layer calibrated to the chair's natural citing behavior.
confidence: medium · vote: synthesis of high-confidence findings; recommendation is inference מקורות: https://www.researchgate.net/publication/376422421_Centrality_Scores_and_Precedent_Value_in_Legal_Network_Analysis, https://www.aallnet.org/wp-content/uploads/2018/12/LLJ_110n4_02_hellyer.pdf, https://arxiv.org/pdf/2106.10776, https://proceedings.iclr.cc/paper_files/paper/2025/file/08dabd5345b37fffcbe335bd578b15a0-Paper-Conference.pdf, https://arxiv.org/html/2512.12844
Synthesis. Choose option B/C, not A. Rationale chain: (1) per-principle importance scoring is only moderately reliable even with citation networks (F1≈0.655) and is the editorial step where commercial citators err one-third-to-two-thirds — too noisy to justify irreversible deletion; (2) the right principle to surface is context-dependent (ICAIL 2021), unknowable at cull time but available at query time; (3) the citation graph is a self-improving asset (Scientific Reports 2021). CONCRETE DESIGN: (a) Keep all ~3,562 principles; attach precedent-level citation-centrality (PageRank/degree on your internal + external citation graph) as a prior, NOT a per-principle delete trigger; rank principles at retrieval time fusing centrality prior + context-aware semantic similarity to the draft block. (b) Mark obviously-redundant/low-quality principles with a reversible 'demoted/suppressed' flag (review_status) rather than deleting — your system already has reversible review_status gating per the project context. (c) Make the chair's NATURAL behavior the supervision signal: log which principles/precedents she actually cites in finalized decisions (implicit feedback) and which retrieved items she ignores; use these as the calibration labels. (d) Wrap auto-keep/demote in a Trust-or-Escalate / SCRC gate calibrated on ~500 of those implicit accept/ignore signals, so only a tiny calibrated fraction (target-α) ever reaches her for explicit review, with a provable agreement bound. This keeps human upfront review at zero and converges via use. Confidence is medium because the recommendation composes high-confidence findings into a design choice the literature supports directionally but does not test end-to-end on a holding-level legal KB.
טענות שהופרכו (verbatim)
- [1-2] Among centrality metrics, simple Degree (in-degree / citation count) was the most stable predictor of precedent value across network and sub-network configurations, outperforming more complex measures like PageRank in robustness.
מקור: https://www.researchgate.net/publication/376422421_Centrality_Scores_and_Precedent_Value_in_Legal_Network_Analysis - [1-2] Citation counts alone (degree centrality / inward citations) are an insufficient proxy for legal importance; a Kleinberg HITS-style hubs-and-authorities measure that combines inward AND outward citations is superior and reveals importance information not evident in simple citation counts.
מקור: http://users.polisci.umn.edu/~trj/MyPapers/s6.pdf - [0-3] A link-prediction model on the citation graph predicts which prior cases a new case will cite with strong accuracy — 95% of cases have a median rank below 292 — demonstrating that citation-network structure alone can rank precedents by likely relevance/importance for retrieval.
מקור: https://www.nature.com/articles/s41598-021-82430-x - [0-3] The system does NOT treat all headnotes/holdings as equally important — it selectively retains only the most relevant one to three holdings per case by similarity, demonstrating commercial citator practice of holding-level selection/pruning rather than keeping everything.
מקור: https://image-ppubs.uspto.gov/dirsearch-public/print/downloadPdf/7580939
שאלות-פתוחות (verbatim)
- Can a per-PRINCIPLE importance score be validated (not just per-precedent)? E.g., does a principle's retrieval-then-citation rate by the chair correlate with any algorithmic signal well enough to gate on — this is the unproven core of your use case and would need an internal study on your own corpus.
- What is the minimum reliable calibration-set size and refresh cadence for the chair's implicit citing behavior, given a small single-author corpus (the project notes ~5,243 principles but a low-data style-acquisition regime)? Trust-or-Escalate used 500 i.i.d. examples; can implicit signals from one chair's decisions reach that volume, and how fast does exchangeability degrade as her preferences evolve?
- Should the importance prior combine INTERNAL citations (the chair's own decision-to-decision citations) with EXTERNAL precedent citations, and at what weighting — internal signals are scarcer but far more aligned to her style than generic court-citation centrality?
- Does aggressive query-time ranking (vs. culling) measurably hurt RAG precision/latency at your corpus scale (~3,562-5,243 items), or is the retrieval set small enough that ranking-only with reversible demotion has no practical downside — i.e., is culling solving a problem you actually have?
כל המקורות
- [primary] https://arxiv.org/html/2410.13460v2 · זווית: Citation-network importance & legal IR ranking · טענות: 5
- [primary] https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2910926 · זווית: Citation-network importance & legal IR ranking · טענות: 5
- [primary] https://www.researchgate.net/publication/376422421_Centrality_Scores_and_Precedent_Value_in_Legal_Network_Analysis · זווית: Citation-network importance & legal IR ranking · טענות: 5
- [primary] http://users.polisci.umn.edu/~trj/MyPapers/s6.pdf · זווית: Citation-network importance & legal IR ranking · טענות: 5
- [primary] https://arxiv.org/pdf/2106.10776 · זווית: Citation-network importance & legal IR ranking · טענות: 5
- [primary] https://www.nature.com/articles/s41598-021-82430-x · זווית: Citation-network importance & legal IR ranking · טענות: 5
- [primary] https://image-ppubs.uspto.gov/dirsearch-public/print/downloadPdf/7580939 · זווית: Headnote/holding selection at commercial citators · טענות: 5
- [primary] https://www.aallnet.org/wp-content/uploads/2018/12/LLJ_110n4_02_hellyer.pdf · זווית: Headnote/holding selection at commercial citators · טענות: 5
- [secondary] https://library.law.northwestern.edu/cases/updating · זווית: Headnote/holding selection at commercial citators · טענות: 2
- [secondary] https://guides.law.stanford.edu/cases/keynumbersystem · זווית: Headnote/holding selection at commercial citators · טענות: 4
- [primary] https://proceedings.iclr.cc/paper_files/paper/2025/file/08dabd5345b37fffcbe335bd578b15a0-Paper-Conference.pdf · זווית: Selective prediction / conformal abstention thresholds · טענות: 5
- [primary] https://arxiv.org/html/2512.12844 · זווית: Selective prediction / conformal abstention thresholds · טענות: 5
- [primary] https://arxiv.org/pdf/2407.18370 · זווית: Selective prediction / conformal abstention thresholds · טענות: 5
- [primary] https://sinatayebati.github.io/vlm-uncertainty/ · זווית: Selective prediction / conformal abstention thresholds · טענות: 5
- [primary] https://openreview.net/forum?id=JJPAy8mvrQ · זווית: Selective prediction / conformal abstention thresholds · טענות: 4
- [primary] https://arxiv.org/abs/2407.18370 · זווית: Multi-model agreement & trust-or-escalate routing · טענות: 4
- [primary] https://arxiv.org/pdf/2511.07396 · זווית: Multi-model agreement & trust-or-escalate routing · טענות: 5
- [primary] https://arxiv.org/html/2605.18796 · זווית: Multi-model agreement & trust-or-escalate routing · טענות: 4
- [primary] https://www.cs.cornell.edu/~tj/publications/joachims_etal_17a.pdf · זווית: Implicit feedback active learning vs upfront review · טענות: 5
- [primary] https://www.cs.cornell.edu/people/tj/publications/radlinski_joachims_05a.pdf · זווית: Implicit feedback active learning vs upfront review · טענות: 5
- [primary] https://dl.acm.org/doi/10.1145/1229179.1229181 · זווית: Implicit feedback active learning vs upfront review · טענות: 5
- [primary] https://arxiv.org/pdf/2403.18962 · זווית: Implicit feedback active learning vs upfront review · טענות: 5
- [primary] https://arxiv.org/abs/2407.12170 · זווית: RAG corpus pruning vs rank-at-retrieval · טענות: 3
- [primary] https://arxiv.org/abs/2511.00505 · זווית: RAG corpus pruning vs rank-at-retrieval · טענות: 4
- [primary] https://arxiv.org/html/2409.13694v2 · זווית: RAG corpus pruning vs rank-at-retrieval · טענות: 4