Files

Chaim 1cf1f30dcd docs(principles): number+rename research files for final-synthesis assembly (#153 )

02-deep-research-importance-recommendation.md (was README) + 03-deep-research-full-output.md

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-20 11:38:49 +00:00

23 KiB

Raw Blame History

מחקר-עומק מלא (גולמי) — קורפוס-הפסיקה

נספח גולמי ל-precedent-corpus-redesign.md. פלט מלא של מנוע deep-research (2026-06-20).

סטטיסטיקה: 6 זוויות · 25 מקורות · 114 טענות חולצו · 25 אומתו · 21 אושרו · 4 הופרכו · 108 קריאות-סוכן · 108 סוכנים.

תקציר-מנהלים (verbatim)

For your specific situation, the evidence points to option (B)/(C): rank-by-importance at retrieval time rather than a destructive cull, with selective-prediction gating that keeps human review near-zero. Automated importance signals from citation-network centrality (PageRank/HITS/degree) are a genuine, scalable proxy for PRECEDENT-level importance — derivable algorithmically without manual annotation (Swiss Criticality, Fowler/Jeon, Derlén & Lindholm) — but they are only moderately predictive (JURIX 2023 F1≈0.655) and are NOT validated at the holding/principle granularity you actually extract. Commercial systems (West KeyNumber patent, Shepard's/KeyCite) do operate at holding-level headnotes via cosine similarity, but their interpretive/editorial labels are substantially error-prone (one-third to two-thirds mislabeled), confirming that holding-level importance judgment is exactly where automation degrades — so you should not destructively prune on a noisy holding-level score. The robust path is to keep all extracted principles (reversible), attach multiple importance signals (precedent-level citation centrality + your chair's actual citation behavior as implicit supervision), rank at query time, and use a calibrated selective-prediction/conformal gate (Trust-or-Escalate cascade, SCRC) so only a tiny, statistically-bounded fraction ever escalates to the human — with a provable agreement guarantee at level 1−α.

לוג-הצינור

Q: Research question for a production legal-AI system (RAG that helps a planning-ap…
Decomposed into 6 angles: Citation-network importance & legal IR ranking, Headnote/holding selection at commercial citators, Selective prediction / conformal abstention thresholds, Multi-model agreement & trust-or-escalate routing, Implicit feedback active learning vs upfront review, RAG corpus pruning vs rank-at-retrieval
Citation-network importance & legal IR ranking: 6 results
Headnote/holding selection at commercial citators: 6 results
Headnote/holding selection at commercial citators: 4 novel (2 filtered)
Selective prediction / conformal abstention thresholds: 6 results
Selective prediction / conformal abstention thresholds: 5 novel (1 filtered)
Multi-model agreement & trust-or-escalate routing: 6 results
Multi-model agreement & trust-or-escalate routing: 3 novel (3 filtered)
Implicit feedback active learning vs upfront review: 6 results
Implicit feedback active learning vs upfront review: 4 novel (2 filtered)
RAG corpus pruning vs rank-at-retrieval: 6 results
RAG corpus pruning vs rank-at-retrieval: 3 novel (3 filtered)
Fetched 25 sources → 114 claims → verifying top 25
"Importance/criticality labels for legal decisions …": 3-0 ✓
"Case criticality is operationalized via a two-tier…": 3-0 ✓
"Derlén & Lindholm apply network-citation analysis …": 3-0 ✓
"Citation-network centrality scores (specifically t…": 3-0 ✓
"Network centrality measures correlate only reasona…": 3-0 ✓
"An ordinal regression model using network centrali…": 3-0 ✓
"Among centrality metrics, simple Degree (in-degree…": 1-2 ✗
"Citation counts alone (degree centrality / inward …": 1-2 ✗
"The authors construct importance scores using two …": 3-0 ✓
"Simple degree centrality (counting inward citation…": 3-0 ✓
"A deep-learning citation recommendation tool (BiLS…": 3-0 ✓
"Leveraging the local textual context surrounding a…": 3-0 ✓
"In a real judicial citation network (CJEU, 1955-20…": 3-0 ✓
"A link-prediction model on the citation graph pred…": 0-3 ✗
"Over time, structural/network features (preferenti…": 2-1 ✓
"West's commercial system classifies legal headnote…": 3-0 ✓
"The system does NOT treat all headnotes/holdings a…": 0-3 ✗
"The patented method operates at the granularity of…": 3-0 ✓
"Commercial citators' negative-treatment/holding la…": 3-0 ✓
"The error source is editorial analysis (the interp…": 3-0 ✓
"Selective evaluation with a calibrated confidence …": 3-0 ✓
"Cascaded Selective Evaluation routes each instance…": 3-0 ✓
"On ChatArena the cascade achieved over 80% human a…": 3-0 ✓
"Selective Conformal Risk Control (SCRC) is a frame…": 3-0 ✓
"SCRC provides a conditional risk-control guarantee…": 3-0 ✓
Verify done: 25 claims → 21 confirmed, 4 killed

הממצאים המלאים (verbatim)

ממצא 1 — Citation-network centrality is a scalable, manual-annotation-free importance signal — but it works at PRECEDENT/case level, not holding/principle level, and is only moderately predictive.

confidence: high · vote: 3-0 across all constituent claims מקורות: https://arxiv.org/html/2410.13460v2, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2910926, http://users.polisci.umn.edu/~trj/MyPapers/s6.pdf, https://www.researchgate.net/publication/376422421_Centrality_Scores_and_Precedent_Value_in_Legal_Network_Analysis, https://www.nature.com/articles/s41598-021-82430-x

Merges claims [0],[1],[2],[3],[4],[5],[7],[10]. Importance labels can be derived algorithmically from citation patterns, yielding far larger datasets than manual annotation (Swiss Criticality, ACL 2025: 138,531 cases via LD-Label + recency-weighted Citation-Label). Network-centrality methods (Derlén & Lindholm: PageRank, HITS, betweenness on 9,125 CJEU judgments; Fowler/Jeon at SCOTUS) establish case-level centrality as a quantitative importance proxy. Eigenvector/HITS approaches are preferred over raw degree because degree treats all citing cases equally regardless of the citing case's own importance. CRITICAL LIMIT: predictive power is only moderate — JURIX 2023 ordinal regression on the court's Importance Score achieved F1≈0.655 ('to an extent' indicates precedent value), and CJEU citation distributions are heavy-tailed/preferential-attachment (few highly-cited cases) confirming a meaningful but skewed signal. All evidence is scoped to precedent/case level; none validates principle/holding-level importance. For your ~3,562 principles (~12/precedent), this means: use precedent-level centrality as a strong prior on a principle's parent case, but do not treat it as a per-principle importance score.

ממצא 2 — Holding-level extraction IS achievable (commercial citators do it), but holding-level IMPORTANCE/treatment labeling is the error-prone editorial step — so a destructive cull keyed on a noisy holding-level score is risky.

confidence: high · vote: 3-0; one constituent refuted (selective top-1-3 retention) מקורות: https://image-ppubs.uspto.gov/dirsearch-public/print/downloadPdf/7580939, https://www.aallnet.org/wp-content/uploads/2018/12/LLJ_110n4_02_hellyer.pdf

Merges claims [12],[13],[14],[15]. West's patented KeyNumber system classifies individual headnotes (discrete holdings, ~6 per opinion, sometimes 50+) into a 90,000+ class taxonomy via cosine similarity over noun-word-pair vectors with composite scoring — proving holding-level extraction/classification is commercially viable. BUT Hellyer (2018, Law Library Journal) shows Shepard's and KeyCite missed/mislabeled ~one-third, and BCite over two-thirds, of negative citing relationships (357-sample); the three citators agreed only 53/357 times. The errors arise specifically in the EDITORIAL ANALYSIS (interpretive treatment/holding labeling), not the mechanical step of identifying citing cases — 'the significant problems occur in the editorial analysis process, after the initial process of identifying the citing cases.' Implication for you: mechanical signals (does a precedent get cited; by whom) are the reliable part; interpretive 'is this principle important' judgment is exactly where even commercial systems with human editors err badly. Note: the claim that West selectively retains only 1-3 holdings per case was REFUTED (vote 0-3) — commercial practice does NOT support aggressive holding-level pruning. This argues against a destructive cull driven by an interpretive importance score.

ממצא 3 — Context-aware, query-time retrieval (using the local textual context of the draft) outperforms context-free importance ranking for choosing which authority to surface — favoring rank-at-retrieval over a pre-pruned static corpus.

confidence: high · vote: 3-0 מקורות: https://arxiv.org/pdf/2106.10776

Merges claims [8],[9]. The ICAIL 2021 'Context-Aware Legal Citation Recommendation using Deep Learning' (Stanford RegLab + CMU) builds a citation recommender for opinion drafting and finds that leveraging local textual context improves recommendation quality over context-free baselines (collaborative filtering on citation lists). Context-based deep models (BiLSTM/RoBERTa) beat context-free methods because they exploit semantics to judge which citation fits the passage. This directly supports your option (B): the right-to-surface principle depends on the draft's local context, which is unknowable at cull time but available at query time. A static importance score (computed once, offline) cannot capture passage-specific relevance — so destroying low-static-importance principles risks discarding items that are highly relevant in a context the cull never saw. Caveat: context-aware recommendation and importance ranking are complementary, not mutually exclusive; the paper benchmarks against a citation-list baseline, not a centrality ranker.

ממצא 4 — Structural/network features become MORE predictive over time while content-similarity features decay — supporting maintaining a persistent citation graph (which improves as the corpus matures) rather than freezing a one-time content-based cull.

confidence: medium · vote: 2-1 מקורות: https://www.nature.com/articles/s41598-021-82430-x

Claim [11]. On the CJEU judicial citation network (1955-2014), Mones et al. (Scientific Reports 2021) found structural/common-neighbor features 'display a significant increase of predictive power' over time while document-content (TF-IDF) features show 'decreasing trends' — the network becomes increasingly informative as it matures. This implies a citation-graph-backed importance ranking is a more durable, self-improving asset than a one-shot content-similarity prune. CAVEATS lowering confidence to medium: (1) the verification flagged a misattribution — preferential attachment is a NODAL (decreasing) feature in the paper, not structural-increasing; the correctly-structural-increasing features are common-neighbor/Adamic-Adar indices. (2) The 'more durable than content similarity' framing is the claim's inference. (3) The paper itself flags automation-bias risk and that its recommendations operate at CASE level, not paragraph/holding level — reinforcing the precedent-vs-principle granularity caution. Still, the core direction (keep and grow the graph; rank at query time) is supported.

ממצא 5 — Selective prediction with calibrated thresholds gives a distribution-free, provable guarantee that auto-accepted judgments agree with the human at level 1−α (w.p. ≥1−δ), so the human reviews only a tiny calibrated fraction — directly satisfying the near-zero-review constraint.

confidence: high · vote: 3-0 מקורות: https://proceedings.iclr.cc/paper_files/paper/2025/file/08dabd5345b37fffcbe335bd578b15a0-Paper-Conference.pdf

Merges claims [16],[17],[18]. ICLR 2025 'Trust or Escalate' (Cascaded Selective Evaluation) formulates threshold selection as a multiple-hypothesis-testing problem on a small calibration set (|D_cal|=500, δ=0.1), guaranteeing P(f_LM(x)=y_human | c_LM(x)≥λ) ≥ 1−α with probability ≥1−δ — distribution-free (only i.i.d. calibration assumed, built on Bates et al. 2021 risk-controlling sets and Angelopoulos et al. 2022 Learn-then-Test). The cascade routes cheap judges first and escalates to a stronger model only when not confident, abstaining when none are confident. Empirically on ChatArena: >80% human agreement at 79.1% coverage, 88.1% of covered instances handled by cheap models, GPT-4 invoked on only 17.5% of instances, 91% guarantee-success vs <60% for point-estimate calibration. For you: this is the mechanism to keep human review near-zero — calibrate against a small set of the chair's own accept/reject decisions, auto-accept high-confidence principles, auto-reject low-confidence ones, and escalate to the human ONLY the calibrated uncertain middle, with a provable agreement bound.

ממצא 6 — Conformal-risk-control variants (SCRC) extend the guarantee to abstention: risk is bounded ONLY on accepted (non-abstained) samples via two calibration thresholds — giving a principled accept/abstain/reject gate suited to a noisy KB triage.

confidence: high · vote: 3-0 מקורות: https://arxiv.org/html/2512.12844

Merges claims [19],[20]. Selective Conformal Risk Control (Xu, Guo, Wei, 2025) combines conformal prediction with selective classification using two thresholds: λ₁ controls which samples are accepted (else abstain/defer), λ₂ controls prediction-set size. Theorem 2 guarantees E[l(C(X),Y) | g(X)≥1−λ₁] ≤ α — expected loss on ACCEPTED samples is bounded below a user-chosen target risk α; the calibration-only variant (SCRC-I) gives the bound w.p. ≥1−δ. This formalizes a three-way KB gate: auto-keep (accept) where conformal risk is provably low, auto-discard candidates, and defer the rest to the human — with risk controlled on exactly the items you act on automatically. Caveat: guarantees rely on exchangeability of calibration/test data, and 'risk' is a general bounded loss (expectation, not a probability); the source is current (Dec 2025) and peer-discussed but newer than the established Trust-or-Escalate line.

ממצא 7 — RECOMMENDATION: do NOT do a destructive holding-level cull; rank-by-importance at retrieval time over a reversibly-retained corpus, gated by a selective-prediction layer calibrated to the chair's natural citing behavior.

confidence: medium · vote: synthesis of high-confidence findings; recommendation is inference מקורות: https://www.researchgate.net/publication/376422421_Centrality_Scores_and_Precedent_Value_in_Legal_Network_Analysis, https://www.aallnet.org/wp-content/uploads/2018/12/LLJ_110n4_02_hellyer.pdf, https://arxiv.org/pdf/2106.10776, https://proceedings.iclr.cc/paper_files/paper/2025/file/08dabd5345b37fffcbe335bd578b15a0-Paper-Conference.pdf, https://arxiv.org/html/2512.12844

Synthesis. Choose option B/C, not A. Rationale chain: (1) per-principle importance scoring is only moderately reliable even with citation networks (F1≈0.655) and is the editorial step where commercial citators err one-third-to-two-thirds — too noisy to justify irreversible deletion; (2) the right principle to surface is context-dependent (ICAIL 2021), unknowable at cull time but available at query time; (3) the citation graph is a self-improving asset (Scientific Reports 2021). CONCRETE DESIGN: (a) Keep all ~3,562 principles; attach precedent-level citation-centrality (PageRank/degree on your internal + external citation graph) as a prior, NOT a per-principle delete trigger; rank principles at retrieval time fusing centrality prior + context-aware semantic similarity to the draft block. (b) Mark obviously-redundant/low-quality principles with a reversible 'demoted/suppressed' flag (review_status) rather than deleting — your system already has reversible review_status gating per the project context. (c) Make the chair's NATURAL behavior the supervision signal: log which principles/precedents she actually cites in finalized decisions (implicit feedback) and which retrieved items she ignores; use these as the calibration labels. (d) Wrap auto-keep/demote in a Trust-or-Escalate / SCRC gate calibrated on ~500 of those implicit accept/ignore signals, so only a tiny calibrated fraction (target-α) ever reaches her for explicit review, with a provable agreement bound. This keeps human upfront review at zero and converges via use. Confidence is medium because the recommendation composes high-confidence findings into a design choice the literature supports directionally but does not test end-to-end on a holding-level legal KB.

טענות שהופרכו (verbatim)

[1-2] Among centrality metrics, simple Degree (in-degree / citation count) was the most stable predictor of precedent value across network and sub-network configurations, outperforming more complex measures like PageRank in robustness.
מקור: https://www.researchgate.net/publication/376422421_Centrality_Scores_and_Precedent_Value_in_Legal_Network_Analysis
[1-2] Citation counts alone (degree centrality / inward citations) are an insufficient proxy for legal importance; a Kleinberg HITS-style hubs-and-authorities measure that combines inward AND outward citations is superior and reveals importance information not evident in simple citation counts.
מקור: http://users.polisci.umn.edu/~trj/MyPapers/s6.pdf
[0-3] A link-prediction model on the citation graph predicts which prior cases a new case will cite with strong accuracy — 95% of cases have a median rank below 292 — demonstrating that citation-network structure alone can rank precedents by likely relevance/importance for retrieval.
מקור: https://www.nature.com/articles/s41598-021-82430-x
[0-3] The system does NOT treat all headnotes/holdings as equally important — it selectively retains only the most relevant one to three holdings per case by similarity, demonstrating commercial citator practice of holding-level selection/pruning rather than keeping everything.
מקור: https://image-ppubs.uspto.gov/dirsearch-public/print/downloadPdf/7580939

שאלות-פתוחות (verbatim)

Can a per-PRINCIPLE importance score be validated (not just per-precedent)? E.g., does a principle's retrieval-then-citation rate by the chair correlate with any algorithmic signal well enough to gate on — this is the unproven core of your use case and would need an internal study on your own corpus.
What is the minimum reliable calibration-set size and refresh cadence for the chair's implicit citing behavior, given a small single-author corpus (the project notes ~5,243 principles but a low-data style-acquisition regime)? Trust-or-Escalate used 500 i.i.d. examples; can implicit signals from one chair's decisions reach that volume, and how fast does exchangeability degrade as her preferences evolve?
Should the importance prior combine INTERNAL citations (the chair's own decision-to-decision citations) with EXTERNAL precedent citations, and at what weighting — internal signals are scarcer but far more aligned to her style than generic court-citation centrality?
Does aggressive query-time ranking (vs. culling) measurably hurt RAG precision/latency at your corpus scale (~3,562-5,243 items), or is the retrieval set small enough that ranking-only with reversible demotion has no practical downside — i.e., is culling solving a problem you actually have?

כל המקורות

[primary] https://arxiv.org/html/2410.13460v2 · זווית: Citation-network importance & legal IR ranking · טענות: 5
[primary] https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2910926 · זווית: Citation-network importance & legal IR ranking · טענות: 5
[primary] https://www.researchgate.net/publication/376422421_Centrality_Scores_and_Precedent_Value_in_Legal_Network_Analysis · זווית: Citation-network importance & legal IR ranking · טענות: 5
[primary] http://users.polisci.umn.edu/~trj/MyPapers/s6.pdf · זווית: Citation-network importance & legal IR ranking · טענות: 5
[primary] https://arxiv.org/pdf/2106.10776 · זווית: Citation-network importance & legal IR ranking · טענות: 5
[primary] https://www.nature.com/articles/s41598-021-82430-x · זווית: Citation-network importance & legal IR ranking · טענות: 5
[primary] https://image-ppubs.uspto.gov/dirsearch-public/print/downloadPdf/7580939 · זווית: Headnote/holding selection at commercial citators · טענות: 5
[primary] https://www.aallnet.org/wp-content/uploads/2018/12/LLJ_110n4_02_hellyer.pdf · זווית: Headnote/holding selection at commercial citators · טענות: 5
[secondary] https://library.law.northwestern.edu/cases/updating · זווית: Headnote/holding selection at commercial citators · טענות: 2
[secondary] https://guides.law.stanford.edu/cases/keynumbersystem · זווית: Headnote/holding selection at commercial citators · טענות: 4
[primary] https://proceedings.iclr.cc/paper_files/paper/2025/file/08dabd5345b37fffcbe335bd578b15a0-Paper-Conference.pdf · זווית: Selective prediction / conformal abstention thresholds · טענות: 5
[primary] https://arxiv.org/html/2512.12844 · זווית: Selective prediction / conformal abstention thresholds · טענות: 5
[primary] https://arxiv.org/pdf/2407.18370 · זווית: Selective prediction / conformal abstention thresholds · טענות: 5
[primary] https://sinatayebati.github.io/vlm-uncertainty/ · זווית: Selective prediction / conformal abstention thresholds · טענות: 5
[primary] https://openreview.net/forum?id=JJPAy8mvrQ · זווית: Selective prediction / conformal abstention thresholds · טענות: 4
[primary] https://arxiv.org/abs/2407.18370 · זווית: Multi-model agreement & trust-or-escalate routing · טענות: 4
[primary] https://arxiv.org/pdf/2511.07396 · זווית: Multi-model agreement & trust-or-escalate routing · טענות: 5
[primary] https://arxiv.org/html/2605.18796 · זווית: Multi-model agreement & trust-or-escalate routing · טענות: 4
[primary] https://www.cs.cornell.edu/~tj/publications/joachims_etal_17a.pdf · זווית: Implicit feedback active learning vs upfront review · טענות: 5
[primary] https://www.cs.cornell.edu/people/tj/publications/radlinski_joachims_05a.pdf · זווית: Implicit feedback active learning vs upfront review · טענות: 5
[primary] https://dl.acm.org/doi/10.1145/1229179.1229181 · זווית: Implicit feedback active learning vs upfront review · טענות: 5
[primary] https://arxiv.org/pdf/2403.18962 · זווית: Implicit feedback active learning vs upfront review · טענות: 5
[primary] https://arxiv.org/abs/2407.12170 · זווית: RAG corpus pruning vs rank-at-retrieval · טענות: 3
[primary] https://arxiv.org/abs/2511.00505 · זווית: RAG corpus pruning vs rank-at-retrieval · טענות: 4
[primary] https://arxiv.org/html/2409.13694v2 · זווית: RAG corpus pruning vs rank-at-retrieval · טענות: 4

23 KiB Raw Blame History Unescape Escape