feat(retrieval): add voyage rerank-2 cross-encoder stage (feature flag)

Stage B of voyage-upgrades-plan rewritten: instead of context-3 (which 4 POCs showed inconsistent improvement), add a cross-encoder rerank layer on top of voyage-3. Default off (VOYAGE_RERANK_ENABLED=false). POC validation (785-doc corpus, 12 queries, claude-haiku-4-5 judge): - mean@3 +4.5% (4.306 → 4.500) - practical-category queries +11.6% (3.78 → 4.22) - latency +702ms per query - no schema change, no re-embed, no double storage Plumbing: - config: VOYAGE_RERANK_ENABLED / _MODEL / _FETCH_K env vars - embeddings.voyage_rerank() wraps voyageai client.rerank - services/rerank.py: maybe_rerank() helper — fetches FETCH_K candidates via the bi-encoder then reranks to top-K. Fail-open if Voyage rerank is unavailable. - tools/search.py: search_decisions, search_case_documents, find_similar_cases all wrapped - services/precedent_library.search_library wrapped Smoke-tested locally with flag on/off — produces expected behaviour and latency profile. Ready for production rollout via Coolify env flip after deploy. POCs (kept under scripts/ for reference): - voyage_context3_poc{_long}.py — context-3 evaluation (rejected) - voyage_multimodal_poc.py — multimodal-3 (stage C, deferred) - voyage_rerank_judge_poc.py — single-case rerank benchmark - voyage_rerank_corpus_poc.py — full-corpus rerank validation Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:43:41 +00:00
parent 688ba37d9c
commit 26c3fddf41
13 changed files with 1578 additions and 100 deletions
--- a/scripts/SCRIPTS.md
+++ b/scripts/SCRIPTS.md
@@ -17,6 +17,11 @@
 | `deploy-track-changes.sh` | bash | סנכרון skills CMP↔CMPA + בדיקות + הנחיות deploy לארכיטקטורת Track Changes | ידני |
 | `retrofit_case.py` | python | retrofit רטרואקטיבי — מזריק bookmarks לקובץ קיים של תיק ספציפי ומגדיר אותו כ-active_draft | ידני (חד-פעמי לתיק) |
 | `reembed_voyage.py` | python | Re-embed כל הוקטורים ב-DB עם המודל ב-`VOYAGE_MODEL` (לאחר שינוי מודל). 5 טבלאות, 1024 דמ', batches של 100. ראה `docs/voyage-upgrades-plan.md` | ידני (אחרי החלפת `VOYAGE_MODEL`) |
+| `voyage_context3_poc.py` | python | POC #1 — voyage-3 vs voyage-context-3 על פסיקה אחת קצרה (קלמנוביץ, 63 chunks). הכרעה: context-3 לא מציג שיפור עקבי | בנצ'מרק חד-פעמי, נשמר לרפרנס |
+| `voyage_context3_poc_long.py` | python | POC #2 — voyage-context-3 על פסיקה ארוכה (אהרון ברק 219 chunks) עם sliding windows. הכרעה: context-3 לא משתפר על פסיקה גדולה | בנצ'מרק חד-פעמי, נשמר לרפרנס |
+| `voyage_multimodal_poc.py` | python | POC #3 — voyage-multimodal-3 על דוח שמאי (89 עמודים). הכרעה: שיפור משמעותי לטבלאות + 22 עמודי image-only שhttp text-OCR מאבד | בנצ'מרק חד-פעמי, מוכן לשלב C |
+| `voyage_rerank_judge_poc.py` | python | POC #4 — voyage-3 vs rerank-2 vs context-3 על אהרון ברק, 18 שאילתות, claude-haiku-4-5 כ-judge. הכרעה: rerank-2 ניצח עם +9% mean@3 | בנצ'מרק חד-פעמי |
+| `voyage_rerank_corpus_poc.py` | python | POC #5 — voyage-3 vs rerank-2 על קורפוס מלא (785 docs). הכרעה: +4.5% mean@3 כללי, +11.6% על P queries (practical) | בנצ'מרק חד-פעמי, אישר את שלב B |

 ## תיקיית `.archive/` — סקריפטים שהושלמו

--- a/scripts/voyage_context3_poc.py
+++ b/scripts/voyage_context3_poc.py
@@ -0,0 +1,182 @@
+"""POC: Compare voyage-3 vs voyage-context-3 retrieval on case 403/17.
+
+Pulls all chunks of "אהרון ברק - תכנית רחביה" (case_law_id=e151fc25-...),
+runs them through voyage-context-3 in a single contextualized_embed call,
+then runs benchmark queries and compares rankings against the existing
+voyage-3 embeddings (already in the DB).
+
+No DB writes — all comparisons in memory. Output: ranking table for each
+query showing top-10 from both models side-by-side.
+
+Usage:
+    /home/chaim/legal-ai/mcp-server/.venv/bin/python \\
+        /home/chaim/legal-ai/scripts/voyage_context3_poc.py
+"""
+from __future__ import annotations
+
+import asyncio
+import math
+import os
+import sys
+import time
+
+# Load ~/.env
+ENV_PATH = os.path.expanduser("~/.env")
+if os.path.isfile(ENV_PATH):
+    with open(ENV_PATH) as f:
+        for line in f:
+            line = line.strip()
+            if line and not line.startswith("#") and "=" in line:
+                k, v = line.split("=", 1)
+                os.environ.setdefault(k, v)
+
+import asyncpg  # noqa: E402
+import voyageai  # noqa: E402
+
+
+# Using קלמנוביץ/לויתן (52K chars, 63 chunks, ~18K tokens)
+# — fits in single context-3 call (32K token limit per inner list).
+# אהרון ברק (60K tokens) requires splitting; we'll handle that after POC.
+CASE_ID = "436efd48-c8ab-49f0-b3a9-52bf15ea806d"  # בר"מ 25226-04-25
+CONTEXT_MODEL = "voyage-context-3"
+BASELINE_MODEL = "voyage-3"  # already in DB
+
+QUERIES = [
+    "סמכות ועדת ערר",
+    "פיצויים לפי סעיף 197",
+    "ירידת ערך מקרקעין",
+    "תכנית פוגעת",
+    "שיקול דעת ועדה מקומית",
+    "חוות דעת שמאי מכריע",
+    "מקרקעין גובלים",
+    "תקופת התיישנות תביעה",
+    "אינטרס ציבורי בתכנון",
+    "דחיית תביעת פיצויים",
+]
+
+
+def cosine(a: list[float], b: list[float]) -> float:
+    dot = sum(x * y for x, y in zip(a, b))
+    na = math.sqrt(sum(x * x for x in a))
+    nb = math.sqrt(sum(y * y for y in b))
+    return dot / (na * nb) if na and nb else 0.0
+
+
+def parse_pgvector(s: str) -> list[float]:
+    """pgvector text format: '[0.1,0.2,...]'."""
+    return [float(x) for x in s.strip("[]").split(",")]
+
+
+async def main():
+    api_key = os.environ["VOYAGE_API_KEY"]
+    pg_pw = os.environ["POSTGRES_PASSWORD"]
+
+    voyage = voyageai.Client(api_key=api_key)
+
+    pool = await asyncpg.create_pool(
+        host="127.0.0.1", port=5433, user="legal_ai",
+        password=pg_pw, database="legal_ai",
+        min_size=1, max_size=2,
+    )
+
+    # 1. Pull all chunks + their existing voyage-3 embeddings
+    rows = await pool.fetch("""
+        SELECT chunk_index, content, embedding::text AS emb_text
+        FROM precedent_chunks
+        WHERE case_law_id = $1
+        ORDER BY chunk_index
+    """, CASE_ID)
+    print(f"[load] {len(rows)} chunks from case 403/17")
+
+    chunks = [r["content"] for r in rows]
+    indices = [r["chunk_index"] for r in rows]
+    baseline_embs = [parse_pgvector(r["emb_text"]) for r in rows]
+
+    # 2. Embed all chunks with voyage-context-3 — single contextualized call
+    total_chars = sum(len(c) for c in chunks)
+    print(f"[context] embedding {len(chunks)} chunks, {total_chars:,} chars total")
+    start = time.time()
+    result = voyage.contextualized_embed(
+        inputs=[chunks],  # one document = one inner list
+        model=CONTEXT_MODEL,
+        input_type="document",
+    )
+    elapsed = time.time() - start
+    # ContextualizedEmbeddingsObject: result.results = list of per-document
+    # embeddings. result.results[0].embeddings = list of chunk embeddings.
+    context_embs = result.results[0].embeddings
+    total_tokens = getattr(result, "total_tokens", "?")
+    print(f"[context] done in {elapsed:.1f}s — total_tokens={total_tokens}")
+    assert len(context_embs) == len(chunks), "embedding count mismatch"
+
+    # 3. For each query — embed twice and compare top-10
+    print("\n" + "=" * 100)
+    print(f"{'Q':<3} {'baseline (voyage-3)':<48} {'context-3':<48}")
+    print("=" * 100)
+
+    rank_overlaps = []
+    score_lifts = []
+
+    for q_idx, query in enumerate(QUERIES, 1):
+        # Baseline query embedding (regular embed)
+        q_baseline = voyage.embed(
+            [query], model=BASELINE_MODEL, input_type="query"
+        ).embeddings[0]
+        # Context query embedding — must use contextualized_embed even for
+        # single-string queries (regular embed() rejects voyage-context-3).
+        q_context = voyage.contextualized_embed(
+            inputs=[[query]],
+            model=CONTEXT_MODEL,
+            input_type="query",
+        ).results[0].embeddings[0]
+
+        # Score every chunk under both models
+        scores_b = sorted(
+            [(cosine(q_baseline, e), i) for i, e in enumerate(baseline_embs)],
+            reverse=True,
+        )
+        scores_c = sorted(
+            [(cosine(q_context, e), i) for i, e in enumerate(context_embs)],
+            reverse=True,
+        )
+
+        top10_b = [i for _, i in scores_b[:10]]
+        top10_c = [i for _, i in scores_c[:10]]
+
+        # Compute overlap and avg score in top-3
+        overlap = len(set(top10_b) & set(top10_c))
+        avg_b_top3 = sum(s for s, _ in scores_b[:3]) / 3
+        avg_c_top3 = sum(s for s, _ in scores_c[:3]) / 3
+        rank_overlaps.append(overlap)
+        score_lifts.append(avg_c_top3 - avg_b_top3)
+
+        print(f"\n[Q{q_idx}] {query}")
+        print(f"  overlap top-10: {overlap}/10  |  avg score top-3: "
+              f"baseline={avg_b_top3:.3f}  context-3={avg_c_top3:.3f}  "
+              f"Δ={avg_c_top3 - avg_b_top3:+.3f}")
+        for rank in range(5):
+            sb, ib = scores_b[rank]
+            sc, ic = scores_c[rank]
+            cb = chunks[ib].replace("\n", " ").strip()[:50]
+            cc = chunks[ic].replace("\n", " ").strip()[:50]
+            print(f"  #{rank+1}  [{indices[ib]:3d}] {sb:.3f} {cb:<55}  "
+                  f"|  [{indices[ic]:3d}] {sc:.3f} {cc}")
+
+    # Summary
+    print("\n" + "=" * 100)
+    print("SUMMARY")
+    print("=" * 100)
+    avg_overlap = sum(rank_overlaps) / len(rank_overlaps)
+    avg_lift = sum(score_lifts) / len(score_lifts)
+    print(f"Avg overlap top-10:  {avg_overlap:.1f}/10  "
+          f"(higher = models agree more)")
+    print(f"Avg score lift top-3 (context - baseline): {avg_lift:+.4f}")
+    print(f"\nNote: cosine scores are not directly comparable across models.")
+    print(f"What matters more is which CHUNKS bubble to the top —")
+    print(f"reading the actual content above tells the real story.")
+
+    await pool.close()
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/scripts/voyage_context3_poc_long.py
+++ b/scripts/voyage_context3_poc_long.py
@@ -0,0 +1,238 @@
+"""POC #2: voyage-3 vs voyage-context-3 on a LONG case (אהרון ברק 403/17).
+
+Case is 178K chars / 219 chunks / ~60K tokens — too big for a single
+contextualized_embed call (32K token limit per inner list). We split the
+chunks into overlapping sliding windows (~80 chunks each, ~22K tokens)
+and merge: each chunk gets the embedding from the window where it sits
+*most centrally* (max symmetric context on both sides).
+
+The hypothesis: voyage-context-3 should shine here because the case is
+full of internal references ("ראה לעיל סעיף 13", "להבדיל מעניין X",
+"תוצאת הבחינה ב-בר"מ 1975/24 שנידונה לעיל"). voyage-3 embeds chunks
+in isolation; context-3 sees ~80 surrounding chunks per embedding.
+
+No DB writes. Output: side-by-side ranking comparison + summary.
+
+Usage:
+    /home/chaim/legal-ai/mcp-server/.venv/bin/python \\
+        /home/chaim/legal-ai/scripts/voyage_context3_poc_long.py
+"""
+from __future__ import annotations
+
+import asyncio
+import math
+import os
+import sys
+import time
+
+ENV_PATH = os.path.expanduser("~/.env")
+if os.path.isfile(ENV_PATH):
+    with open(ENV_PATH) as f:
+        for line in f:
+            line = line.strip()
+            if line and not line.startswith("#") and "=" in line:
+                k, v = line.split("=", 1)
+                os.environ.setdefault(k, v)
+
+import asyncpg  # noqa: E402
+import voyageai  # noqa: E402
+
+
+CASE_ID = "e151fc25-cf12-4563-b638-a86323f8413b"  # 403/17 אהרון ברק (178K chars)
+CONTEXT_MODEL = "voyage-context-3"
+BASELINE_MODEL = "voyage-3"
+
+# Sliding-window split params. With 219 chunks and ~60K tokens total
+# (~275 tokens/chunk average), 3 windows of 80 chunks each is ~22K tokens
+# per call — comfortably under 32K.
+WINDOW_SIZE = 80
+WINDOW_STRIDE = 70  # overlap = WINDOW_SIZE - WINDOW_STRIDE = 10
+
+# Mix of:
+#   (a) generic queries (also tested in POC #1)
+#   (b) queries that require *internal* document context
+QUERIES = [
+    # generic
+    "תכנית רחביה הוראות בנייה",
+    "פיצויים לפי סעיף 197 ירידת ערך",
+    "השפעת תכנית על שווי מקרקעין",
+    "סמכות ועדת ערר לדון בפיצויים",
+    "תוספת זכויות בנייה כפיצוי",
+    # internal-context — should benefit context-3
+    "ההבחנה בין השבחה לפיצויים",
+    "מה נקבע לגבי תמ\"א 38 בפסק הדין",
+    "ההלכה שנקבעה בעניין רובע 3",
+    "כלל הנטרול של זכויות תכנוניות",
+    "הסכמת השופט אלרון לחוות הדעת",
+]
+
+
+def cosine(a: list[float], b: list[float]) -> float:
+    dot = sum(x * y for x, y in zip(a, b))
+    na = math.sqrt(sum(x * x for x in a))
+    nb = math.sqrt(sum(y * y for y in b))
+    return dot / (na * nb) if na and nb else 0.0
+
+
+def parse_pgvector(s: str) -> list[float]:
+    return [float(x) for x in s.strip("[]").split(",")]
+
+
+def build_windows(n: int, size: int, stride: int) -> list[tuple[int, int]]:
+    """Return list of (start, end) ranges (end exclusive) covering 0..n.
+
+    Last window extends to n exactly. Overlap = size - stride.
+    """
+    windows = []
+    start = 0
+    while start < n:
+        end = min(start + size, n)
+        windows.append((start, end))
+        if end == n:
+            break
+        start += stride
+    return windows
+
+
+def assign_chunk_to_window(
+    chunk_idx: int, windows: list[tuple[int, int]],
+) -> int:
+    """Pick the window where chunk_idx sits most centrally (max symmetric
+    distance to either edge). Ties broken by larger window."""
+    best = -1
+    best_score = -1
+    for w_idx, (s, e) in enumerate(windows):
+        if not (s <= chunk_idx < e):
+            continue
+        # symmetric distance: min(distance to s, distance to e-1)
+        dist = min(chunk_idx - s, (e - 1) - chunk_idx)
+        if dist > best_score:
+            best_score = dist
+            best = w_idx
+    return best
+
+
+async def main():
+    api_key = os.environ["VOYAGE_API_KEY"]
+    pg_pw = os.environ["POSTGRES_PASSWORD"]
+
+    voyage = voyageai.Client(api_key=api_key)
+
+    pool = await asyncpg.create_pool(
+        host="127.0.0.1", port=5433, user="legal_ai",
+        password=pg_pw, database="legal_ai",
+        min_size=1, max_size=2,
+    )
+
+    rows = await pool.fetch("""
+        SELECT chunk_index, content, embedding::text AS emb_text
+        FROM precedent_chunks
+        WHERE case_law_id = $1
+        ORDER BY chunk_index
+    """, CASE_ID)
+    n = len(rows)
+    print(f"[load] {n} chunks from אהרון ברק 403/17")
+
+    chunks = [r["content"] for r in rows]
+    indices = [r["chunk_index"] for r in rows]
+    baseline_embs = [parse_pgvector(r["emb_text"]) for r in rows]
+
+    # Build windows
+    windows = build_windows(n, WINDOW_SIZE, WINDOW_STRIDE)
+    print(f"[windows] {len(windows)} windows: "
+          f"{', '.join(f'[{s}:{e})' for s, e in windows)}")
+
+    # Embed each window with context-3
+    window_embs: list[list[list[float]]] = []  # [window][chunk_in_window][dim]
+    total_call_tokens = 0
+    total_start = time.time()
+    for w_idx, (s, e) in enumerate(windows):
+        sub_chunks = chunks[s:e]
+        sub_chars = sum(len(c) for c in sub_chunks)
+        start = time.time()
+        result = voyage.contextualized_embed(
+            inputs=[sub_chunks],
+            model=CONTEXT_MODEL,
+            input_type="document",
+        )
+        elapsed = time.time() - start
+        toks = getattr(result, "total_tokens", 0)
+        total_call_tokens += toks
+        print(f"  [window {w_idx}] [{s}:{e}) — {len(sub_chunks)} chunks, "
+              f"{sub_chars:,} chars, {toks} tokens — {elapsed:.1f}s")
+        window_embs.append(result.results[0].embeddings)
+    total_elapsed = time.time() - total_start
+    print(f"[context] all windows done in {total_elapsed:.1f}s, "
+          f"{total_call_tokens} total tokens")
+
+    # Merge: for each chunk, pick the embedding from its most-central window
+    context_embs: list[list[float]] = []
+    chunk_window_choice = []
+    for i in range(n):
+        w_idx = assign_chunk_to_window(i, windows)
+        chunk_window_choice.append(w_idx)
+        s, _ = windows[w_idx]
+        context_embs.append(window_embs[w_idx][i - s])
+    print(f"[merge] window distribution: "
+          f"{[chunk_window_choice.count(j) for j in range(len(windows))]}")
+
+    # Run queries
+    print("\n" + "=" * 100)
+    print(f"{'Q':<3} {'baseline (voyage-3)':<48} {'context-3 (windowed)':<48}")
+    print("=" * 100)
+
+    rank_overlaps = []
+    for q_idx, query in enumerate(QUERIES, 1):
+        q_baseline = voyage.embed(
+            [query], model=BASELINE_MODEL, input_type="query"
+        ).embeddings[0]
+        q_context = voyage.contextualized_embed(
+            inputs=[[query]],
+            model=CONTEXT_MODEL,
+            input_type="query",
+        ).results[0].embeddings[0]
+
+        scores_b = sorted(
+            [(cosine(q_baseline, e), i) for i, e in enumerate(baseline_embs)],
+            reverse=True,
+        )
+        scores_c = sorted(
+            [(cosine(q_context, e), i) for i, e in enumerate(context_embs)],
+            reverse=True,
+        )
+
+        top10_b = [i for _, i in scores_b[:10]]
+        top10_c = [i for _, i in scores_c[:10]]
+        overlap = len(set(top10_b) & set(top10_c))
+        rank_overlaps.append(overlap)
+
+        print(f"\n[Q{q_idx}] {query}")
+        print(f"  overlap top-10: {overlap}/10  |  "
+              f"avg score top-3: baseline="
+              f"{sum(s for s, _ in scores_b[:3])/3:.3f}  "
+              f"context-3={sum(s for s, _ in scores_c[:3])/3:.3f}")
+        for rank in range(5):
+            sb, ib = scores_b[rank]
+            sc, ic = scores_c[rank]
+            cb = chunks[ib].replace("\n", " ").strip()[:50]
+            cc = chunks[ic].replace("\n", " ").strip()[:50]
+            print(f"  #{rank+1}  [{indices[ib]:3d}] {sb:.3f} {cb:<55}  "
+                  f"|  [{indices[ic]:3d}] {sc:.3f} {cc}")
+
+    print("\n" + "=" * 100)
+    print("SUMMARY")
+    print("=" * 100)
+    avg = sum(rank_overlaps) / len(rank_overlaps)
+    print(f"Avg overlap top-10:  {avg:.1f}/10")
+    print(f"Per-query overlap:   {rank_overlaps}")
+    print(f"Total context-3 tokens used: {total_call_tokens:,}  "
+          f"(in {len(windows)} calls)")
+    print(f"\nNote: cosine across models not directly comparable. The")
+    print(f"meaningful test is *which chunks bubble to the top* — read")
+    print(f"the actual text above to judge relevance.")
+
+    await pool.close()
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/scripts/voyage_multimodal_poc.py
+++ b/scripts/voyage_multimodal_poc.py
@@ -0,0 +1,213 @@
+"""POC #3: voyage-3 (text) vs voyage-multimodal-3.5 (page images) on a
+real appraisal PDF (89 pages, full of tables / signatures / numerical
+data — the corpus class where multimodal should help most).
+
+Document under test:
+  baf10153-d2fc-4481-b250-9fe87440ce69
+  "נספח - שומה מכרעת (אבלין דוידזון שמאמא) - 15.09.24"
+  case 8137-24, 89 pages, 2.1 MB
+
+The pipeline:
+  1. Pull the existing voyage-3 text-chunk embeddings from `document_chunks`.
+  2. Render each PDF page → PNG (PyMuPDF, dpi=144).
+  3. Embed all pages via voyage-multimodal-3.5.
+  4. Run benchmark queries (mix of generic + table-specific + visual)
+     against both: text top-K and page top-K.
+
+The comparison is *qualitative* — text and image embeddings are
+different "spaces" returning different ID types (chunk_id vs page_num).
+What we look at is whether image-based retrieval surfaces tables,
+signatures, or numerical data that text-only OCR loses.
+
+No DB writes.
+
+Usage:
+    /home/chaim/legal-ai/mcp-server/.venv/bin/python \\
+        /home/chaim/legal-ai/scripts/voyage_multimodal_poc.py
+"""
+from __future__ import annotations
+
+import asyncio
+import io
+import math
+import os
+import time
+
+ENV_PATH = os.path.expanduser("~/.env")
+if os.path.isfile(ENV_PATH):
+    with open(ENV_PATH) as f:
+        for line in f:
+            line = line.strip()
+            if line and not line.startswith("#") and "=" in line:
+                k, v = line.split("=", 1)
+                os.environ.setdefault(k, v)
+
+import asyncpg  # noqa: E402
+import voyageai  # noqa: E402
+import fitz  # PyMuPDF  # noqa: E402
+from PIL import Image  # noqa: E402
+
+
+DOCUMENT_ID = "baf10153-d2fc-4481-b250-9fe87440ce69"
+PDF_PATH = (
+    "/home/chaim/legal-ai/data/cases/8137-24/documents/originals/"
+    "נספח - שומה מכרעת (אבלין דוידזון שמאמא) - 15.09.24.pdf"
+)
+TEXT_MODEL = "voyage-3"
+MULTIMODAL_MODEL = "voyage-multimodal-3"  # check supported: 3.5 may not exist yet
+DPI = 144
+# voyage-multimodal: max 1000 inputs/call, 320M pixels/call (rough),
+# so 89 pages at 1240×1750 ≈ 192M pixels = single call.
+
+QUERIES = [
+    # generic-textual (both should handle)
+    "שיטת ההיוון בשומה",
+    "מתודולוגיית הערכת שווי",
+    # table/numerical (multimodal should help)
+    "טבלת השוואת ערכים לפני ואחרי התכנית",
+    "שווי המקרקעין במצב הקודם",
+    "שווי המקרקעין במצב החדש",
+    "ירידת ערך באחוזים",
+    # visual elements (text-only loses)
+    "חתימת השמאי",
+    "תרשים גוש וחלקה",
+    "מפת מיקום הנכס",
+    # context-heavy
+    "מסקנת השמאי המכריע",
+    "עקרון הצפיפות בתכנית",
+]
+
+
+def cosine(a: list[float], b: list[float]) -> float:
+    dot = sum(x * y for x, y in zip(a, b))
+    na = math.sqrt(sum(x * x for x in a))
+    nb = math.sqrt(sum(y * y for y in b))
+    return dot / (na * nb) if na and nb else 0.0
+
+
+def parse_pgvector(s: str) -> list[float]:
+    return [float(x) for x in s.strip("[]").split(",")]
+
+
+def render_pdf_pages(pdf_path: str, dpi: int) -> list[Image.Image]:
+    """Render each page → PIL.Image (RGB)."""
+    doc = fitz.open(pdf_path)
+    images: list[Image.Image] = []
+    for page in doc:
+        pix = page.get_pixmap(dpi=dpi)
+        png_bytes = pix.tobytes("png")
+        img = Image.open(io.BytesIO(png_bytes)).convert("RGB")
+        images.append(img)
+    doc.close()
+    return images
+
+
+async def main():
+    api_key = os.environ["VOYAGE_API_KEY"]
+    pg_pw = os.environ["POSTGRES_PASSWORD"]
+
+    voyage = voyageai.Client(api_key=api_key)
+
+    # 1. Render PDF pages
+    print(f"[render] {PDF_PATH}")
+    start = time.time()
+    images = render_pdf_pages(PDF_PATH, DPI)
+    elapsed = time.time() - start
+    print(f"[render] {len(images)} pages in {elapsed:.1f}s, "
+          f"{images[0].size}px @ {DPI}dpi")
+
+    # 2. Pull existing text chunks + voyage-3 embeddings
+    pool = await asyncpg.create_pool(
+        host="127.0.0.1", port=5433, user="legal_ai",
+        password=pg_pw, database="legal_ai",
+        min_size=1, max_size=2,
+    )
+    rows = await pool.fetch("""
+        SELECT id, chunk_index, page_number, content,
+               embedding::text AS emb_text
+        FROM document_chunks
+        WHERE document_id = $1
+        ORDER BY chunk_index
+    """, DOCUMENT_ID)
+    print(f"[text] {len(rows)} text chunks loaded (voyage-3 in DB)")
+    text_contents = [r["content"] for r in rows]
+    text_chunk_pages = [r["page_number"] for r in rows]
+    text_embs = [parse_pgvector(r["emb_text"]) for r in rows]
+
+    # 3. Multimodal embed — try multimodal-3 first, fall back if needed
+    target_model = "voyage-multimodal-3"
+    print(f"[multimodal] embedding {len(images)} pages with {target_model}…")
+    start = time.time()
+    try:
+        mm_result = voyage.multimodal_embed(
+            inputs=[[img] for img in images],  # list of single-image inputs
+            model=target_model,
+            input_type="document",
+            truncation=True,
+        )
+    except voyageai.error.InvalidRequestError as e:
+        print(f"  [error] {e}")
+        await pool.close()
+        return
+    elapsed = time.time() - start
+    image_embs = mm_result.embeddings
+    mm_tokens = getattr(mm_result, "total_tokens", "?")
+    image_tokens = getattr(mm_result, "image_pixels", "?")
+    text_tokens_mm = getattr(mm_result, "text_tokens", "?")
+    print(f"[multimodal] done in {elapsed:.1f}s — "
+          f"total_tokens={mm_tokens}  text_tokens={text_tokens_mm}  "
+          f"image_pixels={image_tokens}")
+    assert len(image_embs) == len(images), "embedding count mismatch"
+    print(f"[multimodal] embedding dim = {len(image_embs[0])}")
+
+    # 4. Run queries
+    print("\n" + "=" * 100)
+    print("QUERY RESULTS — top-5 chunks (text/voyage-3) "
+          "vs top-5 pages (multimodal)")
+    print("=" * 100)
+
+    for q_idx, query in enumerate(QUERIES, 1):
+        # Text-side: voyage-3 query embedding
+        q_text = voyage.embed(
+            [query], model=TEXT_MODEL, input_type="query"
+        ).embeddings[0]
+        # Multimodal-side: same model, query input_type
+        q_mm = voyage.multimodal_embed(
+            inputs=[[query]],
+            model=target_model,
+            input_type="query",
+        ).embeddings[0]
+
+        text_scores = sorted(
+            [(cosine(q_text, e), i) for i, e in enumerate(text_embs)],
+            reverse=True,
+        )[:5]
+        mm_scores = sorted(
+            [(cosine(q_mm, e), i) for i, e in enumerate(image_embs)],
+            reverse=True,
+        )[:5]
+
+        print(f"\n[Q{q_idx}] {query}")
+        print(f"  --- text (voyage-3) top-5 ---")
+        for s, i in text_scores:
+            page = text_chunk_pages[i] if text_chunk_pages[i] else "?"
+            preview = text_contents[i].replace("\n", " ").strip()[:70]
+            print(f"    {s:.3f}  page={page:>3}  chunk={i:>3}  {preview}")
+        print(f"  --- multimodal (image-only) top-5 ---")
+        for s, i in mm_scores:
+            print(f"    {s:.3f}  page={i+1:>3}  (image)")
+
+    # Token / cost summary
+    print("\n" + "=" * 100)
+    print("SUMMARY")
+    print("=" * 100)
+    print(f"PDF: {len(images)} pages @ {DPI}dpi → {target_model}")
+    print(f"Total multimodal tokens: {mm_tokens}")
+    print(f"Embedding dim: {len(image_embs[0])}")
+    print(f"Time: {elapsed:.1f}s for full doc")
+
+    await pool.close()
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/scripts/voyage_rerank_corpus_poc.py
+++ b/scripts/voyage_rerank_corpus_poc.py
@@ -0,0 +1,318 @@
+"""POC #5 — full precedent_library corpus benchmark.
+
+Tests R1 (voyage-3) vs R2 (voyage-3 + rerank-2) on the *real* corpus that
+search_precedent_library queries against:
+
+  precedent_chunks  — 385 rows from 3 precedent cases
+  halachot          — 400 rule statements with reasoning summaries
+
+Total: 785 documents. The MCP tool merges results from both tables so the
+benchmark mirrors production retrieval. R3 (context-3) is dropped — it
+would require windowed re-embedding of 3 cases which we already proved
+doesn't help (POC #2). The question now is: does rerank-2's +9% on a
+single case generalize to a heterogeneous corpus?
+
+Also measures end-to-end latency: pure voyage-3 vs voyage-3 + rerank.
+
+Usage:
+    /home/chaim/legal-ai/mcp-server/.venv/bin/python \\
+        /home/chaim/legal-ai/scripts/voyage_rerank_corpus_poc.py
+"""
+from __future__ import annotations
+
+import asyncio
+import json
+import math
+import os
+import re
+import subprocess
+import sys
+import time
+from collections import defaultdict
+
+ENV_PATH = os.path.expanduser("~/.env")
+if os.path.isfile(ENV_PATH):
+    with open(ENV_PATH) as f:
+        for line in f:
+            line = line.strip()
+            if line and not line.startswith("#") and "=" in line:
+                k, v = line.split("=", 1)
+                os.environ.setdefault(k, v)
+
+import asyncpg  # noqa: E402
+import voyageai  # noqa: E402
+
+
+TEXT_MODEL = "voyage-3"
+RERANK_MODEL = "rerank-2"
+JUDGE_MODEL = "claude-haiku-4-5-20251001"
+TOP_VEC = 50  # voyage-3 retrieve depth
+TOP_K = 10    # final returned to "agent"
+JUDGE_K = 5   # how many top results to actually judge per retriever
+
+# 12 queries spanning typical use cases by Daphna's agents:
+# precedent search for citing in decision blocks י-יא.
+QUERIES = [
+    # K — keyword
+    ("K1", "פיצויים לפי סעיף 197"),
+    ("K2", "תמ\"א 38 והשבחה"),
+    ("K3", "כלל הנטרול בשמאות"),
+    # C — conceptual
+    ("C1", "תכלית היטל ההשבחה"),
+    ("C2", "מה מקנה לבעלים זכות לפיצוי"),
+    ("C3", "ההבחנה בין השבחה לפיצויים"),
+    # N — narrative / context-aware
+    ("N1", "מה נקבע לגבי תמ\"א 38 בפסיקה"),
+    ("N2", "ההלכה לעניין נטרול ציפיות"),
+    ("N3", "תכנית פוגעת ושומה"),
+    # P — practical (drafting needs — what an agent typically asks)
+    ("P1", "פסיקה שדנה בתכנית מתאר ארצית"),
+    ("P2", "מתי מותר לוועדה לדחות פיצויים"),
+    ("P3", "שיקול דעת הוועדה המקומית"),
+]
+
+
+def cosine(a, b):
+    dot = sum(x * y for x, y in zip(a, b))
+    na = math.sqrt(sum(x * x for x in a))
+    nb = math.sqrt(sum(y * y for y in b))
+    return dot / (na * nb) if na and nb else 0.0
+
+
+def parse_pgvector(s):
+    return [float(x) for x in s.strip("[]").split(",")]
+
+
+BATCH_JUDGE_PROMPT = """אתה שופט רלוונטיות במשפט ישראלי.
+לפניך שאילתה ומספר פסקאות מפסקי דין/הלכות. דרג כל פסקה 1-5 לפי רלוונטיות.
+
+5 — תשובה ישירה למה שנשאל
+4 — מאד רלוונטי, מכיל מידע ליבה
+3 — רלוונטי חלקית, נוגע בעקיפין
+2 — מעט קשור, רעש סביב הנושא
+1 — לא רלוונטי בכלל
+
+השאילתה:
+{query}
+
+הפסקאות:
+{chunks_block}
+
+החזר JSON בלבד: {{"scores": {{"<id>": <1-5>, ...}}}}
+ללא טקסט נוסף, ללא ```."""
+
+
+def batch_judge(query: str, items: list[tuple[str, str]]) -> dict[str, int]:
+    """Judge (id, text) pairs via claude CLI. Returns {id: score}."""
+    blocks = []
+    for cid, content in items:
+        snippet = content.replace("\n", " ").strip()[:1500]
+        blocks.append(f"<id={cid}>\n{snippet}\n</id>")
+    prompt = BATCH_JUDGE_PROMPT.format(
+        query=query, chunks_block="\n\n".join(blocks))
+    proc = subprocess.run(
+        ["claude", "-p", "--model", JUDGE_MODEL],
+        input=prompt, capture_output=True, text=True, timeout=180,
+    )
+    out = proc.stdout.strip()
+    out = re.sub(r"^```(?:json)?\s*", "", out)
+    out = re.sub(r"\s*```$", "", out)
+    try:
+        data = json.loads(out)
+        raw = data.get("scores", {})
+        return {str(k): int(v) for k, v in raw.items()
+                if str(v).isdigit() and 1 <= int(v) <= 5}
+    except (json.JSONDecodeError, ValueError, TypeError) as e:
+        print(f"   [judge parse fail: {e}; out={out[:200]!r}]")
+        return {}
+
+
+async def main():
+    voyage_key = os.environ["VOYAGE_API_KEY"]
+    pg_pw = os.environ["POSTGRES_PASSWORD"]
+
+    try:
+        subprocess.run(["claude", "--version"], capture_output=True,
+                       text=True, timeout=10, check=True)
+    except (subprocess.CalledProcessError, FileNotFoundError, TimeoutError):
+        sys.exit("claude CLI not found")
+
+    voyage = voyageai.Client(api_key=voyage_key)
+
+    pool = await asyncpg.create_pool(
+        host="127.0.0.1", port=5433, user="legal_ai",
+        password=pg_pw, database="legal_ai",
+        min_size=1, max_size=2,
+    )
+
+    # Load full corpus: precedent_chunks + halachot
+    pc_rows = await pool.fetch("""
+        SELECT 'pc:' || id::text AS doc_id,
+               content,
+               embedding::text AS emb_text
+        FROM precedent_chunks
+        WHERE content IS NOT NULL AND embedding IS NOT NULL
+    """)
+    h_rows = await pool.fetch("""
+        SELECT 'h:' || id::text AS doc_id,
+               TRIM(BOTH ' —' FROM rule_statement || ' — ' ||
+                    COALESCE(reasoning_summary, '')) AS content,
+               embedding::text AS emb_text
+        FROM halachot
+        WHERE rule_statement IS NOT NULL AND embedding IS NOT NULL
+    """)
+    all_rows = list(pc_rows) + list(h_rows)
+    print(f"[load] corpus: {len(pc_rows)} precedent_chunks + "
+          f"{len(h_rows)} halachot = {len(all_rows)} total")
+
+    doc_ids = [r["doc_id"] for r in all_rows]
+    contents = [r["content"] for r in all_rows]
+    embs = [parse_pgvector(r["emb_text"]) for r in all_rows]
+
+    # Latency measurement: 5 queries, time the two pipelines
+    print("\n[latency] measuring 5 sample queries…")
+    sample = QUERIES[:5]
+    r1_lat = []
+    r2_lat = []
+    for _, query in sample:
+        # R1: voyage-3 embed + cosine top-10
+        t0 = time.time()
+        q_emb = voyage.embed([query], model=TEXT_MODEL,
+                             input_type="query").embeddings[0]
+        scores = sorted([(cosine(q_emb, e), i) for i, e in enumerate(embs)],
+                        reverse=True)[:TOP_K]
+        r1_lat.append(time.time() - t0)
+        # R2: voyage-3 embed + cosine top-50 + rerank-2 → top-10
+        t0 = time.time()
+        q_emb = voyage.embed([query], model=TEXT_MODEL,
+                             input_type="query").embeddings[0]
+        cands = sorted([(cosine(q_emb, e), i) for i, e in enumerate(embs)],
+                       reverse=True)[:TOP_VEC]
+        cand_texts = [contents[i] for _, i in cands]
+        rr = voyage.rerank(query=query, documents=cand_texts,
+                           model=RERANK_MODEL, top_k=TOP_K)
+        r2_lat.append(time.time() - t0)
+    print(f"  R1 (voyage-3 only)         avg={sum(r1_lat)/5*1000:.0f}ms"
+          f"  min={min(r1_lat)*1000:.0f}  max={max(r1_lat)*1000:.0f}")
+    print(f"  R2 (voyage-3 + rerank-2)   avg={sum(r2_lat)/5*1000:.0f}ms"
+          f"  min={min(r2_lat)*1000:.0f}  max={max(r2_lat)*1000:.0f}")
+    print(f"  Δ (rerank overhead)        avg={(sum(r2_lat)-sum(r1_lat))/5*1000:.0f}ms")
+
+    # Retrieval functions
+    def r1_baseline(query: str, k: int = TOP_K) -> list[int]:
+        q = voyage.embed([query], model=TEXT_MODEL,
+                         input_type="query").embeddings[0]
+        scores = sorted([(cosine(q, e), i) for i, e in enumerate(embs)],
+                        reverse=True)
+        return [i for _, i in scores[:k]]
+
+    def r2_rerank(query: str, k: int = TOP_K) -> list[int]:
+        cands = r1_baseline(query, k=TOP_VEC)
+        cand_texts = [contents[i] for i in cands]
+        rr = voyage.rerank(query=query, documents=cand_texts,
+                           model=RERANK_MODEL, top_k=k)
+        return [cands[r.index] for r in rr.results]
+
+    retrievers = [("R1-voyage3", r1_baseline),
+                  ("R2-rerank2", r2_rerank)]
+
+    print(f"\n[judge] running {len(QUERIES)} queries × 2 retrievers, "
+          f"top-{JUDGE_K} judged…")
+
+    all_results = []
+    for qid, query in QUERIES:
+        print(f"\n[{qid}] {query}")
+        retr_results = {}
+        for r_name, r_fn in retrievers:
+            try:
+                retr_results[r_name] = r_fn(query, k=JUDGE_K)
+            except Exception as e:
+                print(f"  {r_name}: FAILED — {e}")
+                retr_results[r_name] = []
+        union = sorted({i for top in retr_results.values() for i in top})
+        items = [(doc_ids[i], contents[i]) for i in union]
+        print(f"  judging {len(items)} unique docs…")
+        scores_map = batch_judge(query, items)
+        for r_name, top in retr_results.items():
+            scores = [scores_map.get(doc_ids[i], 0) for i in top]
+            mean3 = sum(scores[:3]) / 3 if len(scores) >= 3 else 0
+            mean5 = sum(scores) / len(scores) if scores else 0
+            mrr = 0.0
+            for r, s in enumerate(scores):
+                if s >= 4:
+                    mrr = 1.0 / (r + 1)
+                    break
+            print(f"  {r_name}: doc_ids={[doc_ids[i][:14] for i in top]}  "
+                  f"scores={scores}  m@3={mean3:.2f}  m@5={mean5:.2f}  "
+                  f"MRR={mrr:.3f}")
+            all_results.append({
+                "qid": qid, "category": qid[0], "query": query,
+                "retriever": r_name,
+                "doc_ids": [doc_ids[i] for i in top],
+                "scores": scores, "mean3": mean3, "mean5": mean5, "mrr": mrr,
+            })
+
+    # Aggregate
+    print("\n" + "=" * 100)
+    print("AGGREGATED RESULTS — full precedent_library corpus (785 docs)")
+    print("=" * 100)
+    by_r = defaultdict(lambda: {"mean3": [], "mean5": [], "mrr": []})
+    by_cat_r = defaultdict(lambda: {"mean3": [], "mean5": [], "mrr": []})
+    for r in all_results:
+        by_r[r["retriever"]]["mean3"].append(r["mean3"])
+        by_r[r["retriever"]]["mean5"].append(r["mean5"])
+        by_r[r["retriever"]]["mrr"].append(r["mrr"])
+        ck = (r["category"], r["retriever"])
+        by_cat_r[ck]["mean3"].append(r["mean3"])
+        by_cat_r[ck]["mean5"].append(r["mean5"])
+        by_cat_r[ck]["mrr"].append(r["mrr"])
+
+    print(f"\nOverall ({len(QUERIES)} queries):")
+    print(f"{'retriever':<14} {'mean@3':>8} {'mean@5':>8} {'MRR':>8}")
+    avg = lambda xs: sum(xs) / len(xs) if xs else 0
+    for r_name, _ in retrievers:
+        m = by_r[r_name]
+        print(f"{r_name:<14} {avg(m['mean3']):>8.3f} "
+              f"{avg(m['mean5']):>8.3f} {avg(m['mrr']):>8.3f}")
+    # Improvement
+    r1m = avg(by_r["R1-voyage3"]["mean3"])
+    r2m = avg(by_r["R2-rerank2"]["mean3"])
+    if r1m > 0:
+        print(f"\nR2 vs R1 improvement: "
+              f"mean@3 {(r2m - r1m) / r1m * 100:+.1f}%")
+
+    print(f"\nBy category:")
+    print(f"{'cat':<3} {'retriever':<14} {'mean@3':>8} {'mean@5':>8} "
+          f"{'MRR':>8}")
+    for cat in ["K", "C", "N", "P"]:
+        for r_name, _ in retrievers:
+            m = by_cat_r[(cat, r_name)]
+            if not m["mean3"]:
+                continue
+            print(f"{cat:<3} {r_name:<14} {avg(m['mean3']):>8.3f} "
+                  f"{avg(m['mean5']):>8.3f} {avg(m['mrr']):>8.3f}")
+
+    print(f"\nPer-query winner (highest mean@3):")
+    print(f"{'qid':<4} {'query':<40} {'winner':<14} {'scores'}")
+    by_q = defaultdict(list)
+    for r in all_results:
+        by_q[r["qid"]].append(r)
+    for qid, results in sorted(by_q.items()):
+        max_s = max(r["mean3"] for r in results)
+        winners = [r["retriever"] for r in results if r["mean3"] == max_s]
+        scores = " | ".join(f"{r['retriever'][:7]}={r['mean3']:.2f}"
+                            for r in results)
+        q_str = next(q for qid_, q in QUERIES if qid_ == qid)[:38]
+        print(f"{qid:<4} {q_str:<40} {','.join(w[:8] for w in winners):<14} "
+              f"{scores}")
+
+    out_path = "/tmp/voyage_rerank_corpus_results.json"
+    with open(out_path, "w") as f:
+        json.dump(all_results, f, ensure_ascii=False, indent=2)
+    print(f"\nSaved to {out_path}")
+
+    await pool.close()
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/scripts/voyage_rerank_judge_poc.py
+++ b/scripts/voyage_rerank_judge_poc.py
@@ -0,0 +1,361 @@
+"""POC #4: Comprehensive retrieval benchmark with LLM-as-judge.
+
+Compares 3 retrievers on אהרון ברק 403/17 (219 chunks):
+  R1 — voyage-3 (current production baseline)
+  R2 — voyage-3 + voyage-rerank-2 (retrieve 50, rerank, top-10)
+  R3 — voyage-context-3 (windowed, from POC #2)
+
+Judges relevance with claude-haiku-4-5 — for each (query, chunk) pair the
+judge returns 1-5. Aggregates: mean relevance@3, @5, @10, MRR (rank of
+first 4+ chunk), per-query winner.
+
+20 queries grouped into 3 categories so we can see *which* query types
+benefit from which retriever:
+  K — keyword/lexical (term-heavy, specific entity)
+  C — conceptual (abstract idea, principle)
+  N — narrative/contextual (requires document-internal reference)
+
+Usage (key passed via env, NOT stored in script):
+    ANTHROPIC_API_KEY=... \\
+    /home/chaim/legal-ai/mcp-server/.venv/bin/python \\
+        /home/chaim/legal-ai/scripts/voyage_rerank_judge_poc.py
+"""
+from __future__ import annotations
+
+import asyncio
+import json
+import math
+import os
+import sys
+import time
+from collections import defaultdict
+
+ENV_PATH = os.path.expanduser("~/.env")
+if os.path.isfile(ENV_PATH):
+    with open(ENV_PATH) as f:
+        for line in f:
+            line = line.strip()
+            if line and not line.startswith("#") and "=" in line:
+                k, v = line.split("=", 1)
+                os.environ.setdefault(k, v)
+
+import re
+import subprocess
+
+import asyncpg  # noqa: E402
+import voyageai  # noqa: E402
+
+
+CASE_ID = "e151fc25-cf12-4563-b638-a86323f8413b"  # אהרון ברק 403/17
+TEXT_MODEL = "voyage-3"
+CONTEXT_MODEL = "voyage-context-3"
+RERANK_MODEL = "rerank-2"
+JUDGE_MODEL = "claude-haiku-4-5-20251001"
+
+WINDOW_SIZE = 80
+WINDOW_STRIDE = 70
+
+# 18 queries × 3 retrievers × top-5 = 270 judge calls. ~$0.05 with haiku.
+QUERIES = [
+    # K — keyword/lexical
+    ("K1", "תכנית רחביה הוראות בנייה"),
+    ("K2", "תמ\"א 38"),
+    ("K3", "תכנית 9988"),
+    ("K4", "סעיף 197 לחוק התכנון והבניה"),
+    ("K5", "השופט גרוסקופף"),
+    ("K6", "ועדה מקומית ירושלים"),
+    # C — conceptual / abstract principles
+    ("C1", "כלל הנטרול של זכויות תכנוניות"),
+    ("C2", "אינטרס הציבור בתכנון"),
+    ("C3", "תכלית היטל ההשבחה"),
+    ("C4", "תכנית פוגעת לעומת תכנית משביחה"),
+    ("C5", "ההבחנה בין השבחה לפיצויים"),
+    ("C6", "מהותו של היטל ההשבחה"),
+    # N — narrative / context-dependent
+    ("N1", "מה נקבע לגבי תמ\"א 38 בפסק הדין"),
+    ("N2", "מסקנת בית המשפט בעניין רובע 3"),
+    ("N3", "ההלכה שנקבעה בעניין שמעוני"),
+    ("N4", "ההבדל בין המקרה שלפנינו לעניין רון"),
+    ("N5", "סוף דבר ותוצאת פסק הדין"),
+    ("N6", "הסכמת השופטים האחרים לחוות הדעת"),
+]
+
+
+def cosine(a, b):
+    dot = sum(x * y for x, y in zip(a, b))
+    na = math.sqrt(sum(x * x for x in a))
+    nb = math.sqrt(sum(y * y for y in b))
+    return dot / (na * nb) if na and nb else 0.0
+
+
+def parse_pgvector(s):
+    return [float(x) for x in s.strip("[]").split(",")]
+
+
+def build_windows(n, size, stride):
+    out = []
+    s = 0
+    while s < n:
+        e = min(s + size, n)
+        out.append((s, e))
+        if e == n:
+            break
+        s += stride
+    return out
+
+
+def central_window(idx, windows):
+    best, best_d = -1, -1
+    for w_idx, (s, e) in enumerate(windows):
+        if not (s <= idx < e):
+            continue
+        d = min(idx - s, (e - 1) - idx)
+        if d > best_d:
+            best_d = d
+            best = w_idx
+    return best
+
+
+BATCH_JUDGE_PROMPT = """אתה שופט רלוונטיות במשפט ישראלי.
+לפניך שאילתה ומספר פסקאות מפסק דין. דרג כל פסקה בנפרד 1-5 לפי רלוונטיות.
+
+סולם:
+5 — תשובה ישירה ומדויקת לשאילתה
+4 — מאד רלוונטי, מכיל מידע ליבה
+3 — רלוונטי חלקית, נוגע בעקיפין בנושא
+2 — מעט קשור, רעש סביב הנושא
+1 — לא רלוונטי בכלל
+
+השאילתה:
+{query}
+
+הפסקאות:
+{chunks_block}
+
+החזר JSON בלבד, בפורמט: {{"scores": {{"<id>": <1-5>, ...}}}}
+ללא טקסט נוסף, ללא explanations, ללא ```."""
+
+
+def batch_judge(query: str,
+                items: list[tuple[int, str]]) -> dict[int, int]:
+    """Judge a list of (chunk_idx, content) pairs in a single CLI call.
+
+    Returns: dict[chunk_idx → score 1-5]. Returns 0 for parse failures.
+    """
+    chunks_block_lines = []
+    for ci, content in items:
+        snippet = content.replace("\n", " ").strip()[:1500]
+        chunks_block_lines.append(f"<id={ci}>\n{snippet}\n</id>")
+    prompt = BATCH_JUDGE_PROMPT.format(
+        query=query,
+        chunks_block="\n\n".join(chunks_block_lines),
+    )
+    proc = subprocess.run(
+        ["claude", "-p", "--model", JUDGE_MODEL],
+        input=prompt, capture_output=True, text=True, timeout=120,
+    )
+    out = proc.stdout.strip()
+    # Strip ```json fences if any
+    out = re.sub(r"^```(?:json)?\s*", "", out)
+    out = re.sub(r"\s*```$", "", out)
+    try:
+        data = json.loads(out)
+        raw = data.get("scores", {})
+        return {int(k): int(v) for k, v in raw.items()
+                if str(v).isdigit() and 1 <= int(v) <= 5}
+    except (json.JSONDecodeError, ValueError, TypeError) as e:
+        print(f"   [judge parse fail: {e}; out={out[:200]!r}]")
+        return {}
+
+
+async def main():
+    voyage_key = os.environ["VOYAGE_API_KEY"]
+    pg_pw = os.environ["POSTGRES_PASSWORD"]
+
+    # Verify Claude CLI is available (uses OAuth from ~/.claude/.credentials)
+    try:
+        subprocess.run(["claude", "--version"], capture_output=True,
+                       text=True, timeout=10, check=True)
+    except (subprocess.CalledProcessError, FileNotFoundError, TimeoutError):
+        sys.exit("claude CLI not found or not authenticated")
+
+    voyage = voyageai.Client(api_key=voyage_key)
+
+    # Load chunks + voyage-3 embeddings
+    pool = await asyncpg.create_pool(
+        host="127.0.0.1", port=5433, user="legal_ai",
+        password=pg_pw, database="legal_ai",
+        min_size=1, max_size=2,
+    )
+    rows = await pool.fetch("""
+        SELECT chunk_index, content, embedding::text AS emb_text
+        FROM precedent_chunks
+        WHERE case_law_id = $1
+        ORDER BY chunk_index
+    """, CASE_ID)
+    chunks = [r["content"] for r in rows]
+    chunk_indices = [r["chunk_index"] for r in rows]
+    baseline_embs = [parse_pgvector(r["emb_text"]) for r in rows]
+    n = len(chunks)
+    print(f"[load] {n} chunks loaded")
+
+    # Compute context-3 (windowed) embeddings — same as POC #2
+    windows = build_windows(n, WINDOW_SIZE, WINDOW_STRIDE)
+    print(f"[context-3] embedding {len(windows)} windows…")
+    win_embs = []
+    for s, e in windows:
+        result = voyage.contextualized_embed(
+            inputs=[chunks[s:e]],
+            model=CONTEXT_MODEL,
+            input_type="document",
+        )
+        win_embs.append(result.results[0].embeddings)
+    context_embs = []
+    for i in range(n):
+        w = central_window(i, windows)
+        s, _ = windows[w]
+        context_embs.append(win_embs[w][i - s])
+    print(f"[context-3] done")
+
+    # Retrieval functions
+    def r1_baseline(query: str, k: int = 10) -> list[int]:
+        q = voyage.embed([query], model=TEXT_MODEL,
+                         input_type="query").embeddings[0]
+        scores = sorted(
+            [(cosine(q, e), i) for i, e in enumerate(baseline_embs)],
+            reverse=True,
+        )
+        return [i for _, i in scores[:k]]
+
+    def r2_rerank(query: str, k: int = 10) -> list[int]:
+        # 1) voyage-3 retrieve top-50
+        cands = r1_baseline(query, k=50)
+        cand_texts = [chunks[i] for i in cands]
+        # 2) voyage-rerank-2 over the 50
+        rr = voyage.rerank(
+            query=query, documents=cand_texts,
+            model=RERANK_MODEL, top_k=k,
+        )
+        # rr.results: list of RerankingResult(index=..., relevance_score=...)
+        # `index` refers to position in cand_texts → map back to chunk idx
+        return [cands[r.index] for r in rr.results]
+
+    def r3_context(query: str, k: int = 10) -> list[int]:
+        q = voyage.contextualized_embed(
+            inputs=[[query]],
+            model=CONTEXT_MODEL,
+            input_type="query",
+        ).results[0].embeddings[0]
+        scores = sorted(
+            [(cosine(q, e), i) for i, e in enumerate(context_embs)],
+            reverse=True,
+        )
+        return [i for _, i in scores[:k]]
+
+    retrievers = [("R1-voyage3", r1_baseline),
+                  ("R2-rerank2", r2_rerank),
+                  ("R3-context3", r3_context)]
+
+    # Run all queries × all retrievers, judging top-5 per pair.
+    # Strategy: for each query, gather the union of all retrievers' top-K
+    # and judge them in ONE batched CLI call → 18 calls total instead of 270.
+    all_results = []
+    JUDGE_TOP_K = 5
+    print(f"\n[judge] running {len(QUERIES)} queries × "
+          f"{len(retrievers)} retrievers × top-{JUDGE_TOP_K} — batched per query…")
+
+    for qid, query in QUERIES:
+        print(f"\n[{qid}] {query}")
+        # Collect retrievals first
+        retr_results = {}
+        for r_name, r_fn in retrievers:
+            try:
+                retr_results[r_name] = r_fn(query, k=JUDGE_TOP_K)
+            except Exception as e:
+                print(f"  {r_name}: FAILED — {e}")
+                retr_results[r_name] = []
+        # Union of unique chunk indices to judge
+        union = sorted({i for top in retr_results.values() for i in top})
+        items = [(i, chunks[i]) for i in union]
+        print(f"  judging {len(items)} unique chunks via batch CLI…")
+        scores_map = batch_judge(query, items)
+        # Build per-retriever score lists
+        for r_name, top in retr_results.items():
+            scores = [scores_map.get(i, 0) for i in top]
+            mean3 = sum(scores[:3]) / 3 if len(scores) >= 3 else 0
+            mean5 = sum(scores) / len(scores) if scores else 0
+            mrr = 0.0
+            for r, s in enumerate(scores):
+                if s >= 4:
+                    mrr = 1.0 / (r + 1)
+                    break
+            print(f"  {r_name}: chunks={[chunk_indices[i] for i in top]}  "
+                  f"scores={scores}  mean@3={mean3:.2f}  mean@5={mean5:.2f}  "
+                  f"MRR={mrr:.3f}")
+            all_results.append({
+                "qid": qid, "category": qid[0], "query": query,
+                "retriever": r_name,
+                "chunks": [chunk_indices[i] for i in top],
+                "scores": scores,
+                "mean3": mean3, "mean5": mean5, "mrr": mrr,
+            })
+
+    # Aggregate
+    print("\n" + "=" * 100)
+    print("AGGREGATED RESULTS")
+    print("=" * 100)
+
+    by_retriever = defaultdict(lambda: {"mean3": [], "mean5": [], "mrr": []})
+    by_cat_retriever = defaultdict(
+        lambda: {"mean3": [], "mean5": [], "mrr": []})
+    for r in all_results:
+        by_retriever[r["retriever"]]["mean3"].append(r["mean3"])
+        by_retriever[r["retriever"]]["mean5"].append(r["mean5"])
+        by_retriever[r["retriever"]]["mrr"].append(r["mrr"])
+        cat_key = (r["category"], r["retriever"])
+        by_cat_retriever[cat_key]["mean3"].append(r["mean3"])
+        by_cat_retriever[cat_key]["mean5"].append(r["mean5"])
+        by_cat_retriever[cat_key]["mrr"].append(r["mrr"])
+
+    print("\nOverall (across all 18 queries):")
+    print(f"{'retriever':<14} {'mean@3':>8} {'mean@5':>8} {'MRR':>8}")
+    for r_name, _ in retrievers:
+        m = by_retriever[r_name]
+        avg = lambda xs: sum(xs) / len(xs) if xs else 0
+        print(f"{r_name:<14} {avg(m['mean3']):>8.3f} "
+              f"{avg(m['mean5']):>8.3f} {avg(m['mrr']):>8.3f}")
+
+    print("\nBy category (K=keyword, C=conceptual, N=narrative):")
+    print(f"{'cat':<3} {'retriever':<14} {'mean@3':>8} {'mean@5':>8} {'MRR':>8}")
+    for cat in ["K", "C", "N"]:
+        for r_name, _ in retrievers:
+            m = by_cat_retriever[(cat, r_name)]
+            avg = lambda xs: sum(xs) / len(xs) if xs else 0
+            print(f"{cat:<3} {r_name:<14} {avg(m['mean3']):>8.3f} "
+                  f"{avg(m['mean5']):>8.3f} {avg(m['mrr']):>8.3f}")
+
+    print("\nPer-query winner (highest mean@3, ties shown):")
+    print(f"{'qid':<4} {'query':<45} {'winner':<24} {'scores'}")
+    by_query = defaultdict(list)
+    for r in all_results:
+        by_query[r["qid"]].append(r)
+    for qid, results in sorted(by_query.items()):
+        max_score = max(r["mean3"] for r in results)
+        winners = [r["retriever"] for r in results if r["mean3"] == max_score]
+        scores = " | ".join(f"{r['retriever'][:7]}={r['mean3']:.2f}"
+                            for r in results)
+        q_str = next(q for qid_, q in QUERIES if qid_ == qid)[:42]
+        print(f"{qid:<4} {q_str:<45} {','.join(w[:8] for w in winners):<24} "
+              f"{scores}")
+
+    # Save raw results to JSON for further analysis
+    out_path = "/tmp/voyage_rerank_judge_results.json"
+    with open(out_path, "w") as f:
+        json.dump(all_results, f, ensure_ascii=False, indent=2)
+    print(f"\nRaw results saved to {out_path}")
+
+    await pool.close()
+
+
+if __name__ == "__main__":
+    asyncio.run(main())