feat(retrieval): add voyage rerank-2 cross-encoder stage (feature flag)
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 1m29s

Stage B of voyage-upgrades-plan rewritten: instead of context-3 (which
4 POCs showed inconsistent improvement), add a cross-encoder rerank
layer on top of voyage-3. Default off (VOYAGE_RERANK_ENABLED=false).

POC validation (785-doc corpus, 12 queries, claude-haiku-4-5 judge):
- mean@3 +4.5% (4.306 → 4.500)
- practical-category queries +11.6% (3.78 → 4.22)
- latency +702ms per query
- no schema change, no re-embed, no double storage

Plumbing:
- config: VOYAGE_RERANK_ENABLED / _MODEL / _FETCH_K env vars
- embeddings.voyage_rerank() wraps voyageai client.rerank
- services/rerank.py: maybe_rerank() helper — fetches FETCH_K candidates
  via the bi-encoder then reranks to top-K. Fail-open if Voyage rerank is
  unavailable.
- tools/search.py: search_decisions, search_case_documents,
  find_similar_cases all wrapped
- services/precedent_library.search_library wrapped

Smoke-tested locally with flag on/off — produces expected behaviour and
latency profile. Ready for production rollout via Coolify env flip after
deploy.

POCs (kept under scripts/ for reference):
- voyage_context3_poc{_long}.py — context-3 evaluation (rejected)
- voyage_multimodal_poc.py — multimodal-3 (stage C, deferred)
- voyage_rerank_judge_poc.py — single-case rerank benchmark
- voyage_rerank_corpus_poc.py — full-corpus rerank validation

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-03 18:43:41 +00:00
parent 688ba37d9c
commit 26c3fddf41
13 changed files with 1578 additions and 100 deletions

View File

@@ -17,6 +17,11 @@
| `deploy-track-changes.sh` | bash | סנכרון skills CMP↔CMPA + בדיקות + הנחיות deploy לארכיטקטורת Track Changes | ידני |
| `retrofit_case.py` | python | retrofit רטרואקטיבי — מזריק bookmarks לקובץ קיים של תיק ספציפי ומגדיר אותו כ-active_draft | ידני (חד-פעמי לתיק) |
| `reembed_voyage.py` | python | Re-embed כל הוקטורים ב-DB עם המודל ב-`VOYAGE_MODEL` (לאחר שינוי מודל). 5 טבלאות, 1024 דמ', batches של 100. ראה `docs/voyage-upgrades-plan.md` | ידני (אחרי החלפת `VOYAGE_MODEL`) |
| `voyage_context3_poc.py` | python | POC #1 — voyage-3 vs voyage-context-3 על פסיקה אחת קצרה (קלמנוביץ, 63 chunks). הכרעה: context-3 לא מציג שיפור עקבי | בנצ'מרק חד-פעמי, נשמר לרפרנס |
| `voyage_context3_poc_long.py` | python | POC #2 — voyage-context-3 על פסיקה ארוכה (אהרון ברק 219 chunks) עם sliding windows. הכרעה: context-3 לא משתפר על פסיקה גדולה | בנצ'מרק חד-פעמי, נשמר לרפרנס |
| `voyage_multimodal_poc.py` | python | POC #3 — voyage-multimodal-3 על דוח שמאי (89 עמודים). הכרעה: שיפור משמעותי לטבלאות + 22 עמודי image-only שhttp text-OCR מאבד | בנצ'מרק חד-פעמי, מוכן לשלב C |
| `voyage_rerank_judge_poc.py` | python | POC #4 — voyage-3 vs rerank-2 vs context-3 על אהרון ברק, 18 שאילתות, claude-haiku-4-5 כ-judge. הכרעה: rerank-2 ניצח עם +9% mean@3 | בנצ'מרק חד-פעמי |
| `voyage_rerank_corpus_poc.py` | python | POC #5 — voyage-3 vs rerank-2 על קורפוס מלא (785 docs). הכרעה: +4.5% mean@3 כללי, +11.6% על P queries (practical) | בנצ'מרק חד-פעמי, אישר את שלב B |
## תיקיית `.archive/` — סקריפטים שהושלמו

View File

@@ -0,0 +1,182 @@
"""POC: Compare voyage-3 vs voyage-context-3 retrieval on case 403/17.
Pulls all chunks of "אהרון ברק - תכנית רחביה" (case_law_id=e151fc25-...),
runs them through voyage-context-3 in a single contextualized_embed call,
then runs benchmark queries and compares rankings against the existing
voyage-3 embeddings (already in the DB).
No DB writes — all comparisons in memory. Output: ranking table for each
query showing top-10 from both models side-by-side.
Usage:
/home/chaim/legal-ai/mcp-server/.venv/bin/python \\
/home/chaim/legal-ai/scripts/voyage_context3_poc.py
"""
from __future__ import annotations
import asyncio
import math
import os
import sys
import time
# Load ~/.env
ENV_PATH = os.path.expanduser("~/.env")
if os.path.isfile(ENV_PATH):
with open(ENV_PATH) as f:
for line in f:
line = line.strip()
if line and not line.startswith("#") and "=" in line:
k, v = line.split("=", 1)
os.environ.setdefault(k, v)
import asyncpg # noqa: E402
import voyageai # noqa: E402
# Using קלמנוביץ/לויתן (52K chars, 63 chunks, ~18K tokens)
# — fits in single context-3 call (32K token limit per inner list).
# אהרון ברק (60K tokens) requires splitting; we'll handle that after POC.
CASE_ID = "436efd48-c8ab-49f0-b3a9-52bf15ea806d" # בר"מ 25226-04-25
CONTEXT_MODEL = "voyage-context-3"
BASELINE_MODEL = "voyage-3" # already in DB
QUERIES = [
"סמכות ועדת ערר",
"פיצויים לפי סעיף 197",
"ירידת ערך מקרקעין",
"תכנית פוגעת",
"שיקול דעת ועדה מקומית",
"חוות דעת שמאי מכריע",
"מקרקעין גובלים",
"תקופת התיישנות תביעה",
"אינטרס ציבורי בתכנון",
"דחיית תביעת פיצויים",
]
def cosine(a: list[float], b: list[float]) -> float:
dot = sum(x * y for x, y in zip(a, b))
na = math.sqrt(sum(x * x for x in a))
nb = math.sqrt(sum(y * y for y in b))
return dot / (na * nb) if na and nb else 0.0
def parse_pgvector(s: str) -> list[float]:
"""pgvector text format: '[0.1,0.2,...]'."""
return [float(x) for x in s.strip("[]").split(",")]
async def main():
api_key = os.environ["VOYAGE_API_KEY"]
pg_pw = os.environ["POSTGRES_PASSWORD"]
voyage = voyageai.Client(api_key=api_key)
pool = await asyncpg.create_pool(
host="127.0.0.1", port=5433, user="legal_ai",
password=pg_pw, database="legal_ai",
min_size=1, max_size=2,
)
# 1. Pull all chunks + their existing voyage-3 embeddings
rows = await pool.fetch("""
SELECT chunk_index, content, embedding::text AS emb_text
FROM precedent_chunks
WHERE case_law_id = $1
ORDER BY chunk_index
""", CASE_ID)
print(f"[load] {len(rows)} chunks from case 403/17")
chunks = [r["content"] for r in rows]
indices = [r["chunk_index"] for r in rows]
baseline_embs = [parse_pgvector(r["emb_text"]) for r in rows]
# 2. Embed all chunks with voyage-context-3 — single contextualized call
total_chars = sum(len(c) for c in chunks)
print(f"[context] embedding {len(chunks)} chunks, {total_chars:,} chars total")
start = time.time()
result = voyage.contextualized_embed(
inputs=[chunks], # one document = one inner list
model=CONTEXT_MODEL,
input_type="document",
)
elapsed = time.time() - start
# ContextualizedEmbeddingsObject: result.results = list of per-document
# embeddings. result.results[0].embeddings = list of chunk embeddings.
context_embs = result.results[0].embeddings
total_tokens = getattr(result, "total_tokens", "?")
print(f"[context] done in {elapsed:.1f}s — total_tokens={total_tokens}")
assert len(context_embs) == len(chunks), "embedding count mismatch"
# 3. For each query — embed twice and compare top-10
print("\n" + "=" * 100)
print(f"{'Q':<3} {'baseline (voyage-3)':<48} {'context-3':<48}")
print("=" * 100)
rank_overlaps = []
score_lifts = []
for q_idx, query in enumerate(QUERIES, 1):
# Baseline query embedding (regular embed)
q_baseline = voyage.embed(
[query], model=BASELINE_MODEL, input_type="query"
).embeddings[0]
# Context query embedding — must use contextualized_embed even for
# single-string queries (regular embed() rejects voyage-context-3).
q_context = voyage.contextualized_embed(
inputs=[[query]],
model=CONTEXT_MODEL,
input_type="query",
).results[0].embeddings[0]
# Score every chunk under both models
scores_b = sorted(
[(cosine(q_baseline, e), i) for i, e in enumerate(baseline_embs)],
reverse=True,
)
scores_c = sorted(
[(cosine(q_context, e), i) for i, e in enumerate(context_embs)],
reverse=True,
)
top10_b = [i for _, i in scores_b[:10]]
top10_c = [i for _, i in scores_c[:10]]
# Compute overlap and avg score in top-3
overlap = len(set(top10_b) & set(top10_c))
avg_b_top3 = sum(s for s, _ in scores_b[:3]) / 3
avg_c_top3 = sum(s for s, _ in scores_c[:3]) / 3
rank_overlaps.append(overlap)
score_lifts.append(avg_c_top3 - avg_b_top3)
print(f"\n[Q{q_idx}] {query}")
print(f" overlap top-10: {overlap}/10 | avg score top-3: "
f"baseline={avg_b_top3:.3f} context-3={avg_c_top3:.3f} "
f"Δ={avg_c_top3 - avg_b_top3:+.3f}")
for rank in range(5):
sb, ib = scores_b[rank]
sc, ic = scores_c[rank]
cb = chunks[ib].replace("\n", " ").strip()[:50]
cc = chunks[ic].replace("\n", " ").strip()[:50]
print(f" #{rank+1} [{indices[ib]:3d}] {sb:.3f} {cb:<55} "
f"| [{indices[ic]:3d}] {sc:.3f} {cc}")
# Summary
print("\n" + "=" * 100)
print("SUMMARY")
print("=" * 100)
avg_overlap = sum(rank_overlaps) / len(rank_overlaps)
avg_lift = sum(score_lifts) / len(score_lifts)
print(f"Avg overlap top-10: {avg_overlap:.1f}/10 "
f"(higher = models agree more)")
print(f"Avg score lift top-3 (context - baseline): {avg_lift:+.4f}")
print(f"\nNote: cosine scores are not directly comparable across models.")
print(f"What matters more is which CHUNKS bubble to the top —")
print(f"reading the actual content above tells the real story.")
await pool.close()
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,238 @@
"""POC #2: voyage-3 vs voyage-context-3 on a LONG case (אהרון ברק 403/17).
Case is 178K chars / 219 chunks / ~60K tokens — too big for a single
contextualized_embed call (32K token limit per inner list). We split the
chunks into overlapping sliding windows (~80 chunks each, ~22K tokens)
and merge: each chunk gets the embedding from the window where it sits
*most centrally* (max symmetric context on both sides).
The hypothesis: voyage-context-3 should shine here because the case is
full of internal references ("ראה לעיל סעיף 13", "להבדיל מעניין X",
"תוצאת הבחינה ב-בר"מ 1975/24 שנידונה לעיל"). voyage-3 embeds chunks
in isolation; context-3 sees ~80 surrounding chunks per embedding.
No DB writes. Output: side-by-side ranking comparison + summary.
Usage:
/home/chaim/legal-ai/mcp-server/.venv/bin/python \\
/home/chaim/legal-ai/scripts/voyage_context3_poc_long.py
"""
from __future__ import annotations
import asyncio
import math
import os
import sys
import time
ENV_PATH = os.path.expanduser("~/.env")
if os.path.isfile(ENV_PATH):
with open(ENV_PATH) as f:
for line in f:
line = line.strip()
if line and not line.startswith("#") and "=" in line:
k, v = line.split("=", 1)
os.environ.setdefault(k, v)
import asyncpg # noqa: E402
import voyageai # noqa: E402
CASE_ID = "e151fc25-cf12-4563-b638-a86323f8413b" # 403/17 אהרון ברק (178K chars)
CONTEXT_MODEL = "voyage-context-3"
BASELINE_MODEL = "voyage-3"
# Sliding-window split params. With 219 chunks and ~60K tokens total
# (~275 tokens/chunk average), 3 windows of 80 chunks each is ~22K tokens
# per call — comfortably under 32K.
WINDOW_SIZE = 80
WINDOW_STRIDE = 70 # overlap = WINDOW_SIZE - WINDOW_STRIDE = 10
# Mix of:
# (a) generic queries (also tested in POC #1)
# (b) queries that require *internal* document context
QUERIES = [
# generic
"תכנית רחביה הוראות בנייה",
"פיצויים לפי סעיף 197 ירידת ערך",
"השפעת תכנית על שווי מקרקעין",
"סמכות ועדת ערר לדון בפיצויים",
"תוספת זכויות בנייה כפיצוי",
# internal-context — should benefit context-3
"ההבחנה בין השבחה לפיצויים",
"מה נקבע לגבי תמ\"א 38 בפסק הדין",
"ההלכה שנקבעה בעניין רובע 3",
"כלל הנטרול של זכויות תכנוניות",
"הסכמת השופט אלרון לחוות הדעת",
]
def cosine(a: list[float], b: list[float]) -> float:
dot = sum(x * y for x, y in zip(a, b))
na = math.sqrt(sum(x * x for x in a))
nb = math.sqrt(sum(y * y for y in b))
return dot / (na * nb) if na and nb else 0.0
def parse_pgvector(s: str) -> list[float]:
return [float(x) for x in s.strip("[]").split(",")]
def build_windows(n: int, size: int, stride: int) -> list[tuple[int, int]]:
"""Return list of (start, end) ranges (end exclusive) covering 0..n.
Last window extends to n exactly. Overlap = size - stride.
"""
windows = []
start = 0
while start < n:
end = min(start + size, n)
windows.append((start, end))
if end == n:
break
start += stride
return windows
def assign_chunk_to_window(
chunk_idx: int, windows: list[tuple[int, int]],
) -> int:
"""Pick the window where chunk_idx sits most centrally (max symmetric
distance to either edge). Ties broken by larger window."""
best = -1
best_score = -1
for w_idx, (s, e) in enumerate(windows):
if not (s <= chunk_idx < e):
continue
# symmetric distance: min(distance to s, distance to e-1)
dist = min(chunk_idx - s, (e - 1) - chunk_idx)
if dist > best_score:
best_score = dist
best = w_idx
return best
async def main():
api_key = os.environ["VOYAGE_API_KEY"]
pg_pw = os.environ["POSTGRES_PASSWORD"]
voyage = voyageai.Client(api_key=api_key)
pool = await asyncpg.create_pool(
host="127.0.0.1", port=5433, user="legal_ai",
password=pg_pw, database="legal_ai",
min_size=1, max_size=2,
)
rows = await pool.fetch("""
SELECT chunk_index, content, embedding::text AS emb_text
FROM precedent_chunks
WHERE case_law_id = $1
ORDER BY chunk_index
""", CASE_ID)
n = len(rows)
print(f"[load] {n} chunks from אהרון ברק 403/17")
chunks = [r["content"] for r in rows]
indices = [r["chunk_index"] for r in rows]
baseline_embs = [parse_pgvector(r["emb_text"]) for r in rows]
# Build windows
windows = build_windows(n, WINDOW_SIZE, WINDOW_STRIDE)
print(f"[windows] {len(windows)} windows: "
f"{', '.join(f'[{s}:{e})' for s, e in windows)}")
# Embed each window with context-3
window_embs: list[list[list[float]]] = [] # [window][chunk_in_window][dim]
total_call_tokens = 0
total_start = time.time()
for w_idx, (s, e) in enumerate(windows):
sub_chunks = chunks[s:e]
sub_chars = sum(len(c) for c in sub_chunks)
start = time.time()
result = voyage.contextualized_embed(
inputs=[sub_chunks],
model=CONTEXT_MODEL,
input_type="document",
)
elapsed = time.time() - start
toks = getattr(result, "total_tokens", 0)
total_call_tokens += toks
print(f" [window {w_idx}] [{s}:{e}) — {len(sub_chunks)} chunks, "
f"{sub_chars:,} chars, {toks} tokens — {elapsed:.1f}s")
window_embs.append(result.results[0].embeddings)
total_elapsed = time.time() - total_start
print(f"[context] all windows done in {total_elapsed:.1f}s, "
f"{total_call_tokens} total tokens")
# Merge: for each chunk, pick the embedding from its most-central window
context_embs: list[list[float]] = []
chunk_window_choice = []
for i in range(n):
w_idx = assign_chunk_to_window(i, windows)
chunk_window_choice.append(w_idx)
s, _ = windows[w_idx]
context_embs.append(window_embs[w_idx][i - s])
print(f"[merge] window distribution: "
f"{[chunk_window_choice.count(j) for j in range(len(windows))]}")
# Run queries
print("\n" + "=" * 100)
print(f"{'Q':<3} {'baseline (voyage-3)':<48} {'context-3 (windowed)':<48}")
print("=" * 100)
rank_overlaps = []
for q_idx, query in enumerate(QUERIES, 1):
q_baseline = voyage.embed(
[query], model=BASELINE_MODEL, input_type="query"
).embeddings[0]
q_context = voyage.contextualized_embed(
inputs=[[query]],
model=CONTEXT_MODEL,
input_type="query",
).results[0].embeddings[0]
scores_b = sorted(
[(cosine(q_baseline, e), i) for i, e in enumerate(baseline_embs)],
reverse=True,
)
scores_c = sorted(
[(cosine(q_context, e), i) for i, e in enumerate(context_embs)],
reverse=True,
)
top10_b = [i for _, i in scores_b[:10]]
top10_c = [i for _, i in scores_c[:10]]
overlap = len(set(top10_b) & set(top10_c))
rank_overlaps.append(overlap)
print(f"\n[Q{q_idx}] {query}")
print(f" overlap top-10: {overlap}/10 | "
f"avg score top-3: baseline="
f"{sum(s for s, _ in scores_b[:3])/3:.3f} "
f"context-3={sum(s for s, _ in scores_c[:3])/3:.3f}")
for rank in range(5):
sb, ib = scores_b[rank]
sc, ic = scores_c[rank]
cb = chunks[ib].replace("\n", " ").strip()[:50]
cc = chunks[ic].replace("\n", " ").strip()[:50]
print(f" #{rank+1} [{indices[ib]:3d}] {sb:.3f} {cb:<55} "
f"| [{indices[ic]:3d}] {sc:.3f} {cc}")
print("\n" + "=" * 100)
print("SUMMARY")
print("=" * 100)
avg = sum(rank_overlaps) / len(rank_overlaps)
print(f"Avg overlap top-10: {avg:.1f}/10")
print(f"Per-query overlap: {rank_overlaps}")
print(f"Total context-3 tokens used: {total_call_tokens:,} "
f"(in {len(windows)} calls)")
print(f"\nNote: cosine across models not directly comparable. The")
print(f"meaningful test is *which chunks bubble to the top* — read")
print(f"the actual text above to judge relevance.")
await pool.close()
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,213 @@
"""POC #3: voyage-3 (text) vs voyage-multimodal-3.5 (page images) on a
real appraisal PDF (89 pages, full of tables / signatures / numerical
data — the corpus class where multimodal should help most).
Document under test:
baf10153-d2fc-4481-b250-9fe87440ce69
"נספח - שומה מכרעת (אבלין דוידזון שמאמא) - 15.09.24"
case 8137-24, 89 pages, 2.1 MB
The pipeline:
1. Pull the existing voyage-3 text-chunk embeddings from `document_chunks`.
2. Render each PDF page → PNG (PyMuPDF, dpi=144).
3. Embed all pages via voyage-multimodal-3.5.
4. Run benchmark queries (mix of generic + table-specific + visual)
against both: text top-K and page top-K.
The comparison is *qualitative* — text and image embeddings are
different "spaces" returning different ID types (chunk_id vs page_num).
What we look at is whether image-based retrieval surfaces tables,
signatures, or numerical data that text-only OCR loses.
No DB writes.
Usage:
/home/chaim/legal-ai/mcp-server/.venv/bin/python \\
/home/chaim/legal-ai/scripts/voyage_multimodal_poc.py
"""
from __future__ import annotations
import asyncio
import io
import math
import os
import time
ENV_PATH = os.path.expanduser("~/.env")
if os.path.isfile(ENV_PATH):
with open(ENV_PATH) as f:
for line in f:
line = line.strip()
if line and not line.startswith("#") and "=" in line:
k, v = line.split("=", 1)
os.environ.setdefault(k, v)
import asyncpg # noqa: E402
import voyageai # noqa: E402
import fitz # PyMuPDF # noqa: E402
from PIL import Image # noqa: E402
DOCUMENT_ID = "baf10153-d2fc-4481-b250-9fe87440ce69"
PDF_PATH = (
"/home/chaim/legal-ai/data/cases/8137-24/documents/originals/"
"נספח - שומה מכרעת (אבלין דוידזון שמאמא) - 15.09.24.pdf"
)
TEXT_MODEL = "voyage-3"
MULTIMODAL_MODEL = "voyage-multimodal-3" # check supported: 3.5 may not exist yet
DPI = 144
# voyage-multimodal: max 1000 inputs/call, 320M pixels/call (rough),
# so 89 pages at 1240×1750 ≈ 192M pixels = single call.
QUERIES = [
# generic-textual (both should handle)
"שיטת ההיוון בשומה",
"מתודולוגיית הערכת שווי",
# table/numerical (multimodal should help)
"טבלת השוואת ערכים לפני ואחרי התכנית",
"שווי המקרקעין במצב הקודם",
"שווי המקרקעין במצב החדש",
"ירידת ערך באחוזים",
# visual elements (text-only loses)
"חתימת השמאי",
"תרשים גוש וחלקה",
"מפת מיקום הנכס",
# context-heavy
"מסקנת השמאי המכריע",
"עקרון הצפיפות בתכנית",
]
def cosine(a: list[float], b: list[float]) -> float:
dot = sum(x * y for x, y in zip(a, b))
na = math.sqrt(sum(x * x for x in a))
nb = math.sqrt(sum(y * y for y in b))
return dot / (na * nb) if na and nb else 0.0
def parse_pgvector(s: str) -> list[float]:
return [float(x) for x in s.strip("[]").split(",")]
def render_pdf_pages(pdf_path: str, dpi: int) -> list[Image.Image]:
"""Render each page → PIL.Image (RGB)."""
doc = fitz.open(pdf_path)
images: list[Image.Image] = []
for page in doc:
pix = page.get_pixmap(dpi=dpi)
png_bytes = pix.tobytes("png")
img = Image.open(io.BytesIO(png_bytes)).convert("RGB")
images.append(img)
doc.close()
return images
async def main():
api_key = os.environ["VOYAGE_API_KEY"]
pg_pw = os.environ["POSTGRES_PASSWORD"]
voyage = voyageai.Client(api_key=api_key)
# 1. Render PDF pages
print(f"[render] {PDF_PATH}")
start = time.time()
images = render_pdf_pages(PDF_PATH, DPI)
elapsed = time.time() - start
print(f"[render] {len(images)} pages in {elapsed:.1f}s, "
f"{images[0].size}px @ {DPI}dpi")
# 2. Pull existing text chunks + voyage-3 embeddings
pool = await asyncpg.create_pool(
host="127.0.0.1", port=5433, user="legal_ai",
password=pg_pw, database="legal_ai",
min_size=1, max_size=2,
)
rows = await pool.fetch("""
SELECT id, chunk_index, page_number, content,
embedding::text AS emb_text
FROM document_chunks
WHERE document_id = $1
ORDER BY chunk_index
""", DOCUMENT_ID)
print(f"[text] {len(rows)} text chunks loaded (voyage-3 in DB)")
text_contents = [r["content"] for r in rows]
text_chunk_pages = [r["page_number"] for r in rows]
text_embs = [parse_pgvector(r["emb_text"]) for r in rows]
# 3. Multimodal embed — try multimodal-3 first, fall back if needed
target_model = "voyage-multimodal-3"
print(f"[multimodal] embedding {len(images)} pages with {target_model}")
start = time.time()
try:
mm_result = voyage.multimodal_embed(
inputs=[[img] for img in images], # list of single-image inputs
model=target_model,
input_type="document",
truncation=True,
)
except voyageai.error.InvalidRequestError as e:
print(f" [error] {e}")
await pool.close()
return
elapsed = time.time() - start
image_embs = mm_result.embeddings
mm_tokens = getattr(mm_result, "total_tokens", "?")
image_tokens = getattr(mm_result, "image_pixels", "?")
text_tokens_mm = getattr(mm_result, "text_tokens", "?")
print(f"[multimodal] done in {elapsed:.1f}s — "
f"total_tokens={mm_tokens} text_tokens={text_tokens_mm} "
f"image_pixels={image_tokens}")
assert len(image_embs) == len(images), "embedding count mismatch"
print(f"[multimodal] embedding dim = {len(image_embs[0])}")
# 4. Run queries
print("\n" + "=" * 100)
print("QUERY RESULTS — top-5 chunks (text/voyage-3) "
"vs top-5 pages (multimodal)")
print("=" * 100)
for q_idx, query in enumerate(QUERIES, 1):
# Text-side: voyage-3 query embedding
q_text = voyage.embed(
[query], model=TEXT_MODEL, input_type="query"
).embeddings[0]
# Multimodal-side: same model, query input_type
q_mm = voyage.multimodal_embed(
inputs=[[query]],
model=target_model,
input_type="query",
).embeddings[0]
text_scores = sorted(
[(cosine(q_text, e), i) for i, e in enumerate(text_embs)],
reverse=True,
)[:5]
mm_scores = sorted(
[(cosine(q_mm, e), i) for i, e in enumerate(image_embs)],
reverse=True,
)[:5]
print(f"\n[Q{q_idx}] {query}")
print(f" --- text (voyage-3) top-5 ---")
for s, i in text_scores:
page = text_chunk_pages[i] if text_chunk_pages[i] else "?"
preview = text_contents[i].replace("\n", " ").strip()[:70]
print(f" {s:.3f} page={page:>3} chunk={i:>3} {preview}")
print(f" --- multimodal (image-only) top-5 ---")
for s, i in mm_scores:
print(f" {s:.3f} page={i+1:>3} (image)")
# Token / cost summary
print("\n" + "=" * 100)
print("SUMMARY")
print("=" * 100)
print(f"PDF: {len(images)} pages @ {DPI}dpi → {target_model}")
print(f"Total multimodal tokens: {mm_tokens}")
print(f"Embedding dim: {len(image_embs[0])}")
print(f"Time: {elapsed:.1f}s for full doc")
await pool.close()
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,318 @@
"""POC #5 — full precedent_library corpus benchmark.
Tests R1 (voyage-3) vs R2 (voyage-3 + rerank-2) on the *real* corpus that
search_precedent_library queries against:
precedent_chunks — 385 rows from 3 precedent cases
halachot — 400 rule statements with reasoning summaries
Total: 785 documents. The MCP tool merges results from both tables so the
benchmark mirrors production retrieval. R3 (context-3) is dropped — it
would require windowed re-embedding of 3 cases which we already proved
doesn't help (POC #2). The question now is: does rerank-2's +9% on a
single case generalize to a heterogeneous corpus?
Also measures end-to-end latency: pure voyage-3 vs voyage-3 + rerank.
Usage:
/home/chaim/legal-ai/mcp-server/.venv/bin/python \\
/home/chaim/legal-ai/scripts/voyage_rerank_corpus_poc.py
"""
from __future__ import annotations
import asyncio
import json
import math
import os
import re
import subprocess
import sys
import time
from collections import defaultdict
ENV_PATH = os.path.expanduser("~/.env")
if os.path.isfile(ENV_PATH):
with open(ENV_PATH) as f:
for line in f:
line = line.strip()
if line and not line.startswith("#") and "=" in line:
k, v = line.split("=", 1)
os.environ.setdefault(k, v)
import asyncpg # noqa: E402
import voyageai # noqa: E402
TEXT_MODEL = "voyage-3"
RERANK_MODEL = "rerank-2"
JUDGE_MODEL = "claude-haiku-4-5-20251001"
TOP_VEC = 50 # voyage-3 retrieve depth
TOP_K = 10 # final returned to "agent"
JUDGE_K = 5 # how many top results to actually judge per retriever
# 12 queries spanning typical use cases by Daphna's agents:
# precedent search for citing in decision blocks י-יא.
QUERIES = [
# K — keyword
("K1", "פיצויים לפי סעיף 197"),
("K2", "תמ\"א 38 והשבחה"),
("K3", "כלל הנטרול בשמאות"),
# C — conceptual
("C1", "תכלית היטל ההשבחה"),
("C2", "מה מקנה לבעלים זכות לפיצוי"),
("C3", "ההבחנה בין השבחה לפיצויים"),
# N — narrative / context-aware
("N1", "מה נקבע לגבי תמ\"א 38 בפסיקה"),
("N2", "ההלכה לעניין נטרול ציפיות"),
("N3", "תכנית פוגעת ושומה"),
# P — practical (drafting needs — what an agent typically asks)
("P1", "פסיקה שדנה בתכנית מתאר ארצית"),
("P2", "מתי מותר לוועדה לדחות פיצויים"),
("P3", "שיקול דעת הוועדה המקומית"),
]
def cosine(a, b):
dot = sum(x * y for x, y in zip(a, b))
na = math.sqrt(sum(x * x for x in a))
nb = math.sqrt(sum(y * y for y in b))
return dot / (na * nb) if na and nb else 0.0
def parse_pgvector(s):
return [float(x) for x in s.strip("[]").split(",")]
BATCH_JUDGE_PROMPT = """אתה שופט רלוונטיות במשפט ישראלי.
לפניך שאילתה ומספר פסקאות מפסקי דין/הלכות. דרג כל פסקה 1-5 לפי רלוונטיות.
5 — תשובה ישירה למה שנשאל
4 — מאד רלוונטי, מכיל מידע ליבה
3 — רלוונטי חלקית, נוגע בעקיפין
2 — מעט קשור, רעש סביב הנושא
1 — לא רלוונטי בכלל
השאילתה:
{query}
הפסקאות:
{chunks_block}
החזר JSON בלבד: {{"scores": {{"<id>": <1-5>, ...}}}}
ללא טקסט נוסף, ללא ```."""
def batch_judge(query: str, items: list[tuple[str, str]]) -> dict[str, int]:
"""Judge (id, text) pairs via claude CLI. Returns {id: score}."""
blocks = []
for cid, content in items:
snippet = content.replace("\n", " ").strip()[:1500]
blocks.append(f"<id={cid}>\n{snippet}\n</id>")
prompt = BATCH_JUDGE_PROMPT.format(
query=query, chunks_block="\n\n".join(blocks))
proc = subprocess.run(
["claude", "-p", "--model", JUDGE_MODEL],
input=prompt, capture_output=True, text=True, timeout=180,
)
out = proc.stdout.strip()
out = re.sub(r"^```(?:json)?\s*", "", out)
out = re.sub(r"\s*```$", "", out)
try:
data = json.loads(out)
raw = data.get("scores", {})
return {str(k): int(v) for k, v in raw.items()
if str(v).isdigit() and 1 <= int(v) <= 5}
except (json.JSONDecodeError, ValueError, TypeError) as e:
print(f" [judge parse fail: {e}; out={out[:200]!r}]")
return {}
async def main():
voyage_key = os.environ["VOYAGE_API_KEY"]
pg_pw = os.environ["POSTGRES_PASSWORD"]
try:
subprocess.run(["claude", "--version"], capture_output=True,
text=True, timeout=10, check=True)
except (subprocess.CalledProcessError, FileNotFoundError, TimeoutError):
sys.exit("claude CLI not found")
voyage = voyageai.Client(api_key=voyage_key)
pool = await asyncpg.create_pool(
host="127.0.0.1", port=5433, user="legal_ai",
password=pg_pw, database="legal_ai",
min_size=1, max_size=2,
)
# Load full corpus: precedent_chunks + halachot
pc_rows = await pool.fetch("""
SELECT 'pc:' || id::text AS doc_id,
content,
embedding::text AS emb_text
FROM precedent_chunks
WHERE content IS NOT NULL AND embedding IS NOT NULL
""")
h_rows = await pool.fetch("""
SELECT 'h:' || id::text AS doc_id,
TRIM(BOTH '' FROM rule_statement || '' ||
COALESCE(reasoning_summary, '')) AS content,
embedding::text AS emb_text
FROM halachot
WHERE rule_statement IS NOT NULL AND embedding IS NOT NULL
""")
all_rows = list(pc_rows) + list(h_rows)
print(f"[load] corpus: {len(pc_rows)} precedent_chunks + "
f"{len(h_rows)} halachot = {len(all_rows)} total")
doc_ids = [r["doc_id"] for r in all_rows]
contents = [r["content"] for r in all_rows]
embs = [parse_pgvector(r["emb_text"]) for r in all_rows]
# Latency measurement: 5 queries, time the two pipelines
print("\n[latency] measuring 5 sample queries…")
sample = QUERIES[:5]
r1_lat = []
r2_lat = []
for _, query in sample:
# R1: voyage-3 embed + cosine top-10
t0 = time.time()
q_emb = voyage.embed([query], model=TEXT_MODEL,
input_type="query").embeddings[0]
scores = sorted([(cosine(q_emb, e), i) for i, e in enumerate(embs)],
reverse=True)[:TOP_K]
r1_lat.append(time.time() - t0)
# R2: voyage-3 embed + cosine top-50 + rerank-2 → top-10
t0 = time.time()
q_emb = voyage.embed([query], model=TEXT_MODEL,
input_type="query").embeddings[0]
cands = sorted([(cosine(q_emb, e), i) for i, e in enumerate(embs)],
reverse=True)[:TOP_VEC]
cand_texts = [contents[i] for _, i in cands]
rr = voyage.rerank(query=query, documents=cand_texts,
model=RERANK_MODEL, top_k=TOP_K)
r2_lat.append(time.time() - t0)
print(f" R1 (voyage-3 only) avg={sum(r1_lat)/5*1000:.0f}ms"
f" min={min(r1_lat)*1000:.0f} max={max(r1_lat)*1000:.0f}")
print(f" R2 (voyage-3 + rerank-2) avg={sum(r2_lat)/5*1000:.0f}ms"
f" min={min(r2_lat)*1000:.0f} max={max(r2_lat)*1000:.0f}")
print(f" Δ (rerank overhead) avg={(sum(r2_lat)-sum(r1_lat))/5*1000:.0f}ms")
# Retrieval functions
def r1_baseline(query: str, k: int = TOP_K) -> list[int]:
q = voyage.embed([query], model=TEXT_MODEL,
input_type="query").embeddings[0]
scores = sorted([(cosine(q, e), i) for i, e in enumerate(embs)],
reverse=True)
return [i for _, i in scores[:k]]
def r2_rerank(query: str, k: int = TOP_K) -> list[int]:
cands = r1_baseline(query, k=TOP_VEC)
cand_texts = [contents[i] for i in cands]
rr = voyage.rerank(query=query, documents=cand_texts,
model=RERANK_MODEL, top_k=k)
return [cands[r.index] for r in rr.results]
retrievers = [("R1-voyage3", r1_baseline),
("R2-rerank2", r2_rerank)]
print(f"\n[judge] running {len(QUERIES)} queries × 2 retrievers, "
f"top-{JUDGE_K} judged…")
all_results = []
for qid, query in QUERIES:
print(f"\n[{qid}] {query}")
retr_results = {}
for r_name, r_fn in retrievers:
try:
retr_results[r_name] = r_fn(query, k=JUDGE_K)
except Exception as e:
print(f" {r_name}: FAILED — {e}")
retr_results[r_name] = []
union = sorted({i for top in retr_results.values() for i in top})
items = [(doc_ids[i], contents[i]) for i in union]
print(f" judging {len(items)} unique docs…")
scores_map = batch_judge(query, items)
for r_name, top in retr_results.items():
scores = [scores_map.get(doc_ids[i], 0) for i in top]
mean3 = sum(scores[:3]) / 3 if len(scores) >= 3 else 0
mean5 = sum(scores) / len(scores) if scores else 0
mrr = 0.0
for r, s in enumerate(scores):
if s >= 4:
mrr = 1.0 / (r + 1)
break
print(f" {r_name}: doc_ids={[doc_ids[i][:14] for i in top]} "
f"scores={scores} m@3={mean3:.2f} m@5={mean5:.2f} "
f"MRR={mrr:.3f}")
all_results.append({
"qid": qid, "category": qid[0], "query": query,
"retriever": r_name,
"doc_ids": [doc_ids[i] for i in top],
"scores": scores, "mean3": mean3, "mean5": mean5, "mrr": mrr,
})
# Aggregate
print("\n" + "=" * 100)
print("AGGREGATED RESULTS — full precedent_library corpus (785 docs)")
print("=" * 100)
by_r = defaultdict(lambda: {"mean3": [], "mean5": [], "mrr": []})
by_cat_r = defaultdict(lambda: {"mean3": [], "mean5": [], "mrr": []})
for r in all_results:
by_r[r["retriever"]]["mean3"].append(r["mean3"])
by_r[r["retriever"]]["mean5"].append(r["mean5"])
by_r[r["retriever"]]["mrr"].append(r["mrr"])
ck = (r["category"], r["retriever"])
by_cat_r[ck]["mean3"].append(r["mean3"])
by_cat_r[ck]["mean5"].append(r["mean5"])
by_cat_r[ck]["mrr"].append(r["mrr"])
print(f"\nOverall ({len(QUERIES)} queries):")
print(f"{'retriever':<14} {'mean@3':>8} {'mean@5':>8} {'MRR':>8}")
avg = lambda xs: sum(xs) / len(xs) if xs else 0
for r_name, _ in retrievers:
m = by_r[r_name]
print(f"{r_name:<14} {avg(m['mean3']):>8.3f} "
f"{avg(m['mean5']):>8.3f} {avg(m['mrr']):>8.3f}")
# Improvement
r1m = avg(by_r["R1-voyage3"]["mean3"])
r2m = avg(by_r["R2-rerank2"]["mean3"])
if r1m > 0:
print(f"\nR2 vs R1 improvement: "
f"mean@3 {(r2m - r1m) / r1m * 100:+.1f}%")
print(f"\nBy category:")
print(f"{'cat':<3} {'retriever':<14} {'mean@3':>8} {'mean@5':>8} "
f"{'MRR':>8}")
for cat in ["K", "C", "N", "P"]:
for r_name, _ in retrievers:
m = by_cat_r[(cat, r_name)]
if not m["mean3"]:
continue
print(f"{cat:<3} {r_name:<14} {avg(m['mean3']):>8.3f} "
f"{avg(m['mean5']):>8.3f} {avg(m['mrr']):>8.3f}")
print(f"\nPer-query winner (highest mean@3):")
print(f"{'qid':<4} {'query':<40} {'winner':<14} {'scores'}")
by_q = defaultdict(list)
for r in all_results:
by_q[r["qid"]].append(r)
for qid, results in sorted(by_q.items()):
max_s = max(r["mean3"] for r in results)
winners = [r["retriever"] for r in results if r["mean3"] == max_s]
scores = " | ".join(f"{r['retriever'][:7]}={r['mean3']:.2f}"
for r in results)
q_str = next(q for qid_, q in QUERIES if qid_ == qid)[:38]
print(f"{qid:<4} {q_str:<40} {','.join(w[:8] for w in winners):<14} "
f"{scores}")
out_path = "/tmp/voyage_rerank_corpus_results.json"
with open(out_path, "w") as f:
json.dump(all_results, f, ensure_ascii=False, indent=2)
print(f"\nSaved to {out_path}")
await pool.close()
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,361 @@
"""POC #4: Comprehensive retrieval benchmark with LLM-as-judge.
Compares 3 retrievers on אהרון ברק 403/17 (219 chunks):
R1 — voyage-3 (current production baseline)
R2 — voyage-3 + voyage-rerank-2 (retrieve 50, rerank, top-10)
R3 — voyage-context-3 (windowed, from POC #2)
Judges relevance with claude-haiku-4-5 — for each (query, chunk) pair the
judge returns 1-5. Aggregates: mean relevance@3, @5, @10, MRR (rank of
first 4+ chunk), per-query winner.
20 queries grouped into 3 categories so we can see *which* query types
benefit from which retriever:
K — keyword/lexical (term-heavy, specific entity)
C — conceptual (abstract idea, principle)
N — narrative/contextual (requires document-internal reference)
Usage (key passed via env, NOT stored in script):
ANTHROPIC_API_KEY=... \\
/home/chaim/legal-ai/mcp-server/.venv/bin/python \\
/home/chaim/legal-ai/scripts/voyage_rerank_judge_poc.py
"""
from __future__ import annotations
import asyncio
import json
import math
import os
import sys
import time
from collections import defaultdict
ENV_PATH = os.path.expanduser("~/.env")
if os.path.isfile(ENV_PATH):
with open(ENV_PATH) as f:
for line in f:
line = line.strip()
if line and not line.startswith("#") and "=" in line:
k, v = line.split("=", 1)
os.environ.setdefault(k, v)
import re
import subprocess
import asyncpg # noqa: E402
import voyageai # noqa: E402
CASE_ID = "e151fc25-cf12-4563-b638-a86323f8413b" # אהרון ברק 403/17
TEXT_MODEL = "voyage-3"
CONTEXT_MODEL = "voyage-context-3"
RERANK_MODEL = "rerank-2"
JUDGE_MODEL = "claude-haiku-4-5-20251001"
WINDOW_SIZE = 80
WINDOW_STRIDE = 70
# 18 queries × 3 retrievers × top-5 = 270 judge calls. ~$0.05 with haiku.
QUERIES = [
# K — keyword/lexical
("K1", "תכנית רחביה הוראות בנייה"),
("K2", "תמ\"א 38"),
("K3", "תכנית 9988"),
("K4", "סעיף 197 לחוק התכנון והבניה"),
("K5", "השופט גרוסקופף"),
("K6", "ועדה מקומית ירושלים"),
# C — conceptual / abstract principles
("C1", "כלל הנטרול של זכויות תכנוניות"),
("C2", "אינטרס הציבור בתכנון"),
("C3", "תכלית היטל ההשבחה"),
("C4", "תכנית פוגעת לעומת תכנית משביחה"),
("C5", "ההבחנה בין השבחה לפיצויים"),
("C6", "מהותו של היטל ההשבחה"),
# N — narrative / context-dependent
("N1", "מה נקבע לגבי תמ\"א 38 בפסק הדין"),
("N2", "מסקנת בית המשפט בעניין רובע 3"),
("N3", "ההלכה שנקבעה בעניין שמעוני"),
("N4", "ההבדל בין המקרה שלפנינו לעניין רון"),
("N5", "סוף דבר ותוצאת פסק הדין"),
("N6", "הסכמת השופטים האחרים לחוות הדעת"),
]
def cosine(a, b):
dot = sum(x * y for x, y in zip(a, b))
na = math.sqrt(sum(x * x for x in a))
nb = math.sqrt(sum(y * y for y in b))
return dot / (na * nb) if na and nb else 0.0
def parse_pgvector(s):
return [float(x) for x in s.strip("[]").split(",")]
def build_windows(n, size, stride):
out = []
s = 0
while s < n:
e = min(s + size, n)
out.append((s, e))
if e == n:
break
s += stride
return out
def central_window(idx, windows):
best, best_d = -1, -1
for w_idx, (s, e) in enumerate(windows):
if not (s <= idx < e):
continue
d = min(idx - s, (e - 1) - idx)
if d > best_d:
best_d = d
best = w_idx
return best
BATCH_JUDGE_PROMPT = """אתה שופט רלוונטיות במשפט ישראלי.
לפניך שאילתה ומספר פסקאות מפסק דין. דרג כל פסקה בנפרד 1-5 לפי רלוונטיות.
סולם:
5 — תשובה ישירה ומדויקת לשאילתה
4 — מאד רלוונטי, מכיל מידע ליבה
3 — רלוונטי חלקית, נוגע בעקיפין בנושא
2 — מעט קשור, רעש סביב הנושא
1 — לא רלוונטי בכלל
השאילתה:
{query}
הפסקאות:
{chunks_block}
החזר JSON בלבד, בפורמט: {{"scores": {{"<id>": <1-5>, ...}}}}
ללא טקסט נוסף, ללא explanations, ללא ```."""
def batch_judge(query: str,
items: list[tuple[int, str]]) -> dict[int, int]:
"""Judge a list of (chunk_idx, content) pairs in a single CLI call.
Returns: dict[chunk_idx → score 1-5]. Returns 0 for parse failures.
"""
chunks_block_lines = []
for ci, content in items:
snippet = content.replace("\n", " ").strip()[:1500]
chunks_block_lines.append(f"<id={ci}>\n{snippet}\n</id>")
prompt = BATCH_JUDGE_PROMPT.format(
query=query,
chunks_block="\n\n".join(chunks_block_lines),
)
proc = subprocess.run(
["claude", "-p", "--model", JUDGE_MODEL],
input=prompt, capture_output=True, text=True, timeout=120,
)
out = proc.stdout.strip()
# Strip ```json fences if any
out = re.sub(r"^```(?:json)?\s*", "", out)
out = re.sub(r"\s*```$", "", out)
try:
data = json.loads(out)
raw = data.get("scores", {})
return {int(k): int(v) for k, v in raw.items()
if str(v).isdigit() and 1 <= int(v) <= 5}
except (json.JSONDecodeError, ValueError, TypeError) as e:
print(f" [judge parse fail: {e}; out={out[:200]!r}]")
return {}
async def main():
voyage_key = os.environ["VOYAGE_API_KEY"]
pg_pw = os.environ["POSTGRES_PASSWORD"]
# Verify Claude CLI is available (uses OAuth from ~/.claude/.credentials)
try:
subprocess.run(["claude", "--version"], capture_output=True,
text=True, timeout=10, check=True)
except (subprocess.CalledProcessError, FileNotFoundError, TimeoutError):
sys.exit("claude CLI not found or not authenticated")
voyage = voyageai.Client(api_key=voyage_key)
# Load chunks + voyage-3 embeddings
pool = await asyncpg.create_pool(
host="127.0.0.1", port=5433, user="legal_ai",
password=pg_pw, database="legal_ai",
min_size=1, max_size=2,
)
rows = await pool.fetch("""
SELECT chunk_index, content, embedding::text AS emb_text
FROM precedent_chunks
WHERE case_law_id = $1
ORDER BY chunk_index
""", CASE_ID)
chunks = [r["content"] for r in rows]
chunk_indices = [r["chunk_index"] for r in rows]
baseline_embs = [parse_pgvector(r["emb_text"]) for r in rows]
n = len(chunks)
print(f"[load] {n} chunks loaded")
# Compute context-3 (windowed) embeddings — same as POC #2
windows = build_windows(n, WINDOW_SIZE, WINDOW_STRIDE)
print(f"[context-3] embedding {len(windows)} windows…")
win_embs = []
for s, e in windows:
result = voyage.contextualized_embed(
inputs=[chunks[s:e]],
model=CONTEXT_MODEL,
input_type="document",
)
win_embs.append(result.results[0].embeddings)
context_embs = []
for i in range(n):
w = central_window(i, windows)
s, _ = windows[w]
context_embs.append(win_embs[w][i - s])
print(f"[context-3] done")
# Retrieval functions
def r1_baseline(query: str, k: int = 10) -> list[int]:
q = voyage.embed([query], model=TEXT_MODEL,
input_type="query").embeddings[0]
scores = sorted(
[(cosine(q, e), i) for i, e in enumerate(baseline_embs)],
reverse=True,
)
return [i for _, i in scores[:k]]
def r2_rerank(query: str, k: int = 10) -> list[int]:
# 1) voyage-3 retrieve top-50
cands = r1_baseline(query, k=50)
cand_texts = [chunks[i] for i in cands]
# 2) voyage-rerank-2 over the 50
rr = voyage.rerank(
query=query, documents=cand_texts,
model=RERANK_MODEL, top_k=k,
)
# rr.results: list of RerankingResult(index=..., relevance_score=...)
# `index` refers to position in cand_texts → map back to chunk idx
return [cands[r.index] for r in rr.results]
def r3_context(query: str, k: int = 10) -> list[int]:
q = voyage.contextualized_embed(
inputs=[[query]],
model=CONTEXT_MODEL,
input_type="query",
).results[0].embeddings[0]
scores = sorted(
[(cosine(q, e), i) for i, e in enumerate(context_embs)],
reverse=True,
)
return [i for _, i in scores[:k]]
retrievers = [("R1-voyage3", r1_baseline),
("R2-rerank2", r2_rerank),
("R3-context3", r3_context)]
# Run all queries × all retrievers, judging top-5 per pair.
# Strategy: for each query, gather the union of all retrievers' top-K
# and judge them in ONE batched CLI call → 18 calls total instead of 270.
all_results = []
JUDGE_TOP_K = 5
print(f"\n[judge] running {len(QUERIES)} queries × "
f"{len(retrievers)} retrievers × top-{JUDGE_TOP_K} — batched per query…")
for qid, query in QUERIES:
print(f"\n[{qid}] {query}")
# Collect retrievals first
retr_results = {}
for r_name, r_fn in retrievers:
try:
retr_results[r_name] = r_fn(query, k=JUDGE_TOP_K)
except Exception as e:
print(f" {r_name}: FAILED — {e}")
retr_results[r_name] = []
# Union of unique chunk indices to judge
union = sorted({i for top in retr_results.values() for i in top})
items = [(i, chunks[i]) for i in union]
print(f" judging {len(items)} unique chunks via batch CLI…")
scores_map = batch_judge(query, items)
# Build per-retriever score lists
for r_name, top in retr_results.items():
scores = [scores_map.get(i, 0) for i in top]
mean3 = sum(scores[:3]) / 3 if len(scores) >= 3 else 0
mean5 = sum(scores) / len(scores) if scores else 0
mrr = 0.0
for r, s in enumerate(scores):
if s >= 4:
mrr = 1.0 / (r + 1)
break
print(f" {r_name}: chunks={[chunk_indices[i] for i in top]} "
f"scores={scores} mean@3={mean3:.2f} mean@5={mean5:.2f} "
f"MRR={mrr:.3f}")
all_results.append({
"qid": qid, "category": qid[0], "query": query,
"retriever": r_name,
"chunks": [chunk_indices[i] for i in top],
"scores": scores,
"mean3": mean3, "mean5": mean5, "mrr": mrr,
})
# Aggregate
print("\n" + "=" * 100)
print("AGGREGATED RESULTS")
print("=" * 100)
by_retriever = defaultdict(lambda: {"mean3": [], "mean5": [], "mrr": []})
by_cat_retriever = defaultdict(
lambda: {"mean3": [], "mean5": [], "mrr": []})
for r in all_results:
by_retriever[r["retriever"]]["mean3"].append(r["mean3"])
by_retriever[r["retriever"]]["mean5"].append(r["mean5"])
by_retriever[r["retriever"]]["mrr"].append(r["mrr"])
cat_key = (r["category"], r["retriever"])
by_cat_retriever[cat_key]["mean3"].append(r["mean3"])
by_cat_retriever[cat_key]["mean5"].append(r["mean5"])
by_cat_retriever[cat_key]["mrr"].append(r["mrr"])
print("\nOverall (across all 18 queries):")
print(f"{'retriever':<14} {'mean@3':>8} {'mean@5':>8} {'MRR':>8}")
for r_name, _ in retrievers:
m = by_retriever[r_name]
avg = lambda xs: sum(xs) / len(xs) if xs else 0
print(f"{r_name:<14} {avg(m['mean3']):>8.3f} "
f"{avg(m['mean5']):>8.3f} {avg(m['mrr']):>8.3f}")
print("\nBy category (K=keyword, C=conceptual, N=narrative):")
print(f"{'cat':<3} {'retriever':<14} {'mean@3':>8} {'mean@5':>8} {'MRR':>8}")
for cat in ["K", "C", "N"]:
for r_name, _ in retrievers:
m = by_cat_retriever[(cat, r_name)]
avg = lambda xs: sum(xs) / len(xs) if xs else 0
print(f"{cat:<3} {r_name:<14} {avg(m['mean3']):>8.3f} "
f"{avg(m['mean5']):>8.3f} {avg(m['mrr']):>8.3f}")
print("\nPer-query winner (highest mean@3, ties shown):")
print(f"{'qid':<4} {'query':<45} {'winner':<24} {'scores'}")
by_query = defaultdict(list)
for r in all_results:
by_query[r["qid"]].append(r)
for qid, results in sorted(by_query.items()):
max_score = max(r["mean3"] for r in results)
winners = [r["retriever"] for r in results if r["mean3"] == max_score]
scores = " | ".join(f"{r['retriever'][:7]}={r['mean3']:.2f}"
for r in results)
q_str = next(q for qid_, q in QUERIES if qid_ == qid)[:42]
print(f"{qid:<4} {q_str:<45} {','.join(w[:8] for w in winners):<24} "
f"{scores}")
# Save raw results to JSON for further analysis
out_path = "/tmp/voyage_rerank_judge_results.json"
with open(out_path, "w") as f:
json.dump(all_results, f, ensure_ascii=False, indent=2)
print(f"\nRaw results saved to {out_path}")
await pool.close()
if __name__ == "__main__":
asyncio.run(main())