legal-ai

ezer-mishpati/legal-ai

Fork 0

Files

History

Chaim 8a815ecff5

Build & Deploy / build-and-deploy (push) Successful in 16s

Details

fix(retrieval): rewrite chunk-page retrofit to skip OCR

The first-pass retrofit re-extracted via extractor.extract_text, which
re-runs Google Vision OCR on scanned pages. OCR is non-deterministic,
so the new text didn't match the chunk content stored in the DB
(produced by the original OCR run) — only ~7% of chunks were located.

New approach (no OCR cost):

1. Use the stored documents.extracted_text from the DB — the exact
   text the chunks were produced from, so chunk lookups match.
2. Anchor page boundaries via PyMuPDF direct text reads (free, no
   OCR). Pages with usable direct text are anchored by snippet match;
   OCR-only pages are linearly interpolated between anchors.
3. Search each chunk in extracted_text using a whitespace-tolerant
   helper — needed because the chunker joins paragraphs with single
   '\\n' while extracted_text uses '\\n\\n' as page separators.

Verified on 8174-24 (5 docs, 307 chunks) + 8137-24 (9 docs, 512
chunks): 100% chunks tagged, 13s total, $0 cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-03 20:04:33 +00:00

.archive

LLM session: async, 30min timeout, semantic chunking + parallel

2026-04-30 14:21:35 +00:00

auto-sync-cases.sh

Fix case repo sync + auto-create Gitea repos + add sync indicator

2026-04-14 15:28:16 +00:00

backfill_chunk_pages.py

fix(retrieval): rewrite chunk-page retrofit to skip OCR

2026-05-03 20:04:33 +00:00

backup-db.sh

Add full decision writing pipeline: classify, extract, brainstorm, write, QA, export