fix(extractor): apply Hebrew quote fixer to direct PDF extraction path
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 1m40s
All checks were successful
Build & Deploy / build-and-deploy (push) Successful in 1m40s
Born-digital Hebrew PDFs from legal software often encode gershayim (״) as double-yod (יי), producing the same corruption patterns as OCR. The fixer was only called after Google Cloud Vision OCR — digitally created PDFs that passed quality checks received no correction. Changes: - Apply _fix_hebrew_quotes() in the direct extraction path - Add 'בליימ' → 'בל"מ' (בקשה להארכת מועד — systematic corruption in 1017-03-26) - Add 'תמייא' → 'תמ"א' (תכנית מתאר ארצית) - Update docstring to reflect the broader scope Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -109,6 +109,9 @@ _HEBREW_ABBREV_FIXES: dict[str, str] = {
|
|||||||
'מייר': 'מ"ר',
|
'מייר': 'מ"ר',
|
||||||
'יחייד': 'יח"ד',
|
'יחייד': 'יח"ד',
|
||||||
'בייכ': 'ב"כ',
|
'בייכ': 'ב"כ',
|
||||||
|
# Patterns where double-yod (יי) substitutes for gershayim (״) in born-digital PDFs
|
||||||
|
'בליימ': 'בל"מ', # בקשה להארכת מועד — appears in RTL legal docs
|
||||||
|
'תמייא': 'תמ"א', # תכנית מתאר ארצית
|
||||||
}
|
}
|
||||||
|
|
||||||
_ABBREV_PATTERN = re.compile(
|
_ABBREV_PATTERN = re.compile(
|
||||||
@@ -117,7 +120,12 @@ _ABBREV_PATTERN = re.compile(
|
|||||||
|
|
||||||
|
|
||||||
def _fix_hebrew_quotes(text: str) -> str:
|
def _fix_hebrew_quotes(text: str) -> str:
|
||||||
"""Fix known Hebrew abbreviation quote replacements from Google Vision OCR."""
|
"""Fix known Hebrew abbreviation quote replacements.
|
||||||
|
|
||||||
|
Applied to both Google Vision OCR output and direct PyMuPDF extraction —
|
||||||
|
some born-digital PDFs encode gershayim (״) as double-yod (יי), producing
|
||||||
|
the same corruption patterns as OCR.
|
||||||
|
"""
|
||||||
return _ABBREV_PATTERN.sub(lambda m: _HEBREW_ABBREV_FIXES[m.group()], text)
|
return _ABBREV_PATTERN.sub(lambda m: _HEBREW_ABBREV_FIXES[m.group()], text)
|
||||||
|
|
||||||
|
|
||||||
@@ -189,7 +197,7 @@ async def _extract_pdf(path: Path) -> tuple[str, int, list[int]]:
|
|||||||
text = page.get_text().strip()
|
text = page.get_text().strip()
|
||||||
|
|
||||||
if len(text) > 50 and _text_quality_ok(text):
|
if len(text) > 50 and _text_quality_ok(text):
|
||||||
pages_text.append(text)
|
pages_text.append(_fix_hebrew_quotes(text))
|
||||||
logger.debug("Page %d: direct extraction (%d chars, quality OK)", page_num + 1, len(text))
|
logger.debug("Page %d: direct extraction (%d chars, quality OK)", page_num + 1, len(text))
|
||||||
else:
|
else:
|
||||||
reason = "insufficient text" if len(text) <= 50 else "low quality OCR layer"
|
reason = "insufficient text" if len(text) <= 50 else "low quality OCR layer"
|
||||||
|
|||||||
Reference in New Issue
Block a user