feat(upload): accept legacy .doc, convert via LibreOffice in container
Legacy Hebrew .doc precedents (e.g. nevo.co.il CP1255 OLE2) can now be uploaded directly through the precedent-library, missing-precedent, and training upload paths — the frontend already advertised .doc but the backend gate rejected it before reaching the extractor. - web/app.py: add .doc to ALLOWED_EXTENSIONS (covers all paths that share the set: precedent library, missing-precedent, training). - Dockerfile: install libreoffice-writer-nogui (no X11/Java) so the extractor's existing _extract_doc LibreOffice conversion works in the Coolify container (was missing → would fail at runtime). - extractor.py: isolate the LibreOffice user profile per call to avoid a profile-lock failure on concurrent .doc conversions. Verified in python:3.12-slim (prod base): .doc→.docx→text yields text byte-identical to a native Word .docx save (103 paragraphs, 24,341 chars). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -262,8 +262,15 @@ def _ocr_with_google_vision(image_bytes: bytes, page_num: int) -> str:
|
||||
def _extract_doc(path: Path) -> str:
|
||||
"""Extract text from legacy .doc file by converting to .docx via LibreOffice."""
|
||||
with tempfile.TemporaryDirectory() as tmp_dir:
|
||||
# Isolate the LibreOffice user profile per call: headless soffice
|
||||
# locks a single shared profile, so concurrent .doc conversions would
|
||||
# otherwise fail with a profile-lock error.
|
||||
result = subprocess.run(
|
||||
["libreoffice", "--headless", "--convert-to", "docx", str(path), "--outdir", tmp_dir],
|
||||
[
|
||||
"libreoffice",
|
||||
f"-env:UserInstallation=file://{tmp_dir}/lo-profile",
|
||||
"--headless", "--convert-to", "docx", str(path), "--outdir", tmp_dir,
|
||||
],
|
||||
capture_output=True, text=True, timeout=120,
|
||||
)
|
||||
if result.returncode != 0:
|
||||
|
||||
Reference in New Issue
Block a user