feat(upload): accept legacy .doc, convert via LibreOffice in container

Legacy Hebrew .doc precedents (e.g. nevo.co.il CP1255 OLE2) can now be
uploaded directly through the precedent-library, missing-precedent, and
training upload paths — the frontend already advertised .doc but the
backend gate rejected it before reaching the extractor.

- web/app.py: add .doc to ALLOWED_EXTENSIONS (covers all paths that share
  the set: precedent library, missing-precedent, training).
- Dockerfile: install libreoffice-writer-nogui (no X11/Java) so the
  extractor's existing _extract_doc LibreOffice conversion works in the
  Coolify container (was missing → would fail at runtime).
- extractor.py: isolate the LibreOffice user profile per call to avoid a
  profile-lock failure on concurrent .doc conversions.

Verified in python:3.12-slim (prod base): .doc→.docx→text yields text
byte-identical to a native Word .docx save (103 paragraphs, 24,341 chars).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-03 13:47:47 +00:00
parent db6bad5d1e
commit 476c2fc5d1
3 changed files with 12 additions and 4 deletions

View File

@@ -32,9 +32,10 @@ RUN pip install --no-cache-dir ./mcp-server
FROM python:3.12-slim AS runner
WORKDIR /app
# Install Node.js 20.x
# Install Node.js 20.x + LibreOffice Writer (headless .doc→.docx conversion
# in extractor.py:_extract_doc — needed for legacy Hebrew .doc precedents).
RUN apt-get update && apt-get install -y --no-install-recommends \
curl ca-certificates git \
curl ca-certificates git libreoffice-writer-nogui \
&& curl -fsSL https://deb.nodesource.com/setup_20.x | bash - \
&& apt-get install -y --no-install-recommends nodejs \
&& rm -rf /var/lib/apt/lists/*