Auto-strip Nevo preambles and separate style analysis per appeal subtype

- Add strip_nevo_preamble() to extractor.py — auto-removes Nevo database
  headers (bibliography, legislation, mini-ratio) during training upload
- Add appeal_subtype column to style_patterns table — patterns are now
  stored per subtype instead of globally mixed
- Update clear_style_patterns() to support subtype-scoped deletion
- Pass appeal_subtype through analyze_corpus → store → upsert pipeline

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-15 14:03:06 +00:00
parent ba39707c70
commit 5dd24729e2
4 changed files with 65 additions and 18 deletions

View File

@@ -152,8 +152,9 @@ async def document_upload_training(
if source.resolve() != dest.resolve():
shutil.copy2(str(source), str(dest))
# Extract text
# Extract text and strip Nevo preamble
text, page_count = await extractor.extract_text(str(dest))
text = extractor.strip_nevo_preamble(text)
# Parse date
d_date = None