feat(storage): seal INV-STG1 write path — 15 dual-write seals + CI leak-guard + tripwire
All checks were successful
G12 Leak-Guard / leak-guard (pull_request) Successful in 5s
All checks were successful
G12 Leak-Guard / leak-guard (pull_request) Successful in 5s
אחרי ה-cutover ל-s3-only, אודיט מצא 15 אתרי-כתיבת-בלוב שעוקפים את storage.py (uploads/ finalize/exports/training/research-backup/precedents/bulletins/draft) — קובץ ינחת בתיקיות-הישנות אך **לא** ב-MinIO → יאבד בניקוי, לא מוגש, לא מגובה. ה-pipeline (ingest/ extract) עדיין קורא לפי file_path מהדיסק, אז ביטול-מוחלט של כתיבה-לדיסק דורש read-wiring מלא (Phase 2, משימה נפרדת). תיקון בטוח עכשיו = **dual-write seal**. - storage.py: `mirror`/`mirror_file` (+ sync) — best-effort persist ל-S3 כשה-backend s3/dual (no-op ב-filesystem; כשל S3 נרשם, לא שובר request — DualBackend philosophy). - web/app.py: helpers `_seal_blob`/`_seal_blob_file` + 14 אתרים אטומים (storage.mirror אחרי כתיבת-הדיסק; הדיסק נשאר ל-pipeline). block_writer.py: draft אטום (async). - **CI leak-guard** (test_storage_write_leak_guard): נכשל על כל כתיבת-בלוב-לדיסק (write_bytes/write_text/shutil.copy*/open(wb)) ב-web/+services ללא מרקר `# noqa: STG1`. כל ה-benign (fallbacks/tmp/staging/git-metadata/flag/state) מסומנים עם נימוק. storage.py מוחרג (הוא המימוש). - **tripwire** (scripts/storage_leak_tripwire.py): ניטור-ריצה — בלובים בדיסק שלא ב-MinIO (json-key match, bucket per-file). אומת חי: 0 דליפות. invariants: INV-STG1 (כל I/O דרך storage / ממורר אליו) · INV-STG6 · feedback_silent_swallow (mirror רושם warning, לא bare-except). Phase 2 (read-wire ה-pipeline → להפיל את עותק-הדיסק) = follow-up. tests: 4 mirror + 1 leak-guard + 6 serve_blob + 18 storage קיימות עוברות. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
97
scripts/storage_leak_tripwire.py
Normal file
97
scripts/storage_leak_tripwire.py
Normal file
@@ -0,0 +1,97 @@
|
||||
#!/usr/bin/env python3
|
||||
"""INV-STG1 runtime tripwire — detect blobs that leaked to the old disk folders
|
||||
without reaching MinIO (the detective control complementing the CI leak-guard).
|
||||
|
||||
After the s3-only cutover, every blob written under DATA_DIR/{cases,
|
||||
precedent-library,internal-decisions,digests,training} should ALSO be in MinIO
|
||||
(the upload/finalize paths keep a disk copy for the pipeline but mirror to S3 via
|
||||
storage.mirror — see web/app.py _seal_blob). A file present on disk but ABSENT
|
||||
from the matching S3 bucket means a write bypassed the seal → it would be lost on
|
||||
disk cleanup and is not served/backed-up. This script reports them.
|
||||
|
||||
Classifies disk files into documents/derived buckets exactly like the migration
|
||||
(``*/extracted/*`` and ``*/thumbnails/*`` → legal-derived; the rest → legal-
|
||||
documents) and compares against the live bucket key-sets (proper JSON key match,
|
||||
so Hebrew filenames with spaces compare correctly). Read-only.
|
||||
|
||||
Run locally (needs the `legalminio` mcli alias):
|
||||
python3 scripts/storage_leak_tripwire.py # full scan
|
||||
python3 scripts/storage_leak_tripwire.py --since 2026-06-11 # only newer files
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import subprocess
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
MCLI = str(Path.home() / ".local" / "bin" / "mcli")
|
||||
DATA = Path("/home/chaim/legal-ai/data")
|
||||
CATS = ["cases", "precedent-library", "internal-decisions", "digests", "training"]
|
||||
# non-blob disk files that legitimately stay on disk / in git-per-case
|
||||
SKIP_SUFFIX = {".tmp", ".log"}
|
||||
SKIP_NAME = {"case.json", "notes.md", ".pull.log"}
|
||||
|
||||
|
||||
def _bucket_for(rel: str) -> str:
|
||||
return ("legal-derived" if ("/extracted/" in rel or "/thumbnails/" in rel)
|
||||
else "legal-documents")
|
||||
|
||||
|
||||
def _s3_keys(bucket: str) -> set[str]:
|
||||
out = subprocess.run([MCLI, "ls", "--recursive", "--json", f"legalminio/{bucket}"],
|
||||
capture_output=True, text=True, env={"TERM": "xterm", "HOME": str(Path.home())})
|
||||
keys: set[str] = set()
|
||||
for ln in out.stdout.splitlines():
|
||||
try:
|
||||
k = json.loads(ln).get("key", "")
|
||||
except json.JSONDecodeError:
|
||||
continue
|
||||
if k and "/.git/" not in k:
|
||||
keys.add(k)
|
||||
return keys
|
||||
|
||||
|
||||
def main(args) -> int:
|
||||
s3 = {b: _s3_keys(b) for b in ("legal-documents", "legal-derived")}
|
||||
since = None
|
||||
if args.since:
|
||||
import datetime
|
||||
since = datetime.datetime.fromisoformat(args.since).timestamp()
|
||||
|
||||
leaked: list[str] = []
|
||||
scanned = 0
|
||||
for cat in CATS:
|
||||
root = DATA / cat
|
||||
if not root.exists():
|
||||
continue
|
||||
for f in root.rglob("*"):
|
||||
if not f.is_file() or "/.git/" in f.as_posix():
|
||||
continue
|
||||
if f.suffix in SKIP_SUFFIX or f.name in SKIP_NAME:
|
||||
continue
|
||||
if since and f.stat().st_mtime < since:
|
||||
continue
|
||||
rel = f.relative_to(DATA).as_posix()
|
||||
scanned += 1
|
||||
if rel not in s3[_bucket_for(rel)]:
|
||||
leaked.append(rel)
|
||||
|
||||
print(f"scanned {scanned} disk blobs across {CATS}")
|
||||
if not leaked:
|
||||
print("✓ no leaks — every disk blob is present in MinIO.")
|
||||
return 0
|
||||
print(f"⚠ {len(leaked)} LEAKED blobs (on disk, NOT in MinIO):")
|
||||
for r in leaked[:50]:
|
||||
print(f" {r} → expected in {_bucket_for(r)}")
|
||||
if len(leaked) > 50:
|
||||
print(f" … and {len(leaked) - 50} more")
|
||||
return 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
ap = argparse.ArgumentParser(description=__doc__,
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter)
|
||||
ap.add_argument("--since", help="ISO date — only check files modified on/after")
|
||||
sys.exit(main(ap.parse_args()))
|
||||
Reference in New Issue
Block a user