Files
legal-ai/scripts/legal-halacha-supervisor.config.cjs
Chaim 49acde591e
All checks were successful
G12 Leak-Guard / leak-guard (pull_request) Successful in 6s
fix(db): serialise schema migrations with an advisory lock + stagger drain crons
legal-halacha-drain crashed 29× with asyncpg DeadlockDetectedError. Root cause:
every short-lived cron drain re-runs the idempotent schema migrations on startup
(get_pool → _run_schema_migrations), and three jobs (metadata-drain, halacha-drain,
halacha-supervisor) all fired on the same minute (*/15 / top-of-hour). Two
processes running the DDL concurrently took AccessExclusiveLock in opposite order
→ Postgres killed one with a deadlock.

Two-layer fix:
- Root cause: wrap _run_schema_migrations in a session-level pg_advisory_lock so
  only one process applies DDL at a time; concurrent migrators wait instead of
  deadlocking. DDL body extracted to _apply_schema_ddl. Idempotent, schema
  unchanged.
- Defence-in-depth: give each cron drain a distinct firing minute —
  metadata :00, supervisor :05, halacha-drain :10, digest :12, court-fetch :17 —
  so siblings no longer start at the same instant. SCRIPTS.md updated to match.

Invariants: G1 (fix at source — the single migration path — not the symptom);
G2 (no parallel control path introduced).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 08:08:19 +00:00

49 lines
2.3 KiB
JavaScript
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
/**
* pm2 ecosystem entry for legal-halacha-supervisor — the permanent health
* manager for the halacha-extraction drain (legal-halacha-drain).
*
* Runs ONE health tick per cron fire (~every 15 min) and takes ZERO Claude quota
* itself (only the drain calls Opus); it just reads the DB/logs/pm2 and pokes the
* existing drain via the established run-now mechanism. Each tick:
* • re-triggers the one-shot drain when idle and the queue is non-empty
* • restarts a HUNG run (online but no new chunk-checkpoint > 25 min — the
* out-log only updates when a whole CASE finishes, so mtime is not liveness)
* • backs off on rate-limit (claude_session 429) until the CLI's parsed reset
* • verifies crash-safe per-chunk staging is committing (nothing lost)
*
* BURST (manual "run continuously now" window): source of truth is
* drain_controls.burst_until in the DB — the SAME value the /operations page
* reads/writes (G1 single source; G2 no parallel control path). While it is in
* the future the supervisor LIFTS the drain's night-window; otherwise the drain
* keeps its normal 23:0005:00 IDT window. Burst auto-expires at its deadline.
* Manual front-ends: the /operations toggle, or:
* mcp-server/.venv/bin/python scripts/halacha_drain_supervisor.py burst-on
* mcp-server/.venv/bin/python scripts/halacha_drain_supervisor.py burst-off
*
* Pattern (mirrors legal-halacha-drain): cron_restart fires the script;
* autorestart:false → one-shot per tick (pm2 shows "stopped" between ticks).
*
* Install (once):
* pm2 start /home/chaim/legal-ai/scripts/legal-halacha-supervisor.config.cjs
* pm2 save
*/
// Staggered to minute :05 of the */15 cycle (:05,:20,:35,:50) so it never shares
// a firing minute with legal-metadata-drain (:00) or legal-halacha-drain (:10) —
// avoids the schema-migration DDL deadlock when sibling drains start together.
const cron = process.env.HALACHA_SUPERVISOR_CRON || "5-59/15 * * * *";
module.exports = {
apps: [
{
name: "legal-halacha-supervisor",
cwd: "/home/chaim/legal-ai",
script: "/home/chaim/legal-ai/mcp-server/.venv/bin/python",
args: "scripts/halacha_drain_supervisor.py",
env: { HOME: "/home/chaim", PYTHONUNBUFFERED: "1" },
autorestart: false, // one-shot per cron tick
cron_restart: cron,
max_memory_restart: "300M",
},
],
};