fix(supervisor): re-probe claude.ai quota instead of waiting blindly for the reported reset
All checks were successful
G12 Leak-Guard / leak-guard (pull_request) Successful in 8s
All checks were successful
G12 Leak-Guard / leak-guard (pull_request) Successful in 8s
When the halacha drain hit a 429, the supervisor recorded the reset time the error reported (e.g. "resets 6:50pm UTC") and then HELD until that timestamp, re-reading it from its own state every tick without ever checking whether quota had actually returned. claude.ai usually frees up quota earlier than the message claims, so the drain sat idle for hours after it could have resumed — and only a manual kick (clear cooldown + trigger) got it going again. Now, on any tick where we'd otherwise hold on a cooldown, run a cheap live probe (`quota_available()` → a tiny `claude -p` call, cost ~0) and resume the instant it succeeds — at most one probe per 15-min tick, only while we believe we're limited. Conservative on failure (non-zero exit / timeout / limit message → stay held), so a flaky probe never resumes the drain into a real 429. Adds a claude_bin() resolver so the probe works under pm2 cron where PATH is bare. Invariants: G1 (resume decision driven by actual quota state, not a guessed timestamp); no new control path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -70,10 +70,48 @@ def pm2_bin():
|
||||
return "pm2"
|
||||
|
||||
|
||||
def claude_bin():
|
||||
"""Resolve the claude CLI — PATH may be bare under pm2 cron, so fall back to
|
||||
the known install location (same binary the drain's claude_session uses)."""
|
||||
for c in ["claude", "/home/chaim/.local/bin/claude",
|
||||
*glob("/home/chaim/.nvm/versions/node/*/bin/claude")]:
|
||||
try:
|
||||
if subprocess.run([c, "--version"], capture_output=True,
|
||||
timeout=10).returncode == 0:
|
||||
return c
|
||||
except Exception:
|
||||
continue
|
||||
return "claude"
|
||||
|
||||
|
||||
PM2 = pm2_bin()
|
||||
CLAUDE = claude_bin()
|
||||
_ENV = {**os.environ, "HOME": "/home/chaim"}
|
||||
|
||||
|
||||
def quota_available() -> bool:
|
||||
"""Cheap live probe: is the claude.ai quota actually usable right now?
|
||||
|
||||
The 429 reset time claude.ai reports is often conservative — quota frees up
|
||||
earlier. Rather than trust that timestamp and wait blindly, we re-probe with
|
||||
a tiny `claude -p` call and resume the moment it succeeds. Conservative on
|
||||
failure: any non-zero exit, timeout, or limit message → treat as still
|
||||
limited (so a flaky probe never resumes the drain into a real 429)."""
|
||||
try:
|
||||
r = subprocess.run([CLAUDE, "-p", "Reply with exactly: OK"],
|
||||
capture_output=True, text=True, timeout=60, env=_ENV,
|
||||
cwd=REPO)
|
||||
except Exception:
|
||||
return False
|
||||
if r.returncode != 0:
|
||||
return False
|
||||
out = ((r.stdout or "") + (r.stderr or ""))
|
||||
low = out.lower()
|
||||
if "usage limit" in low or "session limit" in low or 'api_error_status":429' in out:
|
||||
return False
|
||||
return "OK" in out
|
||||
|
||||
|
||||
# ── DB access (via the repo venv; the module self-configures) ────────────────
|
||||
def _venv_py(code: str, timeout: int = 120) -> str:
|
||||
r = subprocess.run([VENV_PY, "-c", code], capture_output=True, text=True,
|
||||
@@ -319,8 +357,20 @@ def tick():
|
||||
cd_dt = datetime.fromisoformat(prev["cooldown_until"])
|
||||
except Exception:
|
||||
cd_dt = None
|
||||
cooldown_until = cd_dt.isoformat() if cd_dt else None
|
||||
in_cooldown = bool(cd_dt and now < cd_dt)
|
||||
# Don't trust the reported reset time — re-probe. claude.ai usually frees up
|
||||
# quota EARLIER than the 429 message claims, and the old code then sat idle
|
||||
# until that (conservative) timestamp. When we'd otherwise hold, a tiny live
|
||||
# probe lets us resume the instant quota is actually back (≤ one tick), no
|
||||
# manual kick. Runs at most once per tick and only while we think we're
|
||||
# limited, so the cost is negligible.
|
||||
if in_cooldown and quota_available():
|
||||
notes.append(
|
||||
f"בדיקת-מכסה הצליחה — המכסה חזרה לפני האיפוס המדווח "
|
||||
f"({cd_dt.astimezone(IDT):%H:%M IDT}); מתחדש מיד.")
|
||||
cd_dt = None
|
||||
in_cooldown = False
|
||||
cooldown_until = cd_dt.isoformat() if cd_dt else None
|
||||
weekly = bool(cd_dt and (cd_dt - now) > timedelta(hours=WEEKLY_GAP_HOURS))
|
||||
|
||||
# progress-based liveness (chunk checkpoints, NOT log mtime)
|
||||
|
||||
Reference in New Issue
Block a user