fix(supervisor): re-probe claude.ai quota instead of blindly waiting for the reported reset #242

Merged

chaim merged 1 commits from worktree-supervisor-quota-probe into main

2026-06-13 09:36:20 +00:00

Author	SHA1	Message	Date
Chaim	013fe39ea7	fix(supervisor): re-probe claude.ai quota instead of waiting blindly for the reported reset All checks were successful G12 Leak-Guard / leak-guard (pull_request) Successful in 8s Details When the halacha drain hit a 429, the supervisor recorded the reset time the error reported (e.g. "resets 6:50pm UTC") and then HELD until that timestamp, re-reading it from its own state every tick without ever checking whether quota had actually returned. claude.ai usually frees up quota earlier than the message claims, so the drain sat idle for hours after it could have resumed — and only a manual kick (clear cooldown + trigger) got it going again. Now, on any tick where we'd otherwise hold on a cooldown, run a cheap live probe (`quota_available()` → a tiny `claude -p` call, cost ~0) and resume the instant it succeeds — at most one probe per 15-min tick, only while we believe we're limited. Conservative on failure (non-zero exit / timeout / limit message → stay held), so a flaky probe never resumes the drain into a real 429. Adds a claude_bin() resolver so the probe works under pm2 cron where PATH is bare. Invariants: G1 (resume decision driven by actual quota state, not a guessed timestamp); no new control path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-13 09:35:45 +00:00

Author

SHA1

Message

Date

Chaim

013fe39ea7

fix(supervisor): re-probe claude.ai quota instead of waiting blindly for the reported reset

G12 Leak-Guard / leak-guard (pull_request) Successful in 8s

Details

When the halacha drain hit a 429, the supervisor recorded the reset time the
error reported (e.g. "resets 6:50pm UTC") and then HELD until that timestamp,
re-reading it from its own state every tick without ever checking whether quota
had actually returned. claude.ai usually frees up quota earlier than the message
claims, so the drain sat idle for hours after it could have resumed — and only a
manual kick (clear cooldown + trigger) got it going again.

Now, on any tick where we'd otherwise hold on a cooldown, run a cheap live probe
(`quota_available()` → a tiny `claude -p` call, cost ~0) and resume the instant
it succeeds — at most one probe per 15-min tick, only while we believe we're
limited. Conservative on failure (non-zero exit / timeout / limit message →
stay held), so a flaky probe never resumes the drain into a real 429. Adds a
claude_bin() resolver so the probe works under pm2 cron where PATH is bare.

Invariants: G1 (resume decision driven by actual quota state, not a guessed
timestamp); no new control path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-13 09:35:45 +00:00

fix(supervisor): re-probe claude.ai quota instead of blindly waiting for the reported reset #242

1 Commits