feat(operations): מסך "סוכנים פעילים" + ניהול ריצות (live-runs/log/cancel) (G12/X15, #119)
All checks were successful
G12 Leak-Guard / leak-guard (pull_request) Successful in 6s

פאנל ב-/operations שמראה אילו סוכני Paperclip עובדים כעת (רצים+בתור), הפלט החי
שלהם, ושליטה מבוקרת: עצירת ריצה, איפוס session. סוגר את הנקודה-העיוורת שבה drain
מונע-סוכן (למשל ריקון תור הלכות ע"י ה-CEO heartbeat) עוקף את בקרת /operations
שמכירה רק שירותי pm2, והפלט הגולמי נגיש רק ב-Paperclip UI.

מקור-נתונים: Paperclip heartbeat-runs API (אומת חי):
  GET  /api/companies/{cid}/live-runs        — רצים+בתור (agentName/status/issue/outputSilence)
  GET  /api/heartbeat-runs/{id}/log          — NDJSON של פלט הסוכן
  GET  /api/heartbeat-runs/{id}/events        — timeline
  POST /api/heartbeat-runs/{id}/cancel        — עצירה מבוקרת (לא kill — מכבד watchdog+checkpoint)
  POST /api/agents/{id}/runtime-state/reset-session

ארכיטקטורה (G12/INV-PORT1): כל המגע החדש עם Paperclip דרך השער בלבד —
web/paperclip_client.py (shell) → re-export ב-web/agent_platform_port.py →
web/app.py צורך מהשער. leak_guard.py עובר (seam שלם). אסור kill ישיר על
process_pid (עוקף את השער).

Backend:
- paperclip_client: list_live_runs / get_run_log / get_run_events / cancel_run / reset_agent_session
- agent_platform_port: re-export pc_list_live_runs / pc_get_run_log / pc_get_run_events / pc_cancel_run / pc_reset_agent_session
- app.py: GET /api/operations/agents (אגרגציה CMP+CMPA, עמיד לכשל-חברה),
  GET .../runs/{id}/log, GET .../runs/{id}/events, POST .../runs/{id}/cancel,
  POST .../agents/{id}/reset-session

Frontend: פאנל "סוכנים פעילים" ב-/operations (polling 4s) + dialog ללוג חי
(פרסור NDJSON→טקסט קריא) + כפתורי עצור/אפס. הוספת hooks ל-operations.ts.

בטיחות: cancel על דריינר הלכות בטוח — חילוץ checkpointed per-chunk + resumable
+ self-heal לשורות processing.

Invariants: מקיים G12/INV-PORT1 (שער-הפלטפורמה). נוגע X6 (UI↔API).
api:types יורץ אחרי deploy (openapi.json חי).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-11 13:26:30 +00:00
parent 130ddc3a7e
commit 2f094b8d84
5 changed files with 477 additions and 0 deletions

View File

@@ -720,6 +720,77 @@ async def get_issue_interactions(issue_ids: list[str]) -> list[dict]:
await conn.close()
# ── Agent-run observability + control ───────────────────────────────────────
# Live view of which agents are actually working right now + their output, and
# the controls to manage a stuck/runaway run. These wrap Paperclip's own
# heartbeat-run API (verified live): we use the *graceful* platform endpoints
# (cancel / reset-session) — never a raw kill on the run's process_pid, which
# would bypass the platform's watchdog and our extractors' per-chunk
# checkpointing. The only seam allowed to call these is web/agent_platform_port.
async def list_live_runs(company_id: str) -> list[dict]:
"""Queued + running heartbeat runs for a company (GET .../live-runs).
Each row carries ``agentName``/``status``/``issueId``/``startedAt`` and an
``outputSilence`` block (``level`` ok|suspicion|critical) — the platform's
own liveness signal, surfaced so the UI can flag a stalled run.
"""
resp = await pc_request(
"GET", f"/api/companies/{company_id}/live-runs", raise_on_error=True,
)
data = resp.json()
return data if isinstance(data, list) else []
async def get_run_log(run_id: str) -> dict:
"""Full output log of a heartbeat run (GET /api/heartbeat-runs/{id}/log).
Returns the platform payload as-is: ``{runId, store, logRef, content}``
where ``content`` is the NDJSON stream the adapter captured.
"""
resp = await pc_request(
"GET", f"/api/heartbeat-runs/{run_id}/log", raise_on_error=True,
)
return resp.json()
async def get_run_events(run_id: str) -> list[dict]:
"""Lifecycle/event timeline of a heartbeat run (.../events)."""
resp = await pc_request(
"GET", f"/api/heartbeat-runs/{run_id}/events", raise_on_error=True,
)
data = resp.json()
return data if isinstance(data, list) else []
async def cancel_run(run_id: str) -> dict:
"""Gracefully cancel a queued/running heartbeat run (POST .../cancel).
The platform stops the run cleanly (process-group teardown + status flip),
respecting the watchdog. Safe for the halacha drain: its extractor is
checkpointed per-chunk and resumes on the next drain — a cancel loses at
most the in-flight chunk.
"""
resp = await pc_request(
"POST", f"/api/heartbeat-runs/{run_id}/cancel", json={},
raise_on_error=True,
)
return resp.json()
async def reset_agent_session(agent_id: str) -> dict:
"""Reset an agent's runtime session (.../runtime-state/reset-session).
Clears a wedged session so the next wakeup starts clean — the smart
alternative to cancelling individual runs when an agent loops.
"""
resp = await pc_request(
"POST", f"/api/agents/{agent_id}/runtime-state/reset-session", json={},
raise_on_error=True,
)
return resp.json()
async def respond_to_interaction(
issue_id: str, interaction_id: str, payload: dict,
) -> dict: