fix(claude_session): surface real CLI error + sanitize nested env (#85)

write_interim_draft failed for all blocks from the CEO MCP instance with "Claude CLI failed (exit 1): unknown error". Two fixes: 1. Error surfacing (the certain win): on non-zero exit, capture and log both stderr AND stdout (the CLI sometimes writes its diagnostic to stdout or nowhere), so the next occurrence is diagnosable instead of collapsing to "unknown error". This is why #85 was unsolved — the real error was swallowed (engineering rule §6: no silent swallow). 2. Defensive hardening: strip Claude Code session markers (CLAUDECODE, CLAUDE_CODE_*, CLAUDE_AGENT_*, AI_AGENT, CLAUDE_EFFORT) from the env of nested `claude -p` calls and run them from $HOME, decoupling them from the parent agent's session/project state. Aligns query() with the existing query_streaming() path (which already sets cwd=HOME). Auth/ config vars are preserved. Note: the original adapter-context failure could not be reproduced in a plain interactive session (nested claude -p succeeds there in both old and new code), so the env markers are a suspect, not a proven cause. The real value is the diagnostics. Verified: nested query() returns PONG from inside a CLAUDECODE=1 session; unit tests cover env sanitization. Invariants: G1 (normalize at source — fix the spawn, not readers), G2 (no parallel path — same query()), §6 (no silent error swallow). INV: feedback_claude_session_local_only preserved (all calls stay local). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 19:29:36 +00:00
parent 37c00bac13
commit 8ec24cf822
2 changed files with 87 additions and 2 deletions
--- a/mcp-server/src/legal_mcp/services/claude_session.py
+++ b/mcp-server/src/legal_mcp/services/claude_session.py
@@ -29,6 +29,7 @@ from __future__ import annotations
 import asyncio
 import json
 import logging
+import os

 from legal_mcp.config import parse_llm_json

@@ -40,6 +41,32 @@ logger = logging.getLogger(__name__)
 DEFAULT_TIMEOUT = 1800
 LONG_TIMEOUT = 3600  # opus block writing on full case context

+# Environment markers a running Claude Code session exports into its child
+# processes. When an agent itself runs as ``claude -p`` (e.g. the CEO MCP
+# instance launched by Paperclip's claude_local adapter), its MCP server
+# inherits these, and a *nested* ``claude -p`` we spawn here inherits them
+# too. Stripping them makes every nested invocation launch as a clean
+# top-level session, decoupled from the parent's session/project state —
+# defensive hardening against the #85 nested-``exit 1`` failure (which could
+# not be reproduced in a plain interactive session, so the markers are a
+# suspect, not a proven cause). Auth/config vars (CLAUDE_CONFIG_DIR,
+# ANTHROPIC_*, PATH, HOME) are kept. See TaskMaster legal-ai #85.
+_SESSION_MARKER_PREFIXES = ("CLAUDECODE", "CLAUDE_CODE_", "CLAUDE_AGENT_")
+_SESSION_MARKER_EXACT = frozenset({"AI_AGENT", "CLAUDE_EFFORT"})
+
+
+def _clean_subprocess_env() -> dict[str, str]:
+    """Copy the current env minus Claude Code session markers.
+
+    Lets a nested ``claude -p`` start fresh instead of detecting it is
+    already inside a Claude Code session (the #85 nested-exit-1 bug).
+    """
+    env = dict(os.environ)
+    for key in list(env):
+        if key in _SESSION_MARKER_EXACT or key.startswith(_SESSION_MARKER_PREFIXES):
+            del env[key]
+    return env
+

 async def query(
    prompt: str,
@@ -100,6 +127,8 @@ async def query(
            stdin=asyncio.subprocess.PIPE,
            stdout=asyncio.subprocess.PIPE,
            stderr=asyncio.subprocess.PIPE,
+            env=_clean_subprocess_env(),
+            cwd=os.path.expanduser("~"),
        )
    except FileNotFoundError:
        raise RuntimeError(
@@ -125,9 +154,20 @@ async def query(
        raise RuntimeError(f"Claude CLI timed out after {timeout}s")

    if proc.returncode != 0:
-        stderr = stderr_b.decode("utf-8", errors="replace").strip()[:500] or "unknown error"
+        # Surface the real cause: the CLI sometimes writes its diagnostic to
+        # stdout (or nowhere) rather than stderr, so a stderr-only message
+        # collapsed to "unknown error" and hid the actual failure (#85).
+        stderr = stderr_b.decode("utf-8", errors="replace").strip()
+        stdout = stdout_b.decode("utf-8", errors="replace").strip()
+        diagnostic = stderr or stdout or "no output on stderr/stdout"
        size_info = f"; prompt_len={len(full_prompt):,} chars" if len(full_prompt) > 100_000 else ""
-        raise RuntimeError(f"Claude CLI failed (exit {proc.returncode}): {stderr}{size_info}")
+        logger.error(
+            "claude -p failed (exit %s): stderr=%r stdout=%r",
+            proc.returncode, stderr[:500], stdout[:500],
+        )
+        raise RuntimeError(
+            f"Claude CLI failed (exit {proc.returncode}): {diagnostic[:500]}{size_info}"
+        )

    stdout = stdout_b.decode("utf-8", errors="replace").strip()
    if not stdout:
@@ -232,6 +272,7 @@ async def query_streaming(
            stdout=asyncio.subprocess.PIPE,
            stderr=asyncio.subprocess.PIPE,
            cwd=cwd,
+            env=_clean_subprocess_env(),
        )
    except FileNotFoundError:
        yield {