The gold-set's human role tags were made while seeing a claude AI recommendation, so human↔AI agreement (~100%) is anchoring, not an independent accuracy signal. This adds a third, genuinely independent judge — a DIFFERENT model (DeepSeek, direct OpenAI-compatible API) classifies rule_role BLIND (never sees the human tag nor the first AI's answer) — and reports an inter-rater agreement matrix. Finding (100 tagged items): ai↔human 100% (anchored) vs deepseek↔human 50% fine-grained — BUT 92% on the coarse axis (generalizable-rule vs application/ obiter). Conclusion: the fine sub-type (holding/interpretive/procedural) is an inherently fuzzy boundary two capable models split differently; the coarse "is this a real rule" axis is robust across models. Use the coarse axis as ground truth; treat the sub-type as advisory, never as a gate. Zero chair tagging, read-only on the gold-set. Key from ~/.hermes deepseek env. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
7.8 KiB
7.8 KiB