legal-ai/scripts/goldset_independent_judge.py at b2ea0c28dd36d5b9a5bf7af7173b784e0f80142e

Files

Chaim 808c2e4c46 feat(goldset): independent second-judge for rule_role (break AI-anchoring)

The gold-set's human role tags were made while seeing a claude AI recommendation,
so human↔AI agreement (~100%) is anchoring, not an independent accuracy signal.
This adds a third, genuinely independent judge — a DIFFERENT model (DeepSeek,
direct OpenAI-compatible API) classifies rule_role BLIND (never sees the human
tag nor the first AI's answer) — and reports an inter-rater agreement matrix.

Finding (100 tagged items): ai↔human 100% (anchored) vs deepseek↔human 50%
fine-grained — BUT 92% on the coarse axis (generalizable-rule vs application/
obiter). Conclusion: the fine sub-type (holding/interpretive/procedural) is an
inherently fuzzy boundary two capable models split differently; the coarse
"is this a real rule" axis is robust across models. Use the coarse axis as
ground truth; treat the sub-type as advisory, never as a gate.

Zero chair tagging, read-only on the gold-set. Key from ~/.hermes deepseek env.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-07 20:12:58 +00:00

7.8 KiB

Raw Blame History

View Raw

7.8 KiB Raw Blame History

7.8 KiB

Raw Blame History