legal-ai

ezer-mishpati/legal-ai

Fork 0

Commit Graph

Author	SHA1	Message	Date
Chaim	808c2e4c46	feat(goldset): independent second-judge for rule_role (break AI-anchoring) The gold-set's human role tags were made while seeing a claude AI recommendation, so human↔AI agreement (~100%) is anchoring, not an independent accuracy signal. This adds a third, genuinely independent judge — a DIFFERENT model (DeepSeek, direct OpenAI-compatible API) classifies rule_role BLIND (never sees the human tag nor the first AI's answer) — and reports an inter-rater agreement matrix. Finding (100 tagged items): ai↔human 100% (anchored) vs deepseek↔human 50% fine-grained — BUT 92% on the coarse axis (generalizable-rule vs application/ obiter). Conclusion: the fine sub-type (holding/interpretive/procedural) is an inherently fuzzy boundary two capable models split differently; the coarse "is this a real rule" axis is robust across models. Use the coarse axis as ground truth; treat the sub-type as advisory, never as a gate. Zero chair tagging, read-only on the gold-set. Key from ~/.hermes deepseek env. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 20:12:58 +00:00

Author

SHA1

Message

Date

Chaim

808c2e4c46

feat(goldset): independent second-judge for rule_role (break AI-anchoring)

The gold-set's human role tags were made while seeing a claude AI recommendation,
so human↔AI agreement (~100%) is anchoring, not an independent accuracy signal.
This adds a third, genuinely independent judge — a DIFFERENT model (DeepSeek,
direct OpenAI-compatible API) classifies rule_role BLIND (never sees the human
tag nor the first AI's answer) — and reports an inter-rater agreement matrix.

Finding (100 tagged items): ai↔human 100% (anchored) vs deepseek↔human 50%
fine-grained — BUT 92% on the coarse axis (generalizable-rule vs application/
obiter). Conclusion: the fine sub-type (holding/interpretive/procedural) is an
inherently fuzzy boundary two capable models split differently; the coarse
"is this a real rule" axis is robust across models. Use the coarse axis as
ground truth; treat the sub-type as advisory, never as a gate.

Zero chair tagging, read-only on the gold-set. Key from ~/.hermes deepseek env.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-07 20:12:58 +00:00

1 Commits