legal-ai

ezer-mishpati/legal-ai

Fork 0

Commit Graph

Author	SHA1	Message	Date
Chaim	0e35060d3d	feat(goldset): AI second-opinion per item (QA aid) — compare vs human tag The chair wanted an independent recommendation beside each tag, to reconsider his own judgments. Adds a NON-ground-truth AI second-opinion: - schema: halacha_goldset.ai_is_holding / ai_correct_type / ai_rationale / ai_generated_at (additive). - db.goldset_set_ai_recommendation + goldset_list now returns the ai_* fields. - scripts/goldset_ai_recommend.py — local claude_session judges is_holding + type + a one-line rationale per item, INDEPENDENTLY (own legal rubric). Independent of the rule-based validators #81.8 measures → no circularity. Never auto-applied; QA aid only. - web-ui: each card shows "🤖 המלצת AI: הלכה/לא · type" + rationale and an agreement/disagreement chip vs the human tag (amber on disagree); a "⚠ אי-הסכמות AI (N)" filter to review only the conflicts. Methodology note kept explicit: the human stays the ground truth; the AI is a prompt to reconsider, not to copy. Verified: tsc --noEmit 0; generator stores recs and flags disagreements with existing human tags. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 14:24:35 +00:00
Chaim	632fe73857	feat(goldset): separate court rulings from committee decisions in tagging Tagging is easier one source-type at a time. goldset_list now returns case_law.source_type; the page adds: - a filter (הכל / פסקי דין / ועדת ערר) with live counts, - a group-sort so even in "הכל" all court rulings come first, then all committee decisions, - a per-card source badge (פסק-דין / ועדת ערר). Verified: tsc --noEmit 0; source_type splits the live batch 58 court / 92 committee. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 13:55:06 +00:00
Chaim	ac279220c4	feat(goldset): interactive gold-set tagging page (#81.7/#81.8) Replaces the CSV-edit workflow with an in-app tagging page so the chair/Dafna can label the extraction-quality gold-set by clicking, and see validator precision/recall live. Schema (V29): halacha_goldset — a stratified, human-tagged evaluation batch (is_holding / correct_type / quote_complete, NULL until tagged). db.py: - goldset_create_sample (stratified round-robin over case×rule_type, idempotent), - goldset_list (items + halacha content + the machine's own labels), - goldset_tag (partial — one field at a time for keyboard tagging), - goldset_score (ports the script's P/R/F1: each validator scored as a not-a-holding detector against the human tags — the #81.8 input). API: GET /api/goldset, POST /api/goldset/sample, GET /api/goldset/score, PATCH /api/goldset/{id}. web-ui: - lib/api/goldset.ts (hooks), - components/goldset/goldset-panel.tsx — card-per-item, keyboard-first (J/K nav, H/N holding, C/X quote), progress bar, hide-tagged toggle, and a collapsible live score table, - app/goldset/page.tsx + nav link "מדגם-זהב" under ידע ולמידה. Methodology guard kept explicit in UI + docstrings: tags are HUMAN ground truth, no AI pre-fill (circular bias). Populated a 150-item stratified batch. Verified: backend create/list/tag/score against the live DB; tsc --noEmit 0; py_compile ok. (Local Turbopack build blocked by worktree symlink — CI builds clean.) Invariants: G1 (eval set modeled at source in its own table); G2 (reuses the same halacha_quality validators the extractor runs — no parallel scoring logic). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 21:52:05 +00:00

Author

SHA1

Message

Date

Chaim

0e35060d3d

feat(goldset): AI second-opinion per item (QA aid) — compare vs human tag

The chair wanted an independent recommendation beside each tag, to reconsider
his own judgments. Adds a NON-ground-truth AI second-opinion:

- schema: halacha_goldset.ai_is_holding / ai_correct_type / ai_rationale /
  ai_generated_at (additive).
- db.goldset_set_ai_recommendation + goldset_list now returns the ai_* fields.
- scripts/goldset_ai_recommend.py — local claude_session judges is_holding +
  type + a one-line rationale per item, INDEPENDENTLY (own legal rubric).
  Independent of the rule-based validators #81.8 measures → no circularity.
  Never auto-applied; QA aid only.
- web-ui: each card shows "🤖 המלצת AI: הלכה/לא · type" + rationale and an
  agreement/disagreement chip vs the human tag (amber on disagree); a
  "⚠ אי-הסכמות AI (N)" filter to review only the conflicts.

Methodology note kept explicit: the human stays the ground truth; the AI is a
prompt to reconsider, not to copy.

Verified: tsc --noEmit 0; generator stores recs and flags disagreements with
existing human tags.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-07 14:24:35 +00:00

Chaim

632fe73857

feat(goldset): separate court rulings from committee decisions in tagging

Tagging is easier one source-type at a time. goldset_list now returns
case_law.source_type; the page adds:
- a filter (הכל / פסקי דין / ועדת ערר) with live counts,
- a group-sort so even in "הכל" all court rulings come first, then all
  committee decisions,
- a per-card source badge (פסק-דין / ועדת ערר).

Verified: tsc --noEmit 0; source_type splits the live batch 58 court / 92 committee.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-07 13:55:06 +00:00

Chaim

ac279220c4

feat(goldset): interactive gold-set tagging page (#81.7/#81.8)

Replaces the CSV-edit workflow with an in-app tagging page so the chair/Dafna
can label the extraction-quality gold-set by clicking, and see validator
precision/recall live.

Schema (V29): halacha_goldset — a stratified, human-tagged evaluation batch
(is_holding / correct_type / quote_complete, NULL until tagged).

db.py:
- goldset_create_sample (stratified round-robin over case×rule_type, idempotent),
- goldset_list (items + halacha content + the machine's own labels),
- goldset_tag (partial — one field at a time for keyboard tagging),
- goldset_score (ports the script's P/R/F1: each validator scored as a
  not-a-holding detector against the human tags — the #81.8 input).

API: GET /api/goldset, POST /api/goldset/sample, GET /api/goldset/score,
PATCH /api/goldset/{id}.

web-ui:
- lib/api/goldset.ts (hooks),
- components/goldset/goldset-panel.tsx — card-per-item, keyboard-first
  (J/K nav, H/N holding, C/X quote), progress bar, hide-tagged toggle, and a
  collapsible live score table,
- app/goldset/page.tsx + nav link "מדגם-זהב" under ידע ולמידה.

Methodology guard kept explicit in UI + docstrings: tags are HUMAN ground truth,
no AI pre-fill (circular bias). Populated a 150-item stratified batch.

Verified: backend create/list/tag/score against the live DB; tsc --noEmit 0;
py_compile ok. (Local Turbopack build blocked by worktree symlink — CI builds clean.)

Invariants: G1 (eval set modeled at source in its own table); G2 (reuses the same
halacha_quality validators the extractor runs — no parallel scoring logic).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-06 21:52:05 +00:00

3 Commits