feat(eval): FU-5 — retrieval eval harness + halacha backlog visibility (#63)

Covers GAP-11 (INV-RET4/G8) and GAP-14 (INV-QA1/G10). Retrieval quality was never measured (only telemetry observation) and the halacha review backlog was invisible (the 10/19 gap was found by accident). Unit B — backlog visibility (pure code, container): - metrics.halacha_backlog(conn) → {pending_review, approved, rejected, published, total, oldest_pending_at}; surfaced in metrics.get_dashboard() (get_metrics MCP tool) and /api/system/diagnostics. Live count revealed 178 pending / 1552 total, oldest from 2026-05-03 — previously invisible. Unit A — retrieval eval harness (host-side scripts): - scripts/eval_gold_bootstrap.py — seeds data/eval/gold-set.jsonl. Two sources: citations (cited==relevant via search_relevance_feedback — empty until decisions cite precedents) and known_item (query=case_name → relevant=self; a real citation-free signal, the methodology #52 checked by hand). Idempotent; preserves source='chair' rows. - scripts/eval_retrieval.py — runs the production retrieval path (search_library / search_internal) over the gold-set; computes precision@k, recall@k, MRR, nDCG@k (k=5,10); aggregates overall + per-corpus + per-practice_area; writes a report and a delta vs committed baseline.json (which records the retrieval_config it reflects). --self-test unit-checks the metric math offline. Gold-set strategy = hybrid (chair decision): bootstrap + chair review. The citation source is empty today (0 cited precedents in decisions), so the seed is known-item (77 queries: 54 internal_decisions + 23 precedent_library). The gold-set is PROVISIONAL until Dafna reviews it (the domain chair-gate). Baseline (production config: multimodal+rerank on): R@10=0.987, MRR=0.837, nDCG@10=0.872. Finding: MULTIMODAL_ENABLED=true slightly lowers known-item recall (image-page results displace exact name matches) — relevant to #15. precedent_library weaker than internal (R@10 0.957 vs 1.0) — one external precedent unfindable by name. "CI gate" realized as discipline (re-runnable harness + committed baseline + run before/after any retrieval-layer change) — retrieval needs prod DB + Voyage, no CI runner has that access. Spec: docs/superpowers/specs/2026-05-31-fu5-eval-harness-design.md Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-31 14:58:13 +00:00
parent cfcac80de2
commit 6ff2e36bf9
10 changed files with 776 additions and 10 deletions
--- a/.taskmaster/tasks/tasks.json
+++ b/.taskmaster/tasks/tasks.json
@@ -2175,9 +2175,9 @@
        "id": "63",
        "title": "[FU-5] eval-harness + נראות backlog",
        "description": "מדידת precision/recall על gold-set + חשיפת backlog הלכות בבדיקת-בריאות.",
-        "details": "מכסה GAP-11,14. מספק INV-RET4/G8/QA1/G10. severity: High. סוג: קוד + החלטת-יו\"ר (בניית gold-set). תלוי ב-FU-2.",
+        "details": "מכסה GAP-11,14. מספק INV-RET4/G8/QA1/G10. severity: High. סוג: קוד + החלטת-יו\"ר (בניית gold-set). תלוי ב-FU-2. | DONE 2026-05-31: Unit B (GAP-14) — halacha_backlog נחשף ב-metrics.get_dashboard + /api/system/diagnostics (גילה 178 pending_review מתוך 1552, הישן 3.5.26). Unit A (GAP-11) — scripts/eval_gold_bootstrap.py (citations+known_item) + scripts/eval_retrieval.py (P/R/MRR/nDCG@5,10, self-test, baseline+config). gold-set=77 known-item queries (citation-source ריק: 0 ציטוטים בהחלטות). baseline בייצור: R@10=0.987 MRR=0.837; ממצא: MULTIMODAL=true מוריד known-item recall קלות (relevant ל-#15). gold-set=provisional עד סקירת דפנה (chair-gate; הדומיין). spec: docs/superpowers/specs/2026-05-31-fu5-eval-harness-design.md",
        "testStrategy": "",
-        "status": "pending",
+        "status": "done",
        "dependencies": [
          "60"
        ],
@@ -2189,9 +2189,10 @@
            "description": "כיום רק telemetry.log_search_bg; איכות-אחזור לא נמדדת.",
            "dependencies": [],
            "details": "INV-RET4/G8",
-            "status": "pending",
+            "status": "done",
            "testStrategy": "",
-            "parentId": "63"
+            "parentId": "63",
+            "updatedAt": "2026-05-31T14:55:38.289Z"
          },
          {
            "id": 2,
@@ -2199,12 +2200,13 @@
            "description": "ספירת pending_review בבדיקת-בריאות (10/19 התגלה במקרה).",
            "dependencies": [],
            "details": "INV-QA1/G10",
-            "status": "pending",
+            "status": "done",
            "testStrategy": "",
-            "parentId": "63"
+            "parentId": "63",
+            "updatedAt": "2026-05-31T14:55:38.295Z"
          }
        ],
-        "updatedAt": "2026-05-30T17:37:34.741136+00:00"
+        "updatedAt": "2026-05-31T14:55:38.295Z"
      },
      {
        "id": "64",
@@ -2418,9 +2420,9 @@
    ],
    "metadata": {
      "version": "1.0.0",
-      "lastModified": "2026-05-31T14:11:37.689Z",
+      "lastModified": "2026-05-31T14:55:38.296Z",
      "taskCount": 70,
-      "completedCount": 62,
+      "completedCount": 63,
      "tags": [
        "legal-ai"
      ]