feat(digests): digest_kind classification — robust extraction for all issue types (X12)

~2% מגיליונות "כל יום" הם לא-הכרעות (עדכוני-חקיקה/הודעות/ברכות) ללא ruling →
החילוץ ה-decision-centric החזיר ריק → both-empty → מחזורי ב-self-heal.

- SCHEMA_V32: `digest_kind` (decision/announcement/other) + backfill legacy בזול
  (יש citation→decision, אחרת announcement) — לפני שה-self-heal מסתמך עליו.
- extractor: prompt מסווג + מחלץ תמיד concept/headline/summary; underlying_* רק
  ל-decision. extract מנרמל digest_kind.
- enrich: שומר digest_kind; חילוץ מוצלח תמיד מסתיים ב-kind לא-ריק (ברירת-מחדל
  לפי citation אם המודל השמיט).
- drain self-heal: הגדרת-כשל = completed עם digest_kind='' (במקום both-empty) →
  הודעות לא מנוסות-מחדש לנצח.
- db: digest_kind ב-_DIGEST_COLS + update-whitelist (זורם ל-search/list/API).
- X12 spec: תיעוד digest_kind + הגדרת-הכשל המתוקנת.

אומת: V32 סיווג 533 (525 decision + 8 announcement, 0 unclassified — self-heal
לא נוגע בהם). extract: 5163→decision+citation · 5060→announcement+concept,
citation ריק (לא both-empty).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-08 06:02:08 +00:00
parent 5bf2ea0262
commit 83d1a8253c
5 changed files with 67 additions and 21 deletions

View File

@@ -36,20 +36,24 @@ CONCURRENCY = int(os.environ.get("DIGEST_DRAIN_CONCURRENCY", "3"))
async def main() -> int:
pool = await db.get_pool()
# Self-heal: an enrich that failed mid-LLM (e.g. the local claude
# subscription window was exhausted) can leave a row 'completed' with no
# concept_tag AND no underlying_citation — a real digest always extracts at
# least a citation, so "both empty" means the extraction never landed. Reset
# those to 'pending' so the next run retries (idempotent auto-resume). Safe:
# successfully-enriched rows always have a concept_tag or citation.
# get_pool() runs schema migrations first — incl. the V32 digest_kind backfill
# that classifies legacy rows — so the failure check below is safe from the
# very first run (no legacy row has digest_kind='').
#
# Self-heal: a successful enrich ALWAYS sets digest_kind (decision/announcement
# /other). So a 'completed' row with digest_kind='' means the extraction never
# landed (e.g. the local claude subscription window was exhausted) — reset to
# 'pending' to retry (idempotent auto-resume). This correctly does NOT touch
# announcements (digest_kind='announcement', legitimately no citation), which
# the old "both fields empty" heuristic wrongly retried forever.
healed = await pool.execute(
"UPDATE digests SET extraction_status = 'pending' "
"WHERE extraction_status = 'completed' "
"AND coalesce(concept_tag,'') = '' AND coalesce(underlying_citation,'') = '' "
"AND coalesce(digest_kind,'') = '' "
"AND coalesce(analysis_text,'') <> ''"
)
if healed and healed != "UPDATE 0":
print(f"self-heal: reset failed-empty digests → pending ({healed})", flush=True)
print(f"self-heal: reset unclassified (failed) digests → pending ({healed})", flush=True)
# Self-heal stale 'processing': flock guarantees a single drainer, so at the
# start of THIS run any row left 'processing' is from a previous run that was
# killed mid-row (session/quota cutoff). Reset to 'pending' so it is retried.