feat(digests): digest_kind classification — robust extraction for all issue types (X12)
~2% מגיליונות "כל יום" הם לא-הכרעות (עדכוני-חקיקה/הודעות/ברכות) ללא ruling → החילוץ ה-decision-centric החזיר ריק → both-empty → מחזורי ב-self-heal. - SCHEMA_V32: `digest_kind` (decision/announcement/other) + backfill legacy בזול (יש citation→decision, אחרת announcement) — לפני שה-self-heal מסתמך עליו. - extractor: prompt מסווג + מחלץ תמיד concept/headline/summary; underlying_* רק ל-decision. extract מנרמל digest_kind. - enrich: שומר digest_kind; חילוץ מוצלח תמיד מסתיים ב-kind לא-ריק (ברירת-מחדל לפי citation אם המודל השמיט). - drain self-heal: הגדרת-כשל = completed עם digest_kind='' (במקום both-empty) → הודעות לא מנוסות-מחדש לנצח. - db: digest_kind ב-_DIGEST_COLS + update-whitelist (זורם ל-search/list/API). - X12 spec: תיעוד digest_kind + הגדרת-הכשל המתוקנת. אומת: V32 סיווג 533 (525 decision + 8 announcement, 0 unclassified — self-heal לא נוגע בהם). extract: 5163→decision+citation · 5060→announcement+concept, citation ריק (לא both-empty). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -36,20 +36,24 @@ CONCURRENCY = int(os.environ.get("DIGEST_DRAIN_CONCURRENCY", "3"))
|
||||
|
||||
async def main() -> int:
|
||||
pool = await db.get_pool()
|
||||
# Self-heal: an enrich that failed mid-LLM (e.g. the local claude
|
||||
# subscription window was exhausted) can leave a row 'completed' with no
|
||||
# concept_tag AND no underlying_citation — a real digest always extracts at
|
||||
# least a citation, so "both empty" means the extraction never landed. Reset
|
||||
# those to 'pending' so the next run retries (idempotent auto-resume). Safe:
|
||||
# successfully-enriched rows always have a concept_tag or citation.
|
||||
# get_pool() runs schema migrations first — incl. the V32 digest_kind backfill
|
||||
# that classifies legacy rows — so the failure check below is safe from the
|
||||
# very first run (no legacy row has digest_kind='').
|
||||
#
|
||||
# Self-heal: a successful enrich ALWAYS sets digest_kind (decision/announcement
|
||||
# /other). So a 'completed' row with digest_kind='' means the extraction never
|
||||
# landed (e.g. the local claude subscription window was exhausted) — reset to
|
||||
# 'pending' to retry (idempotent auto-resume). This correctly does NOT touch
|
||||
# announcements (digest_kind='announcement', legitimately no citation), which
|
||||
# the old "both fields empty" heuristic wrongly retried forever.
|
||||
healed = await pool.execute(
|
||||
"UPDATE digests SET extraction_status = 'pending' "
|
||||
"WHERE extraction_status = 'completed' "
|
||||
"AND coalesce(concept_tag,'') = '' AND coalesce(underlying_citation,'') = '' "
|
||||
"AND coalesce(digest_kind,'') = '' "
|
||||
"AND coalesce(analysis_text,'') <> ''"
|
||||
)
|
||||
if healed and healed != "UPDATE 0":
|
||||
print(f"self-heal: reset failed-empty digests → pending ({healed})", flush=True)
|
||||
print(f"self-heal: reset unclassified (failed) digests → pending ({healed})", flush=True)
|
||||
# Self-heal stale 'processing': flock guarantees a single drainer, so at the
|
||||
# start of THIS run any row left 'processing' is from a previous run that was
|
||||
# killed mid-row (session/quota cutoff). Reset to 'pending' so it is retried.
|
||||
|
||||
Reference in New Issue
Block a user