fix: prevent heartbeat timeout state pollution in validation loop

Fixes a critical bug where heartbeat_timeout_seconds was initialized once at function scope, causing timeout values to leak between different indexing attempts during validation cycles. **The Bug:** When validating multiple active indexing attempts, the first attempt meeting the condition (total_batches > 0 AND completed_batches == 0) would modify the shared heartbeat_timeout_seconds variable from 30 minutes to 12 hours. This modified value would then persist for ALL subsequent attempts in that validation cycle. **Impact:** - Stalled indexing operations that should be detected and terminated within 30 minutes remained undetected for up to 12 hours - Worker resources stayed allocated to dead/stuck processes - Legitimate indexing operations were delayed due to resource starvation - Users experienced "stuck" connectors that appeared working but were deadlocked **The Fix:** Move heartbeat_timeout_seconds initialization inside the loop (after lock_beat.reacquire()) to ensure each attempt is evaluated with its own independent timeout threshold. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2026-04-04 14:32:41 +00:00 · 2025-10-18 22:03:47 -05:00
1 changed files with 3 additions and 2 deletions
--- a/backend/onyx/background/celery/tasks/docprocessing/tasks.py
+++ b/backend/onyx/background/celery/tasks/docprocessing/tasks.py
@@ -152,8 +152,6 @@ def validate_active_indexing_attempts(
    """
    logger.info("Validating active indexing attempts")

-    heartbeat_timeout_seconds = HEARTBEAT_TIMEOUT_SECONDS
-
    with get_session_with_current_tenant() as db_session:

        # Find all active indexing attempts
@@ -171,6 +169,9 @@ def validate_active_indexing_attempts(
        for attempt in active_attempts:
            lock_beat.reacquire()

+            # Initialize timeout for each attempt to prevent state pollution
+            heartbeat_timeout_seconds = HEARTBEAT_TIMEOUT_SECONDS
+
            # Double-check the attempt still exists and has the same status
            fresh_attempt = get_index_attempt(db_session, attempt.id)
            if not fresh_attempt or fresh_attempt.status.is_terminal():