Compare commits

...

1 Commits

Author SHA1 Message Date
Emerson Gomes
ad1f6b5f14 fix: prevent heartbeat timeout state pollution in validation loop
Fixes a critical bug where heartbeat_timeout_seconds was initialized once
at function scope, causing timeout values to leak between different indexing
attempts during validation cycles.

**The Bug:**
When validating multiple active indexing attempts, the first attempt meeting
the condition (total_batches > 0 AND completed_batches == 0) would modify
the shared heartbeat_timeout_seconds variable from 30 minutes to 12 hours.
This modified value would then persist for ALL subsequent attempts in that
validation cycle.

**Impact:**
- Stalled indexing operations that should be detected and terminated within
  30 minutes remained undetected for up to 12 hours
- Worker resources stayed allocated to dead/stuck processes
- Legitimate indexing operations were delayed due to resource starvation
- Users experienced "stuck" connectors that appeared working but were deadlocked

**The Fix:**
Move heartbeat_timeout_seconds initialization inside the loop (after
lock_beat.reacquire()) to ensure each attempt is evaluated with its own
independent timeout threshold.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-18 22:03:47 -05:00

View File

@@ -152,8 +152,6 @@ def validate_active_indexing_attempts(
"""
logger.info("Validating active indexing attempts")
heartbeat_timeout_seconds = HEARTBEAT_TIMEOUT_SECONDS
with get_session_with_current_tenant() as db_session:
# Find all active indexing attempts
@@ -171,6 +169,9 @@ def validate_active_indexing_attempts(
for attempt in active_attempts:
lock_beat.reacquire()
# Initialize timeout for each attempt to prevent state pollution
heartbeat_timeout_seconds = HEARTBEAT_TIMEOUT_SECONDS
# Double-check the attempt still exists and has the same status
fresh_attempt = get_index_attempt(db_session, attempt.id)
if not fresh_attempt or fresh_attempt.status.is_terminal():